A practical guide for Pharma Quality, IT, and data teams building permanent AI-readiness in regulated environments.
The feedback is very consistent: poor data is the #1 problem for enterprise AI applications. A survey among the top 20 Pharma companies revealed that 70% of respondents place poor data as the main obstacle[1]. McKinsey puts the number at 67%[2]. The stakes are high even among the most advanced companies: $250bn LLM company xAI is reportedly lagging its competitors because of low quality training data. In Pharma, data quality is essential both for training quality and for processing accuracy — but data reflects the layered history of people, systems and vocabulary. It is like trying to teach a child to read using books where every paragraph switches language.
In Pharma and other high-precision industries, this challenge is particularly acute for three reasons: the requirements on accuracy are much higher; technical terms and company-specific abbreviations are not well handled by generic AI; and there is a substantial legacy of paper archives, handwritten documents and spread out digital systems.
BCG's 2025 State of AI survey adds a human dimension: 72% of respondents cite lack of expertise with unstructured data as the #1 problem[3]. The issue is not only what shape the data is in, but what is the best way to make it AI-ready and what resources need to be orchestrated to get there. In order, it's also a people and a system challenge.
Data scientists already report spending 45% of their time on data preparation and cleaning[4]. In the age of AI, this work is no longer limited to data science projects — it is a company-wide issue where thousands of people are creating large quantities of inputs and outputs every day. Cleaning up the archive is a good start, but it is only the beginning. Only 14% of organisations currently have their data ready for fast deployment of AI applications[5] — closing that gap is now one of the clearest opportunities for competitive differentiation in the industry. So, what does it take?
In this article
No ISO standard, no FDA guidance, no EMA framework has settled the question. What exists instead is a fragmented landscape of vendor definitions and analyst frameworks — which itself says something about the maturity of the field.
One helpful starting point is the established concept of FAIR data[6], which refers to research data that adheres to four principles — Findable, Accessible, Interoperable, and Reusable — designed to optimise the reuse of data by both humans and machines. The framework was built to increase scientific efficiency, transparency, and reproducibility. For AI applications, however, it is incomplete.
IBM offers one of the more accessible definitions: AI-ready data is "high-quality, accessible and trusted information that organizations can confidently use for artificial intelligence training and initiatives"[7], with key attributes including unified, governed, secure, and supported by human oversight.
Gartner, probably the most credible non-vendor voice, provides a more comprehensive framework[8] organised around seven dimensions:
Gartner further adds that AI-readiness is a project-specific topic. We agree — but would nuance that permanent data quality infrastructure is a major enabler that fast-forwards project-specific work. That distinction shapes the two-part structure below: the always-on layer first, then project specifics.
The World Bank[9] and Qlik[10] also provide relevant perspectives on governance and the organisational conditions needed to sustain data quality over time.
FAIR data provides the clearest starting point — but it was designed for scientific data sharing, not AI deployment in regulated environments. Our definition extends it accordingly.
AI-Ready Data in Pharma is FAIR-GRACE
| Letter | Principle | What it means in practice |
|---|---|---|
| F | Findable | Contextualised with rich metadata (context) and uniquely identifiable (UID), systems can locate and retrieve the right data reliably. |
| A | Accessible | Standardised access with authorisation and authentication — data is available to the right systems and people, without unnecessary friction. |
| I | Interoperable | Standardised formats and agreed definitions across systems. Jargon, abbreviations, and company-specific terminology resolved so generic AI models can interpret content correctly. |
| R | Reusable | Contextualised and traceable. The original source travels with the content and the transformation path is explainable and auditable (concept of Lineage). |
| G | Governed | Regulatory and ethical considerations managed through a company-wide framework — not left to individual projects. Includes Principles and named decision-makers with authority to approve or veto usages. |
| R | Representative | Data comes in sufficient quantity and reflects appropriate diversity and proportion, reducing the risk of wrong or biased AI outputs. |
| A | Authoritative | Source authority is characterised. A QA-reviewed analysis outweighs a third-party document; the latest approved version outweighs prior ones. FAIR records provenance — this evaluates trustworthiness. |
| C | Continuously prepared | AI-readiness is not a one-time project. Ongoing pipelines maintain quality as new data is created. This can be automated and automatically labelled, after a training set has been annotated (cf. similar point from Gartner). |
| E | Ethically Shareable | PII is redacted, enabling use across teams and systems without compliance risk. Applies equally to competitively sensitive R&D data. FAIR covers access control — this is about removing risks entirely. |
Every AI use-case requires its own data preparation — but how fast and successfully that happens depends entirely on the quality of the foundation beneath it. Getting this layer right is the difference between deploying a new AI use-case in a few months versus a year, and the primary reason so many AI projects fail before reaching production. This does not remove the need for project-specific data preparation, but it creates a much higher state of ongoing readiness and significantly faster preparation for project-specific pipelines. Four areas to cover:
Recommended reading on data governance frameworks: see references [11][12][13].
This approach works well to build a general state of readiness or broad applications on top of a data lake. Most AI projects — especially agentic use-cases — need more specific, refined and controlled data to perform.
Let's dive into that !
Discuss with Acodis how to prepare your data
Get a call back →With a solid infrastructure layer in place, each new AI project builds on a much higher baseline. The following six steps apply to every project — the infrastructure layer compresses the time each one takes.
The data type determines the pipeline architecture.
Understand the house before you clean it.
The most time-consuming step — and the one most directly correlated with model quality.
This is the quality foundation everything else depends on. Whilst data preparation can be automated, grappling with the reality on the data and manually checking the content initially is the best way to anchor accuracy at scale.
Use this step to define your data quality standards based on the demands of the use-case:
With a solid base model, begin compounding.
This is never a one-time task.
The data problem in pharma AI is solvable. It is, however, consistently underestimated — and consistently addressed too late, without enough involvement from the subject-matter experts who hold the implicit knowledge that is so essential for high-quality outcomes.
Organisations that build permanent data infrastructure place themselves among the 14% of companies currently ready for fast AI deployment. With a baseline of high-quality data and rigorous practices throughout the organisation, new initiatives become faster and more reliable — leading to new insights and better margins. This is where advantage compounds, with leaders able to operate meaningfully ahead of the pack.
Want to discuss how to prepare your data for AI readiness?
Acodis helps pharmaceutical and life sciences companies build permanent data infrastructure for scalable AI deployment.
Book a free 30-minute consultation →[1] Senderovitz T., Weatherall J., Rochon J. et al. (DISRUPT-DS Roundtable). "Generative AI in pharmaceutical R&D: From large language models to AI agents to regulation." Drug Discovery Today, Vol. 31, Issue 1, January 2026. doi:10.1016/j.drudis.2025.104593
[2] McKinsey & Company. "The State of AI in 2024: Scaling enterprise AI adoption." McKinsey Global Institute Report, 2024.
[3] BCG. "Are You Generating Value from AI? The Widening Gap." BCG 2025 State of AI Survey. bcg.com/publications/2025/are-you-generating-value-from-ai-the-widening-gap
[4] Anaconda. "2021 State of Data Science Report." anaconda.com/state-of-data-science-2021
[5] Wipro Limited. "State of Data4AI 2025: Journeys to help CDAOs scale enterprise-level AI." Technical Report, 2025.
[6] GO FAIR Initiative. FAIR Principles. go-fair.org/fair-principles/
[7] IBM. "AI-Ready Data." ibm.com/think/topics/ai-ready-data
[8] Gartner. "AI-Ready Data." gartner.com/en/articles/ai-ready-data
[9] World Bank. "From Open Data to AI-Ready Data." blogs.worldbank.org/en/opendata/from-open-data-to-ai-ready-data
[10] Qlik. "AI-Ready Data." qlik.com/us/ai-ready-data
[11] DAMA International. DAMA-DMBOK: Data Management Body of Knowledge (2nd ed.). Technics Publications, 2017. The industry-standard reference for data management practices.
[12] Dataversity. "Data Governance Best Practices." dataversity.net — Practitioner-focused guidance on implementing governance programmes.
[13] IQVIA. "Navigating the Data Deluge Through Data Governance." IQVIA White Paper — Pharma-specific; notes that only 31% of pharma companies have a fully implemented data governance strategy.