The #1 challenge for AI applications in Pharma – Data

A practical guide for Pharma Quality, IT, and data teams building permanent AI-readiness in regulated environments.

The feedback is very consistent: poor data is the #1 problem for enterprise AI applications. A survey among the top 20 Pharma companies revealed that 70% of respondents place poor data as the main obstacle^[1]. McKinsey puts the number at 67%^[2]. The stakes are high even among the most advanced companies: $250bn LLM company xAI is reportedly lagging its competitors because of low quality training data. In Pharma, data quality is essential both for training quality and for processing accuracy — but data reflects the layered history of people, systems and vocabulary. It is like trying to teach a child to read using books where every paragraph switches language.

In Pharma and other high-precision industries, this challenge is particularly acute for three reasons: the requirements on accuracy are much higher; technical terms and company-specific abbreviations are not well handled by generic AI; and there is a substantial legacy of paper archives, handwritten documents and spread out digital systems.

BCG's 2025 State of AI survey adds a human dimension: 72% of respondents cite lack of expertise with unstructured data as the #1 problem^[3]. The issue is not only what shape the data is in, but what is the best way to make it AI-ready and what resources need to be orchestrated to get there. In order, it's also a people and a system challenge.

Data scientists already report spending 45% of their time on data preparation and cleaning^[4]. In the age of AI, this work is no longer limited to data science projects — it is a company-wide issue where thousands of people are creating large quantities of inputs and outputs every day. Cleaning up the archive is a good start, but it is only the beginning. Only 14% of organisations currently have their data ready for fast deployment of AI applications^[5] — closing that gap is now one of the clearest opportunities for competitive differentiation in the industry. So, what does it take?

1. Definition of AI-Ready Data

The diverse views from industry experts

No ISO standard, no FDA guidance, no EMA framework has settled the question. What exists instead is a fragmented landscape of vendor definitions and analyst frameworks — which itself says something about the maturity of the field.

One helpful starting point is the established concept of FAIR data^[6], which refers to research data that adheres to four principles — Findable, Accessible, Interoperable, and Reusable — designed to optimise the reuse of data by both humans and machines. The framework was built to increase scientific efficiency, transparency, and reproducibility. For AI applications, however, it is incomplete.

IBM offers one of the more accessible definitions: AI-ready data is "high-quality, accessible and trusted information that organizations can confidently use for artificial intelligence training and initiatives"^[7], with key attributes including unified, governed, secure, and supported by human oversight.

Gartner, probably the most credible non-vendor voice, provides a more comprehensive framework^[8] organised around seven dimensions:

Quantification — sufficient data volume
Semantics and labeling — proper annotation and labeling
Quality — quality standards specific to the AI use case
Trust — reliable data sources and pipelines
Diversity — diverse sources to avoid bias
Lineage — transparency about data origins and transformations
AI technique specificity — different AI techniques have unique data requirements

Gartner further adds that AI-readiness is a project-specific topic. We agree — but would nuance that permanent data quality infrastructure is a major enabler that fast-forwards project-specific work. That distinction shapes the two-part structure below: the always-on layer first, then project specifics.

The World Bank^[9] and Qlik^[10] also provide relevant perspectives on governance and the organisational conditions needed to sustain data quality over time.

The Acodis Framework: FAIR-GRACE

FAIR data provides the clearest starting point — but it was designed for scientific data sharing, not AI deployment in regulated environments. Our definition extends it accordingly.

AI-Ready Data in Pharma is FAIR-GRACE

Letter	Principle	What it means in practice
F	Findable	Contextualised with rich metadata (context) and uniquely identifiable (UID), systems can locate and retrieve the right data reliably.
A	Accessible	Standardised access with authorisation and authentication — data is available to the right systems and people, without unnecessary friction.
I	Interoperable	Standardised formats and agreed definitions across systems. Jargon, abbreviations, and company-specific terminology resolved so generic AI models can interpret content correctly.
R	Reusable	Contextualised and traceable. The original source travels with the content and the transformation path is explainable and auditable (concept of Lineage).

G	Governed	Regulatory and ethical considerations managed through a company-wide framework — not left to individual projects. Includes Principles and named decision-makers with authority to approve or veto usages.
R	Representative	Data comes in sufficient quantity and reflects appropriate diversity and proportion, reducing the risk of wrong or biased AI outputs.
A	Authoritative	Source authority is characterised. A QA-reviewed analysis outweighs a third-party document; the latest approved version outweighs prior ones. FAIR records provenance — this evaluates trustworthiness.
C	Continuously prepared	AI-readiness is not a one-time project. Ongoing pipelines maintain quality as new data is created. This can be automated and automatically labelled, after a training set has been annotated (cf. similar point from Gartner).
E	Ethically Shareable	PII is redacted, enabling use across teams and systems without compliance risk. Applies equally to competitively sensitive R&D data. FAIR covers access control — this is about removing risks entirely.

2. The Always-On Infrastructure Layer

Every AI use-case requires its own data preparation — but how fast and successfully that happens depends entirely on the quality of the foundation beneath it. Getting this layer right is the difference between deploying a new AI use-case in a few months versus a year, and the primary reason so many AI projects fail before reaching production. This does not remove the need for project-specific data preparation, but it creates a much higher state of ongoing readiness and significantly faster preparation for project-specific pipelines. Four areas to cover:

Establish a data governance framework

Cover both principles (how data should be managed) and decision-makers (who decides for what)
Segment clearly: which data is team-level only, and which is available company-wide — this single distinction prevents most access and compliance conflicts downstream
Keep it short — a few pages that people can actually read and remember; the most effective frameworks are the concise ones, iterated over time rather than perfected upfront

Recommended reading on data governance frameworks: see references [11][12][13].

Define data segments and map your systems

Define key data types at team and company level — what data matters, to whom, and for what purpose
Map all data systems and assign a source of truth — know where authoritative data lives before any AI pipeline is built
Agree on a shared nomenclature — standard formats for dates, abbreviations, and naming conventions for the most frequently used items

Digitalise and structure the archive

Ensure all key documents are in accessible, digital form — this already enables simple "search and retrieve" AI applications
Define minimum required metadata for the key document types — for SOPs, Batch Records, CoAs, Clinical Studies, agree the top 10 minimum required data points per document type
Convert documents to structured data at scale — a mass extraction approach with low specificity requirements and reasonable noise tolerance; the goal is coverage, not perfection. Use automation tools such as Acodis to do this at scale with full traceability.

Maintain security and compliance throughout

Encryption: all data encrypted at rest and in transit. For encryption during model usage (inference), consider homomorphic technology, however it slows down processing
Access control: access limited to authorised personnel with strict logs maintained
Access reviews: the biggest risks lie in dormant accounts with full access

This approach works well to build a general state of readiness or broad applications on top of a data lake. Most AI projects — especially agentic use-cases — need more specific, refined and controlled data to perform.

Let's dive into that !

Discuss with Acodis how to prepare your data

Get a call back →

3. Project-Specific Data Preparation, Step by Step

With a solid infrastructure layer in place, each new AI project builds on a much higher baseline. The following six steps apply to every project — the infrastructure layer compresses the time each one takes.

Step 1 — Define the use case and the data you need

The data type determines the pipeline architecture.

Distinguish different types of data:
- Semantic / unstructured content (documents, contracts, reports, emails): requires extraction, classification, and meaning-preserving transformations
- Structured / time series data (sales figures, sensor readings, lab results): requires numerical consistency, timestamp integrity, and clear definitions
What is the minimum data inputs required for the models to produce the desired output (a useful test: what would a newly hired expert need to do this job?
Identify implicit knowledge held by subject-matter experts required for the tasks and how to translate that into machine understandable data: "36 (34 - 38)" means desired temperature is 36 with +/- 2 degree tolerance.

Step 2 — Map and consolidate your sources

Understand the house before you clean it.

Identify all relevant sources — databases, PDFs, CRMs, data warehouses
Extract content buried in complex formats (text, tables, and images from PDFs)
If possible, consolidate data locations to have less integration / sources to maintain

Step 3 — Clean and transform into AI-ready formats

The most time-consuming step — and the one most directly correlated with model quality.

Clean: fix errors, remove duplicates, handle missing values, ensure consistency
Fix and interpret layout: rotated pages, document structure (core vs annex), structure of pages and elements (tables, footers, etc.)
Structure content: extract text, tables and images with content hierarchy
Normalise: standardise units and naming conventions
Capture metadata: source, date, document type, author — this context makes data significantly more useful for downstream AI models
Adapt the depth of refinement to the use-case: RAG-based use-cases work well with structured content and clean metadata; deeper use-cases like transformation in FHIR standard need deeper refinement
Adopt a consistent format: whether JSON, XML, or a specific XML dialect, uniformity is what matters

Step 4 — Build your labelled baseline for data preparation

This is the quality foundation everything else depends on. Whilst data preparation can be automated, grappling with the reality on the data and manually checking the content initially is the best way to anchor accuracy at scale.

Manually label or verify a representative sample of 30–100 documents / data samples
Keep that work within to 1–3 people to maintain consistency — variation in labelling is the fastest way to get low performing models
Ensure the sample captures the full diversity of the underlying data and cases to be processed
If the scope is too wide, train separate data models rather than force one to cover everything

Use this step to define your data quality standards based on the demands of the use-case:

Define cleaning standards: how to handle errors, duplicates, missing values
Define constent structure: text, tables, images with content hierarchy
Normalise: standardise units and naming conventions
Define required metadata: source, date, document type, author, context
Adapt the depth of refinement to the use-case: RAG-based use-cases work well with structured content and clean metadata; deeper use-cases like transformation in FHIR standard need deeper refinement
Adopt a consistent format: whether JSON, XML, or a specific XML dialect, uniformity is what matters
To make all of this easier, use a visual tool with automation features, such as Acodis

Step 5 — Scale up the data pipeline

With a solid base model, begin compounding.

Process a batch roughly 2× the size of your labelled sample, review and correct
Repeat — each cycle the model improves and batch sizes can grow significantly
Use F1 score to track improvements across cycles
Use confidence scores to route borderline predictions to human review
Stop scaling when marginal quality gains no longer justify the labeling cost
To do this at the scale of 100,000s or millions of documents, use cost effective tool such as Acodis

Step 6 — Integrate your data layer and keep it current

This is never a one-time task.

Merge all sources into a single, unified data layer
Preserve metadata and lineage: record where data came from, how it was transformed, and when — essential for auditability and debugging at scale
Keep adding new data on an ongoing basis, according to the same principles
Maintain a "known truth" data set — a fixed sample of known-correct outputs to test against

Conclusion

The data problem in pharma AI is solvable. It is, however, consistently underestimated — and consistently addressed too late, without enough involvement from the subject-matter experts who hold the implicit knowledge that is so essential for high-quality outcomes.

Organisations that build permanent data infrastructure place themselves among the 14% of companies currently ready for fast AI deployment. With a baseline of high-quality data and rigorous practices throughout the organisation, new initiatives become faster and more reliable — leading to new insights and better margins. This is where advantage compounds, with leaders able to operate meaningfully ahead of the pack.

Want to discuss how to prepare your data for AI readiness?

Acodis helps pharmaceutical and life sciences companies build permanent data infrastructure for scalable AI deployment.

Book a free 30-minute consultation →

Prefer email? Reach us at

References

[1] Senderovitz T., Weatherall J., Rochon J. et al. (DISRUPT-DS Roundtable). "Generative AI in pharmaceutical R&D: From large language models to AI agents to regulation." Drug Discovery Today, Vol. 31, Issue 1, January 2026. doi:10.1016/j.drudis.2025.104593

[2] McKinsey & Company. "The State of AI in 2024: Scaling enterprise AI adoption." McKinsey Global Institute Report, 2024.

[3] BCG. "Are You Generating Value from AI? The Widening Gap." BCG 2025 State of AI Survey. bcg.com/publications/2025/are-you-generating-value-from-ai-the-widening-gap

[4] Anaconda. "2021 State of Data Science Report." anaconda.com/state-of-data-science-2021

[5] Wipro Limited. "State of Data4AI 2025: Journeys to help CDAOs scale enterprise-level AI." Technical Report, 2025.

[6] GO FAIR Initiative. FAIR Principles. go-fair.org/fair-principles/

[7] IBM. "AI-Ready Data." ibm.com/think/topics/ai-ready-data

[8] Gartner. "AI-Ready Data." gartner.com/en/articles/ai-ready-data

[9] World Bank. "From Open Data to AI-Ready Data." blogs.worldbank.org/en/opendata/from-open-data-to-ai-ready-data

[10] Qlik. "AI-Ready Data." qlik.com/us/ai-ready-data

[11] DAMA International. DAMA-DMBOK: Data Management Body of Knowledge (2nd ed.). Technics Publications, 2017. The industry-standard reference for data management practices.

[12] Dataversity. "Data Governance Best Practices." dataversity.net — Practitioner-focused guidance on implementing governance programmes.

[13] IQVIA. "Navigating the Data Deluge Through Data Governance." IQVIA White Paper — Pharma-specific; notes that only 31% of pharma companies have a fully implemented data governance strategy.

The #1 challenge for AI applications in Pharma – Data – and How to fix it