10 Pitfalls of Data Preparation for AI in Pharma and how to avoid them

Written by Acodis Pharma Team | 17.04.2026 09:52:05

A practical guide for Quality, Regulatory and data teams who want their AI projects to work.

Most pharma AI projects don't fail because the models are not powerful enough. They fail mostly because of the human factor^[1], wrong expectations and poor data. According to McKinsey, 67% of projects struggle because of insufficient or low quality data.^[2] It's easier to blame the data. But really, the issue is how teams handle the data, not the data itself.

In regulated environments, the stakes are high. A model that performs inconsistently on batch records or CoAs doesn't just slow things down — it creates patient risks and sends teams back to square one.

In a previous article we clarified our view on the right way to address this issue. In this article, we cover the usual mistakes that we see happening in data preparations and how to avoid them.

In this article

The Human Factor Across the Organisation
Process Pitfalls

3.Document Quality Pitfalls
4.Conclusion

Part 1 — The Human Factor Across the Organisation

Pitfall #1 — Expectations that are both too high and too low

"You are getting MI6, you are not getting James Bond"

The expectation problem cuts both ways. Some teams believe AI will figure it out without domain-specific training — feed it the documents and it will understand them. When out-of-the-box performance disappoints, they overcorrect: AI doesn't work for this task. Both conclusions are wrong.

The right model is neither a search engine nor magic. Think of it as a smart new team member who needs proper onboarding and structured feedback. The investment is real — but once made, the model executes the same task at a scale and consistency no human team can match. The question is never "does it work?" It is "is it trained yet?"

Fix: Set expectations based on task complexity and training data quality — not demos or general benchmarks. Define your minimum accuracy threshold upfront, calibrated to the risk level of the use case. Track improvement across training cycles rather than judging on first output. Dropping a model after the first run is like firing a new hire after their first week.

Pitfall #2 — Assuming the model will "just understand" how things work

Ever tried to ask a QA expert from a German oncology site to "just" review a batch from a UK site focused on APIs?

A batch record from Site A and one from Site B may document the same process but use entirely different templates. A human expert bridges the gap quickly. A model trained on Site A's format will not perform well on Site B's out of the box.

The issue is not that variation exists — it is that people routinely underestimate how much variation exists, and assume the model will resolve it without training. Variation that is not represented in the training set becomes an error case.

Fix: Before defining your training set, map the document landscape — templates, source systems, and site variants for each document type. Ensure the training set covers that diversity proportionally. If the scope is too wide for a single model, train separate models.

Pitfall #3 — Undocumented implicit knowledge

In TV Series ER, pediatric doctor Ross shows that organisations often don't operate according to the theoretical rule book

Every quality team carries institutional knowledge that exists nowhere in writing. The letter "G" in the top-right corner of a form indicates a specific approval tier. Country code 28 means France in your legacy ERP. A temperature written as "36 (34–38)" means target 36°C with ±2°C tolerance. All things that a model will likely not figure out alone.

Domain experts apply this automatically, without realising they are doing it. A model trained without it produces technically correct outputs that are operationally wrong — invisible until something downstream breaks, or an auditor asks questions.

Fix: Ask: "What would a smart new hire misread here?" Capture implicit rules in writing, and include that knowledge in the training annotations or extraction logic. Update when you find new ones.

Part 2 — Process Pitfalls

Pitfall #4 — Skipping the manual work to "save time"

Ever tried to build an Ikea wardrobe by taking some short cuts on the prescribed steps?

The most common mistake: automating data pipelines before establishing a verified baseline. The reasoning is understandable — manual review is slow, and AI is supposed to replace it. We all just want to set it up and get the automation rewards at scale. But skipping the initial training is how you end up with a model that processes ten thousand documents confidently and incorrectly.

There is no shortcut to a base standard of quality and the human control on the automation parameters. Before any pipeline can be trained, someone needs to work through a representative sample and build a view of "this is how we go from raw inputs to high quality, reusable data."

Fix: It's ok to do things that don't scale — at first. Manually process 30 to 100 representative documents, document key decisions and cleaning rules to build a clear standard that can be replicated at scale. Time invested here is recovered many times over.

Pitfall #5 — Labelling chaos

It's all obvious… until you check. Ask 3 people in your team what's the agreed definition of RFT.

What counts as a document "author"? The first drafter, all signatories, the QA reviewer, the originating department? Is "product description" the trade name, the unique ID, or an entire paragraph? These questions look trivial until three people answer them three different ways — and all three answers end up in the training set.

Inconsistent labelling is one of the fastest ways to degrade model performance, and it is nearly invisible in the data. The model trains on contradictory signals and learns nothing reliable.

Fix: Work in a small team and use alternating roles: one person labels, one watches, then switch. This surfaces tacit disagreements before they enter the training set. Document every decision and use it to onboard anyone who joins later. Consistency beats individual choice every time.

Pitfall #6 — Expecting high performance from minimal data

Ever asked your latest recruit to perform complex tasks after 2 days on the job?

There is a persistent myth that modern LLMs need very little training data to perform on enterprise tasks. For general tasks, partly true. For high-precision extraction in a regulated pharma environment — defined fields from batch records, CoAs, or deviation reports — it is not.

Your documents contain specialised vocabulary, company-specific formats, and domain logic no general model has encountered. Fine-tuning for that specificity requires data. Adequate training volume typically ranges from 30 to several hundred labelled examples. Think of data as the treadmill and iron for AI models — there is no fitness without training.

Fix: Treat data gathering as a first-class project activity, not a later dependency. Set a minimum volume threshold and hold to it before evaluation begins.

See how Acodis helps teams get data preparation right

Get a call back →

Part 3 — Document Quality Pitfalls

Pitfall #7 — Encoding inconsistencies

Did you know there can be different types of PDFs? I didn't.

Two documents can look identical to a human but be read differently by a text extraction engine if they use different encoding standards. Metadata fields fail to normalise, matching logic breaks, and the cause is genuinely difficult to diagnose. This is common in older archives where documents originated from different authoring systems or scanning workflows.

Fix: Audit encoding standards during document discovery — before building any extraction logic. Standardise to UTF-8 where possible. Flag exceptions for separate handling; they are a small minority but cause disproportionate noise.

Pitfall #8 — Hidden content: white text and invisible characters

The Rheumatoid Arthritis of documents: you cannot see it, but the pain is there

Some documents contain text formatted in white on a white background. The page looks clean to any human reviewer. An extraction engine reading the content layer directly — rather than rendering visually via OCR — reads those characters as real content.

Two problems follow: data quality (hidden text corrupts extracted field values) and security (white text is a known vector for prompt injection, where adversarial instructions embedded in a document are acted on invisibly by a downstream LLM).

Fix: If you get output inconsistencies, just know this might be a source. For any new document, ensure templates don't include white characters. For security risks from external documents, run a scan in your pre-processing pipeline.

Pitfall #9 — Layout challenges: rotated scans, watermarks and funny stamps

Not sure if the radiologist would reach the right conclusions with upside down images with a big "confidential" watermark in the middle

Scanned pages may be rotated. Watermarks overlay text. Multi-column layouts get flattened, interleaving content from adjacent columns. Tables spanning multiple pages lose their headers. Each of these can corrupt an entire document's output — a single rotated page in a 40-page batch record can misalign structural cues across the rest of the document.

Fix: Apply layout-aware pre-processing as standard: deskewing for rotated scans, gamma correction for watermarks, and explicit layout detection for complex structures. Assume these issues exist in your archive. Don't wait to find out.

At Acodis, we have trained ML models just on this.

Pitfall #10 — Nomenclature inconsistencies

If you ask similarly for "Chips" in London or in New York, you are going to get very different things. For precise outcomes, we need to operate with the same dictionary!

04-06-2025 refers to the 4th of June or to the 6th of April? A study report co-authored by a US scientist and a European QA manager can contain both conventions in the same document. Add inconsistent unit formats, abbreviated versus full product names, and mixed list structures — and you have a normalisation problem at scale.

The real danger: inconsistent nomenclature doesn't prevent extraction — it produces confident-looking output that is silently wrong. The model picks an interpretation. It won't flag the ambiguity. The reviewer trusting the output may not catch it either.

Fix: Establish a nomenclature standard before building any pipeline — dates, units, product identifiers, list structures. Enforce it in templates going forward. For historical archives, build normalisation logic for each known ambiguity.

Conclusion

None of these pitfalls are exotic. They are standard features of real-world pharma document environments — and they are entirely manageable if you go in with your eyes open.

The organisations that deploy AI reliably in regulated environments are not the ones with the most advanced models. They are the ones that got the foundation right: solid ground truth, consistent labelling, representative data, clean documents, and domain knowledge captured before automation begins.

The preparation is not the obstacle to AI deployment. It is the deployment. Get it right, and the model becomes a genuine multiplier on your team's expertise. Get it wrong, and even the best model will confidently produce the wrong answer — at scale. If you want to understand what getting the foundation right looks like in practice, start here.

Want to avoid these pitfalls in your next AI project?

Acodis helps pharmaceutical and life sciences companies get data preparation right — from baseline to production pipeline.

Book a free 30-minute consultation →

References

[1] The 7th ISPE Pharma 4.0™ Survey: Digital Transformation cites #1 challenge for innovation as "non-adequate culture", which in our experience really means inadequate people and process management, but it's easier to blame culture.

[2] McKinsey & Company. "The State of AI in 2024: Scaling enterprise AI adoption." McKinsey Global Institute Report, 2024.

View full post