How a Leading HEOR Company Used Mendel to Eliminate OCR Data Loss, Human Abstraction Effort and HIPAA Compliance Risk from Their Phase IV FDA Submission Process

Karim Galil

UPDATES & NEWS — 2 MIN READ

Share on

How a Leading HEOR Company Used Mendel to Eliminate OCR Data Loss, Human Abstraction Effort and HIPAA Compliance Risk from Their Phase IV FDA Submission Process

‍

Leveraging Mendel’s Retina, Read and Redact solutions, the Customer built a computer simulation of a patient journey, and therefore the ability to experiment and adapt quickly and easily within a computer environment.

About the Customer

A leading healthcare company providing outcomes measurement and predictive analytics for value-based and personalized healthcare. By leveraging big clinical data, standardized outcomes measures, and artificial intelligence technology, this leading technology company delivers a robust approach to improving healthcare outcomes, powered by more precise information.

The quest for finding better data to develop more precise information is what brought them to Mendel.

‍

Crawling, Walking, then Running with Mendel

In the process of FDA submission for a Phase IV study, the Customer had acquired over 90,000 images and over 157,000 RTF files with images from scanned/faxed medical reports (like pathology and cytology reports) as well as imaging data (like ultrasounds) that required digitization.

Mendel Retina: Can we do better than Big Tech OCR?

“No one got fired for hiring IBM”. As the saying goes, prominent OCR solutions like Google’s OCR were being considered by the Customer as a low-risk solution. The Mendel team knew they could do much better, but the results needed to speak for themselves.

We evaluated Mendel Retina against Google’s OCR system on 430 images containing nearly 150,000 words. The evaluation showed a 25% error rate for Google’s OCR compared to 6% error rate for Mendel Retina. Table 1 shows detailed results in terms of Precision, Recall, and F-Scores

‍

The results weren’t a fluke. Unlike Optical Character Recognition (OCR) systems, which read one character at a time, Mendel Retina analyzes the meaning of the full sentence as well as the intent of the document while recognizing words and phrases. The resul is unparalleled Precision and Recall in OCR. Translation: less data loss, and less OCR errors.

That’s simply not the case with Mendel. We’ve built an engine that can be asked highly specific questions, such as, “What are the modes of transmission of a virus?” as well as understand that “no intrauterine infections have been recorded” is a relevant answer.

With irrefutably better OCR accuracy than Google OCR, the customer chose Mendel Retina as their platform of choice for digitizing over 250,000 documents containing scans and images.

‍

Moving Beyond OCR. A Better Alternative to Human-only Data Abstraction?

When we inquired what was next for this digitized data, the Customer’s plan was to use human abstractors to turn OCR’d content into analytics-ready data. This wasn’t a surprise as traditionally, companies seeking abstraction have had only two choices: off-the-shelf Big Tech abstraction tech that simply lacks the clinical understanding needed to create rich data, or human-abstraction which is time and resource intensive, but high quality. At Mendel, we believed that Mendel Read- an AI custom-built for clinical understanding through years and millions of dollars in R&D- could offer our customer both the quality of human abstraction, with the scale and cost-efficiency of automated solutions.

Recognizing the Customer’s unrelenting focus on data quality, Mendel needed to prove that AI would not be a compromise on data-richness in any way. So for a blind test, we randomly selected medical documents containing around 8,000 concept occurrences representing all concepts (we made sure that the rare concepts are represented, sometimes even by including all instances that did not exist in our AI training data). None of the files in the blind test set exist in the training data. We asked our QC team to manually label these instances and review each other’s work until we were satisfied that we reached a “ground truth.” In the following sections, we present two evaluations by comparing the ground truth to two outputs:

- Intrinsic evaluation, by comparing to Mendel Read output before human review.
- Extrinsic evaluation, by comparing to the output that was submitted to Customer.

‍

Table 2: Mendel Read evaluation. The first column group lists the extracted endpoints while the second and third column groups show the results in terms of Precision, Recall, and F-Scores.

How did Mendel’s abstraction AI fare so much better than other technologies? Traditional approaches to extracting data points from medical text involve using taxonomies and a search engine in combination with regular expressions for textual pattern matching (e.g., Linguamatics and Averbis). These solutions, while called “clinical NLP” by the companies which offer them, actually fall under traditional Information Retrieval (IR) techniques.
Such approaches return many irrelevant results (false positives) and miss many relevant results (false negatives). In contrast, Mendel Read’s AI reads medical documents for a given patient and extracts “concepts” (data elements that are the study’s endpoints; e.g., “Squamous Carcinoma of the Cervix”). It also provides source document verification by pointing back to the exact location of the concepts (highlighting the source text) in the original de-identified documents.

‍

With Mendel’s abstraction AI showing the precision and recall of human-only abstraction, the Customer decided to switch over to Read for all abstraction. With abstracted data available in minutes, the Customer ran 11 iterations of research in 5 days, a speed that would have been unimaginable with human-only abstraction.

Achieving HIPAA Compliance for PHI De-identification with Technology

The Customer was using their own workforce for patient de-identification. As the last piece of the puzzle in focusing human effort away from the busy-work of data and towards actual research the Customer turned to Mendel Redact for de-identification. Mirador Analytics reported 100% Precision and 99.85% Recall (99.93% F1-Score) exceeding the threshold for HIPAA compliance. Results are reported in the table below copied from Mirador’s statistical verification report (the full report is available upon request).

Needs table from Mirador

Results and Aftermath

Mendel has transformed data OCR, abstraction and redaction at our customer, eliminating thousands of hours and millions of dollars in spend on human-only efforts for the company.

Exploring the Future of Healthcare AI: A Conversation with Kristin Maloney

The recent podcast featuring Kristin Maloney, hosted on Oncology Data Advisor, delves into Mendel AI's transformative role in healthcare. Kristin highlights how Mendel’s clinical AI solutions—such as Retina, Resolve, and Hypercube—are revolutionizing data-driven decision-making, empowering clinicians to extract critical insights from complex datasets quickly and accurately. Mendel AI's mission is clear: turning unstructured and structured healthcare data into actionable intelligence, bridging gaps in clinical care, and providing physicians with tools to deliver optimal patient outcomes.

Introducing Mendel's New Brand Focus: Supercharging Clinical Data Workflows in Healthcare

Mendel has evolved its brand to “Supercharge Your Clinical Data Workflows,” a shift that reflects our commitment to delivering AI solutions that genuinely enhance clinical data management. In healthcare, where talent shortages demand efficient and reliable tech, our Hypercube solution and neuro-symbolic AI bring unmatched cost-efficiency, speed, and accuracy to workflows. This shift emphasizes our focus on alleviating healthcare’s talent strain with tech that builds trust—eliminating errors and reducing the risk of hallucinations. Discover how Mendel’s transformative approach can optimize your workflows with validated solutions trusted by leaders in the industry.

Revolutionizing Patient Cohort Identification with AI – Insights from Mendel’s ACR Benchmark

Introducing ACR: A New Benchmark for Patient Cohort Retrieval This study introduces Automatic Cohort Retrieval (ACR), a novel task for efficiently identifying patient groups from large-scale medical data. Comparing AI-powered approaches, including large language models and neuro-symbolic systems, the research reveals promising advancements in automating cohort selection for clinical trials and studies. The findings highlight the potential of AI to revolutionize healthcare data analysis, while emphasizing the need for continued improvements in accuracy, efficiency, and reliability.

Introduction to Hypercube’s Ontology and Reasoning Engine

Large Language Models (LLMs) hold the potential to transform healthcare by generating clinical insights and supporting decision-making. However, LLMs face challenges such as hallucinations, lack of explainability, and limited reasoning capabilities, which restrict their effectiveness in clinical settings. Mendel's Hypercube platform addresses these limitations by integrating LLMs with structured clinical ontologies, enhancing both inference and decision-making. Unlike standard ontologies focused mainly on documentation, Mendel’s generative ontology prioritizes scalable reasoning through reductionism and emergentism, enabling more accurate clinical reasoning and streamlined data integration.

Mendel Unveils Groundbreaking Neuro-Symbolic AI System Outperforming GPT-4 for Automatic Cohort Retreival in New Study

“Our latest research at Mendel marks a significant milestone in the field of AI in general, and healthcare in particular,” said Wael Salloum, Cofounder and Chief Science Officer at Mendel. “We are the leader in clinical reasoning by coupling LLMs with our hypergraph reasoning, enhancing both the effectiveness and efficiency of patient cohort retrieval.

Improving Clinical Trial Participant Prescreening With Artificial Intelligence (AI): A Comparison of the Results of AI Assisted vs Standard Methods in 3 Oncology Trials

Delays in clinical trial enrollment and difficulties enrolling representative samples continue to vex sponsors, sites, and patient populations. Here we investigated use of an artificial intelligence-powered technology, Mendel.ai, as a means of overcoming bottlenecks and potential biases associated with standard patient prescreening processes in an oncology setting.

Coupling Symbolic Reasoning with Language Modeling for Efficient Longitudinal Understanding of Unstructured Electronic Medical Records

The application of Artificial Intelligence (AI) in healthcare has been revolutionary, especially with the recent advancements in transformer-based Large Language Models (LLMs). However, the task of understanding unstructured electronic medical records remains a challenge given the nature of the records (e.g., disorganization, inconsistency, and redundancy) and the inability of LLMs to derive reasoning paradigms that allow for comprehensive understanding of medical variables. In this work, we examine the power of coupling symbolic reasoning with language modeling toward improved understanding of unstructured clinical texts. We show that such a combination improves the extraction of several medical variables from unstructured records. In addition, we show that the state-of-the-art commercially-free LLMs enjoy retrieval capabilities comparable to those provided by their commercial counterparts. Finally, we elaborate on the need for LLM steering through the application of symbolic reasoning as the exclusive use of LLMs results in the lowest performance.

How to Approach De-Identification

Organizations that use patient data for internal or external research need to take steps to prevent the exposure of PHI to those who are not authorized to view it. They do this by redacting specific categories of identifiers from every patient document. Once the identifiers are masked, the risk profile of these datasets is significantly reduced. But how do you ensure that redaction engines are working to the highest accuracy?

Clinical Data Abstraction

Clinical Record OCR

PHI De-identification

Clinical Search Engine

Clinical Trial Matching

Clinical Data Assets