/* Target unordered list (bullets) in the Rich Text Block */ .Blog-Rich-Text ul { list-style-type: disc; color: #0000FF; /* Change this color to your desired bullet color */ } /* Target ordered list (numbers) in the Rich Text Block */ .Blog-Rich-Text ol { color: #FF0000; /* Change this color to your desired number color */ list-style-type: decimal; /* Customize list type */ } /* Target the list items */ .Blog-Rich-Text ul li, .Blog-Rich-Text ol li { font-size: 18px; /* Customize the font size for list items */ line-height: 1.6; }
Mendel Team
Gold evaluation

Creating Accurate Regulatory and Reference Data

Within the real world evidence space, the generally accepted process for creating a regulatory grade data set is to have two human abstractors work with the same set of documents and bring in a third reviewer to adjudicate the differences. These datasets also serve a second purpose - as a reference standard against which the performance of human abstractors can be measured.  Although this remains the industry standard, it is expensive, time consuming and difficult to scale.

At Mendel we extend this framework and layer AI intelligence to build our own reference set as follows.

We start the same way as the industry standard: we have two human abstractors process the patient record independently (average of the 2 noted as H1) and a third human (R1) to adjudicate the differences. Mendel builds on the standard by including an additional abstraction layer which includes AI. We then run the patient record through our AI pipeline and have a human audit and correct the AI output (Human+AI). Finally we have a second review (R2) to adjudicate the difference between both human only and AI+human outputs. 

Our goals are to ensure quality internally and for our customers. We use the gold set to measure performance of our AI models across the test cohort, generate a quality report, and conduct multiple types of validation to ensure that the data is clinically useful, has been processed correctly, and that the AI models are not skewed. 

During this process, we hypothesized that results from AI + Human collaboration would rival the results generated using the previously described regulatory reference model–two human abstractors adjudicated by a third. 

The Evaluation: Does combining human and AI efforts lead to high data quality?

At the end of 2022, we conducted a series of evaluations across therapeutic areas to assess how our models perform. We wanted to explore whether combining human and AI efforts lead to higher data quality than the regulatory standard and by how much. 

In this experiment, we looked at a total of 140 patients across three therapeutic areas with the following sample sizes:

  • Breast - 40 patients
  • NSCLC - 40 patients
  • Colon - 60 patients

We calculated an F1 score to compare the performance of the average of one human abstractor or the average of two human abstractors (H1), two human abstractors with adjudication (R1), and the combination of one human and AI. 

The F1 score combines the precision and recall of a classifier into a single metric by taking their harmonic mean. We then compared the F1 scores for variables across therapeutic areas. 

Understanding the variance across variables

All approaches, whether human only, adjudicated or Human + AI abstraction demonstrate variability in quality across data variable types. When we think about F1 performance it helps to divide a patient’s data variables into four groups:

  1. Variables that have a high complexity for humans, but easier for AI
    Ex. Variable is difficult to find due to length of record
  2. Variables that have a high complexity for humans and AI
    These variables could be difficult to extract because they are subjective
  3. Variable is easy for both humans and AI to extract
    These variable have a clear interpretation
  4. Variables that are easier for humans, but difficult for AI 
    These variables may require leaps in reasoning

There is also a situation of compound data variables. These variables depend on multiple correct predictions and are difficult for both humans and AI.

Let’s look at the variables specific to Colon Cancer. 

Below we compare the F1 scores of the Human+AI approach for colon cancer variables with the F1 scores of the Human only approach. The AI + Human approach F1 score is shown through the bar graph and the Human only F1 score is plotted over it.

The Human + AI approach exceeds the quality of a human only approach for every variable we studied. This is not surprising, since leveraging the AI output gives the human abstractor a significant advantage.

Does this hold up when looking at patient data that has been double abstracted and adjudicated?

Human+AI performs better than a double extracted and adjudicated data set

In the chart below we compare the average of the pooled results across breast, lung, and colon cancers against the gold standard reference set. 

The AI + Human approach  has an F1 score of 92.2% and the double abstracted and adjudicated (R1)  set has an F1 score of 87.8%. Both approaches perform acceptably high vs the gold set. However, the A1 + Human approach’s F1 score shows an increase of 4.77%.

In addition, the AI+Human application is ⅓ of the effort and cost of using three humans, making this approach inherently more scalable. In our next post, we will explore time savings. 

Interested in learning more about this evaluation and Mendel’s process? Contact hello@mendel.ai.

Mendel is an end-to-end solution that uses the power of a machine and the nuanced understanding of a clinician to structure unstructured patient data at scale.

The Feed