Using artificial intelligence to go from raw pathology reports to the data warehouse.

Manually extracting pathology information from electronic medical records (EMR) is time consuming, expensive and onerous. We’ve developed a better method, using hybrid autodidactic natural language processing and understanding for true artificial intelligence.

Pathology reports contain massive amounts of information critical to research. The majority of the data, however, exists as “free text”– unstructured, raw and in a variety of formats (PDF, Word, EMR, tables, etc.) The data, in this form, is essentially locked away. To extract the data manually requires well-educated and trained scientists, often MDs themselves, to help in the lengthy and tedious process of data preparation.

A professional, highly trained abstractor uses a large codebook that standardizes pathology into three sets of findings: biopsy and surgical procedures; benign and malignant results; and associated laterality. For any given report, there can be multiple procedures, results and lateralities. The issue becomes quickly complex and makes manually-based large-scale pathology research impractical.

It’s the sort of mind-boggling complexity our sophisticated system can tackle, however. We call our machine-learning natural language processing and understanding system ARE4. It allows researchers to use sophisticated big data analysis in an efficient manner not previously available.

Medical Search Technologies (MST) has trained ARE4 on thousands of annotated pathology reports. We’ve extracted relevant tumor characteristics, imported them into Neo4j graph database, and then have used that sophisticated database on unstructured medical text to discover and identify cohorts of patients with distinct characteristics.

Data goes from raw, unstructured pathology reports to the research data warehouse efficiently and elegantly.