There are large technical data warehouses that, while public, are difficult to access because of the vast amount of “raw” or unstructured text. Two of these data warehouses are critically important to medicine and technology.
The Medicine Data Warehouse: MEDLINE & PubMed
MEDLINE is the U.S. National Library of Medicine (NLM) bibliographic database that contains more than 26 million references to journal articles in life sciences with a concentration on biomedicine. MEDLINE is the primary component of PubMed, the search tool made available by the National Center for Biotechnology Information. Data is added seven days a week; over 800,000 entries were added in 2015.
Understanding medical text using artificial intelligence has a number of unique issues that makes it an order of magnitude more difficult than understanding, say, a tweet or a Facebook post. Complexity includes excessive punctuation, abbreviations, tables of data, misspellings, named gene entities and much more.
The Technology Data Warehouse: USPTO
The United States Patent & Trademark Office (USPTO) also has a massive database of mostly unstructured text and graphics, with about 4 million patents in the system since 1976. This database is a veritable treasure trove of information but until now has been largely inaccessible except via crude and quite basic text searches. A significant limitation of simple text searches is an enormous amount of patent information is contained in the text of diagrams and drawings.
Natural Language Understanding with ARE4™
To aid researchers in studying these vast data warehouses, we’ve developed an extraordinary solution for natural language understanding based on our ARE4 natural language processing system. When this powerful system is deployed, it finds relationships between sentence elements including subjects, verbs, objects, preposition phrases, etc.
We are different because our technology goes far beyond merely returning a listing of results which may match a particular query but rather we create structured data from unstructured data and then can be integrated with existing structured data to build a customized database.
We have two specialized versions of our versatile ARE4 engine. We’ve developed separate modules that plug into the ARE4 engine to handle the specialized requirements of both databases—one optimized for PubMed and the other targeted to the USPTO system. Each use customized application program interfaces (APIs) for direct and always up-to-date source data.