Understand how topic modeling can predict the essence of documents and detect the most relevant keywords inside them.
Assume you’re a business consultant, and your lucrative contract job is at stake. Your boss tells you to share your insights concerning 20 annual reports of competitors in front of your board of directors. In less than an hour, it’s showtime. Luckily, you don’t need to read those reports.
Did you know that machines can uncover the topics of documents? Machines can read medical documents, legal papers, and business reports in mere seconds and spit out their essence, not in the form of just an executive summary but also as a dynamic visual aid. Yes, that’s possible. In this blog post, you’ll learn how.
Keyword extraction is the automated process of recognizing the most frequent or unique keywords and expressions in a text. It is a natural language processing (NLP) technique that transforms unstructured data into structured data.
Topic modeling goes one step further. It is a machine learning technique that keeps track of how many times a word is used in one document compared to several others. Topic modeling algorithms can detect keywords that can be searched, classified, and archived. Cool right? While statistical methods can guess the topic of some documents, they ignore the meaning of words, otherwise known as semantics.
We humans intuitively recognize the context of words, but how can computers be programmed to know context? A quick-fix solution is to classify a document. Once a document classifier knows the context of an electronic health record is medicine, it can be inferred that the word “terminal” refers to the final stage of a disease, not a computer terminal. A simple document classifier can do the job, but only if the context of your text doesn’t change.
Some words have the same meaning irrespective of context. For example, “COVID-19” is defined as a coronavirus strain identified in humans in 2019. Other words can hold two or more meanings. The financial term “equity” is a share of a company, while the social sciences word “equity” means fairness.
If you classified Pfizer’s 2021 annual report (10-K), you would soon realize it’s not just a business document. The report contains 26% of legal terms, 18% of business, 15% of healthcare, 11% of accounting, 6% of finance, 5% of HR, and a paltry 4% of medicine. Multiple homonyms may show up on your screen. The computer “virus” would be classified as a pathogen; the same goes for “cell” – the element in a spreadsheet, not the smallest structural unit of an organism. If you classified the report as only a medical report, it would result in a faulty analysis, commonly known as “Garbage in. Garbage out.”
Most keyword extraction and topic modeling tools are no more than glorified word counters. The generic English words “bull,” “bear,” “long,” and “short,” however, could refer to market sentiment in the investing world. Unsophisticated methods count every instance of the word “long” without distinguishing between the meanings of words and their context. In the 2021 Pfizer annual report, the word “long” might mean a lengthy distance or time or a positive investment in a security. What if the report discussed the effects of “long COVID” syndrome or reported promising “long-term” durability of their covid-19 vaccine, or explained that their latest covid-19 treatment pill had a “long way to go.”
Besides, you may think classifying acronyms is an easy enough task. You could create a “lookup” table. However, some acronyms refer to multiple key terms. “IP” means “Internet Protocol” and “Intellectual Property.” Again, your topic modeling algorithm must know context. If it fails to recognize context, it may confuse the two: “Our specialized lawyers help you protect your Internet Protocol.”
For this reason, GILO developed an AI model which recognizes the context in which words are used. It was trained on research papers, business reports, legal documents, and web articles covering 25 fields such as medicine, finance, and ICT.
GILO uses topic modeling together with taxonomies to help visualize documents. Our document classifier scans through a document, accurately classifying keywords (with 80% - 90% accuracy) and counting different instances of those words. “Virus” can be classified under “Medicine/Infectious Diseases/Pathogens” or “ICT/Networks/Hacking.”
Our taxonomy holds 22,000 keywords. Each keyword is classified according to its field, subfield, and subject. 1600 acronyms are tagged with different fields. Our text visualization tool then makes analytics faster, better, easier. Your keywords “come alive” in an interactive graph. You can drill down from field to subfield, from subfield to subject; you can see the percentage of keywords within each field; you can discover the type of documents being classified, e.g., academic, business, legal, or web; finally, you’ll know the top 10 subjects being discussed in your document.
Request a demo of our “Garbage in. Logic out.” apps if you want to transform your unstructured text into structured data.
© 2023 GILO Technologies, All rights reserved.