zoqabytes.blogg.se - Python gensim text compare online

As a historian, I am using topic modelling to identity major themes in text corpora that I am not yet familiar with, and to compare the content of sources on a macro-level. The context provided by topic models can help researchers ambiguate word usage or identify words with similar meanings in large corpora. If “country” and “riding” co-occur, the text is more likely covering hunting or other outdoor activities. If “country” and “border” often co-occur in your corpus, the text might be about politics, or more specifically: international relations or migration. Topic modelling is one of the central methods of Natural Language Processing (NLP), the “automatic manipulation of natural language, like speech and text, by software.” (Jason Brownlee: What Is Natural Language Processing?, in: Deep Learning for Natural Language Processing, 22nd September 2017) In its most basic form, a “topic” modelled by software displays word co-occurrences in texts, assuming that the frequency of co-occurrences defines certain areas of meaning. Today’s blog post covers topic modelling with the Python packages Gensim, spaCy, NLTK and SciKit learn. In April 2020, we started a series of case studies to introduce researchers working with historical sources to data analysis and data visualisation with Python.