Corpus Analysis with spaCy
- Authors
Say you have a big collection of texts. Maybe you’ve gathered speeches from the French Revolution, compiled a bunch of Amazon product reviews, or unearthed a collection of diary entries written during the first world war. In any of these cases, computational analysis can be a good way to compliment close reading of your corpus… but where should you start?
One possible way to begin is with spaCy, an industrial-strength library for Natural Language Processing (NLP) in Python. spaCy is capable of processing large corpora, generating linguistic annotations including part-of-speech tags and named entities, as well as preparing texts for further machine classification. This lesson is a ‘spaCy 101’ of sorts, a primer for researchers who are new to spaCy and want to learn how it can be used for corpus analysis. It may also be useful for those who are curious about natural language processing tools in general, and how they can help us to answer humanities research questions.
Reviewed by:
- Maria Antoniak
- William Mattingly
Learning outcomes
After completing this lesson, you will be able to:
- Upload a corpus of texts to a platform for Python analysis (using Google Colaboratory)
- Use spaCy to enrich the corpus through tokenization, lemmatization, part-of-speech tagging, dependency parsing and chunking, and named entity recognition
- Conduct frequency analyses using part-of-speech tags and named entities
- Download an enriched dataset for use in future NLP analyses
Check out this lesson on Programming Historian's website
Go to this resource