Quick Success Data Science

Learn graphical text analysis with NLTK

Lee Vaughan

A sepia-colored photo of Sherlock Holmes examining a book with a magnifying glass.
Sherlock Holmes (by DALL-E3)

The Natural Language Tool Kit (NLTK) ships with a fun feature called a dispersion plot that lets you post the location of a word in a text. More specifically, it plots the occurrences of a word versus the number of words from the beginning of the corpus.

Here’s an example dispersion plot for the main characters in the Sherlock Holmes novel, The Hound of the Baskervilles:

A dispersion plot that uses vertical blue tick marks to indicate the occurrence of a word in a text.
Dispersion plot for major characters in “The Hound of the Baskervilles” (by author)

The vertical blue tick marks represent the locations of the target words in the text. Each row covers the corpus from beginning to end.

If you’re familiar with The Hound of the Baskervilles — and I won’t spoil it if you’re not — then you’ll appreciate the sparse occurrence of Holmes in the middle, the late return of Mortimer, and the overlap of Barrymore, Selden, and the hound.

Dispersion plots can have more practical applications. For example, imagine you’re a data scientist working with paralegals on a criminal case involving insider trading. To find out whether the accused contacted board members just before making the illegal trades, you can load the subpoenaed emails of the accused as a continuous string and generate a dispersion plot to check for the juxtapositions of names.

Social scientists analyze dispersion plots to study language trends related to specific topics. By tracking the occurrence of terms like “climate change” or “gun control” in news articles, they can gain insights into priorities that are important to society over specific timeframes.

In this Quick Success Data Science project, we’ll write the Python code that generated The Hound of the Baskervilles dispersion plot shown previously.

We’ll use a copy of the novel stored in this Gist. It originally came from Project Gutenberg, a great source for public domain literature. As recommended for natural language processing, I’ve stripped it of…

Source link