Natural Language Processing (Stanford’s Named Entity Recognizer)

This tutorial was written by Katherine Walden, Digital Liberal Arts Specialist at Grinnell College. Tutorial instructions were co-authored by Sarah Purcell (L.F. Parker Professor of History, Grinnell College) and Papa Ampim-Darko, a student research assistant at Grinnell College.

This tutorial was reviewed by Gina Donovan (Instructional Technologist, Grinnell College).

This tutorial is adapted from Michelle Moravic’s History in the City Stanford NER tutorial and Rachel Burma’s The Rise of the Novel Robinson Crusoe NER assignment.

Creative Commons License
Natural Language Processing (Stanford’s Named Entity Recognizer) is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.


Developed in 2006 by a team based out of Stanford University’s National Language Processing Group, the Stanford Named Entity Recognizer (NER) is a Java-based tool for recognizing and extracting named entities in an unstructured textual dataset. In this tutorial, we’ll be using Stanford’s NER to identify named entities in The Interesting Narrative of the Life of Olaudah Equiano, Or Gustavus Vassa, The African, an autobiographical slave narrative published in 1789.


Data

1-Navigate to http://vivero.sites.grinnell.edu/files/ in a web browser and download the Equiano_Text.txt file and save it to your Desktop. Open the Equiano.txt file to see the plain text utf-8 file downloaded from Project Gutenberg.

2-Copy the file to your Desktop, and right click on the file to open it in Notepad, the native Windows text editing program.

3-Because the text file was downloaded from Project Gutenberg, it contains information at the beginning and end of the file that is not part of the original Equiano text. Delete these lines, re-save the file to your Desktop, and close Notepad.


Opening Stanford’s NER

4-To install on your own computer, visit the program’s Download page.

5-To open the program, navigate to where the program installed on your computer and double click on the Windows Batch File named ner-gui.

6-The program will launch two windows—a Command Prompt shell that will run the program via the CLI, and a GUI interface. Both windows need to remain open for the program to run.


Loading data into the NER

7-Erase the sample text in the GUI interface window.

8-Click File->Open File, and select the Equiano.txt file saved to your Desktop.

9-Text will appear in the GUI window once the file has loaded.


Identifying Classifiers in the NER

10-Next, we need to load a list of sample terms for the NER to use when analyzing our text.

11-Click Classifier->Load CRF from File. Navigate to C:\Program Files (x86)\Stanford NER\classifiers and open the classifiers folder.

12-Stanford’s NER includes classifier resource files with 3, 4, and 7 entity categories it can search for in your text.

13-Select the english.all.3class.distim.crf.ser.gz file and click the Open icon. Three entity categories (organization, location, person), with color labels, will now display on the right-hand side of the GUI window.

14-Click on the Run NER icon on the bottom of the GUI window to run the program.

15-You can see the results being generated from the NER program in the CLI window.

16-The GUI window now shows the color-coded results of the NER program’s analysis.

17-The GUI interface gives you options to export your tagged file. Click File->Save Tagged File As. Navigate to your Desktop and save the file as Equiano_Tagged.txt.

18-You could also copy the text generated in the CLI and paste into a text file to see a list with just the tagged entities.

19-Right click on the icon for the Equiano_Tagged.txt file and open in Notepad.

20-You’ll see that the NER has added <LOCATION>, <PERSON>, or <ORGANIZATION> tags around the entities it recognized based on the terms in the classification file. We could use a combination of XML and XPath to isolate terms with particular tag categories.


Reflection questions
  • Scroll down and see how accurate the NER was in identifying and tagging elements in the text. What errors or problems do you notice? How would those errors impact analysis of this text?
  • How did the process of using the NER impact or shape your understanding of the original text?
  • What kinds of historical research questions could a tool like Stanford’s NER help address?
  • What types of questions would not be a good fit for Stanford’s NER?
  • What problems or limitations can you see for using Stanford’s NER as a tool for historical analysis?