Saturday, June 28, 2014

How to Use Stanford's NER and Extract Results


if you are having difficulties, read this re: updating Java development kit

1. download Stanford NLP NER package here
2.  open the file named ner-gui.command

3. which launches program in two windows




4.  in the second window, erase the sample text.  If yours is short enough just cut and paste.  If not load, by clicking on file and either selecting it on your computer or enter the url.  Text will appear in the dialog box below.  n.b. if you uploaded from a URL you can delete anything that pulled like the header from Project Gutenberg. I my experience anything larger than 2MB (or about 500K words hangs, so chunk if necessary) 



5. click on classifier, then click load from file. You will have to navigate to where you downloaded the NER files.  Click on classifier subfolder to reveal choices.  You are looking for files that end in .gz  The differences are in the items extracted ranging from three (PERSON ORGANIZATION LOCATION) to seven (TIME, MONEY, PERCENT, DATE).  Obviously the more entities extracted, the longer it takes.  

The entities to be extracted appear along the right side with their color labels. 


6. Click on classifiers tab again, and click RUN NER.
WAIT because it hangs if your file is big, and the suddenly voila, the magic happens (and continues to happen if you have a large file.  It has taken as long as a hour to run some of my files). 

When finished you are left with these two windows




7. cut and paste the results of the terminal window (the list not the color coded results) into Textwrangler NB at this point you will start to notice errors in the entities extracted.  You have to decide whether you clean or proceed with messy data.  (repeat as needed if you chunked, putting all the results into one text wrangler window)

8. click on text tab, select process lines containing and enter PERSON, check box for copy to new document


9. voila there you have it, PERSONS in a nice tidy list.  Repeat for as many entities as you extracted.


No comments:

Post a Comment