Monday, December 23, 2013

Corpus Linguistics for Historians



On January 2, 2014 at the American Historical Association preconference workshop Getting Started in Digital History*, I’ll be giving a session Corpus Linguistics for Historians.  Below I explain why I think historians should take a look at corpus linguistics and explain how the software I use, AntConc, works.  I have to thank Heather Froehlich for introducing me to this methodology and patiently tutoring me through it over the past 2+ years. [update see also advanced corpus linguistics for historians given the next year

While topic modeling is certainly the better known “distant” or “machine” reading methodology in the United States, corpus linguistics has a long history and is frequently used by scholars in other parts of the world.  While I’ve worked some with topic modeling, I’ve come to see some fairly distinctive benefits in corpus linguistics for historians.

Corpus linguistics is the analysis of language in a body of text (such as primary historical sources).  The texts comprise what is called the “corpus.” Computer-aided corpus linguistics looks for mathematical relationships between words in a body of texts.

Machine reading of sources provides two advantages for the historian 
1.  Machines can deal with far larger volume of source material than the human brain can, anything from hundred to hundreds of thousands 
2.  Machines can find patterns and relationships that the human brain cannot. 

Corpus linguistics software works with every word in a given corpus.  This is extremely valuable for the “small” words that the human brains tends to slide over when reading [these words are often call “stop words” in machine reading because the programming ignores them before analyzing a corpus].  These words, which have very high frequency in most corpora, are extremely valuable for the historian because they allow us to view complex relationships within corpora that looking only at nouns and verbs does not.

The technical requirements for doing corpus linguistics are relatively low.  The two most popular software packages are free (Antconc and Wordsmith).    I use Antconc, created by Laurence Anthony, which has the benefit of working in many operating systems (Mac, Windows, Linux).  It is extremely easy to install.

Creating a corpus is generally the most daunting obstacle that confronts the historian.   Antconc accepts four file formats, txt, xml, htm, and html.  Corpora can be loaded as one file or as multiple files depending on the source material and kind of questions you are asking.  Creating a corpus can be very simple if you are working with already digitized sources in the public domain to extremely difficult if you are working with handwritten archival documents.   Today my examples will come from an ongoing project  that explores gender in the six volume history of woman suffrage that I have in 808 individual text files courtesy of Alexander Street Press. [updated here is published version]

Antconc performs seven different types of analyses, some of which are quite familiar to historians and others that require some explanation.

Overview of Tools (adapted from Laurence Anthony’s excellent documentation)
1.     Concordance Tool:
This tool shows search results in a 'KWIC' (KeyWord In Context) format. This allows you to search files and see results in one window.
2.     Concordance Plot Tool
This tool shows search results plotted as a 'barcode' format. This allows you to see position where search results appear in target texts.  If you loaded your corpora as separate files you will see results expressed for each file.  
3.     File View Tool
This tool shows the text of individual files. This allows you to investigate in more detail the results generated in other tools of AntConc.
4.     Clusters (NGrams):
This Clusters Tool shows clusters [words that appear directly juxtaposed in a corpus based on an N set by the user]. In effect it summarizes the results generated in the Concordance Tool or Concordance Plot Tool. The NGrams Tool, on the other hand, scans the entire corpus for 'N' (e.g. 1 word, 2 words, …) length clusters. This allows you to find common expressions in a corpus.
5.     Collocates:
This tool shows the collocates of a search term. Collocates are words that aco-occurr in a corpus at greater frequency than random chance.
6.     Word List:
This tool lists all the words in the corpus and presents them in a list that can be sorted in many ways, including frequency.   
7.     Keyword List:
This tool shows the words are unusually frequent (or infrequent) in the corpus in comparison with the words in a reference corpus. This is an extremely valuable tool for exploring different discourses.  This tool however is subject to some specific conditions, the most of important of which is that the reference corpus must be bigger than the corpus being analyzed.

The documentation for AntConc is excellent.  User guides and youtube videos can be found on the Antconc home page.

The following examples will show each of the 7 functions in use based on my ongoing analysis of the History of Woman Suffrage.  My research questions revolve around the presence of gender in this corpus, when the word “gender” appears no where.

 word list for 154 files attributed male authors.  I notice "women" at position 18 with n=1116.  The better you know your corpus, the better you will at figuring out what is interesting or significant.  This result of "women" is not unexpected of course, although slightly interesting that plural women is higher frequency than singular woman given the locution of 19th century.  I decide to go with "women" for our example. Since it is an extremely obvious marker of gendered language, how it is being used in the corpus may be interesting.

Concordance for "women" shows all 1116 occurrences.


Obviously this is too long a list to read through, so I note by looking in the concordance plot which files have a high density.  This has added benefit of showing me the distribution of the word "women"  in the files.  This view however cannot be exported.  I skim to see which files are densest since as would be expected all 154 files contain at least 1 occurrence of "women."  I note that one file has 197 occurrences so I click on that to view in "file view."  I see it is a summary by Theodore Stanton, Elizabeth Cady Stanton's son,  of his work The Woman Question In Europe.  This is definitely interesting to me so I make a note of it.  I'd continue explore these files noting those that seem particularly dense in the usage of women as places to start for close reading.  I'd also note where "women" is not present at very high frequency.  What are those items talking about then in a volume about woman suffrage?  



Collocates, words that appear together in a corpus at a frequency greater than chance, are a far more interesting  way to explore a corpus that word frequency.  The top collocate for "women"is "of" and "of women" is also the top cluster (this is not always the case, note the the 2nd and 3rd highest collocates are not the 2nd and 3rd highest clusters).  However we need to look at the "stat" column to determine if the collocations are statistically significant [see excellent explanation by Richard Xiao‎] The value needs to be 3 or higher.  There are two interesting collocates with stat values higher than 3 with fairly high frequency, "are" and "married."  While I would definitely explore "married women" for our example, I'm using "are."                                                        



Looking in the cluster view I see "women are" occurs  64 times making it the 4th most frequent cluster of "women _____" [are women N=9].



and checking N-grams reveals that by frequency "women are" is in the top 100 bi-grams in the corpus


The concordance view and concordance plot of "women are" reveals how it is used and where it is used.  Again it has high distribution, appearing in 142 of the 154 files.  The densest file is again Stanton's (which is the longest file by a male author), so I note that, and continue exploring.  I'm down to numbers that make it possible to read "women are" in context in all occurrences relatively quickly, and I could correlate author metadata for the items to determine is specific usages of "women are" occur in speeches versus say private letters.

Finally I compare the usage of "women" in the files by male authors as compared to 294 items authored by females.  Keyness reveals that "women" is a statistical significant negative key word, which means male authors used it less frequently than female authors.  This is definitely interesting and I would then probably move to exploring all of the above in female authored items in my corpus.




Note that the process outlined relies on a back and forth between machine reading of the texts and close readings of the individual items.  I've keyed in on Stanton's essay already as an important male authored text in the corpus that deals with gender.  This I probably could have done without the aid of a computer based on its size and title alone.  However while  "women are" obviously doesn't provide me all the answers I'd want to know about how male and female authors express gender in the six volume History of Woman Suffrage, it gives me a place to start (79 occurrences of "women are" in male authored files)  and leads me to other places to look further (in female corpus following negative key words).

In a full corpus linguistics analysis I'd repeat the above in many different ways.  Exploring how male and female authors use "women" and then "men" for example gets at deeper aspects of gender in the corpus.  I would look for gendered pronouns, which is actually where I started the project with his/her,  and terms of address (Mrs/Miss/Mr).   I'd exploring trigrams of "women" "women and men" "women not men" and "women or men."  I could separate out the items by the very well known suffragists like Stanton and Anthony, and compare them to the 172 items by authors who appears only once in the corpus to see if differences emerged there.  I could parse the corpus by location of authors to get a view of potential regional variations.   All the while I'd be comparing patterns and shifts over time and space, which is what historians do.  I'd just be doing it with words I'd have ignored and with a volume of sources I never could have used.

* Per @sethdenbo the workshop is full, but you may ask to be placed on the waiting list.

No comments:

Post a Comment