Monday, December 29, 2014

Advanced Corpus Linguistics for Historians

Two fun-filled hours of corpus linguistics at #aha2015 (@professmoravec) www.michellemoravec.com

I'm delighted to be giving one of the American Historical Association advanced workshops at the getting started in digital history pre conference sessions on January 2nd.  I'll be introducing historians to the wonders of Antconc, as well as Wmatrix and Stanford's Named Entity Recognition. This is a follow-up to the beginner's workshop I did at AHA 2014.

If you want to follow along from home, we will be working with the first three volumes of the History of Woman Suffrage (download the txt file for volume I  volume II and volume III).  You will also need to download Antconc (link below).

If you want to know why historians should be interested in corpus analysis, check out this amazing conference that just occurred in the UK Exploring Historical Sources with Language Technology: Results and Perspectives.  It includes talks by Tony McEnery and Kat Gupta, both of whom were amazing models during my apprenticeship in corpus linguistics with Heather Froehlich.  Kat's work on suffrage was the inspiration for my more modest suffrage project.  Paul Baker's work is also a fine model for historians and provided much inspiration for my analyses of gender in women's liberation periodicals, which I'm talking about in another session at the AHA.



ANTCONC (Heather Froehlich Getting Started with Antconc)
Mechanics
loading files
tweaking settings
iterative loop with concordance
exporting results

Explore your corpus
Machine driven
word list
grams
keyness

User driven 
clusters
collocates

if participants are interested we will explore the following. Tagging Your Corpora
NER (my instructions here on how to use
Upload
Wrangling (Textwrangler)


Mechanics
Concatenate

Wmatrix Dan McIntyre and Brian Walker, Introduction to Wmatrix
Upload for tagging
Explore
Extract


Quick thoughts on visualization (although really you need to come hear Fred Gibbs' talk Between Text, Argument, and Data: Interpreting New Visualizations in History Monday, January 5, 2015: 12:00 PM Murray Hill Suite A (New York Hilton)

Software 
Antconc by Anthony Laurence download (tutorials on youtube are quite helpful as is the user group)
Stanford NER has web interface but you will need to download to do large corpus
Wmatrix by Paul Rayson corpus analysis and comparison, has web interface but you need to buy for full functions

Resources
Corpus Linguistics at  Lancaster University, aka everything you need to know about CL
Corpus MOOC, excellent course, not scheduled again, but you can register to show interest
Corpus Linguistics at UCSB, especially good for explanation of stats by Stefan Th. Gries
BYU Corpus of Historical American English, balanced corpus, 400 million words 1809-2009
Stanford's Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources
Regex library

No comments:

Post a Comment