Saturday, July 5, 2014

Distant Reading with Semantic Tagging

More fun Visualizing Gender in the History of Woman Suffrage start there if you want full history of the project.  I'm looking for gender in the History of Woman Suffrage, a 80 year, multi-volume scrapbook of sorts.

Endless suffrage



McEnery and Hardie (2012:1–2) define corpus linguistics as ‘a group of  methods for studying language… dealing with some set of machine-readable  texts…or corpus which is usually of a size which defies analysis by hand and  eye alone within any reasonable timeframe…corpora are invariably exploited using tools which allow users to search through them rapidly and reliably.’

Today I spent more time exploring gender in the History of Woman Suffrage using Wmatrix and semantic tagging.  I first examined the S2 semantic tags (People).



relative frequency of person: female (S2.1) and person male (S2.2) in HWS


tokens with People (S2) s from HWS

Not surprisingly, female people appear most frequently. While it is interesting to see how female people were referred to, it is even more interesting to compare this usage to the Historical Corpus of American English. We already know that the normal pattern of more male references than female ones will be flipped due to the subject of the History of Woman Suffrage, but what else can this comparison tell us about suffrage discourse?



frequencies per million of man and men woman and women in COHA and HWS

While it is not unexapected that the HWS would diverge from COHA in frequencies of female terms given the subject matter, there are other patterns that also mark it as distinct.  Volume 2 contains the most similar uses of the terms woman and women in the HWS and the one where the HWS is most similar to the COHA.  Volume 2 also contains the point at which women and woman flip frequencies.  Women predominates for the remainder of the HWS. What does that mean?  Are women more present in the text as a collective, and therefore plural noun, than as individuals?  Does the content contain less about woman as a universal category (as in womankind) and more about actual women?    This is particularly intriguing as COHA shows that woman is the more frequent term in all historical period, coming close to parity with women only in   the 1920s (when women achieve suffrage) but widens and again in the 1980s (following the resurgence of feminist activism in the 1970s).  The shift in plural to singular for female persons is paralleled in the comparison of man and men in HWS.  man exhibits an even more linear decline, with only a slight leveling between volumes 3 and 4. This again suggests that perhaps there is a movement away from a universal rights discourse (see below for rights analysis) that moves away from using these universal categories.  Concordance reading would be necessary to determine whether the usages are as universals or a singular people as semantic tagging cannot pick that out.  The relative steadiness of men is also intriguing as it suggests that discussion of male persons more or less remains at the same level over all 6 volumes.



The decline of all four terms in volume 5 and 6 points to the distinctive nature of these two volumes. While volumes 1-4 covered chronological periods, volumes 5 and 6 both cover 1900-1920 but divide topically. Volume 5 focuses on the national suffrage movement through attainment of the 19th amendment. Volume 6 larely focuses on state suffrage campaigns as well as documentation of the international suffrage movement.

The decrease in the four terms got me to wondering what was being covered in these two volumes if not man, men, woman, and women?  
I wondered if these volumes contained more individual personal names as the HWS documented the shift from argument about women and the vote to  documenting the people who contributed to the securing the franchise?  

pronouns (Z8)and personal names (Z1)
by relative frequency for 6 volumes HWS


However looking at the tags for personal names and pronouns, it appears that volume 5 has a peak for pronouns, but not personal names.  In fact personal pronouns reach their highest peak since volume 2 in volume 5, while personal names are lower than in the preceding two volumes.  

Looking at what pronouns can tell us about gender in the HWS is something I'll take up using a different methodology, corpus analysis.  










Day 999999 3


To do some more distant reading, today I looked in Wmatrix at multi-word expressions,  recurrent conjoining of words tagged by the software.  Wmatrix also assigns multi-word expressions with semantic fields and calculates their relative frequency in the text.  This makes MWE a fascinating way to look at how authors in the HWS wrote about subjects.  By looking at Tag G, for government, I hoped to zoom in on how authors expressed demands for women’s rights.  For reasons I won't go into here, X suffrage i.e woman suffrage, female suffrage is not pulled as part of the multi-word expressions, so I'm left with the abstract ways demands were expressed  

Below are the relative frequencies by volume (X axis, left to right) of the semantic tag G1 AND G2  





Next I narrowed to  MWE that appear in at least 2 of the 6 volumes of the History of Woman Suffrage





 human rights (green) and civil rights (red) peak in volumes 1 and 2, and fade to relative frequency of 0 after that..  equal rights (yellow) peaks in volumes 3 and 4 and fades out by volume 6, while common law (orange) peaks in volume two, even offs in 3 and 4.  These MWE reflect what historians refer to as a discourse of rights, broadly construed from women’s assertions of their rights to hold property to demands for political participation.  Both human rights and civil rights ring quite modern due to their association with post 1945 social and political movements.  They were, as indicated by their relative frequencies here, a not-uncommon way of “demanding” “the broad principles of  human rights” and “securing” or “protecting” “the civil rights of women.”   The continuing uses of these MWE by subsequent movements reveals the ongoing ideological ties that run back to the Enlightenment, even as rights discourses change with historic circumstances.

By volume 3, the blues and purples that shade what historians have described as a political rhetoric become visible.  political parties (purple), general election (pink) various suffrage organizations (blues) overrun the remaining common law and  equal rights.   The introduction of house of commons (magentareflects the internationalization of the movement for woman suffrage.  The decrease in rights discourse in favor of political language comes from the well documented shift among some activists to a single issue movement focused on the attainment of woman suffrage.

However, looking at other tags in Wmatrix hints at the ways in which suffrage interwove with various other causes, but that analysis will have to wait for another day.








Day 2
This summer at the Berks I was lucky to get Ellen DuBois, noted historian of woman suffrage, to sit down with me for an hour to look at the preimilarny results of Visualizing Gender in the History of Woman Suffrage.  Ellen gave me lots of good questions to run with, including one about who is discussed in the HWS (as opposed to authoring items).   Today, when I stumbled across this FAB looking new book The Myth of Seneca Falls: Memory and the Women's Suffrage Movement, 1848-1898 (Gender and American Culture) by Lisa Tetrault  in which she argues that Stanton and Anthony in the post bellum period “invented” women’s history as they crafted an origin narrative of their movement at Seneca Falls, I decided to run with it.   Tetrault argues that the fixing of Seneca Falls as the “beginning” positions Stanton and Anthony as “the” movement.  Her history runs though the creation of the HWS, and in a great chapter, she carefully traces how SBA and ECS just as carefully crafted their historical account positioning Seneca Falls at the Center. 

I got curious about appearances of Seneca Fall in the six-volume History of Woman Suffrage.  Turns out there aren't many, and most are in volume 1, which covers the year the convention occurred, and in volume 4, which included the U.S. Centennial a moment that Tetrault connects to suffragists' history making.

Seneca_falls in Wmatrix raw frequency by volume


Since Seneca Falls appears so infrequently, I began to think about what Ellen DuBois had asked me. about who appears in HWS outside of authorship.   She suggested looking at John Stuart Mill and Mary Wollsteoncraft.  Working through various software, I explored “named entitites” i.e. people’ s names in each of the six volumes of the History of Woman Suffrage.  This turns out to be a surprisingly difficult problem, with two sub issues.   The first has to do with naming conventions.  Elizabeth Cady Stanton might be referred to as Elizabeth Cady Stanton, Mrs. Stanton, Elizabeth Stanton, E.C. Stanton, ECS.

 Using Antconc, a concordancing software, I can search the most general referent Stanton. In the entire History of Woman Suffrage, Stanton appears  1160 times.  However, I know that both Stanton’s son and daughter were involved in the movement, and indeed scrolling through I can see both Harriot and Theodore.  Pulling out clusters, which is to say 2 word phrases with Stanton in them, I find Marguerite Berry Stanton, Harriet Brown Stanton, and several others I’ve never heard of who share ECS’ surname.

 

Software that pulls out names works better, but still has it difficulties.  Different software recognizes the many variations in names differently.  For example, NER run with Stanford’s NLP did not pull out Mrs. Stanton and that is a huge problem as it is a very common form of address  Antconc concordance shows 612 Mrs Stanton although not all are ECS most are).  Looking at semantic tagging done with Matrix under “personal names” Mrs. Stanton appears alongside the other variations, with the exception of E.C.S. (which appears in Antconc 55 times)   John Stuart Mill, with his noun as surname, confounded the software.   Antconc show 76 mentions by his full name, yet Wmatrix did not pull out one. It found 52 instances of John Stuart, 4 Mr. John Stuart (NB I can add him to the lexicon to correct for this) Mr. Mill 21times ....  Looking for Mary Wollstonecraft proves complicated as well, Harriet Martineau

Fortuanately raw frequencies tell us very little, and what I'm really after is understanding who, relative to the other players, gets the top billing in HWS. Is it Stanton and Anthony? Looking under the Z1 semnatic tag (personal names) in Wmatrix I grabbed the top name  for male and female and plotted them by relative frequencies for each volume.



This led to some surprises for me. Elizabeth Cady Stanton came out of top NONE of the times. Mott beat her out in Volume I, SBA took top spot in Vol 2-4, and Carrie Chapman Catt in Vol 5-6. That Catt reached frequencies almost as high as Anthony’s peak was startling to me as well. Buhle and Buhle have characterized the narrative of the HWS as decreasingly personal, yet Catt in Vols 5-6 would seem to contravene that.   I didn't even know who Mr Hoar was until I clicked through to the concordance (a pro-suffrage Senator from Massachusetts) along Mr. President, along with Wilson and Roosevelt points to the importance of male politicians in suffrage discourse since men would have to vote to give women the right to vote.   Wendell Phillips is far better known as an abolitionist, which highlights one of Tetrault's points,  the origin story of Seneca Falls obscures other origins points, such as abolition in the 1830s.

It is sort of understandable that Anthony comes out of top most. She was the most visible public advocate of woman suffrage who traveled tirelessly to campaign for the cause. Yet Stanton, even tethered to the home, was the pivotal thinker.  I compared their relative frequencies over the volumes (by their most frequent appearance not combined variations)



I’m not quite sure what is more startling to me in this graph, that ECS drops so low by vol 6 that her relative freq is 0, or the status SBA and ECS achieve by vol 4 and 5 (which they did not edit).  This brings me back to Tetrault and myth making in the HWS.  While intellectually historians have traced the influence of Wollstonecraft, Mill, and Martineau on American suffragists, the HWS does not reflect that, at least not in mentions of their names.



Day 1
top 49 semtags in History of Woman Suffrage, freq 8000, excluding the grammatical bin (corpus 2.7M, 2.5M tagged)





click baity data viz.
History of Woman Suffrage in 50(ish) Circles





Today I spent some time "Distant reading" the History of Woman Suffrage with Wmatrix.  Wmatrix is an amazing software written by Paul Rayson that tags bodies of texts (corpora) for part of speech and with semantic tags.  I used the latter today.

You can try it for free here on small amounts of text.  It is well worth the $80ish US I paid for the full version though.  Tutorial is here, although this assumes you have basic knowledge of corpus linguistics.  The interface is quite easy.  Upload a file and let it go.  Once Wmatrix does its thing, various relationships can be explored.


The UCREL semantic analysis system is a framework for undertaking the automatic semantic analysis of text. 

Wmatrix tags texts using a semantic lexicon.   It has 92% accuracy. There are 21 broad semantic tags. 



I tagged the entire corpus using the standard lexicon.  This has some definite limitations for my subject matter.  Firstly it doesn’t take into account historical shifts in language usage, so some things are “mistagged” and it doesn’t tag some important words like Mrs. Or Miss.  Additionally the tagging is a little inconsistent when it comes to gender, my larger question about the History of Woman Suffrage.  For example, there is a tag for unfeminine but not for unmasculine.  I could fix all of the above, and will, once I learn how to customize the tool.   Today though was all about the distant reading.  I started by exploring tag S Social Actions, States And Processes.  Under Tag S are 68 subtags in 9 groups, 


S1      Social Actions, States And Processes
S2      People
S3      Relationship
S4      Kin
S5      Groups and affiliation
S6      Obligation and necessity
S7      Power relationship
S8      Helping/hindering
S9      Religion and the supernatural





For the six volume History of Woman Suffrage, wmatrix tagged 340 types (190,534 total types) with  _S tags.  Below I visualized words with frequency greater than 100

circle packing, in Raw, n=size, color semtag, label token



same info as above, but by Semtag

I  found this all really interesting.  I would have expected “no power” to be more present than "in power."  Looking at the words that contributed to this semantic tag I realized that I was interpreting the results incorrectly.  Many of the words under this semantic tag referred to the suffrage movement itself ( i.e. headquarters) while I was thinking more globally about overall power in society (although those words, like Senate are present as well).


"Belonging to a group" also refers to the suffrage movement itself, which isn't unexpected, but I had to dig into the Sematag "allowed" and  "strong obligation or necessity" to figure out those tags.   I felt a little silly when I realize that right and rights falls under this category as do ratified and ratification. "Allowed" here comes from aspects of the texts that are about women being allowed to vote.   S6 then I conjectured,  "strong obligation or necessity" must be the imperative rhetoric of women pressing their claim to the vote, and indeed I find up top in this category several modal verbs, should and must in addition to the expected rhetoric, but also dutyobligationpromise and patriotism.


Feeling in the swing of things, and having read some more of Paul Baker's really lovely book Using Corpora to Analyze Gender I decide to have a go at the WHOLE History of Woman Suffrage semtags.  YIKES, so that turns out to be a love 515 tags ranging from n 1-913,757.  I worked with the top 49 Semtags which is words with an N greater than 8000.  The tags for Pronouns and personal names doesn't surprise me as the History of Woman Suffrage is a highly personalized account of the movement, featuring memoirs and letters, as well as speeches and accounts of public meetings.




of the other semtags,  "A3 Existing" is quite interesting to me as I connect it in my mind to the sort of ontological justification for woman suffrage.  I wonder if "entire maximum" N5.1+ might have some temporal aspect as the movement narrowed to suffrage?  I'm also curious about what precisely in the text accounts for "Cause effect connection" A2.2.

I need to go back to the word freq semtag file, but for now  it has been a long Saturday just doing this! Even with the steep learning curve I can see that semantic tagging will work much better as a form of distant reading for me than others that I've tried.  I'm also curious to run each volume separately to see what the semtags might reveal about shifting discourse in the History of Woman Suffrage as the volumes were published form 1881 to 1922.


One of the functions of Wmatrix is comparing tag clouds for corpora.  Below is Vol 1 as compared to entire 6 vol HWS.  Mousing over not only gives frequency but also log likelihood, a stat useful for determining if a difference between the two corpora is coincidental.  Clicking on a tag reveals the concordance lines with the word that led to that tagging.   For example, clicking on general ethics I can see that references to temperance are a major chunk of this semantic tag, along with morals and principles.





As always my work is inspired by@heatherfro  and @mixosaurus especially The taint of militancy is not upon them: representations of suffragists, suffragettes and direct action in The Times, 1908-1914

NB I'm a historian borrowing a methodology from another discipline.  I'm very lucky that the people who work in corpus linguistics often help me out by tweeting advice or answering my questions.  However, far more than the normal academic disclaimer, I must note here, that any and all errors are mine and mine alone.  The lovely linguists whose work I follow have not reviewed this blog post!



No comments:

Post a Comment