Tuesday, May 27, 2014

Maybe it's just me, but i'm pretty sure Google Ngram is a NoGram for scholars

Back prepping for another talk I finally had time to run down the original publication on ngram as suggested by Tor Nordam last May.  It looks like  Ngram pulls from a subcorpora of google books; the results at the bottom of the page  "search in google books" are not as is often assumed the source of the  ngram graph, but rather an alternative search method.  As far as I can tell, there is no way to tell what sources contributed to a ngram result.     From a corpus linguistics point of view that makes the ngram pretty unhelpful except as a whiz bang data viz.  To get the data behind your ngram go here.


I'm in the midst of prepping a talk introducing various digital methods to new folks.  I figured I'd better tackle the issue of Google Ngram.  I prepared the slide below and then tweeted it, but got few responses.  I am offering longer explanation here because hey maybe I'm totally wrong that Google Ngram is a No GO.  I've been a Google Ngram skeptic for a long time, so I'm always surprised when I see academics using them. (but if you are going to, try this interface instead of Google's).

While a few of the errors in google n gram are well known, such as incorrect metadata or OCR issues, I repeatedly bump up against an error no one seems to be talking about, which is that what Ngram says is there is not.

Once I dive into the data looking for my n-gram I'm finding books that do NOT contain an ngram (as in two words directly adjacent to one another in order XY).  What I'm finding is co-occurrence, that is to say word X and word Y in proximity with one another (or in linguistic terms, I have collocates not clusters)

For example, I searched google ngram for "cultural feminism" (figure 1) [note actually I started searching for the relative rates of radical feminism,cultural feminism which is really what is seductive to scholars I think, to show relational diachronic language shifts)

slide




figure 1 click to get to Ngram interface
then clicked through to the first date range at the bottom of the page (1980-1987). On the second page of results I saw In A Different Voice and The Handmaid's Tale, both of which I was pretty confident did not contain cultural feminism.

`
When I clicked on In A Different Voice I saw that the results are co-occurrences of cultural and feminism (figure 2).  That is to say cultural and feminism appear near but NOT adjacent to one another (I also looked at these pages in the text via amazon).

figure 2

Just to be sure I searched for cultural feminism in the advanced book function of google search specifying title = In A Different Voice (figure 3)


figure 3
Then I tried searching for ISBN for the specified edition
figure 4

The Handmaid's Tale yielded a similar result (figure 5)

figure 5
When I dug further I found that while the addition above was displayed, the etext was from a 2009 edition which contains Harold Bloom' intro which contains cultural feminism.


I also noticed that variants such as counter-cultural feminism appeared. Generally the further down the results, the less reliable, as in The Heidi Chronicles script (figure 6).

figure 6

As I teach Women and War I felt pretty certain I'd have notice a phrase I've long researched in it (figure 7)

figure 7

It was originally the metadata errors that caused me to distrust Ngram, as in Linda Alcoff's 1988 article being cited in a work dated 1977 (figure 8)



figure 8

I haven't hand checked every result (I do it I use Google Books as evidence in argument) ; the seductive appeal of the Ngram - search lots and lots of books.  Bottom line, most of the texts do contain cultural feminism  I can see it bolded in the snippet of text below the title and are dated correctly  In terms of a very broad very loose approximate change over time, the Ngram might suffice, but I'd be wary of using it for anything precise and I don't think I'd use it in a scholarly presentation without some pretty heavy disclaimers.



got a great tweet from Jeff Sonstein to more sophisticated guide to doing ngram searches which is FAB but with the OCR & metadata issues I don't know that it is enough.

1 comment:

  1. I'm no expert, bit I think the ngram database might be based on something completely different from the google books search algorithm. In a normal google search, you can find a page which doesn't contain the phrase you search for, if other pages link to it in connection with that phrase, maybe google books is the same?

    I'd guess the original paper [1] would have more information?

    1: Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, William Brockman, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden*. Quantitative Analysis of Culture Using Millions of Digitized Books. Science (Published online ahead of print: 12/16/2010)

    ReplyDelete