Monday, January 27, 2014

MOOCing in Public

Although the MOOC may be the ruination of higher education (not for a long time though given the sparse 5% completion rate) I could not pass up the opportunity to learn from Tony McEnery, so I enrolled in his Corpus Linguistic MOOC through Future Learning.

While I"m primarily interested in getting better at corpus linguistics, as a prof who teaches online but who has never been a student online, I figured a MOOC would give me some empathy for the students. Therefore in this ongoing blog entry I'll reflect both on the MOOC experience as well as what I'm learning about teaching online.

 if reading this encourages you to learn more about corpus linguistics, the course is being repeated in Fall of 2014!  An in person summer school version is also offered UCREL Summer School 2014 http://ucrel.lancs.ac.uk/ all star cast of profs!

Week 8
I'm not even going to lie, I totally bailed on the last week of corpus mooc and finished two days after the official end of the course.  I had too many big deadlines during week 8 and no time to get ahead (note to self must encourage students to really ponder their schedules at START of course).  The assignment was pretty awesome though, gendered uses of profanity in a sample corpus of spoken British English with speakers annotated for age, sex, location, and social class.

I chose to look at cunt, a word that I associate with a far less negative usage in UK than in US.  The corpus created by Tony McEnery also has several different ways of measuring statistically significant differences between the sexes (the total number of speakers was too small to find any other significant differences in the additional identity factors).  Fascinatingly the UK corpus showed no difference between male and speaker uses!  I feel like this would very much NOT be the case in a similar corpus of spoken American English but there isn't a similar corpus (honestly from my novice viewpoint Europe has far more corpora, although there are some amazing US corpora, such as those at BYU done by Mark Davies.

Week 7
Mooc is hard. No seriously! The last few weeks I’ve been that 11th hour student (note to self be kinder to students in this situation)

Although the topic of this week, learner corpora, didn’t seem to have any immediate applicability to my work, I persevered because I committed to finishing the MOOC, but I could also just hear Heather telling me to do it because I’d learn something anyway. And of course I did. The rhetorical analysis done by Lenko-Szymanska of essays written by Polish English students and American students is the sort of thing I do with my feminist corpora, using keyword, and their context, to understand different strategies for arguing a point of view. Similarly, the section on discourse analysis, and the use of “hedging” devices like modal verbs (modal verbs such as "may", "might", or "could") is going to be quite crucial to identifying feminist discourses (note to self explain to students why they should do stuff)

The assignment is to figure out how to teach modal verbs, but really I’m never going to do this and my corpus isn’t POS tagged sooo I’m being a bad student and skipping to a modified version of the second part since I really need to get work done on my corpus.
Look at the different modal verbs in English in any corpus available to you (BNC, Brown, LOB or any English corpus in CQPWeb). Check, when you look at a particular corpus, how modal verbs are annotated in that corpus – this will help you find them swiftly.   What order would you introduce them in if you were teaching these verbs to a non-native speaker, and why? Look at the collocates of the verbs. Would knowing about the collocations change how you taught the words view the importance of those words? Feel free to make any other observations about the behaviour of modal verbs in your data as you see fit.

First I have to cheat and make sure I really know what all the modal verbs are can/could, may/might, must, will/would, and shall/should sometimes need and ought). Quick google gets Paul Baker’s book section on the subject which points me to the issue of relative strength in modality i.e. must implies something stronger that might, should something stronger than may etc . Baker also points out the diachronic usages of these words, i.e. shall and ought less common now. Intriguingly he also hints at gendered connotations, for example how often do female pronouns occur in conjunction with prescriptive modal verbs as compared to gender neutral or male. I think I have my research question.

I’m going to use my suffrage corpus, because I need that analyzed by June (prior work  here). Research question will be two-fold – is there a difference in male and female uses of modal verbs, and do they collocate with gendered pronouns in these sub-corpora?

The answer to my first question is a resounding yes as indicated by the table below


keyness
LL
Will
female
26.42
should
female
18.489
shall
female
15.15
May
female
12.719
would
female
10.273
Could
male
6.3
might 
male
5.499
must


Can



This is not too surprising as the women in the suffrage corpus are most definitely arguing a very specific POV and thus usage of modal verbs is high, but what about the relative strength of the modal verbs. Will should shall are stronger than may would could . That female keyness of these stronger modal verbs seems to suggest that they are crafting a very strong discourse using a very prescriptive sort of rhetoric. I will definitely be coming back to explore the usages here more carefully. Note I’ve also realized how bad it is NOT to have a POS tagged corpus. Will is also a noun and one that is an important part of the debates over married women’s property rights. I have no way of knowing if the “will” I’m finding here as very key is the verb or the noun. For that reason I’m omitting it from my collocation analysis for now.

On to collocates I’m working here with some results I already have in keyness between the male and female corpora. I’m curious if the gendered differences I noted are also present in collocates for modal verbs. I leave the span as 5l 5r and set minimum frequency for 5, using MI >3 (note I’d love a setting that allows a numerical value threshold for MI)

female corpus
word should (n) MI shall MI could mi might mi
he 19 3.67 24 4.03 21 4.31 15 4.67
her 51 3.73 35 3.15 35 3.68 17 3.48
his 17 3.26 18 3.3 17 3.7
man 15 3.5 15 3.46 13 3.79
men 33 4.27 14 3 13 3.42
she 41 4.17 22 3.24 49 4.93 14 3.96
woman 64 3.89 40 3.68 16 2.89
women 38 3.89 28 2.6 27 3.14 17 3.32
women 38 3.89 28 2.6 27 3.14 17 3.32
husband 10 4.24


It is hard to even say what is happening. There are more collocates present in the female corpus and this collocation of shall and husband is extremely interesting as it is the only significant collocate for husband.  The collocational strengths though aren't very high

male corpus
word should (n) MI shall MI could mi might mi
he 7 3.01 5 2.53 13 5.17 5 4.73
her 10 2.46 19 3.39 9 3.58 5 3.67
his 6 2.93 7 3.16 6 4.2
man 14 3.86 6 2.64
men 6 2.65 10 3.4
she 15 3.72 21 4.21 5 3.4 5 4.34
woman 33 3.95 27 3.66 14 3.9
women 30 3.73 21 3.22 7 2.9
women 30 3.73 21 3.22 7 2.9


Some of the results for the male corpus are intriguing such as he and could  and she and shall but mostly I'm noting how there are fewer collocates in general.

I also start to worry about the frequency issue since for the male corpus I’m dipping down to n > 5. Convinced by this chapter I decided to re-run using the T score rather than MI since T score reflect frequency as well.  I keep span and frequency the same, P <.05 = 2

female corpus
word should (n) T > 2 shall could t might t
he 19 4.01 25 4.69 21 4.35 15 3.72
her 51 6.6 35 5.25 35 5.45 17 3.75
his 17 3.69 18 3.81 17 3.81
man 15 3.54 15 3.52 13 3.34
men 33 5.44 14 3.27 13 3.27
she 41 6.04 22 4.19 49 6.77 14 3.5
woman 38 5.67 40 5.83 16 3.46
women 64 7.46 28 4.45 17 4.6 17 3.71
husband 10 2.99




male corpus
word should (n) T > 2 shall could might
he 7 2.31 5 1.84 13 3.5 5 2.15
her 10 2.58 19 3.94 9 2.75 5 2.06
his 6 2.12 7 2.35 6 2.31
man 14 3.48 6 2.05
men 6 2.06 10 2.86 3 1.5
she 15 3.57 0 5 2.02 5 2.12
woman 33 5.37 27 4.78 14 3.5 4 1.77
women 30 5.06 21 4.09 7 2.29
husband 0 3 1.67
him 2 1.24 0


I now need to plot all of these to see what comes up where T and MI scores intersect but I'm out of time for this week!

Week 6

This week I was the worst student ever due to a series of illnesses (mine and family member’s) snow day, and overly ambitious writing goals. (note to self stress to students that some weeks will be like this)

Our assignment

Using any corpus (e.g. the BNC, any English corpus from CQPWeb or Brown/LOB using Antconc) look for words ending ‘ly’. You may choose the method you use to search for this pattern.

How many are adverbs? Look at these words and try to categorise them into the types of adverbs discussed by Hanks. How easy is this process? Discuss with others how you searched for the pattern – did everybody do it the same way? Do not spend too long on this task – analyse one or two pages of results only.


At first I attempted to skip straight to the acticity, but that wasn’t a good idea (note to self explain to students minimal amount that must be read first) I quickly skimmed all the transcripts. I think I love corpus patterns analysis because PATTERNS .  Once again I couldn’t help but skip the suggested corpus to jump into my own research since I just got a new corpus to play with (over 3000 files from feminist periodicals 1.4m types, 6.1m tokens)

Oh why

So first thing THIS IS HUGE I have lot of results (over 100K) Clearly this has been anticipated as my instructions tell me to only look at a couple of pages of results (note to self this sort of instruction v. helpful)


I cave and try regex and thankful Tony has on the web a cheat sheet for THAT because I hate regex (note to self try to have as many cheat sheets as possible to aid students who want to experiment), lord, ok so skimming through I’m of course interested in historically because HISTORIAN and I know I need one of the other class so I select angrily (24) because FEMINISTS happily (52) giving lie to the idea that feminist are a bunch of angry people


I see lots of interesting results

Historically 226
Sadly 22
angrily 24
happily 52
luckily 22
fortunately 356


Here is where my corpora not being sorted into subcorpora (which I’m in the process of getting) is killing me because I’m DYING TO KNOW where fortunately is coming from. I peek at concordance plot and it is fairly evenly distributed 299 files.


However, just sort of randomly looking at what is interesting really is not how this is meant to work (note to self really stress to students that playing is OK but following directions sometimes necessary). I suck it up and go back to the word list function which I sort using the word ending function. I export and then open as txt file and start scrolling, so long that my hand actually begins to cramp, cut paste and over into excel. I see that I’ve got 2869 words that end in ly. Mercifully a few clicks gets me frequency order and now I can see what I’m working with

I STILL end up with over 2800 words. However excluding all N > 99 got me down to 184, with 10 = N> 1000. EVEN STILL sorting adverbs from adjectives (only) words that can be both (early, likely) random nouns (family since I'm in a feminist periodicals corpus) - and by this point i'm googling to make sure i have correct POS - I'm left with really, especially, particularly, finally, simply and clearly all of which do seem to be opinion adverbs a la Hanks. I'm dealing with a corpus comprised of both activist and academic feminist periodicals and I wonder if these are characteristics of the latter's discourse but since they aren't in sub-corpora yet I can't test this hypothesis.
Week 5
I got a head start on week 5 since a bad cold stuck me in bed over the weekend.  As usual, the content was excellent, and even though I was working a day ahead of the official start to week 5, the online interaction and support was excellent, as I've come to expect from the MOOC (Note to self, make sure everything is set for ALL the course before the start).  This week's work involved online interfaces to existing corpora.  For most people, the google ngram is probably the most familiar example of this.

However, CL corpora come with far more bells and whistles (and I would add are far more reliable) than google books!   Our assignment was to use some of the metadata categories to compare male and female uses of colors.

I couldn't resist though making things far harder by attempting to do a diachronic analysis of  pink across the three time periods delineated and by medium of publication and sex.  HMMMM probably not such a good idea.  However it was a  LOT OF FUN (note to self stress to students that failure is an option, and often a good one).  I turns out what I wanted to know (did the uses of pink shift or change in written  corpora due to the rise of the women's movement (I hypothesized a decline) or breast cancer awareness (i anticipated a rise) was not really a good research question for this corpus.  The experts, and I mean quite literally Tony McEnery running the course and Andrew Hardie who is largely responsible for the corpora interfaces we used both responded to my posts online (note to self it makes a HUGE difference to have this level of expert interaction while learning online, try to do this if possible).

I have no idea about other MOOCs (well yes I do clearly Cathy Davidson's Future Ed MOOC is excellent in terms of interaction with her), but this MOOC is like the best graduate seminar you ever took with experts in the field directly and specifically teaching you, and reading your work (because yes they even kindly read this blog and the other work I'm inspired to attempt), and correcting you!

Working with the corpora interface reminded me of why I hate google ngrams so much (counting stuff is seldom the most interesting aspect of anything, and shifts over time are fairly useless without context, and oh yes, their are major flaws in how google books pulls data), but I also remembered that Mark Davies at BYU has created a CL interface for google books.  And I was off, with a weekend in bed, playing and attempting to pull out, in various ways, the origins of the term "women of color" and its relationship to other discourses such as "colored." (note to self sometimes the most important thing a student learns is what they really should be doing/researching)

Week 4
I'm super excited for this week which involves computer tagging of parts of speech and computer semantic analysis. The videos are excellent, but when it comes time to do my assignment I’m flummoxed.

My research corpus is in separate files, but the tools we are using take up to 100K words cut and pasted into a field on a web page. Hmmmm not excited about opening and combining that many files. So I grab 3 articles from Chrysalis as they are longer (for a total of just over 32K words) The tagging seems to work but the output is VERY hard to read. I try my trick from Antconc of moving files into excel, but unfortunately the columns don’t copy over so I can’t sort by the fields. Frustrating. Skimming over the 5000+ lines in excel is not going to cut it unfortunately, so I’m not really sure if the semantic tagging is useful or not. Since my corpus for research is a coherent I’m not actually sure I need semantic tagging. Clearly I should have used a smaller corpus (note to self stress parameters to students)

Meanwhile back to the POS tagging. I cannot clearly grasp how I import this into Antconc (note to self make sure to give students explicit instructions for every single step).  I save as TXT and import.  This helps some because the tags are pulling up in the word list, but only shows only the letters in POS tags (until I remember that I didn’t change the global settings to read ‘Letter,’ ‘Number,’ ‘Punctuation,’ ‘Symbol,’ and ‘Mark’ DOH). I notice lexical word her is still popping very high, which I’d expect in a feminist periodical as her_appge which is pre-nominal possessive pronoun which means her + noun. Clusters and collocates reveal “name” and “daughters” but in a corpus this small the frequency is really low

I return to look at “the” which I know will have highest frequency. Even so in a corpus of this size it is hard to find a statistically significant collocate with a frequency greater than 1. I do notice body and swagger which are interesting

Hmmm after working in antconc I decided to try semantic tagging again with smaller corpus. I take one piece of 5K words again her comes up very frequently, specifically her_z8f, which I decode as pronoun, which duh. I reverse to see if anything has been identified as class C which is art, or S7 which denotes power. I can’t seem to do that in antconc so I save the word list to output and pull that into excel but apparently this I not my day since what normally works somehow merges a bunch of cells. I’m getting really frustrated, but I try searching just in the txt window and I’m able to see that yes, there are words tagged for the semantic areas I’m most interested in (note to self, this sort of exercise very hard to do online might need synchronous online class).

I'm STILL not done with my homework, which is the fun task of running keyness on piece of my own writing.

On a more theoretical issue I am now even more worried about my corpus. I have feminist periodicals from the late 1970s with articles ranging in size from newspaper to academic journal length. In addition to the disparate size of the items in my corpus I have some over representation in the newspaper, which was produced by a small group of women as compared to the academic journals, which will have virtually no overlapping authors.
Week 3

Yikes, I dont know if was just my life (travel plus giant snow storm) or if the work in Corpus MOOC just got much harder week 3, but it was a SLOG for me, even though the topic, discourse analysis is one that I love, and the assignment, part of speech (POS) tagged texts is exactly what I need to do to get at the grammar of women's liberation.

As my #moocinpublic tweets revealed, I pulled a shady student move and jumped right to the quiz.  I was able to answer all of the questions with a few references to the PDFs of the lectures.  I then moved on to Laurence Anthony's excellent video's explaining how to use AntConc to do clusters, Ngrams and working with POS tagged files.  I do not have POS tagged files, but as usual, while live tweeting, Heather Froehlich reminded me that YES i needed to watch this anyway, Paul Rayson tweeted to tell me that creating a POS corpus was coming up, and Kat Gupta suggested a tool I might try to tag my corpus. (note to self, when challenging students, 2x as important to have as much IRT support as possible)

Reading through the discussion forums, which are HUGE on Corpus MOOC I could tell the other MOOCers were also feeling the strain.  Straight up word frequency wasn't revealing much.  However my results from week 2 suggested a POS keyness, so I decided to run that instead.  I kept with Brown as the Corpus compared to LOB as the reference corpus.

NOTE if you are playing along at home, the POS tagging makes Antconc hang for a LONG time.  Do not despair.  Wait patiently for your results to pop up! (note to self, make sure to give students these sorts of tips so they do not get frustrated over the silly stuff)

The results are hard to read at first because the tags are there.  Still I didn't hide them because I really need to get familiar with them.   Instead I save the results to a TXT file which allows me to ZOOM

and voila there we have at #18 386 464.643 toward_ii
back to handy reference page of tags to find out what _ii indicates, and it is, as I suspected just a plain old preposition.  Now to look for "around" the other preposition I noticed as Key in week 2.  It is far lower in the #, but there is so much junk crowding up the list, I'm not concerned 62 298 92.549 around_ii.  Ok so I had already identified those both as prepositions on my own.  But I wonder what sorts of clusters they are falling into  around or toward what or whom? hmmm so unsurprisingly toward + an "article" is the most frequent cluster, but I also notice some intriguing pronouns, such as two types of "her" that I did not know existed possessive pronoun, pre-nominal (e.g. my, your, our)  and 3rd person sing. objective personal pronoun (him, her)

UPDAATE the extremely kind Andrew Hardie tweeted some advice to me re: keyness of toward. It is a UK, US dialect different between toward and towards. I must again express my absolute amazement at the degree of engagement around #CorpusMOOC.  I routinely get tweets from the lead prof Tony McEnery as well as AntConc developer Laurence Antony.  This is more like Corpus Master Class than a MOOC.
Honestly I must ponder that for a moment, then I realize it is the difference between her possessing a noun (or something) and her as a direct objecty type of thing (i.e. that is her book v  give the book to her). How that combines with toward hmmm will need to go to the concordance view for that because this is really stretching my incredibly weak grasp of grammar (note to self, make sure handy table such as one provided here is if possible). Turns out these are extremely key, but not very frequent, but at this point i'm looking for education not results, so "toward her father" appears twice and "toward her dowry" appears once" for the possessive "her" for the objective "her" there are also three results "he rolled toward her" "propelled the slave toward her' interesting because in this sense (which I clearly got wrong) the toward implies movement in the direction of the her. Hmmm very interesting.  The final example is an expression of an emotion "resentment toward her"

ok, pressing on, I go to look at the titles and pronouns that I noted as negative keywords in LOB during week 2.  Mr and Sir appear as preceding noun of title  while she appears as 3rd person sing. subjective personal pronoun (he, she)

I think I know what this means, but I realize now looking at the concordance is the easiest way to grasp these grammatical terms. Ok so the Mr and Sir thing mean that they are, as expected, followed by a name.  She is more difficult because it is so numerous, so like I did with toward and around, I go to look at the clusters, most of which appear to be she + past tense of lexical verb (e.g. gave, worked) 
hmmm interesting, that grammatical construction shows women in "action" as we'd say in gender speak.  very interesting

Have been at this for almost 2 hours so I decide that this will have to be it for now.

Week 2
While doing work on Sunday, because I couldn't help myself, I got extremely excited about some new-to-me concepts (see early tweets in storify) in corpus linguistics as well as some actual numbers to use in interpreting corpus linguistics (which turn out to be much harder to come by than I expected)! I ended up highlighting the PDF transcripts of the videos to keep for future use. (note to self, multiple modalities better in online teaching).

Friday I returned to do the assignment, which was really interesting First CHEERS because this is the best supported MOOC ever [note to self try to find way to have 24/7 monitoring when teaching online perhaps by giving credit to students each time they help another student?]! As I live tweeted my assignment for this week's MOOC (I was running Brown G v LOB G) and immediately Amanda Potts tweeted back to be careful about results in a corpus so small. She suggested that I look for larger patterns such as part of speech or semantics.

I then re-ran full corpus against full corpus, using LOB as the ref. corpus. I was interested by the prepositions that are key in Brown (toward & around) as compared to negative keywords for LOB which included pronoun I, she & her (extremely interesting given frequency of those in general) as well as terms of address (mr & sir). These are extremely interesting to me in regards to my work on suffrage discourse. Does this mean LOB has a more formalized, yet also more personalized discourse? Hmmm yet "Mrs" pops pretty high on keyness for Brown, curiouser and curiouser. Of course, I've been well trained to understand that the lists can only get you so far, before you dive into KWIC to understand usage in texts!


Week 1


I got my first good dose of empathy when I read the work load estimate of 3 hour/week. That is significantly less than the 8-10 hour/week I expect my own students to dedicate to my online course, but still I panicked!I determined that I'd best follow my own advice to online students. Figure out the times that fit your schedule and slot the course in as a commitment like you would a face to face class. Right now I"m planning to use Mondays from 12-1 and a chunk of Friday AM. I expect to be able to go fairly fast as I'm not new to corpus linguistics. 

Today I jumped straight to the quiz [note to self, no one is looking at all that stuff at the start of the course. From now long prepare for the TL:DR student]. I was able to pass, but honestly I needed more than 1 try on some of the questions. A few pertain to areas of CL that I haven't explored or worked in, while a few others I'm ashamed to admit I already know how to do. I attempt to skim the introductions in the online discussion thread but I can already tell I'll be using twitter (#MOOCinPublic) to relate to the #corpusMOOC group [note to self if creating community among students find out preferred platform and group size]. I am fortunate that the software the #CorpusMOOC uses is the one I've been using so I get to skip a bunch of things, but I decide to download the new version. Much madness ensues as I repeatedly confuse the old version with the new. Finally i take the plunge and drag the old to the trash.

While I wait I skim through the FORTY ONE parts of WEEK ONE [YIKES]. There are a bunch that are advanced so I skim over those. Two are with the excellent Richard Xiao on translation. Since I"m not an ESL CL person I decide to skip those. Now we are down to 31 things for week 1. I'm going to watch the conversation with Geoffrey Leech later. I decide to check out the "discussion" question which turns out to be about Noam Chomsky. Honestly since this is optional, and I'm not looking to become a professional linguist I decide to skip this as well. Excellent down to twenty eight items. Sadly time to stop. Hopefully I can get back today to at least PLAN for Friday.

Quick skim reveals I missed this [note to self visually emphasize work to be done]

Practical activity - a question
Post your comments in the discussion below.
Take the LOB corpus and build a word list. Look at the top thirty words. How would you characterise these words? Do the same with the Brown corpus. Is it similar? Are there any differences between LOB and Brown? Feel free to concordance the words to inform your analysis.
If you have the time, do the same with the subsections of LOB and Brown. Might wordlists help to determine genre?

I ran the corpora through Antconc [the new interface is SWEET BTW].  I save the results and paste next to one another in excel in order to do the exercise.  The most frequent words are quite similar.  There are some minor variations in the order ("he" for example comes in at 12 v 11, but still higher than female or gender neutral pronouns). Coming at 30 for each is she and they, which is interesting but probably not of any great significance yet (gendered pronouns are an interest of mine).  Genre from a raw word frequency list is not going to happen.  However, since genre is also an interest of mine I continue by looking at the sub files.  I select subcorpus G Belles lettres, biography, essays (75, 77) for size and variety.  However looking at word frequency there really is not any significant variation here.  In part this can be explained by the fact that I selected a large sub corpus that is significant in determining the overall corpus values.  In part it can be explained by the fact that the most frequent words stay pretty standard.  I wonder what will happen if I look at a very small corpus.  Will it be different?  Corpus M Science fiction (6, 6) is the smallest.  Again I fire up Antconc but other than an intriguing "would" in Brown M, I'm really not seeing much.  I am not that surprised as I've already done my own CL for genre and work frequency was somewhat helpful, but I had a very focused corpus of feminist manifestos and it was only after comparing A LOT of manifestos that I began to be able to see any genre markers


the most valuable results have been word frequencies, which have generated two hypotheses so far
√ hypothesis one, woman/women most frequent word in things which should count as manifesto
√ hypothesis two,  documents with male/man/men  in top 10 are aimed at criticizing sexism, while those without with male/man/men focus on women’s roles, status.


I must admit that I've probably already spent 3 hours on #CorpusMOOC and I didn't even watch most of the videos since I already know how to use Antconc [note to self either way over-estimate or get some students to beta test for real times].  I'm curious about how long other people have taken and more than a little worried about future weeks!

No comments:

Post a Comment