Friday, January 3, 2014

Gephi for the historically-inclined

Note I started this blog post when I first finally got Gephi working for myself.  I never finished it, but was prompted by a request from ThatCamp AHA. 

Gephi is an open-source software for visualizing and analyzing large networks graphs.  It does a nice whiz bang job of making digital work look really impressive.  Plus all the dots are pretty pretty as they move (the dots are actually called nodes, lesson #1 complete).   



I'm not an indiot.  I've figured out concordance software and topic modeling programs but DAMN if Gephi didn't make me feel like one.  I found this post by Justin Briggs useful in explaining what the heck Gephi does.  Amanda French's blog post (which I can't find online now) and Robin Davis's kind walk (in person) through when I first go started clarified enough for me over the summer to create my first To/from network (source/target). So in the spirit of paying it forward, here is the bare bones summary.  You should also check out Elena M Friot's recent blog post post

The secret to Gephi is that there is no secret to Gephi.  It is cludgy.  It crashes.  It is not user intuitive, and basically everyone just tweaks the settings until they get the visualization they want!

the good news?  Built into Gephi are numerous explanatory notes, generally indicated by a small ? icon.  This is crucial if you are not familiar with what the hell the settings means (and there are a lot of settings).

The best way to learn Gephi is to simply start using it.  You download here and Gephi resides on your laptop.  Double click the file to launch.

Make a very simple spreadsheet of a simple relationships (i.e. letters FROM 1 person to many people, participants in conferences. the only limit is that one column MUST be labeled SOURCE and one TARGET, this graph will only show the network between individuals and conferences not the indviduals.  For that you need to do 2 separate spreadsheets)  You can have other columns and use those to partition the data later, but Source and Target are mandatory.  If you do this in excel save as comma separated values (.csv) to import into Gephi.

Here is a spreadsheet of participants in four conferences.  The Source is the person and the target the conference.  If a person participated in more than one conference (like Audre Lorde who is the subject of this network), then they are entered more than once.



After you launch Gephi, you have to enter your data in the "data table" tab.  That is your .csv file. I know this makes no sense, but I swear it work, enter this as the edges, the connections in other words, and gephi will generate the nodes for you for this simple network (make sure generate missing nodes is checked, "If you import the edge table first it will create a nodes table using the values in the two columns and you can click the “create missing nodes” box to tell it do so"




Once inside Gephi, click on the "graph" tab or the You select a graph and click run, voila your network.



  Except likely it will look a mess.  That is because within Gephi there are tons of variables that do all sorts of stuff.   Working through this ppt  was extremely helpful in determining how to tweak the settings and for determining which graph layouts visualize which kinds of data best.

Within Gephi you can change color of nodes, add labels, and partition the data by other attributes.  You can change individual nodes, like I did here to highlight the conferences by selecting the node in the data table and editing the color.



To change the label to the color of the node, you have to click on the teeny tiny arrow in the lower right corner, which reveals the screen shown below.



Formulating data is the hardest part.   There are video tutorial on the youtubes that are useful (including Gephi's) and I loved the videos in the coursera with Lada  Adamic (before I dropped out of the course).   Clement Levallois has a nice explanation of how to do the timeline function which is obviously of total interest to the historically inclined, which I've managed to do once)   (A Tutorial – on dynamic networks - Clement Levallois)

I would also suggest though that people check out Raw, which is a far more user friendly interface.  It doesn't generate dynamic networks (i.e. the ability to see the pretty dots move in mesmerizing ways), just static visualizations, but it does some pretty complex thing very easily.  Before pouring a ton of time into Gephi I'd seriously consider whether you need all the bells and whistles, or whether the gee whiz factor is what is seducing you.  Here is the same information above in Raw, which I did by cutting and pasting the spreadsheet into the gui, dragging two label into two spots, and VOILA done.



coming soon, part 2, importing data via the spigot using excel to CSV conversion OHHHH and finally some visualizations/network explorations for which Gephi is actually necessary....


Part II

The above networks are extremely simple, to-from, examples.  In this case I am not convinced that Gephi is particularly necessary.  However, what if you have more extensive data to work with?

Using the "spigot" plug in you can start to do that.  There are good instructions on that link with screen captures, so I'm going to write a briefer how-to than I believe the above

Once you have installed the spigot plug, click under file and go down to import spigot


navigate to your spreadsheet.  Since there is not a search function in this interface try to save these files in an easy-to-remember place.  I keep mine in the skydrive as this metadata file is huge.


As you can see all of the labels for my columns remain.  I've not renamed any "source" or "target" as I did the first example on this blog.  Now I can select any two columns to "network.  This spreadsheet contains all the metadata associated with the files from the six volume History of Woman Suffrage.  Since my project explores gender in this work, I'll select the columns for genre of entry and author sex.  Note because my spreadsheet is in excel I don't have to change any other settings.  If you have your data in comma separated value format (CSV) they you would change to "comma,"  if tab delineated the switch to "tab" etc


The next screen provides you with three options.  The first is going to happen whether you click it or not because we only have two columns.  You may chose to do the second or third.  The final page confirms what will be done with a little explanation.


the final screen before entering the graph interface itself alerts you to any errors in your dataset.  Scroll through. If they are terrible go back and fix, if not push on! (remember most data is messy data).




Once in Gephi, click on the data table tab and select nodes (the circles in your network).  As you can see, Gephi has weighted my network by totaling up the genres and the authors' sex. 


What I get is a graph with nodes for ALL OF THE ABOVE, including male and female.  If I hover over male (I edited the node to make the label caps and the node size larger for better visibility.  Do this by control-clicking on it in the data table and selecting "edit node"



If I hover the node labeled FEMALE, the networks between the genres for "Females" are highlighted.   (line weight = #)


I've also used the "ranking" function to set the edges to weight, which is kind of useful and kind of not.  If I have changed the lines between the nodes (the edges) to colors that reflect the frequency.  However the only way to see the "legend" is to use the little tiny button in the bottom left corner, which does not display in the graph.  You can screen capture and cut and paste with your graph.



The preset color scheme here isn't very useful for visualization purposes.  To change it, I hover over color, you can see the little arrows that set the colors.  However to change the palette you need to click on the strangle little icon on the other side.  Clicking on the color itself will get you an interface to change the palette more.   I find this annoying because of course I want a different color for each # ad have to tweak incessantly to get that.  Left click will get you the "screenshot" mode here to download an image.


Again, although the ability to play with the colors is nice, in terms of exploring the network, I'm not sure this is particularly useful and give how complex gephi can get (and how it tends to crash), I'm not sure this is useful. Below is the same data in Raw. 


No comments:

Post a Comment