Last spring I had the pleasure of working with two undergraduate students, Taylor Lundeen and Catie Olson, enrolled in the University of Michigan’s School of Information. They worked on a capstone project on data visualization, using our Jane Addams digital edition databases. Anneliese Dehner, our web developer, helped out with the some technical aspects of the collaboration.
One of the many great things about digital publication is that the information we create can be reused and repurposed in ways that we might not have thought of. Making our data available to researchers to explore has been one of our goals from the start of our work on Jane Addams, and with this investigation we have learned what we can do fairly easily, and what is more complex.
Accessing the Data
Our first step was to get a copy of our data exported out so that Taylor and Catie could work on it. What they found worked the best was an Omeka plugin (Omeka Rest API) that allowed them to export data in a format that worked well with data manipulation software.
Our ultimate goal is to have a utility on the digital edition that will enable users to download all or parts of the data for investigation.
One problem that reared its head immediately is that we have a very large dataset, and it is growing larger every day. This made it difficult, using the tools they had available to work with the whole set.
Natural Language Processing
One of the approaches, which Catie worked on, was seeing what we could learn from analyzing the “Text” field in our database, where transcriptions are stored. This kind of analysis can track the frequency of words, or compare word usage over time. Eventually it could be used for topic modeling, where a digital tool tries to make sense of words that appear together. These groupings can uncover connections that we sometimes don’t expect.
An important step in working with our texts was data cleaning, the process by which HTML and special characters were cleaned out and text was split word by word. Then Catie built bar charts that displayed the most common words. She built a separate chart for each year to allow us to compare years to see what Addams was thinking and writing about.
The most obvious finding to me, was that we needed to think about stop words — words that are excluded in the results because they are too common or have no analytical meaning. Articles, like “a” and “the” are common stop words– we also had to consider “page” which we use to signify the next page in our transcriptions, and, gulp, even “Hull House” because we transcribed the letterhead that Jane Addams used. Other words like “Mrs,” “Mr.” and “Miss” and salutations like “Dear” are candidates for being pulled from the analysis.
We also got to see the frequency of that nemesis of editors – “illegible.” This comes up far more frequently than I would like, but I was gratified to see that in the years where we have proofread the texts, the frequency is much lower.
It will surprise no one that “peace” and “war” shot to the top in 1915.
In 1905, the most frequent words deal more with the plight of children and represent Addams’ work on child labor and welfare in Chicago.
Catie also worked on another way to show the content of Addams’ writings, plotting the frequency of a word over time. Similar to the Google n-gram viewer that can compare the frequency of words in Google Books over time, this gives you a sense of the chronology. We did not have the capacity at this point to allow users to type the words they want, but were able to produce n-grams for some of the most popular words.
Seen together, it is a little frightening, but on the live version on the site, you can select a single word to analyze.
The n-gram for “Illegible” shows the power of proofreading! When the data was downloaded for use, we had just finished proofreading 1915!
Social Network Analysis
Another approach was to see what we could learn from social network analysis. Using Omeka’s Item Relations plugin, we have been tracking relationships — mostly between documents and the people, organizations, and events that are mentioned in them. We also are building connections between people and organizations, tracking which people were members of which organizations, for example, or who participated in a specific event. We wondered whether the relationships between people and organizations might yield some interesting insights, or whether we could find other connections between people and the metadata gathered about them. Taylor was responsible for this project.
Our large dataset proved to be problematic for developing a meaningful social network based on shared connections. We think there is promise for this in future by controlling which people are included in the network, but the sheer number of people and the amount of common tags produced a daunting graph.
Instead, Taylor created a geographical visualization of Addams’s social networks related to several topics. We used our tags for movements like “Woman Suffrage,” “Child Labor,” and “Peace” and plotted their geographic locations. Compare Addams’ Settlement Movement network and her Peace network below to see the expansion of her work internationally.
On the live version of these maps, you can zoom in and out and mouse over each dot to reveal the name of the activist.
It was amazing to see what two talented students could do in such a short period of time! The experience has helped us think more about how we want to make our data accessible, and has uncovered challenges that we need to think about. Our database is large and complex and developing means to limit the queries is going to be important.
We are looking forward to working with other UMSI students and any digital humanists interested in advancing this work.