On April 26, Lindsay Van Tine and Rachel Buurma from the Early Novels Database project team joined us at WORD LAB to talk about the project’s history and future. The END project is undergraduate-driven and works to create rich metadata for specific copies of early fictional works. They engage in 1) cataloging (data collection), 2) project development (data analysis), and are paid for their work. They have a “40/60 day” where they devote time to learning about the field and their work as well as engaging in END tasks. This post covers more detail about the project as well as examples of student activities and Q&A. Continue reading April 26, 2016 WORD LAB – END Team
On April 1, we were joined via Skype by Mark Ravina, professor of history at Emory University. Mark researches Japanese political history in the latter half of the 19th century. He told us about his recent text analysis work involving 19th-century Japanese history, and also with a student on the recent student protests involving race at various universities around the United States. We’ll be following up with Mark on his research in the Fall 2016 WORD LAB lineup. Continue reading March 1, 2016 WORD LAB – Mark Ravina
Katie Rawson, WORD LAB co-organizer, presented on her current research and its background on January 27, the first WORD LAB of 2015. Her work focuses on the Southern Foodways Alliance (SFA) oral histories of food culture in the American South, and specifically on using topic modeling to analyze those narratives.
The SFA’s mission is to “document and celebrate food in the South” and also to engage in race reconciliation. It does this largely through collecting oral histories and making them (and transcripts in PDF format) freely available, as well as hosting films. Interestingly, Katie noted that the filmmaker who documents Southern food culture and workers is male, whereas the majority of the oral history interviewers are both white and female. In fact, she emphasized that they have particular aesthetics and particular stories they want to tell, although it’s also a question of what they can do with what is actually collected: they’re framing it, but they can’t control the content.
Katie was interested in analyzing the oral histories, which she went through for her dissertation, to discover themes in them – especially those related to gender roles, work, family, and business – that aren’t immediately apparent to a human reader.
She began her work by downloading all the oral histories and making them into plain text files, with little manual cleanup (they’re mostly easy to OCR and fairly standardized), then used Topic Modeling Tool (a GUI for MALLET) to do her analysis. She made her own list of stop words, which expanded greatly over time in specific ways; when trying to get past what “grouped them together already,” she found herself adding not just personal names and places but all food-related words to the stop list, making it longer and longer. This was an attempt to get at latent discourse that might gather the histories into different groups than a human would “by project” or “by theme,” or might show disparities within projects. (For example, topic modeling without this extensive stop word list just identifies projects like “how people talk about barbeque” – the ways in which SFA itself had already organized the histories.)
While Katie is getting something fascinating at the moment, it’s not the “language of running a business” and the “language of food and family” that she was hoping to uncover, and it’s also still not breaking up SFA’s premade sections. What’s the point of doing topic modeling if it’s just breaking the histories into the groups that they came in, in the first place, Katie wonders. Maddie Wilcox suggested that Katie could look at “what happens when you remove each layer” – first personal names, then places, then food – and also take a look at the exceptions rather than the obvious project divisions. Elias Saba agreed, suggesting to go even back before Katie started taking out stop words, and look at outliers: what are the ones that don’t show up in the groups that we know they are “in”? Elias also wondered what other words could be removed – for example, everything that indicates “why they were in the interview in the first place” (as Katie put it) – and then Katie would have a list of words that people are/are not using, perhaps interesting in itself as an outline of language.
Katie also explained a specific analysis she did related to gender roles that used topic modeling, in a subset of histories involving oysters. Specifically, she first ran Topic Modeling Tool on the transcripts, then circled every time a word within a topic had to do with a family relationship, striking out some topics or words as she went, and also highlighted words that had to do with business. She looked at both word frequency and also the relationship between words in topics, and how they were distributed. Katie went in with expectations but things broke down differently: rather than the discourse that was represented in an SFA film about the industry, the one that emerged from the histories was more about women’s work being empowering and interesting. They are the ones who run the household and finances, and they can make something when men can’t because their income is more reliable. Even though there is still a divide, Katie found that the discourse here had a rich story to tell about how work is negotiated in the space of the oyster industry, and what the divide means isn’t as apparent has it seems.
In the end, Katie still wondered about the efficacy of topic modeling in understanding the SFA’s oral histories, but as Brian Vivier pointed out, in at least one case she had found something new – despite having read almost every single interview already for her dissertation. Katie plans to continue her work, to hone the stop word list, and keep thinking about the applicability of topic modeling and how to make the methodology work for her material – and what that methodology might look like.
As a coda, the WORD LAB group raised the question of oral histories as a genre, and how they could be analyzed computationally as such. In addition, we all advocated making this kind of material – including Katie’s data such as plain text files – available freely so that it can be worked with more broadly. This is especially important in the case of oral history, which is less frequently studied as a genre, and in a less-studied geographic area such as the American South. We cheered on the recent release of slave narratives from DocSouth and hope that Katie’s work can contribute to the computational study of narratives of the South in environment.