On April 26, Lindsay Van Tine and Rachel Buurma from the Early Novels Database project team joined us at WORD LAB to talk about the project’s history and future. The END project is undergraduate-driven and works to create rich metadata for specific copies of early fictional works. They engage in 1) cataloging (data collection), 2) project development (data analysis), and are paid for their work. They have a “40/60 day” where they devote time to learning about the field and their work as well as engaging in END tasks. This post covers more detail about the project as well as examples of student activities and Q&A. Continue reading April 26, 2016 WORD LAB – END Team
Our initial question was, does Underwood and Sellers’s argument stand up? While we were concerned about hand-picking data, one member suggested that “the sampling was sampling” and this was convincing. In addition, it felt that the random sample truly could be as random as possible given that the HathiTrust archives of 19th-century texts is better than other periods. We discussed why that might be: deaccessioning, preservation, simply the sheer numbers of books produced in the 19th century as opposed to earlier, OCR accuracy, and copyright restrictions. Also, there was the procedure of digitization itself — University of Michigan digitized its research collections, for example, rather than things in special collections, and this might have included many more 19th-century books.
We also asked what exactly significance and prestige mean. One member brought up the example of Wendell Harris’s article “Canonicity” and his argument that texts become part of the canon based on being part of the conversation; this would go along with the idea that things being reviewed in magazines are significant in that they are being talked about in the first place, whether positively or negatively. And even if things were being negatively reviewed, they still helped to shape literary conversation and thus “the canon” and what survived over time. One member also raised the question of doing sentiment analysis of some kind (whether yes/no or picking out significant words, as another member suggested) on the reviews and adding that data to the analysis.
A question was also raised by a Wharton member: with literary analysis of this kind, is it more about interpretation or explanation, and what is the outcome of the research? With business research, there would be an action suggested based on the conclusions. We ended up talking about how the paper is finding whether there is something at all to interpret or explain in the first place. The authors stop short of explaining or making generalizations, and emphasize the narrowness of their claims. (Whether one should take this approach or go out on a limb to make bigger, but potentially wrong, claims was also discussed at this point.) We also wondered that if they had success in prediction, does that mean there “is” an explanation somewhere of what makes things significant? Is there a latent pattern that exists, and that we might balk at recognizing, as humans?
Finally, we also discussed why the line was always going upward. It seems that this is because the works reviewed and random non-reviewed sample adhered more closely to the “standards,” whatever they are, over time for some reason. Again, we can speculate but not exactly explain what’s going on behind the scenes there.
One conclusion was that we buy the continuity or lack of change in standards over time. The reason is that we see this in other periods where there is a narrative of change, but in reality, much continuity: Meiji Japan (mid-to-late 19th century), and Lu Shi’s use of classical Chinese in his writing despite being at the vanguard of “modern” Chinese literature.
Addendum: Scott has been tinkering with this and is reducing the list of words and making better predictions, with between 100-400 words. A small subset is always in the list of words. It only looks at the training data to do word selection without seeing test data in advance. Depending on training data it picks slightly different sets but a set of about 15 always appear. The top of the list is “eyes” and if you use just the word “eyes” you get 63% accuracy! See Scott’s Github for the code and results.
We got a future potential WORD LAB project out of this discussion, so it was a very productive reading and session!
On April 1, we were joined via Skype by Mark Ravina, professor of history at Emory University. Mark researches Japanese political history in the latter half of the 19th century. He told us about his recent text analysis work involving 19th-century Japanese history, and also with a student on the recent student protests involving race at various universities around the United States. We’ll be following up with Mark on his research in the Fall 2016 WORD LAB lineup. Continue reading March 1, 2016 WORD LAB – Mark Ravina
On Tuesday, we discussed “Digging for Gold with a Simple Tool: Validating Text Mining in Studying Electronic Word-of-Mouth (eWOM) Communication,” by Chuanyi Tang and Lin Guo (2013).
This article tackles the problem of text mining from a marketing perspective, testing whether text mining offers useful information in the study of eWOM (electronic word-of-mouth, aka online reviews). Tang and Guo conclude that while the Star Rating of an online review is the best predictor of people’s attitudes in the review, text mining can offer additional nuance.
Much of our conversation centered around the LIWC software used by Tang and Guo for their study. Essentially an amped-up text tagger, LIWC checks each word in a text against its range of dictionaries and produces a statistical breakdown of that text.
LIWC’s main strength seems to be its dictionaries, which are thoroughly-researched and allow for somewhat sophisticated tagging of words by a range of features: parts of speech, emotions (positive or negative), and many categories including “Body,” “Ingestion,” “Time,” “Money,” “Religion”–over 400 different categories in all. In the case of online reviews, for example, Tang and Guo found that “Negations” and “Money” were both effective predictors. These dictionaries are, however, proprietary, and Christine pointed out the difficulty of accessing these full dictionaries in the latest version of the LIWC software.
We tested LIWC on a segmented version of Dickens’s David Copperfield, as a good example of a coming-of-age story, but weren’t able to find strong trends. The whole paper was a very interesting counterpoint to previous work we’ve discussed on text mining in the humanities, where it’s not always so easy to validate the result.
Thanks to Christine Chou for suggesting the piece and taking the time to give us a great overview!
For this week’s WordLab discussion, we read David L. Hoover’s 2012 article, “The Tutor’s Story: A Case Study of Mixed Authorship.” (From English Studies 93:3, 324-339, 2012). In this article, Hoover looks at The Tutor’s Story, a novel by Victorian author Charles Kingsley, finished by his daughter Mary St Leger Kingsley Harrison, writing under the name Lucas Malet. Hoover uses this text as a test case to compare a range of authorship attribution methodologies, including Burrows’s Delta, Craig’s version of Burrows’s Zeta, and t-tests.
Hoover compares his results to an annotated version of the published text, discovered partway through his research, containing Malet’s own markings about which parts of the text are hers and which are Kingsley’s.
We spent most of our time today piecing out the specifics of the different methods. There seem to be two stages to the process–1) selecting the words which will compose the “fingerprint” of the author’s style, and 2) analyzing the statistical similarity of these words. More of Hoover’s explanation covers variations on the first part. For example, Delta uses the most frequent words in each text, while Zeta is based “not on the frequencies of words, but rather on how consistently the words appear” (329). Our consensus was that we were interested in reading more about this, and we may move to some of the original articles on the Delta method in future weeks. The R-stylo package also apparently has commands for Delta and Zeta which we could explore.
Big Picture Questions
Our discussion also brought up some larger conceptual issues around the question of authorship attribution. How are the results affected based on what we choose to use as the master corpus? How much does an author’s style vary based on different genres? (Malet’s only children’s book, Little Peter, is mentioned multiple times as disrupting the analysis, perhaps because of the smaller word range in a children’s book.) How is the relatively clean-up choice between two potential authors this different from the problems faced when we have more possible authors?
Significantly, Hoover’s results sometimes disagree with Malet’s markings, but it also is not entirely clear when Malet made those markings and how reliable they are. How confident do we need to be in the machine’s results before we start trusting the machine over the human?
Further Implications & Links
Beth Seltzer is a library postdoc at Penn Libraries, who works with Victorian detective fiction and is a recent Temple University Grad. You can find her on Twitter: @beth_seltzer. Her title is ‘Drood Resuscitated’!
The presentation is on her work on Charles Dickens’s last novel, The Mystery of Edwin Drood, which was serialized until and just a little after his death because he had written a few more installments before he died. Many unanswered questions at the end of the book, including about the characters and whether it was even going to be a detective novel in the first place.
Is classical Chinese a natural language? Artificial language?
According to Wikipedia (our most reliable source, “naturally”), a natural language is “unpremeditated” when it is used. When someone is truly fluent in classical Chinese, what happens in their brains? Do they conceptualize poems in spoken language as a mediation layer and then “translate” it into classical Chinese/literary Sinitic, or do they think in that layer directly?
If they do think in it, is that enough to make it a natural language?
How about coding? Is pseudocode natural language (or, shorthand for a natural language), whereas programming languages are artificial? Does anyone think in Python, for example, or is it always mediated by a natural language (or pseudocode, or NL-then-pseudocode) stage beforehand?
What might edge cases teach us, such as sign language — never a spoken language, always spoken physically. What is writing or typing, if not a physical gesture? Sign language looks a lot like a “script” in how it functions, but it’s considered separate from it. There are also different “dialects” and it’s a “script” in the sense that it’s rendering a specific language. How about musical thoughts — are these thoughts the same as “linguistic” thoughts? How about thinking with one’s hands, such as not just someone using sign language does but also master craftsmen/women who have “unmediated” gestures? But they are not necessarily communicating that way, which is what language does.
Where is the line between script and artificial language?
Just some unmediated thoughts from our OPEN LAB today while going through Natural Language Processing With Python!
Katie Rawson, WORD LAB co-organizer, presented on her current research and its background on January 27, the first WORD LAB of 2015. Her work focuses on the Southern Foodways Alliance (SFA) oral histories of food culture in the American South, and specifically on using topic modeling to analyze those narratives.
The SFA’s mission is to “document and celebrate food in the South” and also to engage in race reconciliation. It does this largely through collecting oral histories and making them (and transcripts in PDF format) freely available, as well as hosting films. Interestingly, Katie noted that the filmmaker who documents Southern food culture and workers is male, whereas the majority of the oral history interviewers are both white and female. In fact, she emphasized that they have particular aesthetics and particular stories they want to tell, although it’s also a question of what they can do with what is actually collected: they’re framing it, but they can’t control the content.
Katie was interested in analyzing the oral histories, which she went through for her dissertation, to discover themes in them – especially those related to gender roles, work, family, and business – that aren’t immediately apparent to a human reader.
She began her work by downloading all the oral histories and making them into plain text files, with little manual cleanup (they’re mostly easy to OCR and fairly standardized), then used Topic Modeling Tool (a GUI for MALLET) to do her analysis. She made her own list of stop words, which expanded greatly over time in specific ways; when trying to get past what “grouped them together already,” she found herself adding not just personal names and places but all food-related words to the stop list, making it longer and longer. This was an attempt to get at latent discourse that might gather the histories into different groups than a human would “by project” or “by theme,” or might show disparities within projects. (For example, topic modeling without this extensive stop word list just identifies projects like “how people talk about barbeque” – the ways in which SFA itself had already organized the histories.)
While Katie is getting something fascinating at the moment, it’s not the “language of running a business” and the “language of food and family” that she was hoping to uncover, and it’s also still not breaking up SFA’s premade sections. What’s the point of doing topic modeling if it’s just breaking the histories into the groups that they came in, in the first place, Katie wonders. Maddie Wilcox suggested that Katie could look at “what happens when you remove each layer” – first personal names, then places, then food – and also take a look at the exceptions rather than the obvious project divisions. Elias Saba agreed, suggesting to go even back before Katie started taking out stop words, and look at outliers: what are the ones that don’t show up in the groups that we know they are “in”? Elias also wondered what other words could be removed – for example, everything that indicates “why they were in the interview in the first place” (as Katie put it) – and then Katie would have a list of words that people are/are not using, perhaps interesting in itself as an outline of language.
Katie also explained a specific analysis she did related to gender roles that used topic modeling, in a subset of histories involving oysters. Specifically, she first ran Topic Modeling Tool on the transcripts, then circled every time a word within a topic had to do with a family relationship, striking out some topics or words as she went, and also highlighted words that had to do with business. She looked at both word frequency and also the relationship between words in topics, and how they were distributed. Katie went in with expectations but things broke down differently: rather than the discourse that was represented in an SFA film about the industry, the one that emerged from the histories was more about women’s work being empowering and interesting. They are the ones who run the household and finances, and they can make something when men can’t because their income is more reliable. Even though there is still a divide, Katie found that the discourse here had a rich story to tell about how work is negotiated in the space of the oyster industry, and what the divide means isn’t as apparent has it seems.
In the end, Katie still wondered about the efficacy of topic modeling in understanding the SFA’s oral histories, but as Brian Vivier pointed out, in at least one case she had found something new – despite having read almost every single interview already for her dissertation. Katie plans to continue her work, to hone the stop word list, and keep thinking about the applicability of topic modeling and how to make the methodology work for her material – and what that methodology might look like.
As a coda, the WORD LAB group raised the question of oral histories as a genre, and how they could be analyzed computationally as such. In addition, we all advocated making this kind of material – including Katie’s data such as plain text files – available freely so that it can be worked with more broadly. This is especially important in the case of oral history, which is less frequently studied as a genre, and in a less-studied geographic area such as the American South. We cheered on the recent release of slave narratives from DocSouth and hope that Katie’s work can contribute to the computational study of narratives of the South in environment.
Maddie Wilcox presented about her project “Retranslating Musical Comedy for Shanghai’s Left-wing Film Movement.” She framed her problem – a question of naming a genre – and told us about how she explored and began solving it.
A filmmaker named Yuan Muzhi in the 1930s called for a new kind of film to be made in China – a term yinyue xiju, which translated directly “musical comedies.” However, the kinds of film he was advocating for were not what one would classify as musical comedies based on Western definitions – a genre of films that were part of the foreign and homegrown film scene in China at the time. This genre is often translated as “sing-song pictures” (gechang jupian).
To begin exploring this question of genre names and translation, Maddie used the Shanghai Library database and the Media History Digital Library to explore date distribution of mentions of different genre names using full text searching. This ultimately led her to explore the term “operetta” as a better translation for Yuan Muzhi’s idea of musical comedy.
In the Q&A, we discussed different models for exploring the language around these films, including topic modeling and keywords in context. Ultimately, her research questions may be best served by moving between methods, which allow her to surface terms that she does not know about and then understand how and when they are being used.
We ended by exploring how the digital corpora she used were made and OCRed (a prelude to our discussion this past week). We discussed the lack of digitizing or providing metadata for advertisements. We examined exactly what was OCRed in the PDFs from the library, using a few methods.
In a lively series of experiments, using different tools on several machines (and the expertise of our Chinese readers), we uncovered what parts of the text had been OCRed, the quality of the OCR, and its arrangement. It was, not surprisingly, dirty; however, we were surprised at how much of the text appeared to not be OCRed at all. While it seemed that her results were still helpfully suggestive, we discussed what a difference having a fully OCRed text would be for keyword search.
This week we discussed Ryan Cordell’s Viral Texts project and paper “Infectious Texts: Modeling Text Reuse in Nineteenth-Century Newspapers” (forthcoming). Ryan visited Penn the previous week and gave a talk and a workshop about this project, so we seized the opportunity to further discuss his work at WORD LAB reading club.
Our discussion focused on how Ryan’s techniques might be applicable to other projects that our members are interested in, after we went over some finer points of the article (including wondering which genres the “fast” and “slow” words appeared in, and how they filtered out long-form advertisements). Brian Vivier wondered if the way they’re generating overlapping n-grams would allow for comparison in classical Chinese texts without whitespace for word divisions: picking a 5-gram, for example, would certainly catch some words. However, we also questioned whether we’d pick up too many word fragments and if this noise would be too much for the analysis.
This technique could possibly be used, we thought, to compare documents in a large corpus for text reuse, just as Ryan did with “viral” newspaper reprints, which would be extremely likely in the case of classical Chinese texts. For example, we could find when precedents are invoked or imperial decisions are cited. Where are the patterns, and would the noise fall out if we are looking for patterns like this?
We also felt that the paper was hesitant about making concrete conclusions and hard statements, and discussed that there is a difference in rhetoric between science and humanities: science is more experimental, more exploratory, and more likely to write about failure (although that’s certainly often not the case); humanities papers, meanwhile, tend to make big claims and only after the author is certain their position is solid. Well, we are making generalizations, but given that it was a room full of mostly humanities people, the tone stuck out and was surprising to most of us.
Finally, we talked about the applicability of Cordell et al’s ideas and techniques for other languages and time periods; for example, Maddie Wilcox brought up the similarities between antebellum America and Republican China in terms of printing instability, the spread of railroads, and the penetration of networks into rural areas. It would be interesting, too, to look at local republishing across genre, rather than geographically spread-out republishing. Finally, how about networks based on who studied abroad at the same time, who graduated together, and literary societies in Republican China? Endless possibilities.
The practicality of such projects, and the ethics of using available texts, also came up. We talked about improving OCR and the questionable legality of scanning an entire microfilm series of colonial newspapers (gotten via ILL) or on CD as PDFs. It would be great to compare across languages with Japanese, Chinese, and Korean colonial newspapers, for example. But the quality of the OCR is so poor perhaps this is only a pipe dream. It’s hard to argue for “intellectually viable OCR improvement” – if only we could think of a project and a grant!
Aside: We also covered more ground on the Python tutorial dealing with the Chinese Biographical Database API. It was anticlimactic: Molly found a bug in that made its JSON data invalid and so was unable to process it. Still, she explained the way JSON data is accessed and what it looks like in a script – if only it worked!