Today we were joined (virtually) by Dr. Paul Vierthaler of Leiden University, to talk about his work text mining Jin Ping Mei (a 17th century Chinese novel), in a presentation entitled “Intertextuality, Classification, and Late Imperial Chinese Literature.” Specifically, we looked at intertextuality via text reuse, and experimental attempts at authorship attribution along with text classification. He also taught us very clearly about PCA, a perennial challenge for WORD LAB participants! Read on for a detailed overview of our conversation.Continue reading Feb 21, 2019 WORD LAB: Paul Vierthaler
This week we discussed Ryan Cordell’s Viral Texts project and paper “Infectious Texts: Modeling Text Reuse in Nineteenth-Century Newspapers” (forthcoming). Ryan visited Penn the previous week and gave a talk and a workshop about this project, so we seized the opportunity to further discuss his work at WORD LAB reading club.
Our discussion focused on how Ryan’s techniques might be applicable to other projects that our members are interested in, after we went over some finer points of the article (including wondering which genres the “fast” and “slow” words appeared in, and how they filtered out long-form advertisements). Brian Vivier wondered if the way they’re generating overlapping n-grams would allow for comparison in classical Chinese texts without whitespace for word divisions: picking a 5-gram, for example, would certainly catch some words. However, we also questioned whether we’d pick up too many word fragments and if this noise would be too much for the analysis.
This technique could possibly be used, we thought, to compare documents in a large corpus for text reuse, just as Ryan did with “viral” newspaper reprints, which would be extremely likely in the case of classical Chinese texts. For example, we could find when precedents are invoked or imperial decisions are cited. Where are the patterns, and would the noise fall out if we are looking for patterns like this?
We also felt that the paper was hesitant about making concrete conclusions and hard statements, and discussed that there is a difference in rhetoric between science and humanities: science is more experimental, more exploratory, and more likely to write about failure (although that’s certainly often not the case); humanities papers, meanwhile, tend to make big claims and only after the author is certain their position is solid. Well, we are making generalizations, but given that it was a room full of mostly humanities people, the tone stuck out and was surprising to most of us.
Finally, we talked about the applicability of Cordell et al’s ideas and techniques for other languages and time periods; for example, Maddie Wilcox brought up the similarities between antebellum America and Republican China in terms of printing instability, the spread of railroads, and the penetration of networks into rural areas. It would be interesting, too, to look at local republishing across genre, rather than geographically spread-out republishing. Finally, how about networks based on who studied abroad at the same time, who graduated together, and literary societies in Republican China? Endless possibilities.
The practicality of such projects, and the ethics of using available texts, also came up. We talked about improving OCR and the questionable legality of scanning an entire microfilm series of colonial newspapers (gotten via ILL) or on CD as PDFs. It would be great to compare across languages with Japanese, Chinese, and Korean colonial newspapers, for example. But the quality of the OCR is so poor perhaps this is only a pipe dream. It’s hard to argue for “intellectually viable OCR improvement” – if only we could think of a project and a grant!
Aside: We also covered more ground on the Python tutorial dealing with the Chinese Biographical Database API. It was anticlimactic: Molly found a bug in that made its JSON data invalid and so was unable to process it. Still, she explained the way JSON data is accessed and what it looks like in a script – if only it worked!