Today we were joined (virtually) by Dr. Paul Vierthaler of Leiden University, to talk about his work text mining Jin Ping Mei (a 17th century Chinese novel), in a presentation entitled “Intertextuality, Classification, and Late Imperial Chinese Literature.” Specifically, we looked at intertextuality via text reuse, and experimental attempts at authorship attribution along with text classification. He also taught us very clearly about PCA, a perennial challenge for WORD LAB participants! Read on for a detailed overview of our conversation.Continue reading Feb 21, 2019 WORD LAB: Paul Vierthaler
Today we were joined by Dr. Jeffrey Tharsen of University of Chicago to learn about his research and projects related to premodern Chinese digital philology and phonology.Continue reading Jan 31, 2019 WORD LAB: Jeffrey Tharsen
On April 1, we were joined via Skype by Mark Ravina, professor of history at Emory University. Mark researches Japanese political history in the latter half of the 19th century. He told us about his recent text analysis work involving 19th-century Japanese history, and also with a student on the recent student protests involving race at various universities around the United States. We’ll be following up with Mark on his research in the Fall 2016 WORD LAB lineup. Continue reading March 1, 2016 WORD LAB – Mark Ravina
Is classical Chinese a natural language? Artificial language?
According to Wikipedia (our most reliable source, “naturally”), a natural language is “unpremeditated” when it is used. When someone is truly fluent in classical Chinese, what happens in their brains? Do they conceptualize poems in spoken language as a mediation layer and then “translate” it into classical Chinese/literary Sinitic, or do they think in that layer directly?
If they do think in it, is that enough to make it a natural language?
How about coding? Is pseudocode natural language (or, shorthand for a natural language), whereas programming languages are artificial? Does anyone think in Python, for example, or is it always mediated by a natural language (or pseudocode, or NL-then-pseudocode) stage beforehand?
What might edge cases teach us, such as sign language — never a spoken language, always spoken physically. What is writing or typing, if not a physical gesture? Sign language looks a lot like a “script” in how it functions, but it’s considered separate from it. There are also different “dialects” and it’s a “script” in the sense that it’s rendering a specific language. How about musical thoughts — are these thoughts the same as “linguistic” thoughts? How about thinking with one’s hands, such as not just someone using sign language does but also master craftsmen/women who have “unmediated” gestures? But they are not necessarily communicating that way, which is what language does.
Where is the line between script and artificial language?
Just some unmediated thoughts from our OPEN LAB today while going through Natural Language Processing With Python!
This week we discussed Ryan Cordell’s Viral Texts project and paper “Infectious Texts: Modeling Text Reuse in Nineteenth-Century Newspapers” (forthcoming). Ryan visited Penn the previous week and gave a talk and a workshop about this project, so we seized the opportunity to further discuss his work at WORD LAB reading club.
Our discussion focused on how Ryan’s techniques might be applicable to other projects that our members are interested in, after we went over some finer points of the article (including wondering which genres the “fast” and “slow” words appeared in, and how they filtered out long-form advertisements). Brian Vivier wondered if the way they’re generating overlapping n-grams would allow for comparison in classical Chinese texts without whitespace for word divisions: picking a 5-gram, for example, would certainly catch some words. However, we also questioned whether we’d pick up too many word fragments and if this noise would be too much for the analysis.
This technique could possibly be used, we thought, to compare documents in a large corpus for text reuse, just as Ryan did with “viral” newspaper reprints, which would be extremely likely in the case of classical Chinese texts. For example, we could find when precedents are invoked or imperial decisions are cited. Where are the patterns, and would the noise fall out if we are looking for patterns like this?
We also felt that the paper was hesitant about making concrete conclusions and hard statements, and discussed that there is a difference in rhetoric between science and humanities: science is more experimental, more exploratory, and more likely to write about failure (although that’s certainly often not the case); humanities papers, meanwhile, tend to make big claims and only after the author is certain their position is solid. Well, we are making generalizations, but given that it was a room full of mostly humanities people, the tone stuck out and was surprising to most of us.
Finally, we talked about the applicability of Cordell et al’s ideas and techniques for other languages and time periods; for example, Maddie Wilcox brought up the similarities between antebellum America and Republican China in terms of printing instability, the spread of railroads, and the penetration of networks into rural areas. It would be interesting, too, to look at local republishing across genre, rather than geographically spread-out republishing. Finally, how about networks based on who studied abroad at the same time, who graduated together, and literary societies in Republican China? Endless possibilities.
The practicality of such projects, and the ethics of using available texts, also came up. We talked about improving OCR and the questionable legality of scanning an entire microfilm series of colonial newspapers (gotten via ILL) or on CD as PDFs. It would be great to compare across languages with Japanese, Chinese, and Korean colonial newspapers, for example. But the quality of the OCR is so poor perhaps this is only a pipe dream. It’s hard to argue for “intellectually viable OCR improvement” – if only we could think of a project and a grant!
Aside: We also covered more ground on the Python tutorial dealing with the Chinese Biographical Database API. It was anticlimactic: Molly found a bug in that made its JSON data invalid and so was unable to process it. Still, she explained the way JSON data is accessed and what it looks like in a script – if only it worked!