Today we were joined (virtually) by Dr. Paul Vierthaler of Leiden University, to talk about his work text mining Jin Ping Mei (a 17th century Chinese novel), in a presentation entitled “Intertextuality, Classification, and Late Imperial Chinese Literature.” Specifically, we looked at intertextuality via text reuse, and experimental attempts at authorship attribution along with text classification. He also taught us very clearly about PCA, a perennial challenge for WORD LAB participants! Read on for a detailed overview of our conversation.Continue reading Feb 21, 2019 WORD LAB: Paul Vierthaler
For this week’s WordLab discussion, we read David L. Hoover’s 2012 article, “The Tutor’s Story: A Case Study of Mixed Authorship.” (From English Studies 93:3, 324-339, 2012). In this article, Hoover looks at The Tutor’s Story, a novel by Victorian author Charles Kingsley, finished by his daughter Mary St Leger Kingsley Harrison, writing under the name Lucas Malet. Hoover uses this text as a test case to compare a range of authorship attribution methodologies, including Burrows’s Delta, Craig’s version of Burrows’s Zeta, and t-tests.
Hoover compares his results to an annotated version of the published text, discovered partway through his research, containing Malet’s own markings about which parts of the text are hers and which are Kingsley’s.
We spent most of our time today piecing out the specifics of the different methods. There seem to be two stages to the process–1) selecting the words which will compose the “fingerprint” of the author’s style, and 2) analyzing the statistical similarity of these words. More of Hoover’s explanation covers variations on the first part. For example, Delta uses the most frequent words in each text, while Zeta is based “not on the frequencies of words, but rather on how consistently the words appear” (329). Our consensus was that we were interested in reading more about this, and we may move to some of the original articles on the Delta method in future weeks. The R-stylo package also apparently has commands for Delta and Zeta which we could explore.
Big Picture Questions
Our discussion also brought up some larger conceptual issues around the question of authorship attribution. How are the results affected based on what we choose to use as the master corpus? How much does an author’s style vary based on different genres? (Malet’s only children’s book, Little Peter, is mentioned multiple times as disrupting the analysis, perhaps because of the smaller word range in a children’s book.) How is the relatively clean-up choice between two potential authors this different from the problems faced when we have more possible authors?
Significantly, Hoover’s results sometimes disagree with Malet’s markings, but it also is not entirely clear when Malet made those markings and how reliable they are. How confident do we need to be in the machine’s results before we start trusting the machine over the human?
Further Implications & Links
Our OPEN LAB time on October 29 was split into two parts: discussion of an article on authorship attribution, and step one of reading a CSV file, calling a web API, and then rewriting the file in Python. You can find the article here:
Verifying the authorship of Saikaku Ihara’s work in early modern Japanese literature; a quantitative approach Literary & Linguistic Computing, first published online September 29, 2014 doi:10.1093/llc/fqu049 (9 pages)
Our discussion centered on first the article’s assumptions and methodology, and then authorship attribution and its role in general. One point of contention is that the article attempted to differentiate one epistolary work from the rest of an assumed body of Ihara Saikaku’s works, but did not take into account the major stylistic differences between epistolary writing and the writing of typical fiction during the Edo period (1600-1868) in Japan. Thus, the stylistic differences that led the authors to suspect that Saikaku did not write this particular work could also simply just be due to the difference in genre. We thought that more work on genre could be a productive and interesting direction for this kind of research.
The Python tutorial covered reading in a CSV file using the unicodecsv library, how to import libraries in general, and how to access items in a list. It also demonstrated how to construct a URL and call the Chinese Biographical Database web API through urlopen(). Stay tuned for reading data from the API at the next OPEN LAB.