Today we were joined (virtually) by Dr. Paul Vierthaler of Leiden University, to talk about his work text mining Jin Ping Mei (a 17th century Chinese novel), in a presentation entitled “Intertextuality, Classification, and Late Imperial Chinese Literature.” Specifically, we looked at intertextuality via text reuse, and experimental attempts at authorship attribution along with text classification. He also taught us very clearly about PCA, a perennial challenge for WORD LAB participants! Read on for a detailed overview of our conversation.
Paul set out two main goals at the start: computationally identifying intertextuality (as in the Viral Texts project), and machine-classifying text, with the ultimate goals of understanding history in “untrustworthy” pseudo-historical texts, and trying to identify the author of Jin Ping Mei. How is this linked to intertextuality, then?
Corpus-level Automated Intertextuality Detection
Paul explained first how quasi-histories can show how information is spread or shared over bodies of texts, as they reported on recent events at the time and is useful for understanding a given time’s “cultural imagination” since they reused a good deal of text. The case of Wei Zhongxian’s (1568-1627) death is interesting: before 1627 there are few sources about him, but novels and pseudo-histories (yeshi) appeared right after his death and many more followed after that, with dense intertextuality. So Paul was interested in looking at this case at scale, and turned to Donald Sturgeon’s Chinese Text Project, the Kanseki Repository, and the now-defunct Daizhige to search for Wei’s name across large corpora. Since he was mentioned in at least 118 texts from those corpora between 1600-1800, it’s too much for a human to read, so Paul turned to “corpus-level automated intertextuality detection” to investigate “how information traverses a system like this”: what stories get told in the same way and what is being copied?
What kind of system could realistically take in thousands of Chinese texts and show where they copied each other? Paul ended up using BLAST, which is developed by researchers in bioinformatics to identify DNA alignment, to generalize to an arbitrarily large vocabulary and adapted it to his problem. (It took a lot of work, and his code will be available in a Cultural Analytics article.) First, he defined “meaningful reuse” (how many characters in common and how similar do the strings have to be?), and decided on 10 character strings with 80% similarity. Then, after identifying every instance of text reuse that meets those criteria, Paul visualized it as a network where nodes are documents and edges are the amount of shared text.
The network showed which texts had the most shared text with each other, such as an unofficial history quoting from an official one. But, who is quoting who? It is impossible to tell from this method, so Paul also took into account bibliographic information when it was available. The next challenge, then, was to try to computationally identify the source!
Text Classification and Stylometry
Paul used machine learning for text classification to try to identify quote sources, assuming that “a quote will look more like its source than the text that copies it.” Here, he used stylometry not just for authorship attribution, but to ask “what document was a sentence written in?”
In order to investigate this, Paul tokenized his texts by unigram (because it is currently too difficult to tokenize by “word” in Classical Chinese), then take a frequency measurement for every unigram. Each document was then represented by a list of frequencies, or vector space model. Through this Paul was moving away from the typical literary “style” analysis to a quantitative measurement of variation as the definition of style for his purposes. This allows us to look at a numerical representation of documents and plot them in space, in practice, looking at hundreds or thousands of dimensions. Paul created an interpretive visualization using PCA to reduce the dimensions, then see differences among texts. While you lose some information this way, you gain abstract axes that you can plot the documents on.
Paul plotted unigrams to see how they are influencing the documents, “reading” the characters to understand what the axes mean. (For example, a gradient from Classical to vernacular, or time/place words to abstract concepts.) From this, he built his text classifier to predict what book a quote comes from, or who wrote that book.
In order to choose the machine learning algorithm for the classifier, Paul used KETTLE, a genetic algorithm in Python that runs many ML models and optimizes them, and ended up choosing gradient descent. He used this to first classify, then use PCA to train the model further. For sections at least 50 characters long, he is getting 98% accuracy, and a neural network might do even better! For a harder problem, Paul experimented with comparing the Water Margin and Jin Ping Mei, which share about the first 5 chapters (JPM copied WM). He started by working with “known problems,” removing the shared sections, training and comparing, then doing PCA. After this, he got 90% accuracy but only 70% for predicting a quote’s origin. What might be throwing it off? It’s finding the document, of course, most similar in the corpus but that is not necessarily the quote’s actual origin. (However, Paul is interested in in-corpus similarity so that’s not a problem.)
So, these are methods for tracing information through a corpus, and why are they useful for Jin Ping Mei authorship attribution? Paul wondered if there was a stylistic “signal” without shared text, which comes from many different genres with no apparent logic in the PCA graph – “only a few contribute to the creation of the novel” and much text reuse is “background noise” such as aphorisms. (Paul looked at commonly reused quotes that tie together the corpus as a “textual environment.”)
For Jin Ping Mei, Paul looked at other Ming writing as “source material” to see what percent of it appeared in the novel. He was trying to figure out “what’s quoting what” to see if he could use this information to infer an author from that person’s other writing.
Distilling Jin Ping Mei
In order to create a more refined authorship classifier, Paul worked on “distilling” documents so they contained no “source material” (i.e., reused text). In this case, Jin Ping Mei becomes “more like your average novel” without the source material, and this relates back to the claim that “intertextuality makes the novel.” In the end, Paul believes his “distillation” of the novel is an “algorithmic recreation of contemporary reading practices.” He also compared different editions of the novel to see what editors removed, changed, and inserted, trying to get at what editors were pulling out to “fix” the novel in their editions. Ultimately, this is a “reapproach [to] Chinese print history and literary development with a new set of glasses.”
Going forward, Paul plans to rethink how he understands the “remix culture” that these Ming books existed in, and understand the system of his corpus as a whole. After the presentation we had a lively discussion with questions about semantic vs. syntactic intertextuality (the former being more computationally intensive, and Paul referenced Donald Sturgeon’s work on allusion), the reason Paul is going with text reuse instead of allusion (the algorithm missing something as “shared” is stylometrically significant), why this certain corpus and text, choosing a ML algorithm, and identifying authors across genre (as well as whether author or genre has a stronger signal – it’s the latter).
One important point that came out of the discussion related back to our conversations about corpus-building and digitization last year. Paul is frustrated at not being able to look at multiple editions at scale because digital editions are often “flattened” and “create an ideal text that never existed.” It’s something to keep thinking about as we build our data sets: how much does what we are working with differ from what readers (and editors) saw when they encountered these texts?
Many thanks to Paul for a fascinating conversation, and inspiringly clear and educational presentation!