Today we were joined by Dr. Jeffrey Tharsen of University of Chicago to learn about his research and projects related to premodern Chinese digital philology and phonology.
Jeff began by asking: What does “digital philology” mean, in terms of what he works on? “The machines are stupid,” he reminds us, but we’re getting “kind of” good at teaching them, thus laying a groundwork for doing digital philology. Using NaNoGenMo as an example, Jeff talked about training machines for a kind of understanding, then for generating text based on that. (Per Jeff, the romance novels written by neural networks “can be mindblowingly incisive in their juxtapositions”!)
Specifically, for Chinese digital philology Jeff is working on several projects right now. One is the Textual Optics Lab at Chicago, which provides an interface for analysis of potentially millions of texts. He gave us a demo of this system using the word for “emperor,” and showed how this kind of interface can lead to us being able to make philological arguments based on our explorations. Obviously, it does not substitute for those arguments, but rather provides us with a powerful tool for assessing texts and amassing evidence. Specifically the TOL provides this functionality: concordance, keyword-in-context (KWIC), collocations, and time series. It can be used not just with Chinese but any language, and for example Chicago’s TOL also has English and Japanese corpora pre-loaded into it. (In fact, we recently spoke with Hoyt Long who has been working on augmenting the Aozora bunko Japanese corpus for use in this interface.)
Jeff mentioned Donald Sturgeon‘s Chinese Text Project as another example of this idea: an interface for an aggregated body of sources. The CTP also provides APIs for connecting other tools with it, such as the MARKUS reading and text analysis platform.
Jeff has also been working with aggregating information to be able to figure out the historical sounds of characters, and look at things like rhyme in early Chinese poetry. He created the Digital EDOC etymological dictionary to bring together many sources (via a database on his site, not APIs) to analyze early phonology, and gave us a demo of this dictionary as well.
Some questions that he fielded were about working with Classical Chinese for NLP. Is it harder than other languages, and if so, what are its particular atributes? Jeff answered that in Chinese, it’s not as hard because you can work with unigrams and have no conjugation, case, etc. However, tokenization is really hard: especially with two-, three-, four-character combination words, proper names, and so on. Where are the borders of the “words”? It is extremely hard to do this in Classical Chinese and no tokenizer exists for it as of now. Thus, this language is both easier and more difficult to work with, in different ways.
A WORD LAB participant also asked about the future of Digital EDOC. Per Jeff, Donald Sturgeon has been asking for an API to connect their projects. But currently, Jeff’s site operates entirely on its own database, rather than being a composition of live queries from other sites’ APIs. It does, however, link to other sites for more information about different aspects of characters in the search results.
Finally, Jeff is also interested in poetry’s meter vs. number of syllables, and refers us to Poemage at Utah for an English example of this kind of analysis.