On April 1, we were joined via Skype by Mark Ravina, professor of history at Emory University. Mark researches Japanese political history in the latter half of the 19th century. He told us about his recent text analysis work involving 19th-century Japanese history, and also with a student on the recent student protests involving race at various universities around the United States. We’ll be following up with Mark on his research in the Fall 2016 WORD LAB lineup.
Basics and Context
He’s been using the same tools for several different projects, and moving between close and distant reading – although wants to go beyond a kind of simple computational linguistics. In 1870s Japanese political documents, old words take on new meanings and we find a number of neologisms. For example, classical Chinese terms change meaning over time and find new usages. We find a kind of hybrid language in these documents, with different discourses in competition. Mark is also interested in how language maps power in these documents.
The sources that Mark works with (petitions to the Emperor in the 1870s) been neglected. He’s had some student employees scan the documents from a reprint series (typeset in modern type; important because 19th-century documents cannot really be OCRed), and put a priority on getting metadata in the database as well.
Mark has been using FileMaker Pro in conjunction with R to analyze the documents. He’s been looking at semantically-rich terms, where they appear, and what other words they tend to appear with. For example, the word “jiyū 自由”, which now means “freedom,” at this time had the connotation of “wanton” or “reckless”/”selfish.” It’s important to take a look at how the words were actually being used at the time, rather than going with current (and anachronistic) definitions, and Mark’s analysis can give us a way to find out the historical definitions and connotations of these key words.
Concretely, Mark has had a work study student scan the documents to PDFs, then uses Mac Automator to clean up page numbers and headers. They then run the documents through the Japanese OCR software eTypist. Finally, Mark puts the text of the OCRed PDFs and also metadata for each document into a custom FileMaker Pro database. One challenge has been capturing semantically rich data like carriage returns (which R reads as the start of a new element in a list), which make references to the Emperor appear at the top of a line in order to show respect. Mark ended up preserving a version of the text with the carriage returns, and one version with all of the text together.
One finding that came out of the research so far is that these documents have been held up as the beginning of democracy in Japan. Mark found that there was a sense that “you can tell the government what you think about anything” in the explosion of petitions in the 1870s — for example, the petitions include things such as “women shouldn’t blacken their teeth” in addition to what we’d consider more political or meaningful topics, and what we’d expect from our perspective now. The petitions run up to about 1881-1882, when the government declared that they would “give” the people a parliament — which would presumably give the people an alternative outlet to communicate with the government and gain representation.
The metadata for each document includes geography (not the neatest data — the petition may be from one place but talk about another), as well as author information. One challenge has been that many authors use pseudonyms, and Mark has set up a separate table in FMP that captures the alternative names, then joins it with the main tables. Other metadata includes whether the author was a member of an association; year (converted to Western years in R); the archive the petition came from; and prefecture.
OCR and Machine Learning
There are some major OCR problems even with modern typeset documents that Mark encountered; specifically he notes the notation 々 for signaling repeated words, and the confusion between the character for “2” (二) and the katakana syllable “ni” ニ. He used machine learning (R Weka) to train his OCR software to recognize the difference between the latter. He gave R Weka 50 sentences that he had proofread, and was surprised to find that it gave over 90% accuracy just training on this data. The characters before and after the syllable/character are enough to distinguish between the number and particle. So, context was enough to train the software well to make the distinction, even with a small training set. Of course, this worked much better on documents that had been scanned and OCRed accurately to begin with.
One WORD LAB participant wondered if this procedure would also work on the “long S” in 18th-century texts. Could the right algorithm theoretically take any text from any period and improve the character recognition?
Another participant had a question about how well R tokenizes Japanese texts, and also if Mark had split up the documents or worked with one huge corpus, and how R had handled it if the latter. He noted that R does a fairly good job tokenizing, and that the petitions aren’t too large, so he hasn’t run into problems. The participant had been trying to work with a 1.2 GB text corpus and having no success with Topic Modeling Tool and Voyant Server (which are both Java programs) so was wondering if R would work better.
As for the student project that Mark worked with, they scraped student protest demands from various schools as well as university responses. You can find the visualizations of the words used by students and administrators at their site.