November 11 OPEN LAB

This week we discussed Ryan Cordell’s Viral Texts project and paper “Infectious Texts: Modeling Text Reuse in Nineteenth-Century Newspapers” (forthcoming). Ryan visited Penn the previous week and gave a talk and a workshop about this project, so we seized the opportunity to further discuss his work at WORD LAB reading club.

Our discussion focused on how Ryan’s techniques might be applicable to other projects that our members are interested in, after we went over some finer points of the article (including wondering which genres the “fast” and “slow” words appeared in, and how they filtered out long-form advertisements). Brian Vivier wondered if the way they’re generating overlapping n-grams would allow for comparison in classical Chinese texts without whitespace for word divisions: picking a 5-gram, for example, would certainly catch some words. However, we also questioned whether we’d pick up too many word fragments and if this noise would be too much for the analysis.

This technique could possibly be used, we thought, to compare documents in a large corpus for text reuse, just as Ryan did with “viral” newspaper reprints, which would be extremely likely in the case of classical Chinese texts. For example, we could find when precedents are invoked or imperial decisions are cited. Where are the patterns, and would the noise fall out if we are looking for patterns like this?

We also felt that the paper was hesitant about making concrete conclusions and hard statements, and discussed that there is a difference in rhetoric between science and humanities: science is more experimental, more exploratory, and more likely to write about failure (although that’s certainly often not the case); humanities papers, meanwhile, tend to make big claims and only after the author is certain their position is solid. Well, we are making generalizations, but given that it was a room full of mostly humanities people, the tone stuck out and was surprising to most of us.

Finally, we talked about the applicability of Cordell et al’s ideas and techniques for other languages and time periods; for example, Maddie Wilcox brought up the similarities between antebellum America and Republican China in terms of printing instability, the spread of railroads, and the penetration of networks into rural areas. It would be interesting, too, to look at local republishing across genre, rather than geographically spread-out republishing. Finally, how about networks based on who studied abroad at the same time, who graduated together, and literary societies in Republican China? Endless possibilities.

The practicality of such projects, and the ethics of using available texts, also came up. We talked about improving OCR and the questionable legality of scanning an entire microfilm series of colonial newspapers (gotten via ILL) or on CD as PDFs. It would be great to compare across languages with Japanese, Chinese, and Korean colonial newspapers, for example. But the quality of the OCR is so poor perhaps this is only a pipe dream. It’s hard to argue for “intellectually viable OCR improvement” – if only we could think of a project and a grant!

Aside: We also covered more ground on the Python tutorial dealing with the Chinese Biographical Database API. It was anticlimactic: Molly found a bug in that made its JSON data invalid and so was unable to process it. Still, she explained the way JSON data is accessed and what it looks like in a script – if only it worked!

October 29 OPEN LAB

Our OPEN LAB time on October 29 was split into two parts: discussion of an article on authorship attribution, and step one of reading a CSV file, calling a web API, and then rewriting the file in Python. You can find the article here:

 

Ayaka Uesaka and Masakatsu Murakami

Verifying the authorship of Saikaku Ihara’s work in early modern Japanese literature; a quantitative approach Literary & Linguistic Computing, first published online September 29, 2014 doi:10.1093/llc/fqu049 (9 pages)

Our discussion centered on first the article’s assumptions and methodology, and then authorship attribution and its role in general. One point of contention is that the article attempted to differentiate one epistolary work from the rest of an assumed body of Ihara Saikaku’s works, but did not take into account the major stylistic differences between epistolary writing and the writing of typical fiction during the Edo period (1600-1868) in Japan. Thus, the stylistic differences that led the authors to suspect that Saikaku did not write this particular work could also simply just be due to the difference in genre. We thought that more work on genre could be a productive and interesting direction for this kind of research.

The Python tutorial covered reading in a CSV file using the unicodecsv library, how to import libraries in general, and how to access items in a list. It also demonstrated how to construct a URL and call the Chinese Biographical Database web API through urlopen(). Stay tuned for reading data from the API at the next OPEN LAB.

October 14 OPEN LAB

Laura Gibson, from the Annenberg School of Communication, led our discussion of “The Battle for ‘Trayvon Martin’: Mapping a Media Controversy Online and Offline.”  The authors collected a range of media with mentions of “Trayvon Martin” (and the common misspelling “Treyvon Martin”) from twitter, blogs, online media outlets, newspapers, and television in order to understand the media ecosystem around the killing. They used MIT’s Media Cloud to produce much of their evidence. We were especially interested how they created the data set and how other scholars might use their infrastructure. Laura’s research group’s is currently working with them. We discussed the promising avenues of sharing mined data — in this case, the identification of certain newspaper articles — and the  legal complications of actually gathering the text for subsequent research.  We also examined the various methods and tools the researchers used to analyze their evidence.

In the last part of the session, Molly led a fabulous Python tutorial based on Brian Vivier presentation the previous week.  We walked through object types, ways of easily creating paths for an API, and frameworks for querying and organizing imported CSVs.

Reading:
Erhardt Graeff, Matt Stempeck, and Ethan Zuckerman, “The battle for ‘Trayvon Martin’: Mapping a media controversy online and off–line,” First Monday, vol. 19: 2 – 3, February 2014
http://firstmonday.org/ojs/index.php/fm/article/view/4947/3821