Today we took a look at teaching “global” DH, specifically in terms of text analysis/mining methods, and had a discussion about several elements of that issue: how to teach, what to teach, and why to teach about digital humanities in languages other than English.Continue reading Mar 7, 2019 Discussion: Teaching About ‘Global’ DH
Our initial question was, does Underwood and Sellers’s argument stand up? While we were concerned about hand-picking data, one member suggested that “the sampling was sampling” and this was convincing. In addition, it felt that the random sample truly could be as random as possible given that the HathiTrust archives of 19th-century texts is better than other periods. We discussed why that might be: deaccessioning, preservation, simply the sheer numbers of books produced in the 19th century as opposed to earlier, OCR accuracy, and copyright restrictions. Also, there was the procedure of digitization itself — University of Michigan digitized its research collections, for example, rather than things in special collections, and this might have included many more 19th-century books.
We also asked what exactly significance and prestige mean. One member brought up the example of Wendell Harris’s article “Canonicity” and his argument that texts become part of the canon based on being part of the conversation; this would go along with the idea that things being reviewed in magazines are significant in that they are being talked about in the first place, whether positively or negatively. And even if things were being negatively reviewed, they still helped to shape literary conversation and thus “the canon” and what survived over time. One member also raised the question of doing sentiment analysis of some kind (whether yes/no or picking out significant words, as another member suggested) on the reviews and adding that data to the analysis.
A question was also raised by a Wharton member: with literary analysis of this kind, is it more about interpretation or explanation, and what is the outcome of the research? With business research, there would be an action suggested based on the conclusions. We ended up talking about how the paper is finding whether there is something at all to interpret or explain in the first place. The authors stop short of explaining or making generalizations, and emphasize the narrowness of their claims. (Whether one should take this approach or go out on a limb to make bigger, but potentially wrong, claims was also discussed at this point.) We also wondered that if they had success in prediction, does that mean there “is” an explanation somewhere of what makes things significant? Is there a latent pattern that exists, and that we might balk at recognizing, as humans?
Finally, we also discussed why the line was always going upward. It seems that this is because the works reviewed and random non-reviewed sample adhered more closely to the “standards,” whatever they are, over time for some reason. Again, we can speculate but not exactly explain what’s going on behind the scenes there.
One conclusion was that we buy the continuity or lack of change in standards over time. The reason is that we see this in other periods where there is a narrative of change, but in reality, much continuity: Meiji Japan (mid-to-late 19th century), and Lu Shi’s use of classical Chinese in his writing despite being at the vanguard of “modern” Chinese literature.
Addendum: Scott has been tinkering with this and is reducing the list of words and making better predictions, with between 100-400 words. A small subset is always in the list of words. It only looks at the training data to do word selection without seeing test data in advance. Depending on training data it picks slightly different sets but a set of about 15 always appear. The top of the list is “eyes” and if you use just the word “eyes” you get 63% accuracy! See Scott’s Github for the code and results.
We got a future potential WORD LAB project out of this discussion, so it was a very productive reading and session!
Is classical Chinese a natural language? Artificial language?
According to Wikipedia (our most reliable source, “naturally”), a natural language is “unpremeditated” when it is used. When someone is truly fluent in classical Chinese, what happens in their brains? Do they conceptualize poems in spoken language as a mediation layer and then “translate” it into classical Chinese/literary Sinitic, or do they think in that layer directly?
If they do think in it, is that enough to make it a natural language?
How about coding? Is pseudocode natural language (or, shorthand for a natural language), whereas programming languages are artificial? Does anyone think in Python, for example, or is it always mediated by a natural language (or pseudocode, or NL-then-pseudocode) stage beforehand?
What might edge cases teach us, such as sign language — never a spoken language, always spoken physically. What is writing or typing, if not a physical gesture? Sign language looks a lot like a “script” in how it functions, but it’s considered separate from it. There are also different “dialects” and it’s a “script” in the sense that it’s rendering a specific language. How about musical thoughts — are these thoughts the same as “linguistic” thoughts? How about thinking with one’s hands, such as not just someone using sign language does but also master craftsmen/women who have “unmediated” gestures? But they are not necessarily communicating that way, which is what language does.
Where is the line between script and artificial language?
Just some unmediated thoughts from our OPEN LAB today while going through Natural Language Processing With Python!
This week we discussed Ryan Cordell’s Viral Texts project and paper “Infectious Texts: Modeling Text Reuse in Nineteenth-Century Newspapers” (forthcoming). Ryan visited Penn the previous week and gave a talk and a workshop about this project, so we seized the opportunity to further discuss his work at WORD LAB reading club.
Our discussion focused on how Ryan’s techniques might be applicable to other projects that our members are interested in, after we went over some finer points of the article (including wondering which genres the “fast” and “slow” words appeared in, and how they filtered out long-form advertisements). Brian Vivier wondered if the way they’re generating overlapping n-grams would allow for comparison in classical Chinese texts without whitespace for word divisions: picking a 5-gram, for example, would certainly catch some words. However, we also questioned whether we’d pick up too many word fragments and if this noise would be too much for the analysis.
This technique could possibly be used, we thought, to compare documents in a large corpus for text reuse, just as Ryan did with “viral” newspaper reprints, which would be extremely likely in the case of classical Chinese texts. For example, we could find when precedents are invoked or imperial decisions are cited. Where are the patterns, and would the noise fall out if we are looking for patterns like this?
We also felt that the paper was hesitant about making concrete conclusions and hard statements, and discussed that there is a difference in rhetoric between science and humanities: science is more experimental, more exploratory, and more likely to write about failure (although that’s certainly often not the case); humanities papers, meanwhile, tend to make big claims and only after the author is certain their position is solid. Well, we are making generalizations, but given that it was a room full of mostly humanities people, the tone stuck out and was surprising to most of us.
Finally, we talked about the applicability of Cordell et al’s ideas and techniques for other languages and time periods; for example, Maddie Wilcox brought up the similarities between antebellum America and Republican China in terms of printing instability, the spread of railroads, and the penetration of networks into rural areas. It would be interesting, too, to look at local republishing across genre, rather than geographically spread-out republishing. Finally, how about networks based on who studied abroad at the same time, who graduated together, and literary societies in Republican China? Endless possibilities.
The practicality of such projects, and the ethics of using available texts, also came up. We talked about improving OCR and the questionable legality of scanning an entire microfilm series of colonial newspapers (gotten via ILL) or on CD as PDFs. It would be great to compare across languages with Japanese, Chinese, and Korean colonial newspapers, for example. But the quality of the OCR is so poor perhaps this is only a pipe dream. It’s hard to argue for “intellectually viable OCR improvement” – if only we could think of a project and a grant!
Aside: We also covered more ground on the Python tutorial dealing with the Chinese Biographical Database API. It was anticlimactic: Molly found a bug in that made its JSON data invalid and so was unable to process it. Still, she explained the way JSON data is accessed and what it looks like in a script – if only it worked!
Our OPEN LAB time on October 29 was split into two parts: discussion of an article on authorship attribution, and step one of reading a CSV file, calling a web API, and then rewriting the file in Python. You can find the article here:
Verifying the authorship of Saikaku Ihara’s work in early modern Japanese literature; a quantitative approach Literary & Linguistic Computing, first published online September 29, 2014 doi:10.1093/llc/fqu049 (9 pages)
Our discussion centered on first the article’s assumptions and methodology, and then authorship attribution and its role in general. One point of contention is that the article attempted to differentiate one epistolary work from the rest of an assumed body of Ihara Saikaku’s works, but did not take into account the major stylistic differences between epistolary writing and the writing of typical fiction during the Edo period (1600-1868) in Japan. Thus, the stylistic differences that led the authors to suspect that Saikaku did not write this particular work could also simply just be due to the difference in genre. We thought that more work on genre could be a productive and interesting direction for this kind of research.
The Python tutorial covered reading in a CSV file using the unicodecsv library, how to import libraries in general, and how to access items in a list. It also demonstrated how to construct a URL and call the Chinese Biographical Database web API through urlopen(). Stay tuned for reading data from the API at the next OPEN LAB.
Laura Gibson, from the Annenberg School of Communication, led our discussion of “The Battle for ‘Trayvon Martin’: Mapping a Media Controversy Online and Offline.” The authors collected a range of media with mentions of “Trayvon Martin” (and the common misspelling “Treyvon Martin”) from twitter, blogs, online media outlets, newspapers, and television in order to understand the media ecosystem around the killing. They used MIT’s Media Cloud to produce much of their evidence. We were especially interested how they created the data set and how other scholars might use their infrastructure. Laura’s research group’s is currently working with them. We discussed the promising avenues of sharing mined data — in this case, the identification of certain newspaper articles — and the legal complications of actually gathering the text for subsequent research. We also examined the various methods and tools the researchers used to analyze their evidence.
In the last part of the session, Molly led a fabulous Python tutorial based on Brian Vivier presentation the previous week. We walked through object types, ways of easily creating paths for an API, and frameworks for querying and organizing imported CSVs.
Erhardt Graeff, Matt Stempeck, and Ethan Zuckerman, “The battle for ‘Trayvon Martin’: Mapping a media controversy online and off–line,” First Monday, vol. 19: 2 – 3, February 2014