Today we took a look at teaching “global” DH, specifically in terms of text analysis/mining methods, and had a discussion about several elements of that issue: how to teach, what to teach, and why to teach about digital humanities in languages other than English.
First, we started off by comparing two “non-English DH” courses, one that took a survey view covering a wide variety of methods and topics, and another that focused on NLP and texts specifically but for a wider range of languages. We compared these approaches as “hands on” and “brains on” (Katie’s words), although we don’t mean to imply that a lab-style course is not using students’ brains. It’s a question of how we structure our approach to exploration: “questions” versus “methods.” This issue came up last week in our discussion of Andrew Piper’s Enumerations (Introduction and Chapter 5), when we asked how best to approach experimenting with DH in our fields and how to communicate the value and results to people in our fields who aren’t familiar with computational methods, and who might be very skeptical of them or just not care. We didn’t decide that one approach is better than the other, but recognized that there are options to choose from when teaching anything, especially a DH course.
Moving on to course content: we agreed that DH, text analysis, and computing broadly are by and large focused on and dominated by the Anglophone world, and assume that texts are representable in ASCII, generally. Off-the-shelf tools for non-English texts can be rare, underdeveloped, or difficult to get running (and that’s provided that they don’t require some scripting or programming knowledge to get them working in the first place). Documentation is often scanty (for example, in Molly’s experience of Mecab for Japanese and Stanford Core NLP for Chinese), and almost never in English. Molly recounted cobbling together multiple blog posts in Japanese, her own experimentation, and the “official” website in order to write her own documentation for some elements of Mecab in English.
Most of our session was spent in small groups attempting to practice some NLP tasks on various languages. In fact, we had a difficult time getting any software to run on Spanish, Portuguese, Chinese, or Japanese texts, and what did run was either very slow or had inaccurate results. (Namely, slow = Spanish POS, and inaccurate = Japanese NER.) We never got the Stanford POS Tagger for Chinese to run at all (and Molly and Scott recounted how it took them nearly a half hour to get it working in NLTK for her course in 2018, and of course hadn’t documented what they did!), and Scott only got the POS tagger for Spanish to run in a reasonable amount of time after about 45 minutes of tinkering and searching online. We all got very frustrated and gave up on the tutorials we were following, to break for big-picture discussion provoked by our experiences.
Scott asserts that monolingual topic modeling is “inherently Anglocentric” although we didn’t reach any conclusion about the difference between mono- and polylingual topic modeling. But it’s interesting to think about, and since David Mimno’s paper on polylingual topic modeling is on our (long) list of suggested things to do in WORD LAB, maybe we can have a chance to discuss this in more depth in the future.
The question that came up in the end was: why teach “non-English” DH? Is there something that unifies both Romance and non-Western languages, besides not being representable in ASCII alone, and in Molly’s 2018 case, what “unites” East Asian DH? Is there even such a thing? In a recent presentation on that 2018 seminar, Molly raised just this issue, concluding that what ties “East Asian DH” (China/Japan/Korea) together is the challenges of languages that require Unicode representation and have no space between words: but then what of Korean (which does have spaces), and why group together China, Japan, and Korea, when each language and nation has different challenges, infrastructure, and resources? Is there a point to putting Spanish, Russian, and Chinese in the same environment, just because they all “have issues”? No one had a real answer for this but it led us to another interesting point…
…the amazing course idea that Katie had of “an Axis power WWII course.” We dubbed it “Global Fascism” and Scott in particular was very enthusiastic. Wouldn’t it be amazing to have a multilingual course like this, joined by a specific theme and historical period, where none of the instructors or students could understand 100% of the content? Could computational methods indeed help us understand WWII better from this perspective? Perhaps we’ll never know!