Laura McGrath of Michigan State University (@lbmcgrath) joined us this week via Skype to talk about measuring “literary novelty,” a project she is working on with Devin Higgins and Arend Hintze. The former is MSU’s data librarian, and the latter is a professor at MSU who specializes in computational approaches to genetics and biology. They also collaborated with HathiTrust Research Center. The presentation today focused on introducing us to the collaborative project, then considering how to move into the second stage in the near future.Laura introduced the project as applying data structures for biology and genetics to literary texts, and envisions it in conversation with other projects such as Viral Texts. It is inherently comparative: when we find literary novelty (“something new”) there must be something “old” as well. And as Laura put it, “novelty is a culturally driven metric.” She and her team are interested in literary change and form as “novelty,” and see it as something always shifting and not really measurable in itself.
“Machine reading genetic material is not significantly different from machine reading literary texts,” Laura told us. When looking at genetic material, we use a hash table to keep track of whether data is part of the set or not, and in other words, have we seen that yet or not. Can this kind of data analysis help us with looking at literary novelty?
The team first scans through texts in order:
- Track repetition of small intervals (fixed 12-character window or 12-mer) — a degree of repetition is built in but can also see variation
- Data they’d usually “clean out” is kept for the process (spaces, punctuation, stop words, etc.)
- Assign (1) Have not seen (novel), or (0) have seen (not novel)
- Assign a score based on (1) and (0) for the text
“Repetition as something that is itself novel.”
Looking at a bag of words, we’d see, for example, Gertrude Stein as “not innovative” (because of a small, repeated vocabulary) but here the team is looking at usage differently.
Every text has a negative slope and repeats itself over time, because each text is being assessed based on its start, not compared with other texts.
They’re thinking about the types of novelty this measurement could help uncover — thinking about novelty in a complex and nuanced way. Low points in the slope should be read as repetition as innovative form, not as “not novelty” (peaks and valleys are both significant). They’re looking at degrees of variation. How can we compare texts to each other on a large scale, according to these measurements?
Michael Levenson talks about “intratextual novelty.” Laura is interested in a “holistic sign system” with novels “recycling their own material” — the Bloom factor can find moments of texts echoing or reinventing themselves.
Now: Intratextual. Next: Larger Historical/Cultural Conversations.
They’re starting with Andrew Piper’s Novel 450 corpus (from McGill TxtLab).The slope helps us understand how quickly novelty “decays” — how often the text repeats or reinvents itself while r^2 is the degree of formal variation. What does the measurement reveal, and what are its limitations? If it’s useful for identifying High Modernism, how does this show how it developed over the 20th century and intervene in our understanding?
The team has received a HathiTrust Advanced Collaborative Support Grant to access in-copyright HathiTrust data via a virtual machine for their next, larger-scale analysis. They are excited to see other uses for the measurement as well, and how other scholars might reuse it!