Category Archives: Uncategorized

Feburary 23, 2016 Article Discussion: Text Mining Online Reviews

On Tuesday, we discussed “Digging for Gold with a Simple Tool: Validating Text Mining in Studying Electronic Word-of-Mouth (eWOM) Communication,” by Chuanyi Tang and Lin Guo (2013).

This article tackles the problem of text mining from a marketing perspective, testing whether text mining offers useful information in the study of eWOM (electronic word-of-mouth, aka online reviews). Tang and Guo conclude that while the Star Rating of an online review is the best predictor of people’s attitudes in the review, text mining can offer additional nuance.

Much of our conversation centered around the LIWC software used by Tang and Guo for their study. Essentially an amped-up text tagger, LIWC checks each word in a text against its range of dictionaries and produces a statistical breakdown of that text.

LIWC’s main strength seems to be its dictionaries, which are thoroughly-researched and allow for somewhat sophisticated tagging of words by a range of features: parts of speech, emotions (positive or negative), and many categories including “Body,” “Ingestion,” “Time,” “Money,” “Religion”–over 400 different categories in all.  In the case of online reviews, for example, Tang and Guo found that “Negations” and “Money” were both effective predictors. These dictionaries are, however, proprietary, and Christine pointed out the difficulty of accessing these full dictionaries in the latest version of the LIWC software.

We tested LIWC on a segmented version of Dickens’s David Copperfield, as a good example of a coming-of-age story, but weren’t able to find strong trends. The whole paper was a very interesting counterpoint to previous work we’ve discussed on text mining in the humanities, where it’s not always so easy to validate the result.

Thanks to Christine Chou for suggesting the piece and taking the time to give us a great overview!

January 19, 2016 Article Discussion: Authorship Case Studies

For this week’s WordLab discussion, we read David L. Hoover’s 2012 article, “The Tutor’s Story: A Case Study of Mixed Authorship.” (From English Studies 93:3, 324-339, 2012). In this article, Hoover looks at The Tutor’s Story, a novel by Victorian author Charles Kingsley, finished by his daughter Mary St Leger Kingsley Harrison, writing under the name Lucas Malet. Hoover uses this text as a test case to compare a range of authorship attribution methodologies, including Burrows’s Delta, Craig’s version of Burrows’s Zeta, and t-tests.

Hoover compares his results to an annotated version of the published text, discovered partway through his research, containing Malet’s own markings about which parts of the text are hers and which are Kingsley’s.

The Methods

We spent most of our time today piecing out the specifics of the different methods. There seem to be two stages to the process–1) selecting the words which will compose the “fingerprint” of the author’s style, and 2) analyzing the statistical similarity of these words. More of Hoover’s explanation covers variations on the first part. For example, Delta uses the most frequent words in each text, while Zeta is based “not on the frequencies of words, but rather on how consistently the words appear” (329). Our consensus was that we were interested in reading more about this, and we may move to some of the original articles on the Delta method in future weeks. The R-stylo package also apparently has commands for Delta and Zeta which we could explore.

Big Picture Questions

Our discussion also brought up some larger conceptual issues around the question of authorship attribution. How are the results affected based on what we choose to use as the master corpus? How much does an author’s style vary based on different genres? (Malet’s only children’s book, Little Peter, is mentioned multiple times as disrupting the analysis, perhaps because of the smaller word range in a children’s book.) How is the relatively clean-up choice between two potential authors this different from the problems faced when we have more possible authors?

Significantly, Hoover’s results sometimes disagree with Malet’s markings, but it also is not entirely clear when Malet made those markings and how reliable they are. How confident do we need to be in the machine’s results before we start trusting the machine over the human?

Further Implications & Links

We touched base on a couple of similar problems, including Andrew Piper’s prediction of the Giller Prize. We also discussed the recent discovery of Dickens’s annotated set of the journal All the Year Round, which settled a ten-year computer textual analysis project trying to determine who wrote what in the anonymous journal.
We also noted SocSciStatistics.com, a tool for running statistical analyses (and helping you figure out which statistical method is the best in your situation).