Christine Chou, WORD LAB regular, and Steve Kimbrough joined us from Wharton to talk about text analytics and business intelligence. They discussed three current projects, the first being the topic of a recent paper. The following are brief summaries and Molly’s notes from the discussion.
First Project: Developing Indicators with Text Analytics
They recently wrote a paper, “On Developing Indicators with Text Analytics: Exploring Concept Vectors Applied to English and Chinese.” Some of the main questions were whether, using small corpora, they could generate high-quality, vocabulary-based classifiers for competitive intelligence. They also investigated what kinds of words make useful predictors and how results or patterns of language may vary by language.
The question was whether one could measure innovation based on vocabulary.
Here is how they proceeded. Coding was out, because human labor involving thousands of documents would be unfeasible. Machine learning? No, because one would need to train. Instead, Steve and Christine chose an “external approach” – vocabularies from external sources (concept vectors). Their basic vocabularies included March’s exploitation (something you’re familiar with) vs. exploration (distant and new) (1991), and Pennebaker’s Linguistic Inquiry and Word Count (LIWC) (2011), which looks at content vs. function words. Their level of analysis (entities) were firms, and their documents involved annual reports, as a proxy for the firms, from the USA and Taiwan. They used annual reports from top firms according to rankings, with headquarters in those two countries for the purposes of document collecting. Upon collecting the documents, they converted PDF to text, losing some information (such as graphics) in the process.
The first result from the inquiry was that yes, quality indicators from March and LIWC were pretty good. They also found that function and content words (included in LIWC) both have prediction power in these corpora. Finally, there was a meaningful overlap between important predictor content and function words in English and Chinese – they found overlapping patterns in both languages.
Conclusion and Next Steps
First, they found that more externally validated concept vectors are required. They would like to investigate methods other than classification trees for developing classifiers (ex. regression). And they would like to analyze their unexpected findings – why do some words play such an important role in prediction? Some words have different meanings in their context from the original vocabulary, so they can be meaningful in unpredictable ways.
Second Project: Using Vocabularies to Diagnose Firms
Steve and Christine would like to diagnose CSR – Corporate Social Responsibility. One question was, are companies trying to distract customers with CSR from their lack of innovation? So they want to use the innovation vocabularies to attempt to predict CSR.
Fortune magazine has most/least socially responsible corporations listed from 2006-2012; they collected those companies’ annual reports from 2003-2012 for a total of 475 documents.
They were surprised that when drawing the two studies together, concept vectors were capturing a lot of information.
Third Project: Product Intelligence
- “Value Chain Improvement” – how can a product be improved by incorporating other designs, products, and services?
- “Product Application Discovery” – given an intermediate product, where can it be used to improve existing product value chains? (For example, DuPont is all intermediate products – what can they be used for? What are some new applications?)
Steve is looking at a UN specification that classifies all products and services (4 levels, about 20,000-50,000 leaves). How about a database that lets you search for – for example – products and services that care about moisture barriers, if you’re Tyvek? Their workflow is as follows:
- Identify a topic
- Develop a vocabulary
- Get a corpus/corpora
- Obtain a taxonomy
Once they have a taxonomy, classify the documents according to it (map the documents to the taxonomy) and this will yield a “categorized document base” (CDB) (Steve has a patent on this). Each CDB is a small corpus.
- Put in a query
- Choose a taxonomy/taxonomies
- Map corpus to taxonomy(ies)
- Score or report on the nature of hits, ranked by “activity” (“category relevance reports”) – show the top five or so results for the category
- Some “pleasantly surprising” results every time
Here’s an example. In the 1998 (roughly) tobacco settlement in the US, all of the discovery documents are now public. A researcher looked at all the speeches by a CEO, analyzed with content analysis, and looked at the language use before, during, and after the settlement. (It followed the pattern of “deny” -> “own up” -> new strategy of “we’re a great company.”) This was just one company and one CEO.
Steve and Christine want to look at more than just one. They got the corporate reports. Through content vectors, they got qualitatively the same results – the different themes of before, during, and after. (Not entirely surprising.) It was an interesting use of CDB that complements the new product discovery application.
“There you go!” The conclusion is to “categorize more richly.”
New Project: Now for something completely different . . .
Working with Jessa Lingel (Annenberg) and Johanna Schacht (former WORD LAB regular and Wharton visiting researcher), Steve is looking at how Penn can “be a good citizen” in moving Pennovation to Gray’s Ferry. Specifically, they’re looking at food deserts. Given the amount of food deserts in Philly and their locations, can we develop algorithms that recommend where to place a few grocery stores? Of course, Gray’s Ferry is somewhere to start, as it’s certainly a food desert. Steve then introduced Social Mapping and Design (involving Robin Clark in Penn’s linguistics department).
Economics is impoverished in tools to make sense of what’s going on — we should use Social Mapping and Design. Steve would like to start with Philadelphia and bring in data from multiple sources to understand social well being. We need a system to bring it all together for access, comparison, and so on. He envisions a broad-ranging project to build up data and make it effectively available for a variety of purposes.
Specifically, Steve is interested in using linked open data, to maximally exploit the data, and to create a mash-up of many different sources of data. He hopes for rich data and to pull in everything that becomes available. It will be GIS-based with text, but text isn’t very GIS-locatable… In any case it will be a lot of work but it’s doable.
(The Q&A was rather short before turning into a freewheeling discussion not recorded here.)
- What about involving economic historians?
- There’s the issue of change over time in GIS/geography in general as well.