Chapter 2

From nltkbook
Jump to: navigation, search

2. We interpreted the second part of the question, about "word types," to mean "unique words in the text." The for loop in the second part of the question works as follows: iterate over the list of words that constitutes the variable "persuasion", and for each word in the list, convert it to lowercase. This operation with set() surrounding it then creates a new set (no duplicates) of the words, and we were able to make sure there were not duplicates due to converting each to lowercase. (Python wouldn't be able to tell that an uppercase and lowercase word were really the "same".) Finally, we take the length of the set.

persuasion = nltk.corpus.gutenberg.words("austen-persuasion.txt")
number of words -> len(persuasion)
number of unique words ("word types") -> len(set(w.lower() for w in persuasion))

3. To sample texts from two genres I tried

brown.words(categories=['humor', 'mystery'])

To access sample text from Brown or Web text corpus readers:

from nltk.corpus import brown
sampletext = brown.words(fileids=['ca01', 'ca02'])
print sampletext[:5]

4. Use state_union corpus reader to count occurrences of 'men', 'women', and 'people'. What has happened to use over time?

from nltk.corpus import state_union as su
cfd = nltk.ConditionalFreqDist(
	(target, fileid[:4])
	for fileid in su.fileids()
	for word in su.words(fileids=fileid)
	for target in ["men", "women", "people"]
	if word.lower().startswith(target))
cfd.plot()

Many peaks and valleys. "People" spiked in the mid-1990s and women didn't matter between 1970-1975 and 1953-1955. In 1969 "men" and "women" were used more but "people" decreased; in 1963, 1964 used "men" almost as much as "people," the only time this happened on the chart. Not surprisingly, mentions of women rarely exceed those of men.

5. Branch = meronym, Tree = holonym in this case. member_meronyms() - a jet is a type of plane part_meronyms() - a finger is a part of a hand substance_meronyms() - macaroni is made of wheat

6. What problems might arise from the 'translate' object? What would you suggest to improve it?

The problem is distinguishing between word types (verb, noun) and senses (01, 02, etc.). We would suggest tying word senses and types together, rather than just common words without distinguishing context. You could also narrow down senses that it might choose from by looking at local topic.

7. Strunk & White's recommendations of how to use "however." Use the concordance tool to study actual usage of this word.

nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt')).concordance('however', lines=200)

8.

import nltk
from nltk.corpus import names
cfd = nltk.ConditionalFreqDist(
(fileid, name[0])
for fileid in names.fileids()
for name in names.words(fileid))
cfd.plot()

9. We compared the vocabulary and vocabulary richness of Moby Dick and Paradise Lost from the Gutenberg corpus.

Number of words:

len(gutenberg.words('milton-paradise.txt'))
len(gutenberg.words('melville-moby_dick.txt'))

Number of unique words ignoring case:

len(set(w.lower() for w in gutenberg.words('milton-paradise.txt')))
len(set(w.lower() for w in gutenberg.words('melville-moby_dick.txt')))

Vocabulary richness of one of the texts:

round(len(gutenberg.words('milton-paradise.txt'))/len(set(w.lower() for w in gutenberg.words('milton-paradise.txt'))))

(We questioned the point of doing this. What does lexical diversity actually tell us? Is it an informative measure?)

When investigating genre, we decided to see if we can predict genre based on measurements. We found lexical diversity to be uninformative, with poetry collections getting anywhere from 5 to 12, novels anywhere from 12 to 26. This was not predictive of genre (literary form).

12.

from __future__ import division
import nltk
from nltk.corpus import cmudict
entries = nltk.corpus.cmudict.entries()
count = len(entries)
print count
words = set()
for word, pron in entries: 
words.add(word)
unique = len(words)
print unique
lexical_diversity = unique/count
print lexical_diversity


13. 79.67 percent of the noun synsets don't have hyponyms. We think this is because the hyponyms are more specific and so there are specific words than there are broader words.

len(list(wn.all_synsets('n')))

82115

i = 0
for element in list(wn.all_synsets('n')):
	if element.hyponyms():
		i += 1
i

16693

from __future__ import division
1-(16693/82115)

0.7967119283931072

14. We did this, but we think that it would be more useful to generate lists that separated the definition, hyponym, and hyernyms.

def syn_def(a):

	definitions  = wn.synset(a).definition()

	for element in list(wn.synset(a).hypernyms()):
		definitions += (' ' + element.definition())

	for element in list(wn.synset(a).hyponyms()):
		definitions += (' ' + element.definition())
	return definitions


And this:

syn_def('cat.n.01')

Returns this:

u'feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats any of various lithe-bodied roundheaded fissiped mammals, many with retractile claws any domesticated member of the genus Felis any small or medium-sized cat resembling the domestic cat and living in the wild'


15.

from nltk.corpus import brown
words = []
freqdist = nltk.FreqDist(w.lower() for w in brown.words())
for freq in freqdist:
	if freqdist[freq] >= 3:
	 	words.append(freq)
print len(words)

16.

def lexical_diversity(text):
	return float(len(set(text)) / len(text))
diversities = {}
for cat in brown.categories():
	diversities[cat] = lexical_diversity(brown.words(categories=cat))
print diversities

17.

import nltk
from nltk.corpus import brown
from nltk.corpus import stopwords
stoplist = stopwords.words('english')
text = brown.words(categories='news')
frequencies = nltk.FreqDist(w.lower() for w in text)
for word in stoplist:
    if frequencies[word]:
        frequencies.pop(word)
top_fifty = frequencies.most_common()[:50]
print top_fifty

18.

import nltk
from nltk.corpus import brown
from nltk.corpus import stopwords
stoplist = stopwords.words('english')
text = brown.words(categories='news')
text = [w.lower() for w in text]
bigrams = list(nltk.bigrams(text))
filtered_bigrams = []
for pair in bigrams:
     if pair[0] not in stoplist and pair[1] not in stoplist:
           filtered_bigrams.append(pair)
frequencies = nltk.FreqDist(filtered_bigrams)
top_fifty = frequencies.most_common()[:50]
print top_fifty

You could also do a list comprehension:

filtered_bigrams = [pair for pair in bigrams if pair[0] not in stoplist and pair[1] not in stoplist]

19.

labeled_words = []
for genre in brown.categories():
    for word in brown.words(categories=genre):
        labeled_words.append((genre, word))

cfd = nltk.ConditionalFreqDist(labeled_words)
cfd.tabulate(conditions=brown.categories(), samples=['tree', 'bush', 'bird'])

21. Our strategy was to count all pronunciation codes that started with vowels as syllables, and counted the number of syllables (based on this) in the Genesis text included in NLTK.

prondict = nltk.corpus.cmudict.dict()
from nltk.corpus import genesis
kjv = genesis.words('english-kjv.txt')
syllable_count = 0
for word in kjv:
    if word in prondict:
        word_syllables = 0
        for ph in prondict[word][0]:
            if ph[0] in vowels:
                word_syllables += 1
        syllable_count += word_syllables
syllable_count

22.

inputText = 'Define a function hedge(text) which processes a text and produces a new version with the word like between every third word.'
def hedge(test):
     outputText = []
     count = 0
     for i in range(len(test)):
         outputText.append(test[i])
         if test[i] == ' ':
             count += 1
             if count == 3:
                 addLike = 'like '
                 for j in range(len(addLike)):
                     outputText.append(addLike[j])
                 count = 0
     outputText = .join(outputText)
     return outputText
run = hedge(inputText)
print(run)