Monday, April 2, 2012

HW 10: Word Cloud

Word Cloud for Pride and Prejudice
Word clouds, like the one you see on the right that I made with wordle, are created by first scanning over a text, figuring out which words are most common in that text but not in others and then printing these words in font size proportional to how often those words appear. For this homework you will write a program that finds the most frequent words in a text and prints out how many times they appear.

Your program will
  1. Read in a text file, one word at a time.
  2. Clean up the word by throwing away any non-letter characters, like ".,!? and turning it to lowercase.
  3. If the word is not a stop word then keep track of how many times it appears on the file.
  4. Printout the number of unique words you found and the top 20 most frequent words found along with the number of times they appear in the file.

Here is the output of the program for Pride and Prejudice
There are 6898 unique words.
mr 783
elizabeth 594
such 393
darcy 371
mrs 343
much 328
more 326
bennet 293
miss 283
one 266
jane 263
bingley 257
know 239
before 229
herself 224
though 221
never 220
soon 216
well 212
think 211

and here is for Sense and Sensibility
There are 7531 unique words.
elinor 616
mrs 525
marianne 488
more 403
such 359
one 317
much 287
herself 249
time 237
now 230
know 228
dashwood 224
though 213
sister 213
edward 210
miss 209
well 209
think 205
mother 200
before 198

The list of stop words you will use is
private static final String[] stopWordsList = {
  "a","able","about","after","all","almost","also","am","among","an",
  "and","any","are","as","at","be","because","been","but","by","can",
  "cannot","could","dear","did","do","does","either","else","ever",
  "every","for","from","get","got","had","has","have","he","her","hers",
  "him","his","how","however","i","if","in","into","is","it","its","just",
  "least","let","like","likely","may","me","might","most","must","my",
  "neither","no","nor","not","of","off","often","on","only","or","other",
  "our","own","rather","said","say","says","she","should","since","so",
  "some","than","that","the","their","them","then","there","these","they",
  "this","tis","to","too","twas","us","very","wants","was","we","were","what",
  "when","where","which","while","who","whom","why","will","with",
  "would","yet","you","your"};

Your program will use a HashMap to keep track of the counts. You might also want to use the a HashSet.

This homework is due Monday, 9 April @noon in the dropbox.cse.sc.edu.

No comments: