We want to find all the unique words in these texts. In other words, the set of all words that appear at least once on the text. You will read the words using the
Scanner.next()
. As you noticed in a previous lab, the .next()
method only stops at whitepaces, this means that it will read thinks like "house." and "don't!"" as one word. This presents a problem when trying to find the set of unique words.We can fix this problem by cleaning the words we read: removing any non-letter character from them. A simple way of doing that is by using the following method
/** * Clean up word by removing any non-letter characters from it and making it lower case * @param word * @return the cleaned up word */ public String cleanWord(String word){ //It is OK to use a regular String with + to create the new string, in CSCE145. //However, StringBuilder is much faster in this case because it does not // create a new String each time we append. StringBuilder w = new StringBuilder(); char[] chars = word.toLowerCase().toCharArray(); for (char c : chars) { if (Character.isLetter(c)) w.append(c); } return w.toString(); }
Your code for this lab will read and clean all the words from both Pride and Prejudice and Sense and Sensibility into
HashSet
of words. It will then printout how many unique words there are in each, as well as the number of words in one that are not in the other. The output I got is:Numer of unique words in Pride and Prejudice: 7016 Numer of unique words in Sense and Sensibility: 7649 Number of Words in Sense and Sensibility but not in Pride and Prejudice : 3070 Number of Words in Pride and Prejudice but not in Sense and Sensibility: 2437 They are: chambermaid quadrille holding affords luckless extractions quarrelsome yesthere halfhours surmise undervaluing furnish bribery httpgutenbergorglicense ...and so on...your ordering might be different from mine... jenkinson george breakfasted hedge musicbooks coquetry term boulanger halfamile distrusted cluster culprit phillipses
As always, your lab is due on the dropbox.cse.sc.edu.
1 comment:
I just updated the output of my program. Turns out I was using Emma instead of Sense and Sensibility, ooops.
Post a Comment