Monday, April 2, 2012

Lab 21: Disjoint Sets

For this lab you will extract the set of unique words in two file and then determine the set of words that is in one file but not the other. We will be using Pride and Prejudice and Sense and Sensibility.

We want to find all the unique words in these texts. In other words, the set of all words that appear at least once on the text. You will read the words using the Scanner.next(). As you noticed in a previous lab, the .next() method only stops at whitepaces, this means that it will read thinks like "house." and "don't!"" as one word. This presents a problem when trying to find the set of unique words.

We can fix this problem by cleaning the words we read: removing any non-letter character from them. A simple way of doing that is by using the following method
/**
 * Clean up word by removing any non-letter characters from it and making it lower case
 * @param word
 * @return the cleaned up word
 */
    public String cleanWord(String word){
        //It is OK to use a regular String with + to create the new string, in CSCE145.
        //However, StringBuilder is much faster in this case because it does not
        // create a new String each time we append.
        StringBuilder w = new StringBuilder();
        char[] chars = word.toLowerCase().toCharArray();
        for (char c : chars) {
            if (Character.isLetter(c))
                w.append(c);
            }
        return w.toString();
    }

Your code for this lab will read and clean all the words from both Pride and Prejudice and Sense and Sensibility into HashSet of words. It will then printout how many unique words there are in each, as well as the number of words in one that are not in the other. The output I got is:
Numer of unique words in Pride and Prejudice: 7016
Numer of unique words in Sense and Sensibility: 7649
Number of Words in Sense and Sensibility but not in Pride and Prejudice : 3070
Number of Words in Pride and Prejudice but not in Sense and Sensibility: 2437
They are:
chambermaid
quadrille
holding
affords
luckless
extractions
quarrelsome
yesthere
halfhours
surmise
undervaluing
furnish
bribery
httpgutenbergorglicense
...and so on...your ordering might be different from mine...
jenkinson
george
breakfasted
hedge
musicbooks
coquetry
term
boulanger
halfamile
distrusted
cluster
culprit
phillipses

As always, your lab is due on the dropbox.cse.sc.edu.

1 comment:

jmvidal said...

I just updated the output of my program. Turns out I was using Emma instead of Sense and Sensibility, ooops.