Case Study CS00005: Counting words

The problem: Given some text, count the number of occurances of each word.

The solution:

  1. Read the text from the file
  2. Find the indices of each unique word
  3. List the unique words
  4. Report the number of occurrances of each word by descending count

Note: You can cut and paste these code fragments into the code pane of the Glee interpreter and experiment as you go along to see the actual operations live.

The Glee code:

$$ Read the file
'c:\Glee\Website\CaseStudies\' #fc => path;
'AesopsFables.txt'#file path => f;f[]=>t;
13#asc =>nl;

$$ Read an article from the text.
t \ (t @== ``&'FABLE:') <-_1 => t;
t[t @& ^&'The Peacock and Juno'< ]< =>text;

$$ Convert to Lower Case and segment by word
text %\ \& =>w;

$$ Report total and unique word counts
(w#) ' total words' $;(w & => u#) ' unique words' $;

$$ Report the unique words
'UNIQUE WORDS ARE: ' u ,,$;
'-----------------------'$;
w ``& => i; i /# => cnts;
'Words Used Once:'(cnts *= 1 \+) $;
'-----------------------'$;
'Word Counts:'$;

$$ Sort word items by descending count
i[cnts ``<]=> i;

$$ Loop through indices producing report
:for(i){{
  i< => i; i# =>n;
  w[i <-]':' n ' time'('s'[n *> 1])nl
  }} $;

The Output:
57 total words
45 unique words
UNIQUE WORDS ARE:
a addition and attractions be before bird but cannot content desiring everything fable favourite first have he her his in juno lot nightingale of once one other out peacock persisted petition placed pointed refused request said she that the to voice was when with your

-----------------------
Words Used Once:35
-----------------------
Word Counts:
a:3 times
juno:3 times
and:2 times
...
   ...
was:1 time
when:1 time
with:1 time
your:1 time

The play-by-play:

  1. $$ Read the file
    'c:\Glee\Website\CaseStudies\' #fc => path;
    'AesopsFables.txt'#file path => f;f[]=>t;
    13#asc =>nl;

    $$ Read an article from the text.
    t \ (t @== ``&'FABLE:') <-_1 => t;
    t[t @& ^&'The Peacock and Juno'< ]< =>text;

    This code is fashioned from CS00004.
     
  2. $$ Convert to Lower Case and segment by word
    text %\ \& =>w;

    The %\ with a string argument reads %Convert to \Lower case. The \& with a string argument reads \Segment by &Words. The result we =>assign to w.
     
  3. $$ Report total and unique word counts
    (w#) ' total words' $;(w & => u#) ' unique words' $;

    In w# , the #counts the number of items in the sequence (in our case it is words in the article). In (w & => u#), the &Returns unique items in the sequence. These are our unique words.
     
  4. $$ Report the unique words
    'UNIQUE WORDS ARE: ' u ,,$;
    '-----------------------'$;
    w ``& => i; i /# => cnts;

    In w ``& => i, the ``& Yields indices of  items in w which are identical. The result is a sequence where each item is the indices for each unique item. Using the first index in the item, we can index into the original sequence and learn which original item it stands for. This is much harder to say than it is to see. Experiment and look at the results. In i /# => cnts; , the /# counts the contents of  each item i . These are our word counts since each item contains the indices to all occurrence of a particular word.
     
  5. 'Words Used Once:'(cnts *= 1 \+) $;
    '-----------------------'$;
    'Word Counts:'$;

    $$ Sort word items by descending count
    i[cnts ``<]=> i;


    In (cnts = 1 \+) , the \+ reduces by summation the bit vector produced by marking * where equal 1=1cnts *= 1 . There is a 1 for every occurrence of cnts equal to 1 and a 0 otherwise. In i[cnts <<] => i , the ``< yields a indices of descending grade vector for cnt. This is a vector of indices. When we index out of i with i[cnts <<] => i; we are sorting the sequence of indices according to the number of indices in each item. Thus we are sorting by descending occurrence of words.
     
  6. $$ Loop through indices producing report
    :for(i){{
      i< => i; i# =>n;
      w[i <-]':' n ' time'('s'[n *> 1])nl
      }} $;

    The :for(i){...} loop delivers an item of i to the insulated block{{...}} by indexing out each element of i  in turn. Since indexing is used and we are indexing out of a sequence, the object delivered is a one item sequence. In i< => i; , the i< discloses the item and we assign => i; directly back to i . If this is uncomfortable to you, you can assign to a variable of your choice (e.g. idx). When in an insulated block{{...}}, objects outside can be seen but not changed. So here we see and disclose the i from outside, but when we assign it, we are creating a new i in the block's namespace. From that point on, the i we see is this new "local" version ... not the external one.

    In w[i <-]; , the <- takes the first element of i and we use it to index out of "w[i <-]" our word list w

    In ('s'[n>1]) , we are indexing out of a string containing only the letter 's'. We do this with a bit which is 1 if n (which is the number of occurrences of the word) is > Greater Than 1 . This, in ' time'('s'[n>1]), adds the "s" which English likes appended for plurals.

    In w[i <-]':' n ' time'('s'[n *> 1])nl we just strand our objects and implicitly form a sequence. Such a sequence will naturally display itself by converting each item to a string and then catenate the strings. We do no explicit formatting work at all.

This completes the example. To better understand these operators and other things you can do with them, consult the operator pages according to the type of data you see being operated on.