Case Study CS00005: Counting words
The problem: Given some text, count the number
of occurances of each word.
The solution:
- Read the text from the file
- Find the indices of each unique word
- List the unique words
- Report the number of occurrances of each word by descending count
Note: You can cut and paste these code fragments into the code pane of
the Glee interpreter and experiment as you go along to see the actual
operations live.
The Glee code:
$$ Read the file
'c:\Glee\Website\CaseStudies\' #fc => path;
'AesopsFables.txt'#file path => f;f[]=>t;
13#asc =>nl;
$$ Read an article from the text.
t \ (t @== ``&'FABLE:') <-_1 => t;
t[t @& ^&'The Peacock and Juno'< ]< =>text;
$$ Convert to Lower Case and segment by word
text %\ \& =>w;
$$ Report total and unique word counts
(w#) ' total words' $;(w & => u#) ' unique words' $;
$$ Report the unique words
'UNIQUE WORDS ARE: ' u ,,$;
'-----------------------'$;
w ``& => i; i /# => cnts;
'Words Used Once:'(cnts *= 1 \+) $;
'-----------------------'$;
'Word Counts:'$;
$$ Sort word items by descending count
i[cnts ``<]=> i;
$$ Loop through indices producing report
:for(i){{
i< => i; i# =>n;
w[i <-]':' n ' time'('s'[n *> 1])nl
}} $;
The Output:
57 total words
45 unique words
UNIQUE WORDS ARE:
a addition and attractions be before bird but cannot content desiring
everything fable favourite first have he her his in juno lot nightingale of
once one other out peacock persisted petition placed pointed refused request
said she that the to voice was when with your
-----------------------
Words Used Once:35
-----------------------
Word Counts:
a:3 times
juno:3 times
and:2 times
...
...
was:1 time
when:1 time
with:1 time
your:1 time
The play-by-play:
- $$ Read the file
'c:\Glee\Website\CaseStudies\' #fc => path;
'AesopsFables.txt'#file path => f;f[]=>t;
13#asc =>nl;
$$ Read an article from the text.
t \ (t @== ``&'FABLE:') <-_1 => t;
t[t @& ^&'The Peacock and Juno'< ]< =>text;
This code is fashioned from CS00004.
- $$ Convert to Lower Case and segment by word
text %\ \& =>w;
The %\ with a string argument reads
%Convert
to \Lower case. The \&
with a string argument reads \Segment by &Words. The
result we =>assign to w.
- $$ Report total and unique word counts
(w#) ' total words' $;(w & => u#) ' unique words' $;
In w# , the #counts the number
of items in the sequence (in our case it is words in the article). In
(w & => u#), the
&Returns
unique items in the sequence. These are our unique words.
- $$ Report the unique words
'UNIQUE WORDS ARE: ' u ,,$;
'-----------------------'$;
w ``& => i; i /# => cnts;
In w ``& => i, the
``& Yields
indices of items in w which
are identical. The result is a sequence where each item is the indices for each
unique item. Using the first index in the item, we can index into the original
sequence and learn which original item it stands for. This is much harder to
say than it is to see. Experiment and look at the results. In
i /# => cnts; ,
the /# counts the
contents of each item i .
These are our word counts since each item contains the indices to all
occurrence of a particular word.
- 'Words Used Once:'(cnts *= 1 \+) $;
'-----------------------'$;
'Word Counts:'$;
$$ Sort word items by descending count
i[cnts ``<]=> i;
In (cnts = 1 \+) , the
\+ reduces by
summation the bit vector produced by marking
* where equal 1=1cnts *= 1 . There is a 1 for every occurrence of
cnts equal to 1 and a 0
otherwise. In i[cnts
<<] => i , the ``< yields a indices
of descending grade vector for cnt. This is a vector of indices. When we index
out of i with i[cnts <<] => i; we are
sorting the sequence of indices according to the number of indices in each
item. Thus we are sorting by descending occurrence of words.
- $$ Loop through indices producing report
:for(i){{
i< => i; i# =>n;
w[i <-]':' n ' time'('s'[n *> 1])nl
}} $;
The :for(i){...} loop delivers an item of
i to the insulated block{{...}} by indexing out each element of
i in turn. Since indexing is used
and we are indexing out of a sequence, the object delivered is a one item
sequence. In i< => i; ,
the i< discloses the item and we assign => i; directly back to
i . If this is uncomfortable to you, you
can assign to a variable of your choice (e.g. idx). When in an
insulated block{{...}}, objects outside
can be seen but not changed. So here we see and disclose the
i from outside, but when we assign it,
we are creating a new i in the block's
namespace. From that point on, the i we
see is this new "local" version ... not the external one.
In w[i <-]; , the
<- takes the
first element of i and we use it
to index out of "w[i <-]"
our word list w.
In ('s'[n>1]) , we are indexing out of
a string containing only the letter 's'.
We do this with a bit which is 1
if n (which is the number of
occurrences of the word) is > Greater Than
1 . This, in ' time'('s'[n>1]), adds the "s" which English likes appended for
plurals.
In w[i <-]':' n ' time'('s'[n *> 1])nl
we just strand our objects and implicitly form a sequence. Such a
sequence will naturally display itself by converting each item to a string and
then catenate the strings. We do no explicit formatting work at all.
This completes the example. To better understand these operators and other
things you can do with them, consult the operator pages according to the type of
data you see being operated on.