GLEE Case study: CS00007: Sort Mixed Char and Numeric

Case Study CS00007:

The problem: Convert the following data to numeric values. Then sort the original data in ascending and then descending order displaying each.

The Data:

40 GB;120Gb; 0.9 gb; 256MB;1000mb; .5 mb; 512kb;

The solution:

Segment the data string into a sequence seq breaking on ";"
Condition the data by removing blanks and make uppercase;
Capture the numeric portion in amt;
Convert the character portion to a scale factor;
Multiply amt by 10 raised to the power of scale to get size;
Reorder the sequence by grading the sizes vector.

Note: You can cut and paste these code fragments into the code pane of the Glee interpreter and experiment as you go along to see the actual operations live.

The Glee code:

'40 GB;120Gb;0.9 gb;256MB;1000mb;.5 mb;512kb;'~ %/ =>data; 'gb' 'mb' 'kb'=>cs; 9 6 3=>ns; data \|';'=>seq; 'cvt'#pgm{ x< =>x; x[x*|'0123456789.'].num => amt; x\& =>c; 10^(ns[cs`c]) => scale; amt*scale => size}; :for(seq[.x]){x[.x]cvt}< => sizes; 'Sizes (MB):'(sizes/1e6)$; 'Ascending: '(seq[sizes ``>],,)$; 'Descending:'(seq[sizes ``<],,)$;

The Output:
Sizes (MB):40000 120000 900 256 1000 0.5 0.512 Ascending: .5MB; 512KB; 256MB; 0.9GB; 1000MB; 40GB; 120GB; Descending:120GB; 40GB; 1000MB; 0.9GB; 256MB; 512KB; .5MB;

The play-by-play:

['40 GB;120Gb;0.9 gb;256MB;1000mb;.5 mb;512kb;'~ %/ =>data;]:
Beginning with the raw data we remove blanks (~) and convert to uppercase (%/) before saving (=>) in data.
['gb' 'mb' 'kb'=>cs;9 6 3 =>ns;]:
Create a sequence of strings for three popular byte scale sizes ( 'gb' 'mb' 'kb'=>cs;) and the corresponding power of 10 (9 6 3=>ns;).
[data \|';'=>seq;]:
Break the string of data into a sequence (data \|) using the semicolon as the delimiter (';') and save the result (=>seq;)
['cvt'#pgm{...};]:
Create the conversion program detailed next.
[x< =>x;]:
Since we expect to get elements of a sequence which are then single element sequences themselves, we disclose (x<) so we have a fundamental data object which we assign (=>x;). This works because this assignment is local to the program block.
[x[x *| '0123456789.'].num => amt;]:
To get the numeric part of each element we index out numeric characters. We do this by first marking them (x *| '0123456789.') and then using the resulting bit vector to do the indexing (x[...]). This leaves a string result to which we apply the (.num) method making it numeric. This we assign (=> amt;).
[ x \& =>c;]:
(x \&) breaks our string into a sequence of words which we (=>c) .
[10^(ns[cs`c]) => scale;]
We get quite a bit of work done is this little statement. We're raising (10^)to the power returned by ((ns[cs ` c])). Here we're indexing (ns) (the numeric scale factor) where we match its corresponding (cs) (the character scale factor). We determine this with (cs`c) which finds the index of (c) in (cs). That index we then use in (ns[...]) to find the corresponding numeric scale. This is applied as our power of ten (10^(ns[cs`c])) which we save (=>scale).
[amt*scale => size};]
No explanation needed here except that since this statement is not followed by a semicolon before closing the block (}), (size) becomes the result of our conversion. We really didn't have to give it a name and could have used just (amt*scale};) instead.
[:for(seq[.x]){x[.x]cvt}< => sizes;]:
Here we direct the work. The (:for) loop takes our (seq) sequence which, when in the context of a control structure argument ((...)) and indexed with a field ([.x]), becomes the name of the element passed into the block. This element is selected implicitly by the (:for). Note: this "indexing" must be last as earlier indexing is taken as simple indexing or namespace creation. Any Glee statement may appear in the (:for) argument ... if you're careful.
[{x[.x]cvt}]:
This is the body of the (:for) control loop. We take the (x) and by indexing it with a field ([.x]) we get a namespace containing the element as named. This name wants to be the name used by the (cvt) routine and is delivered into its namespace (in this case, they are the same name (x) ). The program (cvt) is "called" by delivering a namespace as its left argument. Thus (cvt) can be viewed as a monadic operator in this context.
[:for(...){... cvt}< => sizes;]:
Each pass through the (:for) loop gives us the result from the (cvt) program. This is collected as a sequence with elements corresponding to the (:for) loop iterations. In other words, we get a sequence of numeric sizes. We want a vector of numbers and since our result is a homogeneous sequence (i.e. all numerics) (<) disclosing it yields a numeric vector which we (=>sizes;).
['Sizes (MB):'(sizes/1e6)$;]:
This just converts our sizes to megabytes (sizes/1e6) and displays them preceeded by annotation ('Sizes (MB):'). The ($) forces a display of the result it finds on its left.
['Ascending: '(seq[sizes ``>],,)$;]
We grade sizes (>>) getting the indices that will put it in ascending order and use those to index into our sequence (seq[sizes ``>]). This gives us the ascending sort we want. The result is a sequence. Our sequence layout operator (,,) supplies us with separating spaces.
['Descending:'(seq[sizes ``<],,)$;]:
More of the same for our descending presentation.

This completes the example. To better understand these operators and other things you can do with them, consult the operator pages according to the type of data you see being operated on.