General:This section is about Glee's power for dealing with character and string data. Glee trys to get you results ... not hassle you about details. Towards this end, Glee character data comparisons use a Glee compare. A Glee compare is a liberal set of rules for comparing string data. First, it ignores case. Next, it treats all punctuation and non-printable characters as whitespace. Finally, it ignores extraneous whitespace. It does this for both left and right arguments. By default, Glee uses Glee compare rules. "Glee compares" take longer than exact compares and often exact compares are what is needed. For these cases Glee has operators (containing the symbol "=") that use exact compares.
This whole section on character operators is still very experimental. There are glaring omissions (like find and replace) which will come along soon but may be documented in a general indexed assignment context. It may turn out that this current operator set is impractical or confusing to use. Further, much of what is done with characters is context sensitive. Are we ignoring white space, special characters, non-printable characters, case, or all the permutations of these? Soon, these context considerations need to be addressed.
I have only guessed at a limited set of facilities I think
allow me to do real work. New facilities may allow the Glee
programmer to specify the context through arguments (e.g. strings are objects
and have properties, of which case and other compare state can be manipulated
at run time). I may also be able to specify context through compound operators
created with adaptors (e.g.a " *&.~ " operator for mark
string and eat). Right now this is pretty ugly to me. All of these issues will
be worked out when I start presenting case studies of real life problems. The
framework on which I have built Glee will accomodate whatever I
decide. It's the deciding that is problematic!
Upper case:With so many
operators, choice of symbols for operations becomes an issue. I have tried to
have a style in choosing symbols. I have tried to keep the style consistent.
This is an example. Conversion to upper case is a character by character
operation on the elements. For such operations, I have chosen the (%)
symbol. Picture it as an old fashioned clothes wringer or a pair of rollers
reforming steel plate. I'm reforming the elements. In the case of Upper Case, I
follow this with the (/) symbol. Picture this as a "ramp
up". I'm ramping up the characters to upper case with the (%/)
operator .
Lower case:As discussed for
Upper Case, I try to chose picture descriptive symbols for forming the
operators. Here (%\) ramps the string elements down to lower case.
Mark for Glee
compare: By default, Glee assumes you want to make liberal
compares of string objects. This means case, redundant whitespace and special
characters are ignored. This monadic operator marks objects to its left for
Glee compares. Subsequent comparison operations interrogate this
marking to deliver the appropriate behavior. In the verbose display, a colon
":" signifies the liberal Glee compare
marking.
Mark for exact compare:
To override the default Glee compare marking and obtain exact
compares, use this operator. It marks the object to its left for exact
comparison. Subsequent comparison operations interrogate this marking to
deliver the appropriate behavior. In the verbose display, a equal sign
"=" signifies the exact compare marking.
Equal To: A Glee
compare is performed on the elements. Where they match, 1 is returned.
Otherwise 0 is returned. If you want an exact compare, preceed the test with
"@==". This marks the object telling subsequent operations
to perform exact rather than the more liberal and computationally expensive
Glee compare.
Mark String: Here we mark over the span of
a matching Glee compare. I have chosen the (&)
symbol because to Glee it generally means all. The result
is where the left and right arguments match for all characters in the
right argument string. Notice in the third example there is no
Glee match. This is because the dash (-) is being ignored for
Glee compares. But the blank in the right argument is not being
ignored. Hyphenated and contracted words are problematic. This is the most
generally desired solution.
Mark String Exact: Here, we
mark over the span of a string matching exactly. This marking operator is
faster than using theGlee compare version (*&). You
will use it when you know exactly what you're looking for (case, whitespace,
text, etc.) or when you have to find special characters in context with text.
Mark characters:Here I have
chosen the marking symbol (*) and the "or"
(|) symbol. I am marking where the first or the second
or the third or ... characters in the right argument are found in
the left argument. Since individual characters are being compared (typically
special delimiters like "," and ".") I do
an exact match on these characters. Thus, letter matches are case sensitive.
Mark Words Start:Marks
the beginning of words. Words are substrings beginning with an alphabetic (i.e.
preceeded by a non-alphabetic). They are ended by the beginning of the next
word.
Mark Strings
Start:Marks the beginning of strings using a Glee compare.
Mark Words End:Marks the
end of words. Words end with an alphabetic followed by a non-alphabetic.
Mark Strings End:Marks
the end of strings using a Glee compare.
Mark pairs:It is common to
have to parse out strings delimited by pairs and deal with them separately.
This operator generates the bit vector that helps you do that. The right
argument is a string containing delimiters in pairs. Glee starts
marking when it sees a pair begin and stop marking when it sees it end. It does
this pair by pair and ors the result.
Catenate:Joins two strings
end to end to produce a new string.
Index of chars:Returns the
index of the first occurrence of characters on the right in the string on the
left. An exact comparison is made. Glee comparison rules are not
used.
Contains chars: Returns
a 1 element bit vector. If the right argument contains any of the characters in
the left argument using liberal Glee comparison rules, the result
is true. Otherwise it is false. This symbol is made up of the (^)
symbol (as in housed under that little roof) to symbolize containment.
It then uses the (|) symbol for any. So ^| reads
contains any in this context. Since the liberal Glee
comparison changes non-printables and punctuation to blanks, this operator is
only useful for doing alphanumeric character and special symbol compares.
Contains String: The
(&) symbol meaning all is used in the operator for finding
strings containing other strings. The ( ^& ) operator is one of
the most powerful operators in the character operator suite. It can be used
when scanning logs and removing clutter. For example, log[log ^&
'robots.txt'~]=>log would remove web log lines generated by some
groping bots.
Contains Exact: The
(=) symbol meaning exact is used in the operator for finding
strings containing other strings exactly. The ( ^= ) operator is
faster than the ( ^& ) because it makes only simple decisions ...
are the substrings exactly the same or not. If you're trying to locate strings
in lines and you know exactly what you're looking for, this operator will mark
the lines for you.
Segment CRLF:Often text is
delimited into lines by combinations of CR (carrier return) and LF (line feed)
characters. This operator recognizes these characters and returns a sequence,
each element of which is a string of the text. The operator recognizes CR,
CRLF, LF, and LFCR as single delimiters. It sees LFLF and CRCR as two separate
delimeters. The first line is captured as if it had a leading CRLF. However,
GLEE will not add the CRLF to the first line of text. On these lines, the CRLF
is always found at the beginning of the string.
Segment and eat CRLF
:Typically when dealing with text as lines you don't want the CRLF in the
way. This operator eats the CRLF characters as it builds the sequence.
Segment Delimiter: When
you need to segment text at points other than CRLF, this is the operator to
use. Notice the delimiter belongs to the previous string. The first example
illustrates this. You naturally expect the "," and
"." to go with the phrase and the sentence respectively).
This is different than segmenting with indices or bit vectors. In those cases,
as shown in the second example, the marked position is the beginning of the
string. Otherwise you would have the first letter of marked words included with
the previous word.
Segment and eat
Delimiter: This operator consumes the delimeters used in segmenting.
Segment Index: The result
of any method creating indices or a bit vector can be used for segmenting the
text.
Segment Index and
Eat:This operator consumes (eats) the indexed characters when it builds the
sequence of strings.
ASCII: If the left argument is
numeric (integer 0..255), a character string is returned representing ASCII
characters corresponding to the numbers in the vector. If a number is out of
range, it is taken as 256 modulus. Numbers are coerced to integers. If the left
argument is a string, a numeric vector representing those characters from the
ASCII table are returned.
Base (%>)and
Representation (%<): In the string domain, the dyadic base
operator converts the string to contain only the characters in the right
argument. This is helpful to convert the string to only transmittable
characters. The Representation operator (commonly called "rep")
reverses the process reconstituting the original string. The right argument for
both is a string of valid characters for the result (or in the
input in the case of Rep). In the case of Rep, any invalid characters in
the left argument (i.e. not in the right argument) are ignored. This is useful
when transmission or display adds characters like linefeeds, carrier returns,
and spacing. This makes Base and Rep useful for including ciphered text along
with unciphered text (as in an email message) and reconstituting it on receipt.