Commentary: Operators; Character

Commentary: Operators, Character

General:This section is about Glee's power for dealing with character and string data. Glee trys to get you results ... not hassle you about details. Towards this end, Glee character data comparisons use a Glee compare. A Glee compare is a liberal set of rules for comparing string data. First, it ignores case. Next, it treats all punctuation and non-printable characters as whitespace. Finally, it ignores extraneous whitespace. It does this for both left and right arguments. By default, Glee uses Glee compare rules. "Glee compares" take longer than exact compares and often exact compares are what is needed. For these cases Glee has operators (containing the symbol "=") that use exact compares.

This whole section on character operators is still very experimental. There are glaring omissions (like find and replace) which will come along soon but may be documented in a general indexed assignment context. It may turn out that this current operator set is impractical or confusing to use. Further, much of what is done with characters is context sensitive. Are we ignoring white space, special characters, non-printable characters, case, or all the permutations of these? Soon, these context considerations need to be addressed.

I have only guessed at a limited set of facilities I think allow me to do real work. New facilities may allow the Glee programmer to specify the context through arguments (e.g. strings are objects and have properties, of which case and other compare state can be manipulated at run time). I may also be able to specify context through compound operators created with adaptors (e.g.a " *&.~ " operator for mark string and eat). Right now this is pretty ugly to me. All of these issues will be worked out when I start presenting case studies of real life problems. The framework on which I have built Glee will accomodate whatever I decide. It's the deciding that is problematic!

Upper case:With so many operators, choice of symbols for operations becomes an issue. I have tried to have a style in choosing symbols. I have tried to keep the style consistent. This is an example. Conversion to upper case is a character by character operation on the elements. For such operations, I have chosen the (%) symbol. Picture it as an old fashioned clothes wringer or a pair of rollers reforming steel plate. I'm reforming the elements. In the case of Upper Case, I follow this with the (/) symbol. Picture this as a "ramp up". I'm ramping up the characters to upper case with the (%/) operator .

Lower case:As discussed for Upper Case, I try to chose picture descriptive symbols for forming the operators. Here (%\) ramps the string elements down to lower case.

Mark for Glee compare: By default, Glee assumes you want to make liberal compares of string objects. This means case, redundant whitespace and special characters are ignored. This monadic operator marks objects to its left for Glee compares. Subsequent comparison operations interrogate this marking to deliver the appropriate behavior. In the verbose display, a colon ":" signifies the liberal Glee compare marking.

Mark for exact compare: To override the default Glee compare marking and obtain exact compares, use this operator. It marks the object to its left for exact comparison. Subsequent comparison operations interrogate this marking to deliver the appropriate behavior. In the verbose display, a equal sign "=" signifies the exact compare marking.

Equal To: A Glee compare is performed on the elements. Where they match, 1 is returned. Otherwise 0 is returned. If you want an exact compare, preceed the test with "@==". This marks the object telling subsequent operations to perform exact rather than the more liberal and computationally expensive Glee compare.

Mark String: Here we mark over the span of a matching Glee compare. I have chosen the (&) symbol because to Glee it generally means all. The result is where the left and right arguments match for all characters in the right argument string. Notice in the third example there is no Glee match. This is because the dash (-) is being ignored for Glee compares. But the blank in the right argument is not being ignored. Hyphenated and contracted words are problematic. This is the most generally desired solution.

Mark String Exact: Here, we mark over the span of a string matching exactly. This marking operator is faster than using theGlee compare version (*&). You will use it when you know exactly what you're looking for (case, whitespace, text, etc.) or when you have to find special characters in context with text.

Mark characters:Here I have chosen the marking symbol (*) and the "or" (|) symbol. I am marking where the first or the second or the third or ... characters in the right argument are found in the left argument. Since individual characters are being compared (typically special delimiters like "," and ".") I do an exact match on these characters. Thus, letter matches are case sensitive.

Mark Words Start:Marks the beginning of words. Words are substrings beginning with an alphabetic (i.e. preceeded by a non-alphabetic). They are ended by the beginning of the next word.

Mark Strings Start:Marks the beginning of strings using a Glee compare.

Mark Words End:Marks the end of words. Words end with an alphabetic followed by a non-alphabetic.

Mark Strings End:Marks the end of strings using a Glee compare.

Mark pairs:It is common to have to parse out strings delimited by pairs and deal with them separately. This operator generates the bit vector that helps you do that. The right argument is a string containing delimiters in pairs. Glee starts marking when it sees a pair begin and stop marking when it sees it end. It does this pair by pair and ors the result.

Catenate:Joins two strings end to end to produce a new string.

Index of chars:Returns the index of the first occurrence of characters on the right in the string on the left. An exact comparison is made. Glee comparison rules are not used.

Contains chars: Returns a 1 element bit vector. If the right argument contains any of the characters in the left argument using liberal Glee comparison rules, the result is true. Otherwise it is false. This symbol is made up of the (^) symbol (as in housed under that little roof) to symbolize containment. It then uses the (|) symbol for any. So ^| reads contains any in this context. Since the liberal Glee comparison changes non-printables and punctuation to blanks, this operator is only useful for doing alphanumeric character and special symbol compares.

Contains String: The (&) symbol meaning all is used in the operator for finding strings containing other strings. The ( ^& ) operator is one of the most powerful operators in the character operator suite. It can be used when scanning logs and removing clutter. For example, log[log ^& 'robots.txt'~]=>log would remove web log lines generated by some groping bots.

Contains Exact: The (=) symbol meaning exact is used in the operator for finding strings containing other strings exactly. The ( ^= ) operator is faster than the ( ^& ) because it makes only simple decisions ... are the substrings exactly the same or not. If you're trying to locate strings in lines and you know exactly what you're looking for, this operator will mark the lines for you.

Delete All Blanks: .

Delete All Characters: .

Delete Extraneous Blanks: .

Delete Extraneous Chars: .

Delete Leading Blanks: .

Delete Leading Chars: .

Delete Trailing Blanks: .

Delete Trailing Chars: .

Segment CRLF:Often text is delimited into lines by combinations of CR (carrier return) and LF (line feed) characters. This operator recognizes these characters and returns a sequence, each element of which is a string of the text. The operator recognizes CR, CRLF, LF, and LFCR as single delimiters. It sees LFLF and CRCR as two separate delimeters. The first line is captured as if it had a leading CRLF. However, GLEE will not add the CRLF to the first line of text. On these lines, the CRLF is always found at the beginning of the string.

Segment and eat CRLF :Typically when dealing with text as lines you don't want the CRLF in the way. This operator eats the CRLF characters as it builds the sequence.

Segment Delimiter: When you need to segment text at points other than CRLF, this is the operator to use. Notice the delimiter belongs to the previous string. The first example illustrates this. You naturally expect the "," and "." to go with the phrase and the sentence respectively). This is different than segmenting with indices or bit vectors. In those cases, as shown in the second example, the marked position is the beginning of the string. Otherwise you would have the first letter of marked words included with the previous word.

Segment and eat Delimiter: This operator consumes the delimeters used in segmenting.

Segment Index: The result of any method creating indices or a bit vector can be used for segmenting the text.

Segment Index and Eat:This operator consumes (eats) the indexed characters when it builds the sequence of strings.

ASCII: If the left argument is numeric (integer 0..255), a character string is returned representing ASCII characters corresponding to the numbers in the vector. If a number is out of range, it is taken as 256 modulus. Numbers are coerced to integers. If the left argument is a string, a numeric vector representing those characters from the ASCII table are returned.

Base (%>)and Representation (%<): In the string domain, the dyadic base operator converts the string to contain only the characters in the right argument. This is helpful to convert the string to only transmittable characters. The Representation operator (commonly called "rep") reverses the process reconstituting the original string. The right argument for both is a string of valid characters for the result (or in the input in the case of Rep). In the case of Rep, any invalid characters in the left argument (i.e. not in the right argument) are ignored. This is useful when transmission or display adds characters like linefeeds, carrier returns, and spacing. This makes Base and Rep useful for including ciphered text along with unciphered text (as in an email message) and reconstituting it on receipt.

: .