Case Study CS00011: Base64 encoding and decoding

The problem: In e-mail and other transfer mechanisms, some characters in the ascii set are reserved for special purposes. If they appear in the message there is a problem. The most common technique employed is the Base64 coding. This method converts 3 original characters to 4 transmitted characters. On the receiving end the process is reversed. This example shows how this encoding and decoding is accomplished in Glee code

The solution:

  1. Create the 64 character string of valid transmission characters.
  2. Encode a source string into transmittable string.
  3. Decode the received string back to the source string.

Note: You can cut and paste these code fragments into the code pane of the Glee interpreter and experiment as you go along to see the actual operations live.

The Glee code:

'A'..'Z','a'..'z','0'..'9','+/'=>b64;
'EncB64'#pgm{b64[t\(t# ` %3*=0)#asc %>256%<64@& ->4< +1]=>t;t,'='<-*(t# %/3)};
'DecB64'#pgm{t&&b64=>t;b64 @== `t-1\(t# ` %4*=0)%>64%<256@& ->3< #asc};
'Now is the time for all good men.'=>t$;
t[.t]EncB64=>ct$;  ct[.t]DecB64$;

The Output:
Now is the time for all good men.....
Tm93IGlzIHRoZSB0aW1lIGZvciBhbGwgZ29vZCBtZW4uLi4uAAAu==
Now is the time for all good men.......

Note: This algorithm really doesn't produce the industry standard Base64 conversion unless the source is an exact multiple of 3 characters. In fact, it doesn't even work unless this restriction is met. The problem lies in the last three characters produced by EncB64. The problem is the absolutely necessary ->4 on encoding and the ->3 for each set on decoding. If there are less than three remaining characters it inserts 0's to fill. The real algorithm does not do this in the last three bytes. It works on a bit by bit basis. I discovered this when I tried to compare speeds of Glee and hardcoded C++. The C++ ran nearly 400 times faster (and produced the correct result regardless of character count). In internet-land, Base64 is a necessity. So I added to the string type the C++ optimal methods .B64 (base 64) and .R64 (representation 64). The code above becomes simply:


'Now is the time for all good men.....'=>t$;
t.B64=>ct$;  ct.R64$;

The C++ coded Output:
Now is the time for all good men.....
Tm93IGlzIHRoZSB0aW1lIGZvciBhbGwgZ29vZCBtZW4uLi4uLg==
Now is the time for all good men.....

The play-by-play:

  1. 'A'..'Z','a'..'z','0'..'9','+/'=>b64;. Valid characters for Base64 are are upper-case alpha; lower-case alpha; the numerals; and the + and / characters. Glee produces the continguous character sequences using its ranging operator ..(e.g. 'A'..'Z','a'..'z','0'..'9'). We save these in b64.
  2. 'EncB64'#pgm{b64[t\(t# ` %3*=0)#asc %>256< %<64@& ->4< +1],('='%%4)<-*(t# %/3) };. The encode function indexes into the b64 character string. Building up the index is the essence of the operation. The technique is explained in the following steps
  3. (t# @= ` %3*=0): t# counts the characters in the text . ` generates a numeric index vector (i.e. 1 2 3 ... up to t#); %3 does the modulo producing 1 2 0 1 2 0 ... .*=0 Marking where the zeros are gives us a bit vector we can use to segment the text t \(bit vector ). This breaks our text into a sequence of three character strings. #asc operates on the elements of the sequence in the string converting the characters to numeric form. %> 256 then converts these three-number vectors into a single number base 256. This result is a sequence. %<64@& ->4< +1 represents the numbers in base 64 producing a sequence of 4 element numeric vectors which we disclose. The result is in origin 0 and we get the desired origin 1 by adding 1. b64[numeric index vector] gives us the character string we're looking for.
  4. ,('=')<-*(t# %/3) pads strings which are not exact multiples of 3 with the = character. ,'='<-* 3) catenates up to 2 ='s onto the string for padding. t# %/3 counts the string characters and rounds up to the next multiple of 3. string <-* number takes the proper number of characters from the front of the string dragging on the necessary number of = pad characters from the back.
  5. 'DecB64'#pgm{t&&b64=>t;b64 @== `t-1\(t# ` %4*=0)%>64%<256@& ->3< #asc}. This is the decoding program.
  6. It filters "t" t&&b64=>t; to include only B64 characters. Sometimes spaces and linefeeds are added to the text in transmittal and this removes them.
  7. b64 @== `t-1 finds each character in the B64 list and returns its index. Notice we must do this in exact mode (@== `) because B64 contains upper and lower case characters. This is the left argument to the segmentation operator ( \ .
  8. We form the right argument (t# ` %4*=0) by generating a bit vector with ones at the slicing points. This is done using the same technique employed in encoding ... only this time we are slicing by 4 characters rather than three.
  9. %>64%<256@& ->3< #asc. Once sliced we convert the 4 digit vectors to base 64 and then represent them in base 256. We assure that each set contains three digits@& ->3; disclose to obtain a numeric vector from our sequence; and convert back to ascii.< #asc.

This completes the example. To better understand these operators and other things you can do with them, consult the operator pages according to the type of data you see being operated on.