Case Study CS00011: Base64 encoding and decoding
The problem: In e-mail and other transfer mechanisms,
some characters in the ascii set are reserved for special purposes. If
they appear in the message there is a problem. The most common technique
employed is the Base64 coding. This method converts 3 original characters to 4
transmitted characters. On the receiving end the process is reversed. This
example shows how this encoding and decoding is accomplished in Glee
code
The solution:
- Create the 64 character string of valid transmission characters.
- Encode a source string into transmittable string.
- Decode the received string back to the source string.
Note: You can cut and paste these code fragments into the code pane of
the Glee interpreter and experiment as you go along to see the actual
operations live.
The Glee code:
'A'..'Z','a'..'z','0'..'9','+/'=>b64;
'EncB64'#pgm{b64[t\(t# ` %3*=0)#asc %>256%<64@& ->4<
+1]=>t;t,'='<-*(t# %/3)};
'DecB64'#pgm{t&&b64=>t;b64 @== `t-1\(t# `
%4*=0)%>64%<256@& ->3< #asc};
'Now is the time for all good men.'=>t$;
t[.t]EncB64=>ct$; ct[.t]DecB64$;
The Output:
Now is the time for all good men.....
Tm93IGlzIHRoZSB0aW1lIGZvciBhbGwgZ29vZCBtZW4uLi4uAAAu==
Now is the time for all good men.......
Note: This algorithm really doesn't produce the
industry standard Base64 conversion unless the source is an exact multiple of 3
characters. In fact, it doesn't even work unless this restriction is met. The
problem lies in the last three characters produced by EncB64. The
problem is the absolutely necessary ->4 on encoding and the
->3 for each set on decoding. If there are less than three
remaining characters it inserts 0's to fill. The real algorithm does not do
this in the last three bytes. It works on a bit by bit basis. I discovered this
when I tried to compare speeds of Glee and hardcoded C++. The C++
ran nearly 400 times faster (and produced the correct result regardless of
character count). In internet-land, Base64 is a necessity. So I added to the
string type the C++ optimal methods .B64 (base 64) and .R64
(representation 64). The code above becomes simply:
'Now is the time for all good men.....'=>t$;
t.B64=>ct$; ct.R64$;
The C++ coded Output:
Now is the time for all good men.....
Tm93IGlzIHRoZSB0aW1lIGZvciBhbGwgZ29vZCBtZW4uLi4uLg==
Now is the time for all good men.....
The play-by-play:
- 'A'..'Z','a'..'z','0'..'9','+/'=>b64;. Valid
characters for Base64 are are upper-case alpha; lower-case alpha; the numerals;
and the + and /
characters. Glee produces the continguous character
sequences using its ranging operator ..(e.g. 'A'..'Z','a'..'z','0'..'9'). We save these in
b64.
- 'EncB64'#pgm{b64[t\(t# ` %3*=0)#asc %>256<
%<64@& ->4< +1],('='%%4)<-*(t# %/3) };. The encode
function indexes into the b64 character
string. Building up the index is the essence of the operation. The technique is
explained in the following steps
- (t# @= ` %3*=0):
t# counts the characters in the text .
` generates a numeric index vector (i.e.
1 2 3 ... up to t#);
%3 does the modulo producing 1 2 0 1 2 0
... .*=0 Marking where the zeros are
gives us a bit vector we can use to
segment the text t \(bit vector ).
This breaks our text into a sequence of three character strings.
#asc operates on the elements of the
sequence in the string converting the characters to numeric form.
%> 256 then converts these
three-number vectors into a single number base 256. This result is a sequence.
%<64@& ->4< +1 represents
the numbers in base 64 producing a sequence of 4 element numeric vectors which
we disclose. The result is in origin 0 and we get the desired origin 1 by
adding 1. b64[numeric index vector]
gives us the character string we're looking for.
- ,('=')<-*(t# %/3) pads strings
which are not exact multiples of 3 with the = character. ,'='<-* 3) catenates up to 2 ='s onto the string for padding.
t# %/3 counts the string characters and
rounds up to the next multiple of 3. string
<-* number takes the proper number of characters from
the front of the string dragging on the necessary number of
= pad characters from the back.
- 'DecB64'#pgm{t&&b64=>t;b64 @==
`t-1\(t# ` %4*=0)%>64%<256@& ->3< #asc}. This is
the decoding program.
- It filters "t" t&&b64=>t; to include only B64
characters. Sometimes spaces and linefeeds are added to the text in transmittal
and this removes them.
- b64 @== `t-1 finds each character in
the B64 list and returns its index. Notice we must do this in exact mode (@== `) because B64 contains upper and lower
case characters. This is the left argument to the segmentation operator ( \ .
- We form the right argument (t# `
%4*=0) by generating a bit vector with ones at the slicing points.
This is done using the same technique employed in encoding ... only this time
we are slicing by 4 characters rather than three.
- %>64%<256@& ->3<
#asc. Once sliced we convert the 4 digit vectors to base 64 and
then represent them in base 256. We assure that each set contains three
digits@& ->3; disclose to obtain a
numeric vector from our sequence; and convert back to ascii.< #asc.
This completes the example. To better understand these operators and other
things you can do with them, consult the operator pages according to the type of
data you see being operated on.