Case Study CS00015: Interpolating data and resampling

The problem: You are collecting data which you need to normalize for statistical analysis. The data is collected over arbitrary time frames defined by a start and end time and a value for the indicated interval. Intervals do not overlap. Some intervals are missing. You need to produce a data stream with a regular interval and no voids. A Glee solution to this case was requested by Felix. His is a telecom application where call request and call enabled data are collected from network elements. He needed an efficient method of filling in missing data and resampling it for his analysis.

The solution:

  1. Place the cells in ascending time order
  2. Convert the cell value to a cell rate value (value/cell interval)
  3. Interpolate the rates at cell boundaries and desired interval indices
  4. Sum scan the rate times the interval (i.e. integrate)
  5. Index out by desired resampling indexes
  6. First difference to determine value of resampled cells

$$ The Glee Code:.
25=>w;
'Cell Start: '->w   (2 10 15 31 33 => s)$;    
'Cell End: '->w     (6 15 30 33 45 => e)$;
'Cell Value: '->w   (3  4  6  2  8 => v)$;  
'Cell Rate: '->w               (v/(e-s)=>r)$;
'Resampling granularity: '->w  (5=>g)$;
'Sampling range start: '->w    (s<- /g%\ *g => srs)$;
'Sampling range end: '->w      (e-> /g %/ *g => sre)$;
'Samples: '->w                 (sre-srs/g%\ =>ns)$;
'Sample indices: '->w          (0..ns*g +srs=>si)$;
'Data&Sample indices : '->w    (s e si < & =>sidb)$;
'Interpolated rates: '->w      (s e r #@` sidb => iv)$;
'Interpolated intervals: '->w  (sidb -- =>ii)$;
'Interpolated aggregates: '->w (ii*iv=>ia)$;
'Cumulative aggregates: '->w   (ia/+ %- =>ca)$;
'Result indices: '->w          (sidb`si=>ri)$;
'Sampled aggregates: '->w      (ca[ri]=>sa)$;
'Sampled results:'->w          $;
'Sample start: '->w            (si   ->_1 @&(%* ->3))$;
'Sample end: '->w              (si+g ->_1 @&(%* ->3))$;
'Sample value: '->w            (sa-- <-_1 @&(%* ->3))$;

The Output:
             Cell Start: 2 10 15 31 33
               Cell End: 6 15 30 33 45
             Cell Value: 3 4 6 2 8
              Cell Rate: 0.75 0.8 0.4 1 0.666667
 Resampling granularity: 5
   Sampling range start: 0
     Sampling range end: 45
                Samples: 9
         Sample indices: 0 5 10 15 20 25 30 35 40 45
   Data&Sample indices : 0 2 5 6 10 15 20 25 30 31 33 35 40 45
     Interpolated rates: 0.75 0.75 0.75 0.775 0.8 0.4 0.4 0.4 0.7 1 0.666667 0.666667 0.666667 0.666667
 Interpolated intervals: 0 2 3 1 4 5 5 5 5 1 2 2 5 5
Interpolated aggregates: 0 1.5 2.25 0.775 3.2 2 2 2 3.5 1 1.33333 1.33333 3.33333 3.33333
  Cumulative aggregates: 0 2 4 5 8 10 12 14 17 18 20 21 24 28
         Result indices: 1 3 5 6 7 8 9 12 13 14
     Sampled aggregates: 0 4 8 10 12 14 17 21 24 28
         Sampled results:
           Sample start:   0  5 10 15 20 25 30 35 40
             Sample end:   5 10 15 20 25 30 35 40 45
           Sample value:   4  4  2  2  2  3  4  3  4

The play-by-play:

  1. 25=>w;
    'Cell Start: '->w   (2 10 15 31 33 => s)$;    
    'Cell End: '->w     (6 15 30 33 45 => e)$;
    'Cell Value: '->w   (3  4  6  2  8 => v)$;  
    'Cell Rate: '->w               (v/(e-s)=>r)$;
    'Resampling granularity: '->w  (5=>g)$;
    .
    This is the input data for the example. The caption display width is held in "w". The right take of the captions causes them to be right justified. Stranding the caption with the parenthesized expression and displaying it with "$" is a quick way to document each software sculpturing step. In real life, once the algorithm is proved, it is usually packaged into a Glee function. The "cell rate (r)" is the rate at which values accumulate over the cell interval. We need this derivative value for resampling the cells.
    The output:
                 Cell Start: 2 10 15 31 33
                   Cell End: 6 15 30 33 45
                 Cell Value: 3 4 6 2 8
                  Cell Rate: 0.75 0.8 0.4 1 0.666667
     Resampling granularity: 5
  2. 'Sampling range start: '->w    (s<- /g%\ *g => srs)$;
    'Sampling range end: '->w      (e-> /g %/ *g => sre)$;
    'Samples: '->w                 (sre-srs/g%\ =>ns)$;
    'Sample indices: '->w          (0..ns*g +srs=>si)$;
    'Data&Sample indices: '->w  (s e si < & =>sidb)$;

    We create the sampling indices by making them span the input data and break on granularity points. For the algorithm to work, we need to combine these "desired output" indices with the "input data" indices, removing duplicates. This gives us the needed interpolation indices "sidb"
    The output:
       Sampling range start: 0
         Sampling range end: 45
                    Samples: 9
             Sample indices: 0 5 10 15 20 25 30 35 40 45
        Data&Sample indices: 0 2 5 6 10 15 20 25 30 31 33 35 40 45
  3. 'Interpolated rates: '->w      (s e r #@` sidb => iv)$;
    In one quick step, the interpolation operator "#@`" (read the glyph as number (value) at index) interpolates the cells at all indices "sidb". The operator recognizes "cell interpolation" by the three item sequence (s: start; e: end; r: rate) given on the left. The result is returned as the interpolation vector "iv"
    The output:
    Interpolated rates: 0.75 0.75 0.75 0.775 0.8 0.4 0.4 0.4 0.7 1 0.666667 0.666667 0.666667 0.666667
  4. 'Interpolated intervals: '->w  (sidb -- =>ii)$;
    'Interpolated aggregates: '->w (ii*iv=>ia)$;
    'Cumulative aggregates: '->w   (ia/+ %- =>ca)$;
    The interpolated intervals are the first difference of the indices. This interval "ii" times the interpolated rate "iv" produces the interpolated aggregate for the interval. Notice the intervals are the original data intervals and our desired sample index intervals. By doing the sum scan "/+", we integrate and save intermediate values for each interval. I round these to the nearest integer with the "%-" rounding operator.
    The output:
     Interpolated intervals: 0 2 3 1 4 5 5 5 5 1 2 2 5 5
    Interpolated aggregates: 0 1.5 2.25 0.775 3.2 2 2 2 3.5 1 1.33333 1.33333 3.33333 3.33333
      Cumulative aggregates: 0 2 4 5 8 10 12 14 17 18 20 21 24 28
  5. 'Result indices: '->w          (sidb`si=>ri)$;
    'Sampled aggregates: '->w      (ca[ri]=>sa)$;
    'Sampled results:'->w          $;
    'Sample start: '->w            (si   ->_1 @&(%* ->3))$;
    'Sample end: '->w              (si+g ->_1 @&(%* ->3))$;
    'Sample value: '->w            (sa-- <-_1 @&(%* ->3))$;

    We're only interested in results at the sample indices. The "sidb`si=>ri" step gives us the result indices of interest. Using these indices we extract the results we want from the cumulative aggregates "ca". The results are then displayed: First the start indices (si->_1) and next the end indices (si+g->_1). I haven't implemented Glee's formatting operator yet so I use the trick of converting to a string and doing the right take to align the output (@&(%* ->3)) ... crude, but it works for now. Since our result is cumulative, we need to get the first difference to obtain the amount contributed by each cell (sa-- <-_1 ). The various drops you see from the beginning and ending of the vectors aligns the results for presentation.
    The output:
             Result indices: 1 3 5 6 7 8 9 12 13 14
         Sampled aggregates: 0 4 8 10 12 14 17 21 24 28
             Sampled results:
               Sample start:   0  5 10 15 20 25 30 35 40
                 Sample end:   5 10 15 20 25 30 35 40 45
               Sample value:   4  4  2  2  2  3  4  3  4

This completes the example. It gives a flavor of the techniques employed when doing vector programming. To realize Glee's processing efficiencies you need to consider the problem in vector, rather than scalar or atomic, form. Once you're used to it, you will produce much more readable code than with the corresponding classical scalar coding style. To better understand these operators and other things you can do with them, consult the operator pages according to the type of data you see being operated on.