Case Study CS00015: Interpolating data and resampling
The problem: You are collecting data which you
need to normalize for statistical analysis. The data is collected over
arbitrary time frames defined by a start and end time and a value for the
indicated interval. Intervals do not overlap. Some intervals are missing. You
need to produce a data stream with a regular interval and no voids. A
Glee solution to this case was requested by Felix. His is a
telecom application where call request and call enabled data are collected from
network elements. He needed an efficient method of filling in missing data and
resampling it for his analysis.
The solution:
- Place the cells in ascending time order
- Convert the cell value to a cell rate value (value/cell interval)
- Interpolate the rates at cell boundaries and desired interval indices
- Sum scan the rate times the interval (i.e. integrate)
- Index out by desired resampling indexes
- First difference to determine value of resampled cells
$$ The Glee Code:.
25=>w;
'Cell Start: '->w (2 10 15 31 33 => s)$;
'Cell End: '->w (6 15 30 33 45 => e)$;
'Cell Value: '->w (3 4 6 2 8 => v)$;
'Cell Rate:
'->w
(v/(e-s)=>r)$;
'Resampling granularity: '->w (5=>g)$;
'Sampling range start: '->w (s<- /g%\ *g => srs)$;
'Sampling range end: '->w (e-> /g %/ *g
=> sre)$;
'Samples:
'->w
(sre-srs/g%\ =>ns)$;
'Sample indices: '->w
(0..ns*g +srs=>si)$;
'Data&Sample indices : '->w (s e si < &
=>sidb)$;
'Interpolated rates: '->w (s e r #@` sidb
=> iv)$;
'Interpolated intervals: '->w (sidb -- =>ii)$;
'Interpolated aggregates: '->w (ii*iv=>ia)$;
'Cumulative aggregates: '->w (ia/+ %- =>ca)$;
'Result indices: '->w
(sidb`si=>ri)$;
'Sampled aggregates: '->w (ca[ri]=>sa)$;
'Sampled results:'->w
$;
'Sample start:
'->w
(si ->_1 @&(%* ->3))$;
'Sample end:
'->w
(si+g ->_1 @&(%* ->3))$;
'Sample value:
'->w (sa--
<-_1 @&(%* ->3))$;
The Output:
Cell Start: 2 10 15 31 33
Cell End: 6 15 30 33 45
Cell
Value: 3 4 6 2 8
Cell Rate: 0.75 0.8 0.4 1 0.666667
Resampling granularity: 5
Sampling range start: 0
Sampling range end: 45
Samples: 9
Sample indices: 0 5 10 15 20
25 30 35 40 45
Data&Sample indices : 0 2 5 6 10 15 20 25 30 31 33 35 40 45
Interpolated rates: 0.75 0.75 0.75 0.775 0.8 0.4 0.4
0.4 0.7 1 0.666667 0.666667 0.666667 0.666667
Interpolated intervals: 0 2 3 1 4 5 5 5 5 1 2 2 5 5
Interpolated aggregates: 0 1.5 2.25 0.775 3.2 2 2 2 3.5 1 1.33333 1.33333
3.33333 3.33333
Cumulative aggregates: 0 2 4 5 8 10 12 14 17 18 20 21 24 28
Result indices: 1 3 5 6 7 8 9
12 13 14
Sampled aggregates: 0 4 8 10 12 14 17 21 24 28
Sampled results:
Sample
start: 0 5 10 15 20 25 30 35 40
Sample
end: 5 10 15 20 25 30 35 40 45
Sample
value: 4 4 2 2 2 3 4
3 4
The play-by-play:
- 25=>w;
'Cell Start: '->w (2 10 15 31 33 => s)$;
'Cell End: '->w (6 15 30 33 45 => e)$;
'Cell Value: '->w (3 4 6 2 8 => v)$;
'Cell Rate:
'->w
(v/(e-s)=>r)$;
'Resampling granularity: '->w (5=>g)$;.
This is the input data for the example. The caption display width is held in
"w". The right take of the captions causes them to be right
justified. Stranding the caption with the parenthesized expression and
displaying it with "$" is a quick way to document each
software sculpturing step. In real life, once the algorithm is proved, it is
usually packaged into a Glee function. The "cell rate
(r)" is the rate at which values accumulate over the cell
interval. We need this derivative value for resampling the cells.
The output:
Cell Start: 2 10 15 31 33
Cell End: 6 15 30 33 45
Cell
Value: 3 4 6 2 8
Cell Rate: 0.75 0.8 0.4 1 0.666667
Resampling granularity: 5
- 'Sampling range start: '->w
(s<- /g%\ *g => srs)$;
'Sampling range end: '->w (e-> /g %/ *g
=> sre)$;
'Samples:
'->w
(sre-srs/g%\ =>ns)$;
'Sample indices: '->w
(0..ns*g +srs=>si)$;
'Data&Sample indices: '->w (s e si < &
=>sidb)$;
We create the sampling indices by making them span the input data and break on
granularity points. For the algorithm to work, we need to combine these
"desired output" indices with the "input data" indices,
removing duplicates. This gives us the needed interpolation indices
"sidb"
The output:
Sampling range start: 0
Sampling range end: 45
Samples: 9
Sample indices: 0 5 10 15 20
25 30 35 40 45
Data&Sample indices: 0 2 5 6 10 15 20 25 30 31 33
35 40 45
- 'Interpolated rates:
'->w (s e r #@` sidb => iv)$;
In one quick step, the interpolation operator "#@`" (read
the glyph as number (value) at index) interpolates the cells at all
indices "sidb". The operator recognizes "cell
interpolation" by the three item sequence (s: start; e: end; r: rate)
given on the left. The result is returned as the interpolation vector
"iv"
The output:
Interpolated rates: 0.75 0.75 0.75 0.775 0.8 0.4 0.4
0.4 0.7 1 0.666667 0.666667 0.666667 0.666667
- 'Interpolated intervals: '->w (sidb --
=>ii)$;
'Interpolated aggregates: '->w (ii*iv=>ia)$;
'Cumulative aggregates: '->w (ia/+ %- =>ca)$;
The interpolated intervals are the first difference of the indices.
This interval "ii" times the interpolated rate
"iv" produces the interpolated aggregate for the interval.
Notice the intervals are the original data intervals and our desired sample
index intervals. By doing the sum scan "/+", we integrate
and save intermediate values for each interval. I round these to the nearest
integer with the "%-" rounding operator.
The output:
Interpolated intervals: 0 2 3 1 4 5 5 5 5 1 2 2
5 5
Interpolated aggregates: 0 1.5 2.25 0.775 3.2 2 2 2 3.5 1 1.33333 1.33333
3.33333 3.33333
Cumulative aggregates: 0 2 4 5 8 10 12 14 17 18 20 21 24 28
- 'Result indices:
'->w
(sidb`si=>ri)$;
'Sampled aggregates: '->w (ca[ri]=>sa)$;
'Sampled results:'->w
$;
'Sample start:
'->w
(si ->_1 @&(%* ->3))$;
'Sample end:
'->w
(si+g ->_1 @&(%* ->3))$;
'Sample value:
'->w (sa--
<-_1 @&(%* ->3))$;
We're only interested in results at the sample indices. The
"sidb`si=>ri" step gives us the result indices of
interest. Using these indices we extract the results we want from the
cumulative aggregates "ca". The results are then displayed:
First the start indices (si->_1) and next the end indices
(si+g->_1). I haven't implemented Glee's formatting
operator yet so I use the trick of converting to a string and doing the right
take to align the output (@&(%* ->3)) ... crude, but it works
for now. Since our result is cumulative, we need to get the first difference to
obtain the amount contributed by each cell (sa-- <-_1 ). The
various drops you see from the beginning and ending of the vectors aligns the
results for presentation.
The output:
Result indices: 1 3 5 6 7 8 9 12 13 14
Sampled aggregates: 0 4 8 10 12 14 17 21 24 28
Sampled results:
Sample
start: 0 5 10 15 20 25 30 35 40
Sample
end: 5 10 15 20 25 30 35 40 45
Sample
value: 4 4 2 2 2 3 4
3 4
This completes the example. It gives a flavor of the techniques employed
when doing vector programming. To realize Glee's processing
efficiencies you need to consider the problem in vector, rather than scalar or
atomic, form. Once you're used to it, you will produce much more readable code
than with the corresponding classical scalar coding style. To better understand
these operators and other things you can do with them, consult the
operator pages according to the type of
data you see being operated on.