Contenu | Menu

The previous tests confirmed that direct and predictable access to the data is always faster than indirect access with code branches.

This time we try to improve the code for the general 2D convolution routine. Three versions were compared for the convolution of a 2048x2048 image with a 5x5 kernel:

I used the same simple profiling tools as before, with gprof:

measures

without compiler optimization

 %   cumulative   self              self     total           
time   seconds   seconds    calls   s/call   s/call  name    
39.33     25.36    25.36       10     2.54     2.54  _convol_macro
33.48     46.95    21.59       10     2.16     2.16  _convol_old
25.20     63.20    16.25       10     1.62     1.62  _convol_pointer
 1.99     64.48     1.28                             main
 0.00     64.48     0.00       30     0.00     2.11  convol

with compiler optimization

 %   cumulative   self              self     total           
time   seconds   seconds    calls   s/call   s/call  name    
46.85     10.47    10.47       10     1.05     1.05  _convol_macro
26.35     16.36     5.89       10     0.59     0.59  _convol_old
24.43     21.82     5.46       10     0.55     0.55  _convol_pointer
 2.37     22.35     0.53                             main
 0.00     22.35     0.00       30     0.00     0.73  convol

So, what?

These measures were obtained for a compilation with gcc-4.2. The results are similar with gcc-3.4; with gcc-2.95, the optimization of the "old" version seems inefficient, but the overall impression is the same.

alternatives?

I tried an orthogonal approach, looping first on the kernel, then on the image; as the kernel is supposed to be (much) smaller than the image, it would result in long and regular inner loops, which I expected to be more efficient. But then we can't use a 64-bit accumulator. Is this really needed?

Finally, the results were not convincing, because the loops needed lots of temporary variables, probably more than what the CPU registers allow. I keep it aside, it may be useful later.

miscellaneous remarks

integer indexes

I noticed that with gcc 4.2 on an x86_32 CPU, indexing the loops with an unsigned integer makes the code 30% slower than when we index the loops with a vanilla signed integer. Too me, it looks like a gcc optimization bug.

register starvation

x86 architecture has only (!) 7 integer registers and it's easy to need more for nested loops with variable limits. A few extra registers would be of great help, and are likely to increase the code efficiency, but the x86 CPU family won't provide them.

restrict keyword

In C99, the restrict keyword provides hints to the compiler; it ensures (from the programmer's brain) that two pointers will never alias the same data, allowing better optimizations, and saving some registers. This may improve out pointer-based loops. But it's C99-only...

unrolling the loops

Partial unrolling of the loops usually uses more efficiently the modern CPUs superscalar and pipelining features. This is achieved by replacing a single-instruction loop

for (i = 0; i < imax; i++)
    do_stuff_with(i);

by a bunch of similar instructions inside the loop; for example, with 8 instructions in the loop

for (i = 0; i < imax % 8; i++)
     do_stuff_with(i);
for (i = imax % 8; i < imax; i += 8)
{
     do_stuff_with(i);
     do_stuff_with(i + 1);
     do_stuff_with(i + 2);
     do_stuff_with(i + 3);
     do_stuff_with(i + 4);
     do_stuff_with(i + 5);
     do_stuff_with(i + 6);
     do_stuff_with(i + 7);
}

The downside is that this is largely architecture-dependant. ATLAS uses build-time heuristics to do these optimisations.