The previous tests confirmed that direct and predictable access to the data is always faster than indirect access with code branches.
This time we try to improve the code for the general 2D convolution routine. Three versions were compared for the convolution of a 2048x2048 image with a 5x5 kernel:
- a naive convert+macro based routine
- the routine from megawave 3
- a modified version, using only pointer arithmetics in the inner loop
I used the same simple profiling tools as before, with gprof:
measures
without compiler optimization
% cumulative self self total
time seconds seconds calls s/call s/call name
39.33 25.36 25.36 10 2.54 2.54 _convol_macro
33.48 46.95 21.59 10 2.16 2.16 _convol_old
25.20 63.20 16.25 10 1.62 1.62 _convol_pointer
1.99 64.48 1.28 main
0.00 64.48 0.00 30 0.00 2.11 convol
with compiler optimization
% cumulative self self total
time seconds seconds calls s/call s/call name
46.85 10.47 10.47 10 1.05 1.05 _convol_macro
26.35 16.36 5.89 10 0.59 0.59 _convol_old
24.43 21.82 5.46 10 0.55 0.55 _convol_pointer
2.37 22.35 0.53 main
0.00 22.35 0.00 30 0.00 0.73 convol
So, what?
- Well, the first remark is that the macro version is slow. This doesn't
mean that the macro access is slow (it is very close to the syntax
used in the "old" version), because the loops design is poor, in fact,
with lots of
ifbranches. It breaks the CPU prediction and the insctrucion pipelining (see the "unrolling" note, hereafter). - then the
pointercode is faster. 25% faster without compiler optimizations, only 5% faster with optimization. This advantage was observed for various image sizes, but it's not very significant. My guess is that the gain in pointer arithmetics (less variables and less operations in the inner loop) were compensated by the pointer aliasing problems (see the "restrict" note, hereafter).
These measures were obtained for a compilation with gcc-4.2. The results are similar with gcc-3.4; with gcc-2.95, the optimization of the "old" version seems inefficient, but the overall impression is the same.
alternatives?
I tried an orthogonal approach, looping first on the kernel, then on the image; as the kernel is supposed to be (much) smaller than the image, it would result in long and regular inner loops, which I expected to be more efficient. But then we can't use a 64-bit accumulator. Is this really needed?
Finally, the results were not convincing, because the loops needed lots of temporary variables, probably more than what the CPU registers allow. I keep it aside, it may be useful later.
miscellaneous remarks
integer indexes
I noticed that with gcc 4.2 on an x86_32 CPU, indexing the loops with
an unsigned integer makes the code 30% slower than when we index the
loops with a vanilla signed integer. Too me, it looks like a gcc
optimization bug.
register starvation
x86 architecture has only
7 integer registers
and it's easy to need more for nested loops with variable limits. A
few extra registers would be of great help, and are likely to increase
the code efficiency, but the x86 CPU family won't provide them.
restrict keyword
In C99, the restrict keyword provides hints to the compiler; it
ensures (from the programmer's brain) that two pointers will never
alias the same data, allowing
better optimizations,
and saving some registers. This may improve out pointer-based loops. But
it's C99-only...
unrolling the loops
Partial unrolling of the loops usually uses more efficiently the modern CPUs superscalar and pipelining features. This is achieved by replacing a single-instruction loop
for (i = 0; i < imax; i++)
do_stuff_with(i);
by a bunch of similar instructions inside the loop; for example, with 8 instructions in the loop
for (i = 0; i < imax % 8; i++)
do_stuff_with(i);
for (i = imax % 8; i < imax; i += 8)
{
do_stuff_with(i);
do_stuff_with(i + 1);
do_stuff_with(i + 2);
do_stuff_with(i + 3);
do_stuff_with(i + 4);
do_stuff_with(i + 5);
do_stuff_with(i + 6);
do_stuff_with(i + 7);
}
The downside is that this is largely architecture-dependant. ATLAS uses build-time heuristics to do these optimisations.