SIMD is additional instructions available on some processors that work on multiple elements of data instead of single elements of data like the core instruction set. Much of the additional data throughput possible on modern ARM and x86_64 cpus is only possible when using these instructions. These instructions are incompatible with older cpus where they crash the software instead.
babl takes the approach of handrolled alternatives for specific instruction sets and conversions that gets included in a cached just-in-time benchmark to figure out which implementation to use. GEGL had some early attempts at using vector types and conditional dispatch in each operations process() for different available instruction sets. The maintanance overhead of this was much larger than the seen gains, more than a decade later there is many more SIMD instruction sets and compilers have gotten better at making use of these instructions.
When developing ctx, I initially had an AVX2 implementation of the core solid color filling operation which is the most used compositing operation in 2D vector graphics, during benchmarking I started noticing that my C code to handle the unaligned data at the start and end, when tasked with doing all the processing performed better compiled with -O3 than my own mediocre AVX2 implementation of the same.
I then removed the AVX2 code which had been hard and fun to make work, simplified the build and focused on making the code perform well when compiled with -march=native -mtune=native -O2 -ftree-vectorize.
In ctx the compositing is implemented by the rasterizer dispatching to work functions with this prototype:
apply_coverage (CtxRasterizer *rasterizer, /* contains information about source/compositing mode etc. */ uint8_t *dst, // pointer to destination pixels uint8_t *src, // possibly unused int x0, uint8_t *coverage, // alpha coverage for the pixels int count); // number of pixels
The core of ctx is u8 and float implementations of this function for generically handling the porter duff compositing modes and the SVG blending modes, for colors, gradients and textures. The implementation of this ends up being loops over pixels with many branches inside; not what compilers like - whether the code is SIMD or not.
For RGBA8/BGRA8 there is separate implementations that achieve the same as the generic code, as well as hand-rolled implementations direct switch dispatch for some cases like spans of full coverage inside the rasterizer.
For GRAYA8 (and indirectly GRAY8) which is used for computing clipping masks, there isn't a manual implementation of the compositing. But for the common case of full coverage alpha=1.0 spans fast paths exist for all formats directly in the rasterizer.
If doing GRAY8/GRAYA8, RGBAF or CMYKAF processing is important it is possible to turn on on inlining of the implementation of the normal blending mode. This causes the compiler to end up compiling bits of the generic code with the compositing, blending and fragment options fixed, which when the compiler is doing its job well causes the branches inside loops to disappear. Compiling takes longer - but yields efficient code.
When building the ctx library/terminal, the compositing and the rasterizer code are built multiple times with different compiler flags in each pass. When initializing ctx hotswaps a handful of pointers to choose between implementations.
After having success with this in ctx I have done the same to GEGL - with speedups expected from this on both x86 and ARM.
Having inlined implementations of all blend modes would be possible, for now development is focused on making the default cases fast as well as work on making the rest correct.
clang surprises in managing to double the already impressive source-over solid fill rect performance of gcc, by nearly halving the runtime.