ctx internals

rasterizer

The ctx vector rasterizer is an active edge table scanline rasterizer, with per-scanline choice of three different rasterization strategies.

The best case is when the slope of all scanlines crossing the scanline are steeper than 45 degrees. Where the aa for each coverage span has a single pixel of start/end aa.

The second best case is when all portions of AA in a scanline are linear ramps, and no edges interesecting, these scanlines comprise of three types spans, opaque, aa gradient and no coverage.

Scanlines where the number of active edges change within the scanline are rasterized with 15 level vertical oversampling, this is the fallback when other strategies fail us, and it is expensive - the algorithms worst case scanline. The rasterizer would be improved by use the area sum instead for this type of scanline as well as scanlines with many edges.

For an introduction to how scanline rasterizerization and vertical oversampling works oversampling works, which might prove useful in understanding the above explaination. See How the stb_truetype Anti-Aliased Software Rasterizer v2 Works.

Render targets handled natively are 8bit sRGBA RGBA8, floating point scRGB RGBAF 8bit and floating point grayscale with alpha. GRAYA8 and GRAYAF and floating point CMYK CMYKAF. Integration points are catered for in API and protocol for color management, which will be done with babl. The formats RGB332, RGB565, RGB565_BYTESWAPPED, CMYKA8, RGB8, GRAY1, GRAY2, GRAY4, GRAY8 and GRAYF are handled by converting processed scanlines back and forth to one of the supported targets. BGRA8 is handled by swapping components in the compositing source.

ctx supports grayscale, RGB and CMYK color models, all of which can be used and freely mixed while drawing. Conversion to the device/compositing representation is done during rasterization / rendering; at this point conversion between ICC matrix profiles for RGB spaces is currently supported when babl support is built in; making a hard-coded set of primaries known to match the specific display used - without babl - would be nice for microcontroller use.

The default RGB color space for both device and user is sRGB. Thus code from elsewhere specifying sRGB colors will work as expected. By adding an RGB matrix display profile in /tmp/ctx.icc the SDL,KMS and fbdev backends use the display space instead of sRGB for compositing.

optimization vs binary size

ctx is designed from the beginning to act as a software GPU for modern microcontrollers, some of which are more powerful than the PCs in the mid 90s. RAM usage can be tuned to be as small as ~7kb of RAM + 42kb of code + 12kb of fontdata, with further size reductions possible by reducing what is built. There is abstractions in ctx for working with external retained framebuffers, where ctx keeps track of what is rendered and only issues redraws for the part of the framebuffer that is changing, the same hashing of drawing commands is used to manage muliple render threads in the SDL and KMS backends.

font data size:    18kb  (A sans font subsetted to only ASCII,
                                latin1 ~= 33kb )
RGBA8 rasterizer:  ~46kb (compiled with -Os, can triple in size with -O3)
ctx parser:        24kb  (not needed for direct use from C, but also
                          on embedded this can be useful for ease of
                          integration with other languages or directly
                          using ctx+mictrocontroller+display as a serial
                          display, the ctx parser is also a svg path data
                          parser.)

The RAM requirements are small and by tuning the engine to have only a couple of save/restore states, and paths with fewer than 256 edges, the total RAM footprint of the rasterizer can be as low as ~5kb on 32bit platforms when a display with retained framebuffer is used, the parser for the ctx protocol needs an additional 1kb.

compositing

portable C with autovectorization

SIMD is additional instructions available on some processors that work on multiple elements of data instead of single elements of data like the core instruction set. Much of the additional data throughput possible on modern ARM and x86_64 cpus is only possible when using these instructions. These instructions are incompatible with older cpus where they crash the software instead.

babl takes the approach of handrolled alternatives for specific instruction sets and conversions that gets included in a cached just-in-time benchmark to figure out which implementation to use. GEGL had some early attempts at using vector types and conditional dispatch in each operations process() for different available instruction sets. The maintanance overhead of this was much larger than the seen gains, more than a decade later there is many more SIMD instruction sets and compilers have gotten better at making use of these instructions.

surprise performance

When developing ctx, I initially had an AVX2 implementation of the core solid color filling operation which is the most used compositing operation in 2D vector graphics, during benchmarking I started noticing that my C code to handle the unaligned data at the start and end, when tasked with doing all the processing performed better compiled with -O3 than my own mediocre AVX2 implementation of the same.

I then removed the AVX2 code which had been hard and fun to make work, simplified the build and focused on making the code perform well when compiled with -march=native -mtune=native -O2 -ftree-vectorize.

ctx approach to compositing code

In ctx the compositing is implemented by the rasterizer dispatching to work functions with this prototype:

apply_coverage (CtxRasterizer *rasterizer, /* contains information about
                                              source/compositing mode etc.  */
                uint8_t *dst,   // pointer to destination pixels
                uint8_t *src,   // possibly unused
                int x0,
                uint8_t *coverage, // alpha coverage for the pixels
                int count);        // number of pixels

The core of ctx is u8 and float implementations of this function for generically handling the porter duff compositing modes and the SVG blending modes, for colors, gradients and textures. The implementation of this ends up being loops over pixels with many branches inside; not what compilers like - whether the code is SIMD or not.

For RGBA8/BGRA8 there is separate implementations that achieve the same as the generic code, as well as hand-rolled implementations direct switch dispatch for some cases like spans of full coverage inside the rasterizer.

For GRAYA8 (and indirectly GRAY8) which is used for computing clipping masks, there isn't a manual implementation of the compositing. But for the common case of full coverage alpha=1.0 spans fast paths exist for all formats directly in the rasterizer.

If doing GRAY8/GRAYA8, RGBAF or CMYKAF processing is important it is possible to turn on on inlining of the implementation of the normal blending mode. This causes the compiler to end up compiling bits of the generic code with the compositing, blending and fragment options fixed, which when the compiler is doing its job well causes the branches inside loops to disappear. Compiling takes longer - but yields efficient code.

multipass

When building the ctx library/terminal for x86_64, the compositing and the rasterizer code are built multiple times with different compiler flags in each pass. When initializing ctx hotswaps a handful of pointers to choose between implementations.

After having success with this in ctx I have done the same to GEGL - with speedups expected from this on both x86 and ARM.

Having inlined implementations of all blend modes would be possible, for now development is focused on making the default cases fast as well as work on making the rest correct.

clang surprises in managing to double the already impressive source-over solid fill rect performance of gcc, by nearly halving the runtime.