the ctx rasterizer

The ctx vector rasterizer is an active edge table scanline rasterizer, with adaptive vertical oversampling. The oversampling only occurs for scanlines where edges are closer to horizontal than a threshold or edges start or end. For a description of how traditional scanline rasterizer with vertical oversampling works see the introduction of How the stb_truetype Anti-Aliased Software Rasterizer v2 Works.

As can be seen in test renders on the bottom of this page, the adaptive renders produce similar results to the non-adaptive - but they are faster. It makes sense to render interactive animations with lower AA settings and rerendering at full quality when the UI settles.

At the moment the rasterizer generates high quality 8bit masks, this also works well also in floating point; but full floating point is desirable and would be best achieved with a renderer similar to v2 of the stb_trutetype rasterizer. Rather than the simpler to implement extension of the current approach to u16.

Render targets handled natively are 8bit sRGBA RGBA8, floating point scRGB RGBAF 8bit and floating point grayscale with alpha. GRAYA8 and GRAYAF and floating point CMYK CMYKAF. Integration points are catered for in API and protocol for color management, which will be done with babl. The formats RGB332, RGB565, RGB565_BYTESWAPPED, CMYKA8, RGB8, BGRA8, GRAY1, GRAY2, GRAY4, GRAY8 and GRAYF are handled by converting processed scanlines back and forth to one of the supported targets.

The compositing is written generically for N number of components in u8 or float. Pre-processor acrobatics is used to make the compiler able to do inlining and code-elimination. (work is in progress on an AVX2 version of the generic u8, eventually SIMD through intrinsics will be attempted also for floating point - but is less pressing there since autovectorization works better on float than u8.

static void
__ctx_u8_porter_duff (CtxRasterizer         *rasterizer,
                     int                    components,
                     uint8_t *              dst,
                     uint8_t *              src,
                     int                    x0,
                     uint8_t *              coverage,
                     int                    count,
                     CtxCompositingMode     compositing_mode,
                     CtxFragment            fragment,
                     CtxBlend               blend);
       

Overrides for RGBA8 for sourceOver and normal are provided separately, with compiletime optional AVX2 acceleration. The prototype of all the innerloop functions follow are:

void ctx_composite_pixels (CtxRasterizer *rasterizer,
                           uint8_t       *dst,
                           uint8_t       *src,
                           int            x0,
                           uint8_t       *coverage,
                           int            count);
        

The following are two images from the test suite used for evaluating and verifying the possible antialiasing modes.

15x17 adaptive CTX_ANTIALIAS_BEST

15x17

17x15 adaptive CTX_ANTIALIAS_DEFAULT

17x15

51x5 adaptive CTX_ANTIALIAS_GOOD

51x5

85x3 adaptive CTX_ANTIALIAS_FAST

85x3

1

1x1 CTX_ANTIALIAS_NONE

In the following graphic, we want symmetric artifacts, and can observe the differences in vertical fidelity by the quantization in the near horizontal spokes. The horizontal fidelity can be examined in the vertical spokes - by counting how many steps the gradient gets.

15x17 adaptive CTX_ANTIALIAS_BEST

15x17

17x15 adaptive CTX_ANTIALIAS_DEFAULT

17x15

51x5 adaptive CTX_ANTIALIAS_GOOD

51x5

85x3 adaptive CTX_ANTIALIAS_FAST

85x3

1x1 CTX_ANTIALIAS_NONE

optimization vs binary size

ctx is designed from the beginning to act as a software GPU for modern microcontrollers, some of which are more powerful than the PCs in the mid 90s. Different optimization settings of the compiler give a wide range for the performance/binary size tradeoff. But the small footprint in RAM and ability to use a shared read-only display list makes the ctx rasterizer core also useful in multi-threaded rendering.

font data size:    12186 bytes (A sans font subsetted to only ASCII)
RGBA8 rasterizer:  38869 bytes (-Os ~38kb - with many features disabled ~30kb
                                -O0  90-631kb
                                -O2  57-114kb
                                -Ofast/-O3  76-181kb, size (and compiletime)
                                bump due to SIMD tree-vectorizer)

ctx parser:        16384 bytes (not needed for direct use from C, but also
                                on embedded this can be useful for ease of
                                integration with other languages or directly
                                using ctx+mictrocontroller+display as a serial
                                display.)

Even more agressive optimization with exponential compile time has been tested by forcing inlining and separate compilation of all combinations of blend, composite and image source modes - building ctx then takes 5-6minutes and the resulting binary is close to a megabyte, performance gains exist but in practice they are near neglible.

The RAM requirements are small and by tuning the engine to have only a couple of save/restore states, and paths with fewer than 256 edges, the total RAM footprint of the rasterizer can be as low as ~5kb on 32bit platforms, the parser for the ctx protocol needs an additional 1kb. Where framebuffer is too large to fit in RAM, the allocation needed for scanline(s) must be wheighed against RAM needed for renderstream. Commands take a multiple of 9bytes, there is code/provisions for runtime compacting of the renderstream in prior git revisions.

is ctx fast yet?

This table contains fills of 1024px with various fill sources for all of ctx' supported targets, along with cairo included as a frame of reference. Small changes, like changing between inlining, forced inlining and no inlining can lead to performance fluctuations.

Ctx implements all combinations of compositing and blending modes, but the code path for normal with source over has received the most attention, both with optional AVX256 acceleration and not.

format color a=1.0 color a=0.75 lgrad rgrad sAtop a=1.0 sAtop a=0.75 sAtop lgrad sAtop rgrad overlay a=1.0 overlay a=0.75 overlay lgrad overlay rgrad aTop overlay..lgrad..rgrad
RGBA8 AVX2380919962811811451269292100885656103975656
RGBA8799765217167141122908711510256561171035756
cairo258511481156664456011569
GRAYA810526518454264258785418717863521881786352
GRAYAF5065067473594596737222922955552312295454
RGBAF51367121416050753717614812912652501191175048
CMYKAF5324054254379381505814514436391481483639
CMYKA8133131384412912740448180303280823033
GRAY12682405845157154554512612246431261224643

some observations on the above

To know how this compares with yet other renderers, compare with the numbers in blend2d's benchmarks..

When targeting a microcontroller that has a small enough framebuffer that it is possible to keep RGBA8 instead of RGB565, one should use RGBA8 and convert on-the-fly when copying out, for best performance.

For floating point compilers do a better job of autovectorizing code, SIMD implementations for normal + source-over does however still make sense.