ctx rasterizer

Small footprint
Can be tuned for microcontrollers down to ~7kb of RAM + 30kb of code + 12kb of fontdata, combined with immediate mode UI that can be re-run, it is sufficient to have a framebuffer covering one or a few scanlines. More RAM permits more flexible arrangement of more components using the CTX protocol.
Portable
The core rasterizer is C99 code that also compiles as C++, with simple make basd build system and only optional dependencies, and thus should run on most CPU architectures.

ctx supports grayscale, RGB and CMYK color models, all of which can be used and freely mixed while drawing. Conversion to the device/compositing representation is done during rasterization / rendering; at this point conversion between ICC matrix profiles for RGB spaces is currently supported when babl support is built in; making a hard-coded set of primaries known to match the specific display used - without babl - would be nice for microcontroller use.

The default RGB color space for both device and user is sRGB. Thus code from elsewhere specifying sRGB colors will work as expected. By adding an RGB matrix display profile in /tmp/ctx.icc the SDL,DRM and fbdev backends use the display space instead of sRGB for compositing.

TODO: color manage conversions between CMYK and RGB

The ctx vector rasterizer is an active edge table scanline rasterizer, with adaptive vertical oversampling. The oversampling only occurs for scanlines where edges are closer to horizontal than a threshold or edges start or end. For a description of how traditional scanline rasterizer with vertical oversampling works see the introduction of How the stb_truetype Anti-Aliased Software Rasterizer v2 Works.

As can be seen in test renders on the bottom of this page, the adaptive renders produce similar results to the non-adaptive - but they are faster. It makes sense to render interactive animations with lower AA settings and rerendering at full quality when the UI settles. Due to bitrot; the adaptive part of rendering has been turned off in favor of 3/5 levels of vertical supersampling in ctx, the renders on this page was made when it was working well.

At the moment the rasterizer generates high quality 8bit masks, this also works well also in floating point; but full floating point is desirable and would be best achieved with a renderer similar to v2 of the stb_trutetype rasterizer. The current approach will be much slower when doing to satisfaction for u16.

Render targets handled natively are 8bit sRGBA RGBA8, floating point scRGB RGBAF 8bit and floating point grayscale with alpha. GRAYA8 and GRAYAF and floating point CMYK CMYKAF. Integration points are catered for in API and protocol for color management, which will be done with babl. The formats RGB332, RGB565, RGB565_BYTESWAPPED, CMYKA8, RGB8, BGRA8, GRAY1, GRAY2, GRAY4, GRAY8 and GRAYF are handled by converting processed scanlines back and forth to one of the supported targets.

The compositing is written generically for N number of components in u8 or float. Pre-processor acrobatics is used to make the compiler able to do inlining and code-elimination. (work is in progress on an AVX2 version of the generic u8, eventually SIMD through intrinsics will be attempted also for floating point - but is less pressing there since autovectorization works better on float than u8.

static void
__ctx_u8_porter_duff (CtxRasterizer         *rasterizer,
                     int                    components,
                     uint8_t *              dst,
                     uint8_t *              src,
                     int                    x0,
                     uint8_t *              coverage,
                     int                    count,
                     CtxCompositingMode     compositing_mode,
                     CtxFragment            fragment,
                     CtxBlend               blend);
       

Overrides for RGBA8 for sourceOver and normal are provided separately, with compiletime optional AVX2 acceleration. The prototype of all the innerloop functions follow are:

void ctx_composite_pixels (CtxRasterizer *rasterizer,
                           uint8_t       *dst,
                           uint8_t       *src,
                           int            x0,
                           uint8_t       *coverage,
                           int            count);
        

The following are two images from the test suite used for evaluating and verifying the possible antialiasing modes.

15x17 CTX_ANTIALIAS_BEST

17x15 CTX_ANTIALIAS_DEFAULT

51x5 CTX_ANTIALIAS_GOOD

85x3 adaptive CTX_ANTIALIAS_FAST

1

1x1 CTX_ANTIALIAS_NONE

In the following graphic, we want symmetric artifacts, and can observe the differences in vertical fidelity by the quantization in the near horizontal spokes. The horizontal fidelity can be examined in the vertical spokes - by counting how many steps the gradient gets.

15x17 CTX_ANTIALIAS_BEST

17x15 CTX_ANTIALIAS_DEFAULT

51x5 CTX_ANTIALIAS_GOOD

85x3 CTX_ANTIALIAS_FAST

1x1 CTX_ANTIALIAS_NONE

optimization vs binary size

ctx is designed from the beginning to act as a software GPU for modern microcontrollers, some of which are more powerful than the PCs in the mid 90s. Different optimization settings of the compiler give a wide range for the performance/binary size tradeoff. But the small footprint in RAM and ability to use a shared read-only display list makes the ctx rasterizer core also useful in multi-threaded rendering.

font data size:    12186 bytes (A sans font subsetted to only ASCII)
RGBA8 rasterizer:  38869 bytes (-Os ~38kb - with many features disabled ~30kb
                                -O0  90-631kb
                                -O2  57-114kb
                                -Ofast/-O3  76-181kb, size (and compiletime)
                                bump due to SIMD tree-vectorizer)

ctx parser:        16384 bytes (not needed for direct use from C, but also
                                on embedded this can be useful for ease of
                                integration with other languages or directly
                                using ctx+mictrocontroller+display as a serial
                                display.)

Even more agressive optimization with exponential compile time has been tested by forcing inlining and separate compilation of all combinations of blend, composite and image source modes - building ctx then takes 5-6minutes and the resulting binary is close to a megabyte, performance gains exist but in practice they are near neglible.

The RAM requirements are small and by tuning the engine to have only a couple of save/restore states, and paths with fewer than 256 edges, the total RAM footprint of the rasterizer can be as low as ~5kb on 32bit platforms, the parser for the ctx protocol needs an additional 1kb. Where framebuffer is too large to fit in RAM, the allocation needed for scanline(s) must be wheighed against RAM needed for renderstream. Commands take a multiple of 9bytes, there is code/provisions for runtime compacting of the renderstream in prior git revisions.

is ctx fast yet?

This table contains the time it takes to fill a 512x512px buffer with an inscribed circle with various fill sources for all of ctx' supported pixelformat target encodings. Tests done on battery - the race between ctx and cairo is close enough that for solid RGBA8 fills ctx is faster on battery and cairo/pixmans is faster on AC. (proably related to available cpu frequency boosting.).

format color a=1.0 color a=0.75 lgrad rgrad sAtop a=1.0 sAtop a=0.75 sAtop lgrad sAtop rgrad
cairo350462415562161202127840286078
RGBA8258303120717872033225639753521
RGBA8 nosimd481573102010312025234042223632
BGRA8347401137519442134233741093601
GRAYA8558804491773221199121247277041
GRAYAF5445355899494664664756934795
RGBAF646884147320181010103015292014
CMYKAF574910870158231344127385535997
CMYKA83392368511570860141234088119308833
GRAY115381811584380291902189954947823
GRAY217451986561682152269227058418208
GRAY414621696565081791983197655637909
GRAY8598859460571311125112447437043
GRAYF6436325860501475976158224906
RGB8435479140520082210242942393705
RGB332974994192925232737295948054577
RGB56510131074211826742817298047794215
RGB565_BS908957188924612922294650614461
CMYK82485282510440744032373026100507227

smaller numbers are better, numbers are time to render one frame in us,

some observations on the above

All of these tests are single threaded, in normal interactive use ctx contexts would use paralell rendering, the rasterization happens twice but the compositing wors is fully parallelized.

As seen with the variation with AA=5 and AA=3, tuning the cost of the rasterizer down makes ctx compare OK with cairo, the visual fidelity of the default - and high quality - aa in cairo is equivalent to ctx' AA=15; which is the default value used if not specified. There is still refactoring room to improve the performance/memory access patterns of the rasterizer.

When targeting a microcontroller that has a small enough framebuffer that it is possible to keep RGBA8 instead of RGB565, to keep number of intermediate conversions between RGB565 and RGBA8 down, one should use RGBA8 and convert on-the-fly when copying out, for better performance, some of this might later be done automatically by ctx; doing this has the advantage of higher quality compositing - this is similar to how 16bit RGB could be used to get linear compositing alternates for RGBA8 formats.