GPU RENDERING

SIMD Parsing

SIMD-accelerated Rust parser. 16-32 bytes per cycle. Your ANSI escape sequences never had it so good.

Questions this answers

  • tmux very slow output in less
  • Why is terminal output slow even on a fast machine?
  • How fast can a terminal emulator parse ANSI escape codes?
  • Terminal bottleneck parsing escape sequences during large output
  • Fastest ANSI parser for terminal emulators

How it works

Terminal emulators must parse every byte of output from the PTY, scanning for ANSI escape sequences (CSI, OSC, DCS, and more) that control cursor movement, colors, and screen manipulation. Traditional parsers examine one byte at a time through a state machine. Chau7's parser, written in Rust, uses ARM NEON and x86 SSE/AVX SIMD intrinsics to scan 16 to 32 bytes simultaneously.

The SIMD fast path loads a chunk of PTY data into a vector register and compares all bytes against the ESC character (0x1B) and the C1 control range in a single instruction. If no escape characters are found in the chunk, the entire block is dispatched as printable text without touching the state machine. When an escape is detected, the parser falls back to a scalar state machine for that sequence only, then resumes SIMD scanning.

This approach achieves throughput measured in gigabytes per second on Apple Silicon. The parser processes data at roughly the speed of an L1 cache read, meaning the PTY read syscall and kernel buffer copy are the actual bottleneck, not the parser itself.

Why it matters

Parsing is the first stage of the terminal pipeline and sets the throughput ceiling for everything downstream. Traditional byte-by-byte parsers create a bottleneck that no amount of GPU rendering can compensate for. Chau7's parser processes 16-32 bytes per SIMD instruction using Rust intrinsics, handling entire cache lines in a single cycle. The specific SIMD width adapts to your hardware.

Frequently asked questions

Does SIMD parsing work on Intel Macs?

Yes. The Rust parser compiles with both ARM NEON (Apple Silicon) and SSE2/AVX2 (Intel) backends, selected at compile time. Both paths process 16+ bytes per iteration.

How does SIMD handle multi-byte UTF-8 sequences?

The SIMD scanner looks for escape introducers (0x1B, 0x90-0x9F), not character boundaries. UTF-8 continuation bytes (0x80-0xBF) never match these values, so multi-byte characters pass through the fast path without special handling. Full UTF-8 decoding happens in a subsequent stage.

What throughput does SIMD parsing achieve?

On Apple M-series chips, the parser sustains 4-6 GB/s on pure printable text and 1-2 GB/s on escape-heavy output like colored compiler diagnostics. In practice, PTY bandwidth (~800 MB/s) is the limiting factor.