C# port of Refterm with satisfying performance:
💻 https://github.com/warappa/refterm (csharp folder)
For testing performance use the new
type <filepath> command, or build the C project and use
splat2.exe <filepath>. For example type
On the Windows Terminal GitHub repository a discussion unfolded about the poor rendering performance of Windows Terminal.
Casey Muratori opened an issue describing the results of his terminal benchmark program TermBench.
TermBench writes out sequences of colored text to the console (VT-100). He found that using colored text in console caused the performance to drop 40 times. Modern CPU and RAM should be able to handle this without much overhead.
🤨 Diverging Views
The Microsoft developers claimed, that not parsing but rendering would be the root issue, but Casey contested. Having game development experience himself he knew that this could not be an issue, providing some more details.
After some back and forth, Microsoft developers claimed that for fast rendering with complex layouts (i.e. Arabic text), this could not be much more performant – unless someone would do a doctoral research project.
This is where Casey exited the discussion saying:
"When we’re at the stage when something that can be implemented in a weekend is described as "a doctoral research project", and then I am accused of "impugning the reader" for describing something as simple that is extremely simple, we’re done."
Then he set out to do a high performance console renderer – in a weekend.
And he delivered.
⭐ Presenting: Refterm
Casey’s focus was on rendering performance, but expanded the scope a bit to address some concerns about "shortcuts", so he implemented a simple but not fully fledged terminal example.
You can find the original source code on GitHub.
With Refterm you can run regular programs like
ping, but at it’s core Refterm was only built to show the huge opportunities of improved rendering techniques. Expect some shortcomings in other areas.
In this video Casey showcases Refterm’s superior performance:
The output of a child-process or direct user input read continuously.
When a chunk of characters arrives, it gets stored in a scroll-back-buffer. Right after that the characters get parsed into lines, considering the current cursor.
The state of the cursor determines to which line they belong and if the characters are colored or blinking or have other special attributes.
After some time (~16ms) the lines get layouted, which means the current visible lines get parsed into glyphs. A glyph is a single or more characters drawn as "one" – emojis or Arabic text for example. This glyphs get matched with already drawn glyphs in a glyph-map-texture. This texture caches all glyphs ever encountered by the program. If it was not already drawn, a new spot on texture is found and drawn with DirectWrite.
All glyphs get an index on the texture. This data is then fed to simple vertext- and pixel-shaders, which then know where to copy the glyph from, and where to put it on the screen.
- High Performance Rendering
Rendering 1GB output in 0.7s (1.314GB/s) vs. Windows Terminal taking ~207,772s (0.004GB/s) – ~300x faster.
For rendering colored, blinking etc. text in console
- Complex Text Layout
Using built-in Uniscribe Windows API which also supports mixing RTL and LTR texts (mixing English with Arabic), complex layouts can be constructed.
For every glyph Refterm detects it draws it one time to a texture which then a vertex- and pixel-shader puts on the right spot of the window’s client area.
- Colored Emojis
Only with Page-Up and Page-Down keys.
🔣 C# Port
I was amazed by the actual performance gain together with the features. After playing around a little bit, and seeing that is was not "that much" code, I started to wonder, if something like this would be possible with C#.
The thought prolonged and so I forked and started the port.
This port is a rather direct port to see how the C code maps to C# code. Don’t expect super idiomatic C# OOP code.
At first I just copied over the functions one by one, and converted the C specific parts overt to C#. But soon I realized, that C# and C are more apart than I suspected. Memory allocation and memory access (pointer-arithmetic!) where extremely different compared to what I was used to in C#.
💾 Memory Safety
C# is memory-safe by default. This means that all access to memory is bounds-checked. The advanced pointer arithmetic found in this code would need to be translated into safe memory access.
Using DirectX was a steep learning curve, but at least SharpDX came to the rescue. SharpDX is a wrapper over the raw APIs in an idiomatic C# style.
Also: How to debug shaders if something does not work? And you can bet I was sitting in front of an empty black screen due some mixups for longer than I am willing to admit 🤦♂️.
Debugging shaders works great with ALT + F5 – the Graphics Debugger. This let’s you capture a frame, open it, select a pixel an see exactly the steps of how it was composed.
This also resolved why I was seeing a black screen. Mixing up source and destination texture, as well as sending the data to the shaders in the wrong data layout.
📰 Text Layouting
Special text-layouting with Windows’ Uniscribe was new topic for me. This was required to handle complex texts like Arabic characters correctly.
Calling Windows APIs which require special data types was not so new to me, but also not familiar.
As soon as I was able to run the port I realized: it was slow. Not like Windows-Terminal-Slow (🤭), but still an order of magnitude slower compared to the original.
Well, I knew I had stripped some SSE2 accelerated simple/complex-text-classification algorithm and replaced with something more "straight forward". And I knew I was copying some byte arrays around. So this was my way to more performance: through uncharted land.
✂️ Spans and Memories
In .NET 5
Memory<T> were introduced as a deeply integrated performant memory access concepts – even reaching down to the runtime.
Most lower level APIs in .NET required some copying around if you just wanted a slice of an array, not the whole. So you had to copy it into a new array.
Unfortunately, allocations and performance are on opposing sides.
The idea behind
Memory is, that you just need a "view" onto a slice of memory you already have.
Arrays in C# are easily convertable to Spans. Just call
byteBuffer.AsSpan(offset, length) and you’re good to go.
⏩ Hardware Intrinsics
Memory<T> in place I still noticed through profiling, that my code was spending much time enumerating and checking the input-buffer for complex text-sections which require more sophisticated handling.
It turns out: comparing huge buffers byte-by-byte is slow. Who knew? 🤦♂️
In the original code this was handled in batches with some strange methods. In fact those where SSE2 accelerated methods, which allow to process 16 bytes at a time.
In .NET 5 such hardware intrinsics where added.
SSE2, with support from
MemoryMarshal, where the .NET equivalents.
With those in place I was able to further improve the performance.
⚡ Port Performance
So, time to see where we stand.
Overall I’m very pleased with the performance.
On my machine (Ryzen 5800x), a 1.77GB file can be outputted by Refterm in 2.7 seconds.
My C# port takes 4.9 seconds – about 56% of Refterm. For a managed port I consider this as a win! 🎉
❓ Why The Performance Gap?
Where does the remaining gap originate from? A profiler-session revealed: the encoding.
About 30% of the overall time is spent converting the bytes into chars.
I’m no C expert, but C’s
char is exactly 1 byte wide, whereas in .NET
char is (at least) 2 bytes. So when the original can copy the child process’ output directly to its buffer, in the port we have to convert to Unicode and pay the encoding tax.
If we subtract this performance penalty, the rest of the .NET overhead gets to just about 22%. Actually pretty nice (but still room for improvements 😉).
📦 Missing LRU Cache
The original version uses a last-recently-used-hash-cache for the glyph resolving. This is maybe an area which could yield some extra percent points.
📝 Code Style
As stated before, this port wants to make the C and C# code comparable, to see how one maps to the other. A more idiomatic C# code/refactoring could maybe lead to a better performance, but this is not guaranteed.
This was an very intense and exhausting, yet educational and satisfying experience.
I’ve learned a lot about how "standard" code impacts performance and can it be improved by a lot by hardware intrinsics,
Span<T> and friends. I also learned how to translate pointer arithmetic to memory-safe code, and what to consider when calling native API methods. And finally how to use and debug DirectX.
It will take some time – and more pet projects – to fully absorb the lessons learned, and I’m looking to it!
Look out for further posts about the new concepts I learned.