Updated Sep 22 2022 :: by Alex Yumashev

Here I am, writing about performance optimization again. I'm a performance junkie. Constantly monitoring and investigating bottlenecks for our SaaS helpdesk webapp is my favorite thing ever. And I'm proud to say that with thousands of clients, even some really big ones, our app's backend process rarely goes higher than 5-6% CPU. Even during the peak load times which happen around 2-3pm UTC - the time when "Americas are waking up while Europe is still very active".

See, the biggest reason we obsess over performance are deploys. To spare the boring details, when one updates an ASP.NET Core app, the old process is shut down, and HTTP requests are being queued up until the new process is fully launched. This can result in tens of thousands of requests rushing in once the new process is up. And for 10-20 seconds the app becomes slow. Really slow. Now that is the biggest reason we pay so much attention to performance (specifically, startup performance) - so that our customers do not notice any hiccups when we deploy a new version.

And recently I found a perfect candidate to optimize

.NET's "MemoryCache"

TL;DR "MemoryCache" sucks

We cache a lot of stuff in memory, to spare a database call whenever we can. We use Redis as an "L2" cache provider, but for "L1" we use in-memory caching primitives offered by .NET. Basically Microsoft offers two caches: System.Runtime.Caching.MemoryCache and Microsoft.Extensions.Caching.MemoryCache.

They both suck and ruin performance.

  • MemoryCache is slow
  • It comes with bloat (for example, performance counters that you can't disable)
  • It uses string keys only (so you have to allocate a string every time you address an object)
  • Not generic (causes boxing/unboxing for primitive value-types like int or DateTime
  • Uses black magic and non-obvious heuristic to evict items under "memory pressure"
  • Non-atomic writes and reads that lead to Cache stampede
  • Did I mention it is slow?

Enter FastCache

What I needed is basically a "concurrent dictionary" with expiring items, that wil be fast, lightweight, atomic and generic.

So this weekend I did what every programmer would do, I wrote my own cache with blackjack and hookers. Please welcome: FastCache. No, honestly I did google for alternatives, but haven't found anything that really fits my needs. After all, in-memory cache efficiency is not something you worry about at the beginning of your software company's journey (which is where most companies are). I guess that explains why there's so little offerings in this space.

Making a fast cache

My two biggest challenges were:

  1. How to work with date/time efficiently when evicting expiring items?
  2. How to make writes atomic?

"DateTime.Now" is slow

Check this benchmark out. It does nothing but getting "current system time"

|          Method |     Mean |    Error |   StdDev |
|---------------- |---------:|---------:|---------:|
|    DateTime_Now | 93.55 ns | 1.516 ns | 0.083 ns |
| DateTime_UtcNow | 19.19 ns | 1.479 ns | 0.081 ns |

DateTime.UtcNow is a little bit faster because it does not have to account for time zones (another good reason to use UtcNow in your code!). But still not fast enough. If I'm going to check whether a cached item is expired I need something much faster. Check this out:

|          Method |      Mean |     Error |    StdDev |
|---------------- |----------:|----------:|----------:|
|    DateTime_Now | 92.530 ns | 3.2421 ns | 0.1777 ns |
| DateTime_UtcNow | 19.037 ns | 0.6913 ns | 0.0379 ns |
|    GetTickCount |  1.751 ns | 0.0245 ns | 0.0013 ns |

Yes. How about that. 1 nanosecond, baby. This is Environment.TickCount. It is limited to int data type though, which is 2.4 billion milliseconds. But hey, if I target .NET 6 there's the TickCount64 which is equally fast on 64-bit processors!

"Atomicness"

The second biggest challenge is "atomicness". When it comes to atomicness, the biggest trick is to check "item exists" and "item not expired" conditions in one go.

When an item was "found but is expired" - we need to treat this as "not found" and discard the item. For that we either need to use a lock so that the the three steps "exist? expired? remove!" are performed atomically. Otherwise another thread might jump in, and ADD a non-expired item with the same key while we're still evicting the old one. And we'll be removing a non-expired key that was just added.

Or - instead of using locks we can remove by key AND by value. So if another thread has just rushed in and shoved another item with the same key - that other item won't be removed.

Basically, instead of doing this

lock {
	exists?
	expired?
	remove by key!
}

We now do this

exists? (if yes the backing dictionary returns the value atomically)
expired?
remove by key AND value

Here's how we do it. If another thread chipped in while we were in the middle of checking if it's expired or not, and recorded a new value - we won't remove it.

Why lock is bad?

Locks suck because they add an extra 50ns to benchmark, so it becomes 110ns instead of 70ns which sucks. So - no locks then!

Overall speed test

Benchmarks under Windows

|               Method |      Mean |     Error |   StdDev |   Gen0 | Allocated |
|--------------------- |----------:|----------:|---------:|-------:|----------:|
|     FastCacheLookup  |  67.15 ns |  2.582 ns | 0.142 ns |      - |         - |
|    MemoryCacheLookup | 426.60 ns | 60.162 ns | 3.298 ns | 0.0200 |     128 B |
|   FastCacheAddRemove |  99.97 ns | 12.040 ns | 0.660 ns | 0.0254 |     160 B |
| MemoryCacheAddRemove | 710.70 ns | 32.415 ns | 1.777 ns | 0.0515 |     328 B |

Benchmarks under Linux (Ubuntu, Docker)

|               Method |        Mean |      Error |    StdDev |   Gen0 | Allocated |
|--------------------- |------------:|-----------:|----------:|-------:|----------:|
|      FastCacheLookup |    94.97 ns |   3.250 ns |  0.178 ns |      - |         - |
|    MemoryCacheLookup | 1,051.69 ns |  64.904 ns |  3.558 ns | 0.0191 |     128 B |
|   FastCacheAddRemove |   148.32 ns |  25.766 ns |  1.412 ns | 0.0253 |     160 B |
| MemoryCacheAddRemove | 1,120.75 ns | 767.666 ns | 42.078 ns | 0.0515 |     328 B |

Wow, a 10x performance boost on Linux. I love my job.

P.S. The title of this post is a hat tip to Sam Saffron.


'How I learned to stop worrying and wrote my own memory-cache' was written by Alex Yumashev
Alex Yumashev
Alex has founded Jitbit in 2005 and is a software engineer passionate about customer support.


Subscribe comments Tweet