We just submitted a PR to Microsoft to make .NET (and Jitbit Helpdesk) faster

by Alex Yumashev · Updated Mar 13 2024

There's nothing quite as satisfying as identifying a dangerous performance issue and squashing it like a bug. That's exactly what we did recently at Jitbit, and we're pretty excited about it. We even submitted a pull request to Microsoft to share our solution with the ASP.NET Core community, and - yay! - they approved the PR and even agreed to backport the fix to the current .NET relaase. Which means that the next .NET 8.0.3 will be faster, more realiable and secure.

The Discovery

It all started with a routine performance review of our application. We noticed that our CPU usage was higher than expected, especially during peak times. This was puzzling because we had optimized our codebase extensively and after some investigation, we pinpointed the issue to the SignalR component of ASP.NET Core, specifically during the disconnection of clients.

The Diagnosis

SignalR is a fantastic websocket-library for real-time web functionality, and it's a critical part of our helpdesk software. However, we found that whenever a connection drops, SignalR would iterate over the entire concurrent dictionary of "groups" to remove the connection from all groups it belonged to.

Think of "groups" as "chat rooms". Say, you're building an issue tracking system and multiple users are looking at the same issue. In SignalR these users are all grouped into a... "group". And whenever the issue updats, the server needs to initiate UI refresh by sending a message to all the group members. So far so good.

But here's the thing. Once one (just one) user disconnects - the current SignalR architecture iterates the entire collection of all groups in the system, which is... slow. With tens of thousands of "groups" and a high rate of disconnects, this process was consuming a significant amount of CPU resources.

In a typical load we had around 150k connections with 50k groups, averaging about 3 connections per group. A quick benchmark revealed that iterating over the group dictionary took 28ms. With a disconnect rate of about 100 per second, this meant we needed 2800ms of CPU time per second. On a 4-core machine this accounts for 70% of CPU usage, and it doesn't take much to push to 100% during bursts of disconnects.

The Risks

The worst part is when you reload the app during deploys, this results in all clients disconnecting at once. Since most apps work behind a reverse proxy, reloading the proxy literally results in all clients disconnected at once. Imagine editing your nginx.conf, and reloading the server. For every client out of 100k there's an iteration of a 50k-long dictionary, which means BILLIONS AND BILLIONS of iterations.

The dangerous part is that this even introduces a DDoS vector. Go find a .NET app that uses webcokets, open 100k connections and disconnect them all at once. Boom the app is down. Rinse and repeat.

The Solution

We realized that the issue stemmed from the fact that, unlike its predecessor "ASP.NET SignalR", the "ASP.NET Core SignalR" didn't track group memberships per connection. Instead, it relied on scanning all groups in the application to find and remove disconnected connections. To address this, we took inspiration from our workaround and the RedisHubLifetimeManager implementation in SignalR.

We modified the DefaultHubLifetimeManager to make every connection track its own group memberships in a HashSet. This way, when a client disconnects, there's no need to iterate through a massive dictionary of all groups. We only need to remove the connection from the groups that are specific to it. This change not only streamlined the disconnection process but also significantly sped up application shutdown scenarios, such as recycling the application pool on IIS/Windows, gracefully ending the Kestrel process on Linux, or switching nginx-proxy between application instances in a blue-green deployment.

The Pull Request

We wrote an in-house workaround (a hacky "lifetime manager" class that tracks groups) but since we believe that this optimization can benefit the wider ASP.NET Core community, we've submitted a pull request to Microsoft with our changes.

Instead of iterating over ALL the connection-groups, which is slow and dangerous, we remove from groups that are specific to this connection upon disconnect. Similar to RedisHubLifetimeManager, the new built-in DefaultHubLifetimeManager now makes every connection track groups per connection.

The Impact

Implementing this solution in our Jitbit Helpdesk software has led to a noticeable improvement in performance and reliability. Our CPU usage during peak times has dropped significantly, ensuring a smoother experience for our clients and their customers. We're proud to have contributed to the ASP.NET Core community, bringing this optimization to a wider audience.

Conclusion

In the end, this journey was a reminder of the importance of continual performance monitoring and optimization. Trace your apps, find bottlenecks, fix them. It also highlighted the power of community and open-source collaboration. By sharing our solution, we hope to help other developers facing similar challenges, making the web a faster and more efficient place for everyone.

So, here's to faster apps, happier users, and a more robust ASP.NET Core! Cheers!