Cool Regex performance hacks I bumped into

by Alex Yumashev · Mar 5 2018

The first part of this post applies only if you run ASP.NET on Windows, you can simply skip to the regex hack if not interested, but I strongly recommend you read the whole post.

Debugging ASP.NET CPU spikes

About three weeks ago we started seeing unusual CPU load spikes on our customer service app production server (the cloud-hosted version). We do have a super useful tool (originally written by mister Sam Saffron who kindly allowed me to modify it a little bit and whom I'd like to thank here: sir, this little gem of yours continues to be a life saver in my little company) showing the exact call stacks that use up all the CPU ticks

But the problem is you have to "attach" this tool to the process at the exact moment the CPU load spikes up. And "catching" that moment can be... uhm... hard.

Even though we've set up AWS to text us when CPU load gets higher than 75% and stays there for 5 consecutive minutes (a must have for anyone who runs a business-critical SaaS app by the way) the time it takes to run to my laptop, remote-login to the sever, get to the command-line, attach to the overloaded process etc etc... CPU load gets back to normal. Darn.

Luckily my cofounder has written a short Windows .BAT script that does this automatically - if the CPU load gets higher than 75% the script finds the web-application process ID, launches the tool, attaches it to the PID and dumps the call-stack data to a text file (if you're going to use the script, change the "OurApplicationPool" to the actual user account that runs your app pool).

@Echo OFF

SET /A "MAXUSAGE=75"

For /F %%P in ('wmic cpu get loadpercentage ^| FINDSTR "[0-9]"') do (
echo %%P
    IF %%P GTR %MAXUSAGE% (
        for /f "tokens=1,2 delims= " %%A in ('tasklist /FI "USERNAME eq OurApplicationPool" /fo table /nh') do cpu-analyzer.exe %%B >> log_%date:~-4,4%%date:~-7,2%%date:~-10,2%.txt
    )
)

Then we have simply set up a job in Windows Task Scheduler that runs every 5 minutes (I know, but we needed a quick and dirty solution).

In our case - it was Regex

After some digging through the logs and call stacks it turned out most of the CPU ticks were spent on Regex.Search. Our helpdesk app uses A LOT of regular expressions both on the back and front ends. For example, when converting markdown and "BbCode" (used for internal storage) to "normal" HTML (used for UI and emails).

Long story short, after hours of blindly trying different regex patterns, testing and benchmarking here's what I found:

If you want to match a rule that goes "something is not preceded with an 'A'" don't ever use code like this: (^|[^a])something (the first group basically says "it's either the beginning of the text OR not the letter A") that most people find more readable compared to a "regex lookbehind", including myself. But it turns out regex "lookbehinds" and "lookaheads" are almost 3-4 times faster than patterns involving "beginning/end of string" especially when working with large stings. In our case replacing (^|[^a]) with an (?<!a) (which literally means "not preceded with") has made our code 3 times faster. I wrote a quick benchmark test that does 10000 search-replace iterations in both .NET and JavaScript regex engines (we use both) and the results were:

`(^\|[^a])something`	1346ms (avg)
`(?<!a)something`	389ms (avg)

Whoa, that's more than 3 times faster! Obviously, the longer the input text you have, the bigger the impact.

Cool. But can we improve it even further? Is there a way to make the "not preceded with an A" rule even faster?

Turns out there is! Unfortunately there are not many regex performance analysis on the net, so (again) I was on my own here... But fortunately I somehow remembered reading a cool HackerNews post couple of years ago that suggests a great way to optimize rules like "not preceded with" or "not surrounded with" or "not ends with" and similar. Basically the trick goes like this:

asomething|(something)

Nice, huh? If you want to match "something not preceded with an A" simply match what you DON"T WANT and then ignore overall matches, looking only at the "Group 1"! The benchmarks:

`(?<!a)something`	389ms (avg)
`asomething\|(something)`	16ms (avg)

Holy sh*t! That's almost a hundred times faster than my original code!