Updated Sep 30 2020 :: by Alex Yumashev

We have just spent the last 24 hours fixing a nasty bug in production.

But first, here's a tricky question for you: how long do you think this code will run?

for (i=0; i<1000000; i++)
   thread.sleep(1);

It would seem that the answer is obvious - a million milliseconds, which is about 15 minutes.

WRONG.

The correct answer is - more than 4 hours.

Wait, what?! Bare with me ;-)

The nasty production bug story

We have a background worker on the backend - a huge while loop that runs through an array of millions of elements and does all kinds of in-memory checks and manipulations on them.

But we don’t want the CPU to get stuck at 100% during this loop and choke the server, do we? We want the server to stay alive and kicking.

So what does the average Joe programmer do? That's right - Joe adds a short pause into the cycle and goes home.

if there's any game developers reading this - I can already hear you giggling and reaching out for popcorn

Here's the thing: "pauses" AKA Delay() AKA Sleep() in most operating systems are based on timers. The resolution of these timers is 12-15ms. You cannot pause for 1 millisecond - there will be at least 15.

So on a large array with a million elements, we get 15ms * 1000000 / 1000 / 60 / 60 = 4.16 - more than four hours.

And coming back to work in the morning our Joe-the-programmer sees what? That his loop is still running from yesterday. The job, that used to take 7 minutes (although it kept the CPU at 100%), now takes half a day to finish. In a "relaxed" mode though.

Everything is broken and customers start creating the "huh?!" tickets.

and the gamedevs reading this - laugh viciously

Because in their game development world this happens all the time, this is called a "busy loop" or a "tight loop". And you can't rely on timers.

But how do we throttle properly?

1. Use multimedia timers or timers from OpenGL/DirectX (overkill)

2. Throttle every N-th step, not every step (inelegant and stinks)

3. Dump "pauses" completely and use the magic Thread.Yield instruction - which is a "polite" way to share resources and tell the operating system "hey, I'm still busy, but if you really need this, slow me down and let other threads do the work" (and this is the best way)

The 100% CPU load won't go away, but now it's not an issue - everything is fast and responsive.

Thread.Yield is available in many languages:

C#: Thread.Yield

C++: std::this_thread::yield

Win32: SwitchToThread

Java: Thread.yield

Go: runtime.Gosched (I think...)

Visual Basic: DoEvents (kidding! ...although not really)

Python: time.sleep(0) (on Windows it's time.sleep(0.0001) don't ask me why... because Python...)

(by the way, it's not just Python - quite a few system libraries are smart enough to translate sleep(0) into a yield, including .NET, Posix and WinApi)

etc. Google your favorite language.

The moral of the story, I guess, is that even the trivial things get super tricky at scale. Things that used to be nice and simple when we just launched our little SaaS now get complicated when you have thousands of companies using your stuff. But that's a nice problem to have, I guess.

And while we're at it, here's a little video of me from yesterday:


'How long do you think this code will run?' was written by Alex Yumashev
Alex Yumashev
Alex has founded Jitbit in 2005 and is a software engineer passionate about customer support.


Subscribe comments