This post is for our hosted helpdesk customers only. I figured we owe them an explanation of what's been happening here last week.
I'm writing this at 4:00am CET after a sleepless night and several days of struggle, but it's OK, I'm probably the happiest man in central Europe right now. We have finally resolved the latest challenge we had: occasional connectivity issues between our cloud servers.
Last weekend we did a regular "stop-start" of the web-server. Since we are hosted by AWS (Amazon) this actually means that your server is restarted on a new physical machine (which is a good thing by the way, you have to swap them from time to time to move to fresh hardware). But after a while we started seeing occasional errors - the web-server was having issues connecting to our database servers.
The actual error messages varied from "weird" to "total nonsense". Like this:
An error occurred while communicating with the remote host. The error code is 0x80070006. ---> System.Runtime.InteropServices.COMException (0x80070006): The handle is invalid.
(0x80131904): A network-related or instance-specific error occurred: 0 - An operation was attempted on something that is not a socket.)
What?! "Something is not a socket." I can't... Even... I mean, come on you stupid tin box. Give me something we can work with.
We even posted a quesiton on Stack Overflow which is still unanswered by the way.
Meanwhile, our customers started seeing weird things. Some of them reported short network outages, some of them saw short delays in importing email messages into the helpdesk and some of them were complaining about the app being sluggish and non-responsive. While Vlad was handling it on the customer-success side calming our clients down, me, Max and Serge were scratching our heads trying to find a solution.
More restarts didn't help. Nor did upgrading Amazon "EC2 management" services to the latest version.
I even tried disabling TCP offloading, which is something Amazon recommends whe you see network issues. No effect.
The issue was cleary network-related. And it was clearly software, not hardware. Updating "Citrix PV" drivers to the latest version didn't help (no surprise, this "Citrix" thing is unsupported these days anyway)...
I finally decided we need to upgrade AWS's network PV driver. But the docs said that a server might go down for up to 35 minutes (!) during the upgrade process! We can't afford that. Our customers can't afford that. We either have to wait until the weekend, or come up with something else.
That's when I received another bunch of erro notification to my inbox. About 800 to be precise. In fact, here's a screenshot of my Gmail inbox:
F*ck it. We need to do somtheing RIGHT NOW. And here's when the beauty of running in the cloud came shining through.
The idea was right in front of us this whole time. Being too busy with tuning all these stupid TCP parameters and buffers, I was so dumb not to see it! It's a freaking cloud after all - we just need to spawn a duplicate server, an exact copy of the one having problems. Do what we need and then simply switch the IP-address to the new server. No one should even notice anything.
After 4 hours of: launching new AWS intances, creating disk snapshots, attaching them to the new server, reconfiguring firewalls... Then upgrading the actual drivers, reinstalling the TCP/IP stack (just in case), reset it with
netsh wisock reset... And finally switch the IP-adress to the new server. Voi-la. It works, it even works faster then ever. And no errors.
And you guys haven't even noticed.
I would like to apologize to you folks we haven't resolved this sooner. Sorry.
Also, we're setting up a bunch of automation scripts that spawn new servers on their own, keeping network drivers up to date automatically etc. etc. We'll do our best you won't see this again.
Over and out. I'm gonna have a 100 hour sleep now. Don't worry, an SMS will wake me up if something goes wrong.
P.S. none of the incoming tickets/email/replies have been lost.
by Alex. CEO, founder