Our hosted help desk app was down for more than half an hour today. The outage affected all the services including the email-pickup modules and the web-app. During the outage anyone accessing the helpdesk app was getting a 503 "Service unavailable" error.
We screwed up. Sorry about that.
Lots of things, actually. I should probably stop bragging how cool our setup is.
1. Report-scheduling engine crash. Like any help desk app we have a reports module. Some of these reports can be scheduled to be sent to our users via email every month. This email-sending module has crashed.
2. No isolation from web-app The email-sending thread was working in the same context with the web-app (design error), causing the web-app Windows-process (w3wp.exe) to become sluggish and eventually crash.
3. App-pool watchdog fault The service that watches our "worker processes" has restarted the web-app a couple of times but eventually shut down (subject of further investigation).
4. (stupid one) The monitoring service has ran out of SMS-messaging credits. It'd be funny if it weren't so sad. Most of the downtime was caused by us having no idea what's happening.
God bless Amazon. Before we even started investigating and fixing anything, we spawned a new instance of the server, pointed it to the helpdesk database engine, swapped the IP-address mapping - and everything was back up. What can I say, I love the cloud.
The email-sending module has been (a) fixed to prevent future crashes with proper network error handling (b) isolated from the web-app processes We have also fixed the watchdog services and set up two more backup SMS-alerting services for critical helpdesk modules.
This was totally our bad. If you were affected by the downtime please contact our support, we'll find a way to settle it (service extension, discounts etc). Sorry we let you down guys.
PS, No data was lost, no emails were missed, no worries.