Today's hardware failure

by Alex Yumashev · Dec 18 2017

This is a short post explaining what's happened today. We're still very busy repairing our helpdesk servers, but I thought we owe everyone an explanation and an apology.

Long story short: the database cluster for our SaaS Helpdesk app has physically crashed today.

Good news: since we run in the cloud using Amazon's "EBS-drives" (basically, network disks), no data was lost. We switched the data disks to new servers and were back, up and running in approx 30-40 minutes (varies from region to region).

Bad news: the "search" functionality is very sluggish and, basically, non-functional. We will have to rebuild the full-text search indexes later today (and we will need to take the farm down for about 20 minutes to do that) to restore the search functionality.

We still kept the burned servers' root disk snapshot so we can examine the logs and investigate what exactly happened (and how we can prevent this in the future), but currently that's not our priority. We will also update our emergency-recovery procedure so we can get online faster next time (hopefuly, there' won't be any "next time").

MORE UPDATES COMING: I will keep updating this post with more info as we go. Also, please follow us on our "status twitter" which is @jitbithelpdesk for live updates.

UPDATE: as we have already notified you inside the admin panel, the app will ba taken down today at 7 pm EST for 15-20 minutes to repair the search module

UPDATE: we spent an hour repairing the search and rebuilding the index, the servers are back, up and running, the search is faster than ever. Tomorrow we'll start examining the original logs of the dead server to understand what happened.