Today we had our longest downtime in seven years. Jitbit Helpdesk and all other services, including our website were down for a full hour.
Around noon EST, all our services have stopped responding. Our team gets text messages on their phones when something is wrong, so we were immediately on the case.
We tried to login to our server remotely but failed. After that, we tried to reboot the server using the Amazon AWS console – no luck. We had no other choice, but to "force stop" the server.
The server went to the "stopping" state and got stuck there. After ten minutes, we have made the decision to backup the disks and spawn up an identical backup server (one of the advantages of cloud hosting). Once the server was finally up, we attached the hard disks from the old server to the new one, and that was it.
We are still investigating. But we're strongly suspecting that it was a hardware failure, otherwise a 100% identical server would not start. The fact that our server was in the "stopping" state for a full hour and there were no errors in the software logs also indicates that there was something wrong on the Amazon's side - the particular underlying hardware that our servers were running on.
We are definitely not trying to shift the blame on them. Amazon has been a great and very reliable hosting for us for the last three years. But we couldn't find any evidence that it may have been caused by our software, by Windows or by an external attack. We could be wrong, and we are contacting Amazon right now.
I can't promise that downtimes won't happen in the future because they will. But we've spent a lot of time to make sure that we won't lose your data when it happens. We have a pretty solid backup plan, and we can restore everything quickly. We have also developed an improved procedure to restore everything much, much faster than we did today.
We have also put a lot of effort into a system that prevents us from losing your support emails during the downtimes. It can only happen when multiple independent things fail at once and that's highly unlikely.
We are extremely sorry that this has happened. We know that you rely on our app, and we've let you down today. To our new customers, who have been with us for only a month and have seen a major downtime like this, I want to assure you that this is not normal. We won't let you down again.
You all have been extremely supportive today, and we are really thankful to have you as customers. Thanks a lot for your understanding.
by Max. co-counder