Our SaaS helpdesk was experiencing some sluggishness this Sunday and I thought I owe our customers an explanation on what was going on. So here's a quick&dirty raw post with the technical details.
1. Many customers requested that our search should be looking at ticket replies as well by default, not just the ticket subject + body. We have that functionality but that checkbox is disabled by default, b/c searching within replies is tough. The "comments" database table (that's what we call replies here - "comments" - historically), now that table is 100 GB heavy. We have a separate full-text index for that table that allows us to search for text in it.
2. Since we do not want to search through all the "comments" - just the human-originated ones - we need to index only the "non-system" entries in that table. THat is why we created a so-called "materialized view" (AKA an "indexed view") of that table, that filters those out. That view is "only" 40 gigabytes. This is still huge but we can work with that.
3. Here's the thing: mixing full-text-search criteria with relational database queries is slow. Here's what I mean by that: if you tell the server "find me entries that contain 'shipment'" - that's going to work real fast. Lightspeed, actually. But when you say "gimme entries where text contains 'shipment' and the 'type' column equals 4' - that is where it's going to be really, really, REALLY slow. This causes so called "key lookups", which is guaranteed to give your server a heart attack on a 100GB dataset.
4. One way around this is to include some meta-data right into the text that's being indexed. So your entry "Sarah needs shipment asap" become "typeId-4 Sarah needs shipment asap". It's ugly but it works just fine.
5. But to add this data at our scale we would have to re-create our "view", re-create the clustered index for it (to make it "materialized"), and then re-build the full-text-index on that materialized view.
6. That's what we tried to do on Sunday. But it all got stuck (100GB, oh dear), so after an hour of our servers being overwhelmed with these operations we decided to abort and cancel it all.
7. But you can't just abort stuff in a database cluster. The database-engine will need to "roll back" every tiny change it has made otherwise the database becomes corrupt. So it took another 40 minutes to roll it back, while our team was trying to speed our disk drives at the same time (that's the beauty of running in the cloud - we were upgrading our AWS storage to faster disks on the fly, how cool is that)
8. Ok, everything was aborted, and here's what we decided to do since we only had hours before Australia and NZ wake up: we decided to reindex replies for the last year only, and the full-text search within those replies works faster than ever. The "last year" bit is just temporary of course
But that's not all
See, weekends and holidays is the busiest time for our devops team. Because - yay - an outage window! The database changes were just the tip of the iceberg. We also did this:
PS. Unfortunately that last change is pending a reboot, because we need to update AWS NVMe drivers but but don't worry that's OK because: 1) it's fully operational anyway 2) we'll wait until most of our customers go to sleep 3) Seems like it's Memorial Day in the US today (and Spring bank holiday in the UK) which makes it easier.
Alex has founded Jitbit in 2005 and is a software engineer passionate about customer support.