How we send 22000 emails every hour

by Alex Yumashev · Updated Oct 2 2019

Our help desk software sends and receives about 22,000 (22 thousand) emails per hour.

That's more than half a million emails every day.

During US daytime it peaks to 25-30 emails per second and goes down to almost zero on a late Sunday evening - but only for 1-2 hours because Australians.

Here's a cute completely irrelevant story, feel free to skip. The phrase "Because Australians!" has actually become an inside joke/catchphrase in our company. See, Aussies and Kiwis (New Zealanders) wake up before everyone else (even before Japan). Basically they are the reason SaaS businesses don't have an outage/maintenance window. And a couple of times when we were about to put a server down for maintenance on a late Sunday evening (EU time), someone would shout "Wait! Don't do that!" - "What... Why?" - "Because Australians!"

So after 4 or 5 times this phrase is now used as an ultimate "42" answer why everything has to be up and running 24/7. I even have a tiny map of Australia in my office. I (nor anyone from the team) have never been to Australia or New Zealand, but hey, we're always thinking about you mates [wink]

Anyway, back to our enormous amounts of email. Half a million per day.

We looked at paid PaaS providers like Mailgun, Sendgrid, Amazon SES... And all of them would cost several thousand a month at this scale. And since we're not a sexy funded startup from California, but a boring customer-funded business, we can't afford this bill.

The boring solution

That is why we finally settled on a... tiny AWS instance running Postfix under Ubuntu. It has 2 Gigs of memory, priced around $9/month and the CPU load rarely goes beyond 15%.

But we had to learn everything about email. Every damn bit.

SMTP, relays, MX-records, SPF verification, DKIM signatures, spam filtering, reverse-DNS, blacklisting, "Internet Message Format", throttling, MIME-encoding, email-antiviruses, bounces, headers, log-parsing, rate-limiting, greylisting, RFC standards (and that no one follows them)... And of course - about IP address reputation (more on that later).

Our Postfix installation handles both inbound and outbound emails.

Outbound email

Our ticketing app sends an outbound message whenever "something happens" - like when a new support ticket arrives, it notifies helpdesk agents. Or notifies a user whenever an agent has responded to their ticket. Etc.

Our set up is pretty straightforward - the Postfix machine simply acts as an SMTP server, that listens for incoming connections from the web-app on port 587 (black arrows on the diagram). Connections are allowed form within our internal VPC-network only ("Virtual Private Cloud") and additionally protected by a username/password, just in case. After receiving an email, Postfix encrypts the message via TLS and forwards it "into the wild".

Inbound email

In addition to sending email out, we handle a lot of inbound email. All our customers are provided with a xxx@xxx.jitbit.com mailbox, with an option to add their own mailboxes via IMAP/POP/EWS (yeah, more acronyms). Our Postfix machine's IP-address is listed as a wildcard MX-record for our domain and listens to all emails sent to *@*.jitbit.com on port 25.

The inbound emails are then routed to a local PHP script (yes, THAT unsexy) because that's the only way Postfix can "execute something" when an email arrives. The PHP script then does some basic parsing and feeds the messages to the web app via an internal http/REST endpoint (blue arrows on the diagram).

But before that the inbound email is scanned for viruses (ClamAV), spam-filtered (Rspamd), checked against blacklists (SpamHaus), SPF-verified, FQDN-verified, throttled (if abusive) etc. etc. etc.

Monitoring

We use:

Postfix's built-in logging system and cat, zcat, grep to navigate the logs
qshape for reviewing the mail queue and bottleneck analysis
AWS's built-in alerts whenever the server slows down, like when the CPU load goes higher than X% for more than a minute or when the disk usage jumps abnormally (I think it's called "CloudWatch Alarms" or something, Amazon is terrible with names) - which usually means someone's flooding or spamming
An in-house developed watchdog that simply sends a dummy email to a disposable Mailinator mailbox every 5 minutes to check that things still work. If not - it sends a message to admins via SMS, Telegram, WhatsApp and Slack saying (quote) "email is fucking down". It also tries to "fix" things on its own by restarting some daemons, "recycling" web-application pools on Windows-powered machines etc. - until a human shows up.
Another in-house developed watchdog that periodically queries the database for things like "incoming tickets rate per user" and "email rate per company account" etc. You can't even begin to imagine the amounts of inbound and outbound spam we're dealing with (yes, outbound too - people register trial accounts and find creative ways to abuse it).

IP reputation

Now this is a tricky part. You can't just launch a new server and start sending hundreds of thousands of messages out of the blue - your IP will be blocked. You have to build up reputation gradually otherwise Gmail, Outlook.com, Hotmail, Yahoo (still exists yeah), Apple, AOL, ProtonMail etc. will simply ban your IP for good.

This is not documented anywhere, but this is a de-facto standard email-admins don't tell anyone about. Gradually build up your volume. Start by sending internal error alerts to your team. Then move some of your email marketing campaigns. Then some system messages. And so on.

It is also a good idea to keep more than one IP addresses "warmed up" like that, just in case your server gets banned. Also, keep an eye on your server via GMail postmaster service, to see the Google's spam stats on you. Yahoo and Hotmail have similar services too I believe.

You will probably have to contact your hosting provider too (AWS, Azure, DigitalOcean etc) to let them know you're about to send massive amounts of emails (otherwise - ban), and ask them to add a "reverse-DNS" record for your IP (otherwise - ban).

Build vs Buy

This is a classic "make vs buy" problem. With one exception that you're not actually writing any new software, you're just setting up the existing one. It takes some time, but the field is pretty well documented. Email has been around for decades. There's plenty of information out there (although some websites look like they're from the 90s - that's because they are - and at times I feel the docs are intentionally cryptic, no offense).

Setting it all up is a one-time expense, it can take hours, days or even weeks to get it working, but this is knowledge that stays with you (and the company) forever. After that it's pretty low maintenance and with super-cheap operating cost (the "$9 a month" mentioned above).

The "make" option drawbacks can be minimized by being hosted in the cloud. Like, even if your server goes down you can spawn a replica within seconds and switch the IP address to the cloned machine with just a couple of clicks - and then take your time to investigate what happened.

The "buy" route is much more expensive but provides the blessing of ignorance. Everything is being managed for you. You don't have to learn or do anything just sit back, relax and pray.