Wednesday, August 31, 2011

Replacing sendmail and /var/spool/mail/mqueue with Gearman

I am obsessed with making things faster without spending an extraordinary amount of time reconstructing the wheel. I ran into an issue with sending mail-that I couldn't work around with a pure sendmail configuration. I needed to write the body of mail and have it queue, then sendmail send it. From the way sendmail works I couldn't force mail to queue then send without doing ugly hacks like blocking outbound port 25 connections. I tried all the common tweaks to make sendmail send faster, such as

  disabling sendmail from using dns
 Using options such as FEATURE(`accept_unresolvable_domains') and FEATURE(`nocanonify')

but none worked. In fact I was generating so much traffic that sendmail would refuse to accept anymore commands. Thus I lost emails. Let me describe some background of the setup and the growth rate to explain the problem in more detail. Sendmail is configured to dump all mail to sendgrid, a Cloud provider that handles your email relationship for you. Thus there is no need to reverse dns or do any checks since Sendgrid does it for me. All I need is to dump the mail as quickly as possible to SendGrid.

I found that sendmail is not fast enough to dump mail to SendGrid. I could of used other transfer agents but I don't want to configure more things. The environment is very simple. PHP talks to localhost port 25 to dump the generated email to sendmail. Sendmail will try to connect to sendgrid, if so, block the client (Apache) until the mail is sent. The average send time is 1 second, thus the apache server blocks for 1 second per email sent. Blocking is bad especially with this case, since a social application could contact 1000s of users from a logged-in user's action. Let's imagine that one user generated event can contact 5000 users. If each email takes 1 second that takes 5000 seconds for that process to finish. Now that the problem is defined, lets find a good solution which can be handed off.

I picked Gearman. I now build the email and write to a gearman queue. Around 30 workers are listening to the queue and will dump the mail directly to smtp.sendgrid.net via the PHP module Swift. What was gained in doing this?
  • Contacting 5000 email recipients takes 200ms since I am writing to memory
  • Apache does not block for to long (220ms).
  • I do not use localdisk since sendmail is bypassed. /var/spool/mail/mqueue is not used
  • I save on diskio and cpu cycles
  • I can now send at a higher rate with hardly any errors.
  • I can scale with more workers
Yet with all these gains I am exposed to loosing email at this moment in time. If sendgrid goes down and the queue builds up-I can run out of memory. There are ways to work around this such as setting up persistent storage with Gearman, so this exposure is mitigable.

This image shows how large my gearman queue gets. Its largest size is 400 at any given time, for 30 workers. As a result I am sending 300K emails a day and the email event is no more then 1 min old from when it was created.


No comments: