WashYourLyrics - Scalability

  • March 8, 2020
  • Introduction

    This is the story of how we helped WashYourLyrics.com stay online under the weight of traffic from Twitter.

    With the UK responding to the COVID-19 outbreak, the National Health Service (NHS) put out an official poster showing correct hand-wash technique, along with suggestions that those washing their hands sing two verses of “Happy Birthday” to get the right amount of time spent cleaning.

    Meanwhile, William (@neoncloth) was writing a website to allow visitors to put their choice of lyrics to the NHS Infographic.

    🎶 I made a little site in 24h that generates hand washing instructions accompanied by lyrics from a song of your choice instantly - check it out!

    • William (@neoncloth) March 8, 2020

    This predictably went viral on Twitter (no pun intended). It was clear that the site was suffering under extreme load, and the services based around song poster generation were taking between 100ms and 60s.

    Tom, our CTO reached out to William on Twitter offering DevOps assistance with scaling the site if required.

    At 6pm, Matt Hancock MP, the Secretary of State for Health and Social Care retweeted William’s tweet, driving further traffic to the site.

    Which brings us to around 8pm, when Tom’s assistance was requested.

    After a brief chat on Twitter DM, we added him to one of our Slack workspaces for a longer-form discussion with our entire team.

    • Tom: General DevOps wizard.
    • Noah: K8s genius and an expert with Javascript
    • Matt: Experienced sysadmin.

    We did a quick analysis of the tech stack. It’s all Kubernetes based.

    • Frontend/api is a Nodejs app.
    • Microprocessor, the image generation backend.
    • Redis
    • MongoDB, in a 3-node ReplicaSet
    • Cloudflare up front.
    • Mesa as the WebSocket broker handling pub/sub.

    It was immediately apparent that the primary database server was under extreme load.

    Despite having 3 nodes, all the traffic was seemingly only going to one server. The cause of this was excessive write-heavy database traffic. It appeared that read-write splitting wasn’t occurring.

    Tom made some changes to the DB connection options to use the latest features available in the underlying driver to allow reads to go to the secondary nodes and writes only to the primary.

    The subsequent investigation we did was into database indexing and added new multi-field indexes on frequently queried fields. This improved performance somewhat, but writes were still taking absolutely forever.

    Tom and Noah started a code-review of the application’s code, finding performance bugs and bottlenecks.

    We discovered an app-layer rate limit, which was counter-intuitively generating one extra database lookup per POST request. This was removed after we suggested that rate limiting is best applied either at a service mesh level or given the application is already behind Cloudflare, it’s simple to configure Cloudflare to handle the per-IP rate limit.

    We also discovered that every time the Download Poster button was pressed, a database write occurred incrementing a counter. Under Tom’s guidance, this field was removed from the primary database.

    The download counter field was moved to Redis which was already in the tech stack and is more suited to atomic increment operations. Ideally, this kind of metric would best be gathered in a time-series database such as InfluxDB or Graphite.

    Noah made some recommendations to William about refactoring the create function to limit the amount of database writes that happen when a song is selected, basically replacing create with findOrCreate, which had the desired effect, dropping database load from 250% to more like 50%.

    Huge thanks to @devopstom, @StrideNoah and @VMCloudyMatt from @TwinHelixDNA for helping out with scaling Wash Your Lyrics and identifying bottlenecks. The app is now so much more stable thanks to them!

    • William (@neoncloth) March 9, 2020

    With those changes in place, the site was once again stable, handling upwards of 1200 concurrent users with consistent response times around 100ms for image generation.

    This is one of the things we do best.

    We’re small and agile enough to just throw our time behind something that’s so important and interesting. We’re still in regular communication with William to ensure the site’s survival no matter what the internet throws at it.

    We’ve been in touch with DigitalOcean and Cloudflare who are both keen to help support the sustainability of the site as a public service. We’re dedicating our time and effort to this for free to help raise awareness of simple health measures during these difficult times.

    Update:

    Cloudflare upgraded William’s domain to a Business Plan for a few months. We’re all very grateful as this means greater visibility of incoming traffic, plus the ability to have granular control and rate-limiting rules applied.

    We also discovered that the default readPreference for MongoDB sends all queries (including Reads) to the Primary node in a replicaSet.. Which is all sorts of What The Fuckery.

    Changing that to secondaryPreferred dramatically reduced the load on the Primary MongoDB server, and spreads the queries far more evenly across all nodes.

    The final piece in the puzzle was to eliminate case-insensitive regex searches from the pre-generation lookup, because they’re terrible for performance, and don’t use indexes. Once those were gone, API performance was consistent and significantly lower, averaging around 200ms.

    Statistics

    👐 Wash Your Lyrics: Day 2 Stats

    🔥 237,564 posters created (+ 148,037 / 265.3%)

    👤 420,139 unique users (+ 298,860 / 346.4%)

    🖥 8,103,669 server requests (+ 5,915,146 / 370.2%)