Netsight is the technical partner for a project that will deliver structured discussion forums to the general public, akin to something like Disqus. This is anticipated to be a very high traffic site, with the potential to have increasingly large numbers of concurrent users, generating and viewing rapidly expanding, large-scale quantities of content.
The first stage of this project was to deliver scalable prototype software, to confirm performance and likely ongoing costs. For the purposes of our testing, our overall aim is to deliver a site that represents real-world use while ensuring an average response time of less than one second per page.
Cloud based hosting providers provide simple, cost effective means of dealing with the dynamic and potentially high load of these types of project.
For the application itself we decided to start testing with a simple off-the-shelf Django forum app for two reasons: 1) it partially resembles the type of content that we envisage being created in this project; 2) Django is quick to develop and widely used with many cloud-based hosting services (including AWS, which we already use). Some minor modifications were made to the forum app to allow load-testing services to create and access random content. Although there are a number of solutions out there, we chose to go with Heroku for the purposes of testing as their Platform as a Service (PaaS) appeared to be popular and simple to use as well as offering the easiest way to scale a deployment.
Setting up load testing
As I move into the slightly more technical description of the process we undertook, it may be helpful to define a few of the terms I’ll be using:
- Topic – The off-the-shelf app we chose uses the term Topic to refer to a thread or top level forum post
- Post – A comment / reply to a Topic
- Vote – A simple thumbs up / down of a Topic
- Nodes – I refer to a node as any of the following components in the deployment:
- dyno (This is Heroku’s term for their lightweight application containers – akin to a VM)
We then constructed typical user scenarios for the load testing service to run:
- View a topic on the site (picked at random)
- Comment on a topic on the site
- Vote on a topic on the site
- Create a new topic
The ratio of these scenarios was set to approximate the real world usage of a typical forum (for example, considerably more views than topic creations). We then ran these scenarios over increasing quantities of procedurally generated content.
To evaluate the overall testing process, we started with only 400 topics, with 4,000 posts across them and 10,000 votes across those posts. With these quantities of data and with 50 concurrent users performing the scenarios mentioned above over the course of 10 minutes, we were seeing an average response time of 781ms. This was achieved with just the most basic package from Heroku (1 dyno). The same test was run again over 30 minutes (and this remained the test duration going forwards) to see whether the short test duration was being covered entirely by caching. This resulted in an increase of average response time to 1.4s, which we still consider to be reasonable. Doubling the number of web workers to two brought the average response time back down to 812ms.
Stress testing with larger datasets, more complex user scenarios
After validating that the testing setup works, we increased the dataset substantially to resemble a more established, growing real-world community.
Based on our research of published numbers by both Reddit and Disqus, we opted to go with approximately 400,000 topics, 4 million posts and 67 million votes. Up until this point all our tests were with anonymous users, so we also created 400,000 users and updated the scenarios to login a random user before creating content (views remained anonymous, which is typical for this kind of site).
Not surprisingly, we initially saw failures in all of our test scenarios.
After looking into the detailed logs that Heroku provide, we found that the app was timing out calculating the number of posts in a topic. After looking into the code of the off-the-shelf Django forum app, we found that on every post or topic creation, the total number of posts and topics for a forum was calculated by running a query over the entire post table. This is the kind of seemingly-simple functionality that actually requires careful consideration when developing a scalable web app. Typically this kind of calculation over a large dataset might either be performed by a separate automated job, or asynchronously to realtime user interaction (as accurate post and topic counts can afford to be slightly delayed).
Once a fix for this was in place, our scenarios returned to 100% test passes, although our average response time (with just the standard database node) had increased to an unacceptable 3.6s.
Looking through the Heroku metrics, we could see that the load on the PostgreSQL node was very high. Our next step, therefore, was to create what Heroku calls a “follower” database. This is effectively a read-only copy of the main database. Heroku then load balances read requests across both nodes.
After adding the follower database, we immediately saw an improvement in performance: we were now achieving an average response time of just under one second (our original objective).
Conclusions and next steps
I am convinced that Heroku can do even better than this. Heroku has some very high profile clients including Macy’s and Toyota Europe; websites such as these gives Heroku quite a reputation and provable track record.
Also Django itself can go further. Disqus and Instagram (to name two proven high-performance sites) both run on Django and cope with billions of page views per month whilst remaining highly responsive.
The fact is: we have simply reached the limit of the off-the-shelf forum app. From briefly reading through the codebase, we could clearly see that it was never designed with scalability in mind. This was evident from the issue we ran into with timeouts, for example, as the off-the-shelf forum app repeatedly attempted to run a query over the entire post table when calculating the number of posts in a topic.
At this point it’s worth restating the modifications we made to the off-the-shelf app to allow it to be load tested. Most importantly, we added code to view random topics. This allowed the load testing software to visit a static URL and get different content each time. To facilitate this, however, we had to explicitly disable caching of this URL. In the real-world, web caching software (such as Varnish) directly serves requests to the same URL in a fraction of the time instead of sending the request to the application itself. This would have significantly reduced response times as well as allowing the application itself to handle many more concurrent requests.
This was a proof-of-concept, however, and it served its purpose.
When the real-world solution is implemented, we will build the core application from scratch using little third party code (beyond the framework itself). This would put the development team in the best position to anticipate and methodically test for scalability issues throughout the design phase, reducing the likelihood of a problem arising at a later date in development or in production.
Ultimately, we have asserted our confidence that Django and Heroku are a great combination for high-traffic scalable web applications. I’ve personally taken away from this experience a couple main points: the crucial importance of multi-layer caching; the fundamental importance of owning and understanding the architecture of the application you intend to build.