Improved Site Speed and Web Crawler Management

Quick Summary: The website is running 20% faster across all pages (<2 seconds load time), especially book details pages. This was accomplished by prioritizing page load speed and fresh information for real visitors, rather than everyone (which included a LOT of web crawlers and bots).

Hey everyone, it’s been a while since we have talked on this blog so I wanted to give you an update of the work that has occurred over the past few weeks. While nobody has complained (since the site was still loading pages in less than 3 seconds), I still realized there was room for efficiency in the code that runs NovelRank. Why? Because at any given second there are 40-50 real people accessing the site, while many multiples of that number of pages are being accessed by robots: web crawlers, bots, rss readers, widgets, etc. That’s a lot of information to process, and while NovelRank’s servers (~$100/month in costs) have been up to the task, their load was starting to getting high.

So I went to work identifying the sources of the database server overload and came up with this list (highest to lowest):

Charting requests, especially for long-term data (more than 30 days)
Detail stats for salesrank: minimum, maximum, average, etc (refreshed automatically every 30 days)
CSV exports by bad bots who do not obey NOINDEX and NOFOLLOW
Activity tracking for individual books (how I can tell someone is still looking at it)

Smarter System

With this in hand, I first tackled the problem by making the system smarter. In all of the above areas, when a request comes in it will check to see if the database is currently working REALLY HARD on something (i.e. high number of queries or total number of minutes processing active queries). If it is working hard, a gentle message is displayed to the visitor letting them know to try back again in a few moments. This was critical in stopping a cascade effect where work piles on top of work making it take many times longer to complete all of the tasks because of the split resources. It also allowed the server not to overload itself when it was locking tables during the nightly backups (in case of critical failure).

Web Crawlers and Bad Bots

The second approach was to deal with bots, and more importantly, bad bots. See, search engines have bots (i.e. web crawlers) that look throughout your website and allow you to show up in search results. This is a good thing! However, not all bots are created equal and obey the language of robots. Secondly, a robot may look at books on the site that no real person has viewed in months, falsely making that book look active and triggering a slew of extra behaviours. So, I increased awareness and detection of bots, choosing not to display charts to them at all (no use) and to serve them a cached version of the page (if available). They don’t need very fresh data and it also speeds up their access to the pages (relevant for GOOD bots).

The added benefit to the bot changes was that activity tracking will be more accurate. This means that zombie books, books that are not actively being watched anymore by a real person, will actually be able to be identified. Translation: They can be deactivated, thus increasing the frequency of sales rank checks for books still being tracked by active users on the website.

Book Stats

The final problem was book statistics. These are the calculations of sales rank information displayed on the book’s details page that show lowest, highest, averages, and standard deviation for that book’s sales rank. Calculating these, due to the sheer amount of data in the database, takes effort on the processor side of the database. Thus, many months ago, this was relegated to occurring only 30+ days after the last time it was processed. This helped (a lot), but it was still determined by when visitors to the site requested the update (including bots) by visiting the page, meaning that there would be bursts of requests causing overload and slowing things down.

Now, if the system detects that new statistics need to be calculated, then it adds that book’s info to a queue. That queue has a little background worker that is gently tapped on the shoulder and asked to please process the information at its earliest convenience. Okay, maybe that’s a bit of a stretch, analogy wise, but it will spawn a single worker script that will process through the queue, one at a time, only if the server isn’t overloaded already (like the charts described above), and will process itself as fast as possible so it can take a break until the next batch comes in. This means that the page loads immediately for you (no waiting for it to get the new stats), and then next time (after an hour or so) you visit that page, the new stats will be automatically available, instantly.

Neat, huh?

Final Thoughts and Added Benefits

With the statistics moved into the background, that allows me to offer a feature in the future where users can request an update at any time they wish, effectively adding it to the queue for processing. This is one of the features I will be making available to PRO NovelRank accounts (more info coming soon!). Overall, by January 2013 there should be continued speed, stability, and availability for the service that is better than it has ever been in the last 3 years.

Be Sociable, Share!