Thoughts on Performance

Surely it is amusing that I write in here, instead of getting my own diary up to date? 🙂 Hehehe 😉 I’ll get back to it, honest 🙂

I just wanted to note down some thoughts on performance of Websites – for those readers that are building, running, or even just considering running their own.

When you start building dynamic interactive websites you see just how much flexibility you have, you literally can do anything that you want to (within the confines of the web ofcourse!) – it becomes very easy to make this, that, and everything else dynamic, that is – generated on the fly each time someone requests it. The advantage to dynamic pages is that they show effectively custom content, or content that is always up to the second up to date. The downside, is the load that this causes on the computer!

As an example, we had changed our error message pages to be dynamically generated so that we could offer things like the Genre selector and ensure they were always up to date, generated using the colours from the database and so forth (yes, in two commands to the server we can change the colour scheme of the entire site now) – however Steve spotted that these error pages are actually getting hit quite a bit for some odd reason. This is causing extra load on the server that is unnecessary – so he’s changed it now so that they are static html pages that are refreshed every 5 minutes, so they’re almost dynamic, but they come down at static speeds and causing little to no load on the server.

The lesson to be learned here is that dynamic pages are costly to the server and the moment that your site starts getting more and more hits you will see the server skyrocket its load through the roof (as was happening on Dear Diary for a while). The error pages weren’t causing the enormous load, but they were contributing to it.

The biggest cause of extra load, as Steve reported in his entry yesterday, is to do with database indexes. For the uninitiated here’s a quick expose on databases…

A database stores data, pure and simple. You define ‘tables’ that contain the actual data in ‘columns’ – think about something like Microsoft Excel (or any other Spreadsheet package) – you have columns and rows in that too. Databases are a lot more powerful for data storage than a spreadsheet, but the analogy works if you are familiar with spreadsheets 🙂

Think of a spreadsheet row as being one ‘record’ of data in the table and you’re not far off at all 🙂

Each table contains data for a particular area, for example we have a table containing all of the configuration information (colours, icons, menu bars etc), another containing all the entries, another containing all the comments and so on. Getting data into these tables is easy, getting it out is easy too – but if you do it wrong you can quickly make your server do lots of extra work.

So how do you get data out of databases quickly? Use a thing called an ‘index’. In just the same way that your address book has the letters of the alphabet down one side, so that you can quickly find the ‘P’s and from there find the address for someone whose name starts with a P, an Index allows the database to narrow the search down very quickly before it has to go into the main database (which is slower with more data in it).

To give you an idea, as Steve commented in his entry, there are now something like 43,000 entries in the entries table, that’s 43,000 records to search through. Let’s put this into perspective – when you go to someones diary, for example, my diary (username: neutronic), you will see that there are some calendars on the front page. How does it know which ones are filled in and which ones aren’t? It’s got to go look through those 43,000 records and find all my entries! Ouch!

What this means is that if that table doesn’t have the right indexes on it, or if the query is inefficient and doesn’t use the Indexes properly, it’s going to end up looking at each and every record to find the ones that match… Sadly, that’s exactly what it has been doing until yesterday – we didn’t quite have the indexes on that table correct so it couldn’t use them.

Our database software now tells us whenever a query takes more than 1 second to execute, a ‘slow query’. We can then ask the database to tell us how it would execute that query and invariably we will find that it is an index that isn’t quite right, or an aspect of the query that isn’t quite right. We tweak the query or the indexes until it seems to be fairly optimal and then put it back into the system to see how things improve.

That’s what we did with the entries table, fixed the indexes and voila, the machine is now running a lot smoother just with that fix. Why? Because now it just has to search through the index for records that match, and the index is an enormous amount smaller than the real table – for example, the index does not contain the entries themselves and that alone chops tens of megabytes from the size and thus increases the speed dramatically 🙂

We found a couple of other queries were taking in the order of 10 seconds to run (for example, the ‘last updated diaries’ panel) and going through some of the fixes in the new version of our database software (since we upgraded a little while back) we found that some things we tried to do earlier that didn’t work, now would – and within seconds Steve had that query running at around 0.1 seconds.

One final thing to note if you’re running a website is that when one thing runs slowly, everything else will slow down slightly because of it. It’s very much an exponential collapse. The entries query running slowly was effectively making it so that all queries were taking more than a second to return, in fact, on average 2-3 seconds. An analogy could be that if you are asked to add up two numbers you can do it fairly quickly. If I give you a factorial do work out, and then while you’re doing that give you the same simple addition to do, you’ll take longer to do that simple addition because you’re trying to work out the factorial.

So by speeding up the slowest queries the whole site has benefited as a result, the computer quite simply isn’t as busy anymore (by a long shot). Now, we continue to watch the ‘slow queries’ list to see what comes up and work on those that still continue to return slowly.

We still have a lot more to do to bring the site speed up, and right now I’m quite happy with the performance of it. Rewriting the site into PHP will allow us to develop more quickly than with Perl, it will allow a lot more cool and exciting new features to come in than we could achieve with Perl, and first and foremost, PHP runs an ENORMOUS amount faster than Perl. In light of the performance issues with the site (which are mostly resolved now, thanks to (mostly Steve’s) work on the queries yesterday) we have rescheduled our development plans and the PHP version of Dear Diary will begin development as soon as possible.

There are other issues to consider when thinking about performance – something called ‘normalisation’ on your tables can be good for optimising data storage but it can have a negative impact on performance because you have to do more work to get the data out – if you’re designing a database you should try to find a happy medium that gives you suitable normalisation but doesn’t cost you in performance any more than you can afford (for example, any tables that get accessed an enormous amount should aim to not be too normalised.

What is normalisation? Well, put simply normalisation is about not repeating data anywhere. For example, we have a users table that contains just purely user data, and we have a diaries table that contains purely diary information. It’s fairly normalised because there is no duplication (there’s a bit more to it than just duplication, but for the needs of this entry, that’s close enough) – a non normalised version would have one table with all the user data and diary data in one table. Why would that be bad? What if you wanted two diaries for one user? What about a user that doesn’t have
a diary? The first example would end up repeating the user information in two records – now, what happens when the user changes their settings? You have to update it in two places! The second example means that you have to cope with records that aren’t ‘complete’, ie. missing data because they don’t have a diary.

Hark! I hear you cry, but you can’t have two diaries with Dear Diary anyway! Not on a single username! Well, the User Interface doesn’t allow it but trust me when I say the database design does. Hopefully the PHP version will be able to offer out this functionality but the hardest part about adding anything new is the how and where of it. How does the user do it, where do you put the options, how does the user know how to operate it, all these questions and more must be answered. User Interfaces really are a tough thing to get right (as you can see from our customiser, which is most definitely not right 🙂 ).

Well, hopefully this has been a little informative and instructive to those of our users that aren’t savvy about Databases (99% of you I suspect ;)) – if you have any questions about this then by all means drop us a note, we’re more than happy to explain things (if only because as we explain it to others, we spot things that are right/wrong and that helps us too 🙂 ).