So... It probably goes something like this, you've got your software as a service, your selling it, people are loving it, you're building and releasing new functionality all the time. Things are going well, that is until they stop going well, you notice things are running a bit slow, or even worse, your customers notice. You login to your servers to find out what? It's maxed out the CPU, your database is living in page file, or you're close to running out of disk space and fragmented to hell. Either way everything is grinding to a halt and as a result, you are boned. Obviously you sort this out asap, your customers get over it and everyone can relax again. Except for the fact it's going to happen again, unless you can work out when it's going to happen and make sure you're ready for it next time. This is what happened to me a about year ago, or had been happening for some time, but it was then that I started working on making sure it wouldn't happen again. I certainly don't know everything there is to know about this stuff, but I know enough to have survived since then without repeating the mistakes of the past. Here's my top tips, things I think you should do now if you aren't already.
1) Start measuring server performance
Measure the performance on your servers, cpu, memory, disk reads writes and disk space free. I use Hyperic for this, it's pretty simple to set up and start using. I use a separate machine for the server component, you should at least check out their recommended specs and be careful installing it on a production server which is already stressed or key to the performance of your app. This will allow you to monitor over a long period of time the load on your servers and get an idea of where the next bottleneck might come from. You can also set alerts to notify you should something get out of hand, ie free disk space getting below 25%, cpu utilisation over 75% for more than 5 mins, whatever. Hyperic is extremely easy to install and set up, there's a free edition and aside from a few UI peculiarities I've not had any serious issues with it in 12 months of use.
2) Start measuring some application level metrics
Ie, number of users on the system, number of widgets wangled etc. Flickr monitor pictures uploaded/downloaded for example, this showed them that Sunday is when they get the most uploads, and Monday is when people back at work start browsing said photos. This is very useful as it gives you the background to how people use your service, and how this then drives the measurements above. I'd been doing this long before making any use of it as user logins and various actions were recorded in an audit trail, you want the data somewhere it's easily queried so a database would be ideal. You may well be able to get this data into Hyperic although I've not tried. I currently use total user logins per day and then pick the peak day each week to represent that week. I'm interested in the peaks since it's those I need to plan for and looking at data at a weekly resolution is fine for my needs.
3) Make some simple predictions
Now you've got some data, which may mean waiting for a while after point 2 above, it's time to work out when the shit is next going to hit the fan. I've used Excel for this so far as a) I know how to get it to do what I want, b) it works well enough for my current needs. I've worked primarily with my user login numbers, these predict load on my servers closely enough for me to get upgrades in at the right time. This is likely as I've really only the one server doing everything, if you've a number of more specialised servers you'll probably want to work out which application metric drives load on each one. Start simple though, then add more detail if you need to. What you are looking for here is a trend over time, I have peaks every 3 months and the pattern from one year to the next is very similar allowing last years figures to predict very closely this years figures by simply multiplying them up.
4) Work out when you're boned
Simple as this, you know what (cpu/memory/disk) is or is likely to be the bottleneck on one server. You know that one level of your application metric correlates roughly to some level of load on your server. You know roughly what increase in that metric gets you some measurable increase in load on those servers. You know where you think that metric will go for the next few months. So at what point will it approach load on that server which you can't sustain. Make sure you get the relevant upgrade in before that happens, be aware here of how long it takes to actually get the upgrade in. Some upgrades are quicker than others and some require more or less investment of your time to do them, ie replacing a cpu with a faster one vs moving to a machine with the capacity to add a second cpu.
A concrete example of all the above for me was CPU which is often the driver behind our upgrades. We up our logins about 70% every 12 months at the moment with spikes every 3 months. Since cpus can't be increased by small steps I don't worry about directly predicting utilisation, instead I ask myself if I think we'll get through the next 3 monthly spike based on cpu utilisation currently and how high the spike in user logins is predicted to be. If I'm concerned I get the next upgrade, if not I wait. With cpu specifically I try and keep under an average utilisation of 75% for any significant amount of time. Significant isn't a very large amount of time when users are waiting for a response from your servers. this is based on the work of far smarter people than myself, there's a good write up here.
That's it for now, buy John Allspaws book if you're really interested, or if you think you may need to get a handle on this stuff.