Usage tracking

This is about tracking the usage of individual resources in your environment - whether a CPU is busy, the space used on a disk or RAM, and so on.

Tracking the usage of a particular resource is useful in several respects; it allows you to:

  • predict a point at which the resource will be exhausted, e.g. no more free licenses
  • allocate proportional costs to each business department
  • check expectations against reality - it may be a warning sign if you see heavy usage of a resource that should be lightly used, or is not marked as critical
  • consolidate, where you see underutilised resources.

My belief is that we tend not to track these things closely enough, but that improving should be easy.


Most systems have simple methods for monitoring key measurements; often they are not used at all, let alone systematically managing this data. Sometimes this data is gathered automatically by monitoring, but only used for immediate alerts.

A simple example is checking the free space in a filesystem: if you do this only when you get a complaint from users, you will merely confirm that your disk is already 100% full. One step better is to get your monitoring to alert you when it is at 90% - this will help avoid nasty surprises. But when you get such an alert, you still don’t know the current rate of growth; whether it has been steadily increasing for weeks, or just spiked in the last few minutes.

So let’s plan to check free space in all our filesystems regularly, and store the results in a simple database. Not hard. But with this data we can inspect the history and extrapolate trends, useful both for managing a current problem, and for avoiding them in the future.

Examples of metrics you might want to track are:

  • disk usage
  • CPU load
  • memory usage
  • network traffic
  • transactions per minute
  • response time
  • number of logged-in users

Note that your metrics may overlap – if you have a filesystem shared between several applications, you may want to track disk space used by a particular application or user, as well as the space used overall.

The measurement interval will vary depending on the metric – it may be sufficient to check disk usage once per day (unless it is something particularly volatile, perhaps for temporary caching). Something more variable like CPU usage will need to be sampled much more frequently.

Where you sample extremely frequently, it may not be appropriate to keep every reading for an extended period, simple because of the amount of data involved. Where possible, try to aggregate figures into an average or a sum rather than take just one sample: if you measure CPU every 5 seconds it is better to derive and keep “average CPU utilisation per hour” rather than just taking one sample for each hour (which could be wildly unrepresentative).

Usually the more volatile metrics (CPU, response time, …) are the sort of things you already observe via conventional monitoring anyway. Ideally you can just extract the samples you need from the monitoring system and aggregate as required.

Of course, you will drive all this data collection from your CMDB – thus ensuring that every component gets tracked as soon as it is added to the environment.

Why do all this?

Real-time monitoring should be handling the most immediate concerns – usage tracking is about painting the wider picture.

In the literal sense of “picture” … usage history should be viewable in graph form to make sense of the data.

In the event that a system starts getting heavily loaded, a quick review of performance data should show whether this is a one-off (something unusual is going on), the culmination of months of steady growth, or something that happens regularly (every Wednesday night we see…).

Ideally, there should also be some intelligent tools to extrapolate: “this disk will fill in about 3 months”. [See separate article on Usage Patterns.]

Also, this data can form the basis of a microcharging regime. [See separate article on Microcharging.]

There are other side benefits, e.g. identifying overused or underused systems, or as a cross-check that load-balancing is working evenly. You may see that one user has far more stored data than average. It may help you identify complementary workloads that could be combined onto a single machine, such as a nighttime batch and an office-hours utility.

Leave a Reply