On Servers and Monitoring
One of the most important things to get right when setting up server infrastructure is monitoring. Proper server monitoring is your best friend and will easily pay back any time and money you invest in getting it right.
On the face of it, server monitoring seems like a fairly simple problem. Just decide what needs to be monitored, find a tool to do it, and set up alerts when something goes wrong. Right? Well… Sort of. There are a few issues to look out for when managing server monitoring infrastructure. Before we go too much further though, I’d like to clarify the difference between an alert and a metric. I believe this distinction will clear up a lot of disagreements between the ‘monitor everything’ and ‘don’t monitor too much’ camps.
Monitoring, Alerts, and Metrics⌗
Let’s start with monitoring. Monitoring is the over-arching term for the system(s) that you put in place to alert you when something in your infrastructure goes wrong and allow you to efficiently track down the root cause of an issue, even after it’s happened.
The definition of monitoring already gives some clues as to the role of Alerts. Alerts are the mechanism by which your monitoring system lets you and your team know when something is wrong. Alerts may be based on metrics, lines in a log file, an error from an application, or anything else. The important thing to bear in mind is these are the things that send you an email/sms/push notification or whatever mechanism of contact you use.
Metrics, on the other hand, are silent items usually used for longer-term trends. These may include things like “number of errors in log file x”, CPU Utilization, website response time, etc… These items are there if you need them, but should not intrude on your day-to-day lives. Remember, you can have alerts based on metrics, in this case, the alert is what is interrupting you - not the metric. You may even have multiple alerts on a single metric.
Alert Fatigue⌗
One of the most common issues in server monitoring is alert fatigue. Ever heard of the story of the boy who cried wolf? Well, that applies to server monitoring too! How many times have you got an alert from a server, glanced at it, and thought to yourself “Yeah, I know what that is, it’s not an issue”. For lots of people, I’m willing to bet this is something that they’ve experienced a lot in at least a few workplaces.
This may not seem like a big issue at first. After all, you’re still being alerted of the real issues too, what harm are a few false alarms going to do, other than take up five minutes of my day? Well, much like the boy who cried wolf, pretty soon you will begin to ignore them. No matter how professional you are, I guarantee you, if you have too many false alerts coming in you will miss an important one. I’ve seen all sorts of ‘workarounds’, none of which actually solved the problem and all of which ended up causing the users to miss real alerts. Everything from filtering all alert emails into a folder to trying to do selective email filtering to filter away the false positives (I’ll let you guess how that one went!). Don’t try to brush the problem under the carpet. Fix it!
Thinking that to yourself more than once or twice a month is should be a big red fucking humungous warning to you that something is not right. Luckily, the fix is easy and comes in two parts. First, when adding new alerts spend some time looking back in history and see when it would have triggered in the past. Ensure at this point that, based on historical data, it would only trigger when there’s an actual problem. If there are things such as short spikes that don’t constitute an actual issue, now is the time to make sure you’re basing it on things such as a rolling average. Second, every (and I mean EVERY) time you get an alert that doesn’t constitute an actual issue, treat that as a high priority bug. Why high priority? Because it is. Sure, that alert alone won’t cause too many issues, but what about when you have 5, 10, or 20 alerts that alert when they shouldn’t. Pretty soon alert fatigue sets in and your monitoring infrastructure might as well not exist. These bugs are important, and you’d be well served to not ignore them.
Notice how different levels of alerts didn’t come into this. Having levels such as Critical, Warning, Info (Why?!?!?), etc… These levels do have a use, but it’s not to distinguish between alerts to be acted upon and alerts not to be acted upon. Instead, they should be used to prioritize. If you have two servers, one with a critical alert and one with a warning alert, you know you need to work on the critical one first and the one with the warning second. Every alert should still represent a real-world issue that needs intervention, no matter the alert level. Let me repeat that for you again, because it is important. Every alert should represent a real-world issue that needs intervention, no matter the alert level.
So, what do I monitor?⌗
A common question from new sysadmins. Now that we know the difference between Alerts and Metrics it becomes a little easier to answer because we can now split it into two parts. What alerts should I have, and what metrics should I keep?
Metrics⌗
Starting with metrics, my opinion is to keep as much as you can. You will usually be limited by either the cost of the infrastructure (disk space, etc..) or by the amount of time it takes to implement the metric monitoring. You should aim to keep at least 6 months’ worth of metrics with a granularity of at least one hour. If you do have your system aggregate older data, try to get it to keep the average, maximum, and minimum over a period. Keeping six months worth of data provides a good balance between being able to see long-term trends and keeping disk usage out of control. Even longer periods can be useful for capacity planning, but aggregating the data down to daily values will usually suffice for that in my experience.
When starting out, lots of new sysadmins begin by monitoring the memory, CPU, and disk space of the server. Although at first, this seems like a sensible thing to do, those aren’t necessarily your most useful attributes to be monitoring. When deciding what to monitor, start at the top of the stack and work down. For a website, this often constitutes uptime measurements as the bare-bones monitoring. Then response time, then the number of lines in error logs. Having these in place allow you to know about any issue the customer would see, usually before they even see it themselves. In a professional environment, this provides a vital opportunity to notify support personnel of a problem and get out in front of it. A company that knows there’s a problem before a customer phones up look infinitely more professional than one who has to rely on a customer telling them that your site is down.
This high-level monitoring also provides data you can use to base your KPI’s on. What the business cares about is ensuring that the site is as reliable as possible. If you can provide data that clearly shows that over the past 12 months downtime has reduced to one hour from five hours over the preceding 12 months you will be able to show the value of investing in monitoring in a way that management will understand and appreciate.
Once you’ve got high-level monitoring sorted, you can begin thinking about monitoring some of the underlying resources. Now is the time to consider setting up metrics for CPU/Memory usage, number of requests, etc… Many off the shelf tools will have pre-configured monitoring for a wide range of system resources that will provide you a really solid base for debugging issues that have happened in the past.
Once the initial monitoring is set up, I tend to keep adding as many metrics as I can get away with under the monitoring system limitations. Bear in mind, however, that you may need to swap out metrics over time if you find a metric that’s more important or needed for an alarm. That’s fine, your monitoring system is a living project, and is never done. If it’s not evolving you’re not doing it right.
Alarms⌗
Alarms are a little more nuanced than metrics. Whereas metrics sit in the background until you need them, and so you don’t lose anything by adding more than you need, alarms will interrupt your day. Furthermore, as discussed above, too many false alarms will cause alarm fatigue.
Just as with metrics, creating alarms on high-level customer-facing metrics is a good place to start. The obvious thing is whenever the site goes down, raise an alert. You can then discuss with key stakeholders what acceptable levels of performance look like. You will probably want to express this in percentiles, which I won’t get into here but they are very useful for describing how fast the website should be for what proportion of visits. You can then create alerts based on these goals, knowing that if you ever don’t meet the acceptable levels of performance that’s something to investigate.
The next easy target is CPU/Memory/etc… However, this isn’t as cut-and-dry as it may first seem. Let’s take CPU as a quick example, what is an acceptable level? For a server that is running a mostly memory-intensive load, your CPU usage may be down in the 3-4%, and anything over 50% may be something that needs looking into. On the other hand, a server that is well balanced may have a usual CPU usage of around 70%, which fluctuates up to 80% and down to 60%. Here you may set an alert for 90% or 95% to account for occasional spikes. Taking this to the extreme, if you have a server that is expected to be, for example, encoding videos at all times if the server drops below 95% that may be a sign something isn’t right. The point of all this is to say every workload is unique, don’t just blindly accept the defaults provided by whatever system you are using but think about what your server metrics look like normally, and what they would look like if something wasn’t right.
Just remember, it’s better to have no alert at all than one that causes alert fatigue. Alert fatigue could cause all alerts to be ignored from the monitoring system, rather than just one alert missed because you didn’t add it.
Parting Thoughts⌗
This post applies to all areas of monitoring. Be it a Linux server, Windows server, networking equipment, or anything else. The same basic rules and pit-falls apply. This post was intentionally written to be tool and system agnostic and not as a how-to.
This post also doesn’t cover logging. Proper and efficient logging could be an entire book itself. If you’re interested in moving on from the old appending lines to a file whenever something happens, look into “Structured Logging”. There’s also a whole rabbit-whole around observability, @mipsytipsy on twitter has some interesting views and her tweets will likely be a good starting point in learning more.