«

»

Aug 21

Monitoring

With over 1000 servers and almost as many switches and other pieces of routing equipment across the university, one of the most important things is to make sure that everything is working as expected. There are many approaches we could take towards monitoring the services we run, from mostly manual to entirely automated. To save on man-hours, we want to automate as much of the monitoring as possible – this way, if everything is running as expected, we shouldn’t need to do anything, yet when something goes wrong, we get alerted as soon as possible.

Over the past two years, a large amount of effort has been put into improving the primary monitoring system we have, to keep the systems we support running smoothly. The old system was comparatively basic, and performed basic checks on each of our servers such as ping, CPU usage and memory usage. This worked well for monitoring the health of the servers themselves, but didn’t provide any insight as to whether the service itself was working correctly (i.e. was a website or application actually running).

We have now replaced the old system with Icinga (an up-to-date fork of the popular Nagios), which can perform the same checks as the old system, as well as many more and this has already vastly improved our monitoring of services (as opposed to servers). Icinga also has a number of other advantages, as follows:

  1. Customisation: we are able to write our own checks – if there is a new piece of hardware or software which needs monitoring, we don’t have to wait for proprietary checks to be developed if there is an existing API – we are able to write something from scratch. We have already made great use of this, and have developed a custom XML schema which our web developers can use to ensure connections to all parts of a website.
  2. Distributed monitoringas we add more checks into our monitoring system, the load caused by these checks (both in terms of CPU time and disk I/O to the database) increases – while we could add more resources to the server to account for this, this is only feasible up to a point. A much better solution, which Icinga supports is distributing the checks across multiple servers, which all feed back their data to one central server. While this adds in a significant amount of administration overhead, it means we can add in many more checks without fear of overloading one particular server.
  3. Alerts: While the primary view we use is the web interface overview of all alerts, we are also able to get alerts by e-mail and SMS. Once again, the customisation aspect is incredibly useful, as we can configure certain services to send emails to a set group of people, depending on what the error is.
  4. Acknowledgements: A lot of the time, the alerts we get from the monitoring system is due to planned works, in either upgrading or fixing the systems – when this happens, we want to acknowledge all errors with those systems until after the work is complete – Icinga allows us to do this with ease
  5. Customised views: the web interface to Icinga allows us to group servers and the checks we run on them by the service they support. This allows us to immediately tell what service might be affected when a problem is noted, and also allow the various teams within iSolutions to keep a closer track of the health of the services they support.

There’s a number of different ways we could get monitoring information off servers – whether it’s SSH’ing in and running a command, installing an agent on every machine that reports back its status, or using a different interface which exposes this information, such as SNMP.

We have opted for SNMP, primarily due to the fact that it comes installed with most operating systems, can handle a wider range of devices (such as servers, switches, wireless access points, etc.), and is easily configurable without installing anything else. It also doesn’t open up the security hole which would be presented by allowing the monitoring server to SSH to every server.

On the down side, this causes a lot of load on the monitoring server; rather than allowing each server to decide when it monitors itself and then reporting back.

As mentioned above, Icinga allows for distributed monitoring, which is something we make full use of. As well as the “master” server, which collates all data and server up the web interface, we have about ten more “slave” servers, which have individual configurations applied to them, and each monitor a small subset of servers. They then feed back all of this information to the master server, which all in all, significantly reduces the amount of work the master server has to do.

This method has a number of problems associated – we have to also monitor these Icinga slave servers to ensure they don’t go down, and we need to trust that the slaves will always be able to relay information back to the master server. We also have the issue of organising the configuration across all these slave nodes, to ensure that the load of monitoring is distributed evenly, and that no server is monitored more than once.

This is where a tool called nconf has helped us enormously – it is a web interface tool for managing Icinga configuration across multiple servers. It allows us to easily add checks to a server, then assign a certain slave node to monitor that server. It will then generate a new set of configuration files to push to the slave server, copy them across and restart services seamlessly. This saves us the hassle of writing and deploying configuration files by hand, and allows us to make bulk changes very easily.

Using the methods above (primarily SNMP), we are able to monitor specific, expected variables, for example “Is the disk on this server full?”, as well as other checks such as “Does this website respond to HTTP requests”. This doesn’t, however, alert us to any unexpected errors, such as a server losing one of its redundant power supplies (the server would still be responding fine, but it’s not a situation we want to stay in for long). While it would be possible to add in a whole set of new checks to monitor “how many power supplies are active”, what a lot of hardware offers is SNMP traps.

These traps are triggered whenever certain events take place (a power supply failure being one example), or when certain thresholds are reached (the temperature of the server is above the suggested maximum). These can be configured to be sent to a certain location, as long as it is set up to receive traps. Icinga can be set up to handle SNMP traps from servers, which we are currently in the process of doing, though the overhead on doing this is significant.

When a trap is received, very little information is sent along with it apart from the fact that a specific event happened – normally it requires you to look up what the error is before assessing whether it’s important at all – it could well be a simple informational trap. As such, we have to set up a database of trap IDs, and their severity. For example, when we receive a trap saying “a power supply was lost”, we want to be alerted immediately, whereas a trap saying “Everything is OK” can be ignored. The initial set up of this mapping is a tedious process, as many manufacturers have to use their own set of trap IDs to describe events specific to their hardware or software. However, the long term gain from this is we get alerted to errors which we otherwise may not even know to look for, well ahead of when they might start causing us problems.

As well as Icinga, we have other monitoring tools which allow us to see how services change on a day-by-day basis. Our filestore and e-mail systems are among many systems which have our own proprietary monitoring system which graphs various aspects of the systems. These allow us to see usage trends and spot anything out of the ordinary, that simple threshold monitoring may not be able to pick up. There’ll be more on this in a future post!

Leave a Reply