Jan 22

Yet more spam filtering…

I feel I’ve written a bit too much about spam filtering recently… but nonetheless, here’s one more, only with some graphs and maps! Attacks against our mail servers have not stopped since my last update – they’ve increased – but despite this, we’ve had very little spam (I won’t go as far as to say none, as I know that’s not true!) go out from our mail servers since! In my last post, I said that we’d had 190,203 attempts to send messages without authenticating. At the time, that seemed like a large number…but we’re now occasionally seeing that kind of number of attempts in a day – we’ve seen 47,040 attempts in a single hour recently. In the last four weeks, we’ve seen over 830,000 attempts!

Attempts at password brute-forcing continues too. Since November 9th, we’ve automatically blocked 189,400 (non-unique) IP addresses for brute-forcing, meaning we’ve seen close to a million brute force attempts in that time period. As the graphs below shows, December 22nd was a day when we were hit heavily on this: we blocked 2,207 IPs in an hour on that day, with a total of 23,761 IPs blocked on that day.

Banned IPs per Hour

Banned IPs per Hour

Banned IPs per Day

Banned IPs per Day

The graphs also seem to show that the rate of blocks have slowed down since Christmas. Prior to Christmas, we were blocking on average 3,510 IPs a day, and since Christmas it’s down to about 1,450. The data here is however slightly skewed as to help combat the attacks on December 22nd, we started blocking IPs for longer. The effect here is that each IP can’t do as many attempts per day or be blocked as many times per day, hence the reduction on the graph.

A valid question to ask here is “who are we blocking?” Well, the map below shows the countries who’ve had IPs that we have blocked. The grey countries have had nothing blocked at all, and the brighter reds have had more. As you can see, there are very few countries that we haven’t had to block an IP in. (Disclaimer: the locations on this graph are not necessarily the location of the people responsible for sending the spam – these are more likely the locations of machines infected with malware being used to send out spam).

Banned IP Map

Heat-map of countries who’ve had IPs blocked since November

Thankfully – and this is possibly the most important thing here – the number of compromised accounts we’ve seen has decreased drastically. In my last post, I said we’d detected 28 compromised accounts in October. As the graph below shows, this dropped to eleven in November and only four in December, making it the quietest month in at least a year. In the week leading up to the Christmas break, we didn’t detect and compromised accounts at all, something we haven’t done since March! All that said, January hasn’t been as good – we’ve reset 8 accounts so far this month. These are most likely caused by inbound phishing attempts which we’ve seen a distinct, definite increase in this month.

Password Resets per Week

Password Resets per Week

Password Resets per Month

Password Resets per Month

As a final bit of good news, all of our outbound mailservers still have a >95% SenderScore rating, with all our edge SMTP servers having a score of 100% now! Again, we’ve seen no new blacklistings – not one in four months now.

No post about spam would be complete without some negativity though. Some of the attacks we’ve seen in recent weeks have been a lot more manual (less “script-kiddie” stuff) than we usually see. A couple of the compromised accounts we detected were sending out low volumes of spam from our remote-access servers, having downloaded mass-mailer applications once they’d got there. One attack also tried to use University webspace to host phishing scams targeting us with very convincing e-mail login pages. Very worrying, but we caught them in time.

Universities are a big target to attackers – lots of users (meaning a potential for a decent number of less-computer-literate users), lots of accounts, lots of servers, lots of potentially-vulnerable services, and high-bandwidth Internet connections with respected mail relays. The fact that universities are a high-profile target became even more apparent in an attack we saw recently: in one of the above cases of an attack on our remote-access servers, the connection to our server had come from the VPN of another UK-based university. The attacker was sending out phishing scams to other universities and in another case, was downloading VPN access information from other universities using a different compromised accounts at that institution. We’ve naturally spoken to all the universities involved here and advised them of what we’ve seen, and they’ve been very helpful in shutting these compromised accounts down.

It seems that we’re just about back on the winning side when it comes to outbound spam currently, but given the attacks are increasing and getting more complex – no time for us to get complacent.

Jan 14

More Statistics

Just over a year ago (in fact one of our first posts!) we posted some statistics about our estate and infrastructure. Here’s an update on where we are, plus some new stats:

  • Our data centre is now mostly virtual. We have 68 VMware servers and 4 Solaris virtualisation hosts, running nearly 1200 virtual machines. Including virtualisation hosts, but excluding HPC, we have less than 300 physical servers, meaning we are now more than 80% virtualised. Most of these physical systems are virtualisation hosts, or are at remote sites.
  • Our primary filestore cluster holds over 171TB of data (up 35% since March), of which 100TB is user files, 54TB is shared/resource data, 6TB is teaching and other data, and 11TB is backups.
  • The primary filestore cluster contains over 130 million files (a more than 50% increase since March), but it still holds true that 75% of which are less than 128KB.
  • Somewhat unsurprisingly, 31% of the files on the primary filestore cluster are Word Documents, JPEGs and PDFs.
  • Our VMware virtual infrastructure now has roughly 15TB of RAM (up ~50%) and 270TB of storage (more than double what we had at the time of the last post)
  • The 68 VMware hosts have 3.3 Terahertz of CPU capacity
  • During the day in term time, approximately 1600 people log in to Blackboard per hour. Weekends are much quieter with Saturday being the quietest day of the week. Wednesday is the quietest day of the working week with approximately 1400 people logging in per hour.
  • At peak times on an average week, there are now over 2,000 people logged in to SUSSED.
  • Since putting wireless in to halls, during peak times, we get more than 200,000 authentication requests to Eduroam per hour. Before term started this peaked at around 80,000 authentication requests per hour.

Nov 06

Spam Filtering Update

We’ve had a full month of our new edge SMTP service in place now so I thought I’d do a very quick post taking a look at how well it had performed in the last four weeks. Firstly the positive:

  • 190,203 (!) attempts to send messages without authenticating
  • A sum total of 430,431 (!!!) attempts to brute force passwords
  • 2,022 IP address automatically banned (in addition to the 213,809 we block anyway)
  • The 2,022 IPs were blocked a total of 69,459 times
  • 28 detected compromised accounts – this is the fewest in a month since April
  • Only two compromised accounts in the last week of October – the fewest in a week since April 7th
  • No new blacklistings for any any outbound mail server!
  • SenderScore reputation >95% on all outbound mail servers

It’s not all positive though – there’s still some things to improve:

  • Approximately 559 spam messages managed to get through the new service from about five spam attacks (this number is actually very low – we’d previously have done this in under half an hour)
  • A similar number of e-mails and attacks from other non-SMTP methods
  • A number of false-positive detections on accounts. We’re working on fixing these now.

The outlook (somewhat unintentional pun) is positive though. Let’s see what November brings!

 

Oct 13

Spam Filtering

Junk mail – it’s irritating enough when it comes through your door, but it’s just as frustrating when it clutters up your e-mail inbox too. Sadly, there’s no one-stop solution for dealing with spam – it’s a constant war between the spammers and the spammed. As we deal with one wave of spam, the next wave is already being set up.

As we’ve mentioned in a previous post, whilst spam coming in to the University is irritating, it’s a disaster if we send it out. That’s what we’re focussing on in this post, as we’ve recently made some changes to help us detect and prevent us sending spam out. To the uninitiated, this begs a very valid question: how and why are we sending out spam?

There are many sources of spam: virus/malware infected workstations (often operating as part of a botnet), spammer-bought VPSs, compromised servers, compromised web sites, and compromised user accounts, to name just a few. Some of these result in spam coming in to the University, which we attempt to block (with a relatively high degree of success). Compromised workstations, websites, servers and user accounts are all on the list of how we could send spam out. If a workstation is infected with malware, it can be used to try and send out spam. This one is fairly trivial to fix: we block port 25 (the port used for sending mail to other organisations) outbound almost everywhere. As such, an infected workstation can try and send spam to mail servers, but it won’t be able to connect. The same is true of the significant majority of our servers.

The source which has been the big problem for us in recent years has been compromised user accounts. User accounts are typically taken control of in one of two ways: a brute-force attempt on the password – the spammer literally guesses the password; and the user falling for some kind of phishing scam. We’re pretty confident that a big portion of our compromised accounts fall in to the latter reason.

How does a compromised account cause us problems though? Once the details of a particular account are discovered, any malicious party has access to everything that that user has. This is very concerning for many reasons: they potentially can access the users filestore, their e-mails, the University VPN, off-site journal articles, and most University services. With some more work they could potentially gain access to their payroll details too. One of the services that they definitely have access to is our SMTP service (provided by Microsoft Exchange), which is what users can use to send e-mail. Herein lies the spam problem.

Once an account is compromised, the spammer connects to our SMTP service, authenticates using the username and password they’ve obtained and starts using (or rather, abusing) the service to send spam or phishing messages – this may be to University addresses (typically to trick people in to giving out more usernames and passwords) or to other organisations.

With Microsoft Exchange as the SMTP service provider, we’re limited on what we can do here. We can limit:

  • The number of connections from an IP address per minute
  • The number of e-mails from an account per minute
  • The number of recipients an e-mail contains

We’d limited all these things, but none of these options prevent the spam from going out – merely limit the rate at which it does. We have monitoring in place so we can easily see when an individual is sending a lot of mail – but spammers aren’t stupid. They’ll typically time their attacks overnight and at weekends when IT staff are unlikely to be watching. We’d set up systems to alert us by text message when we saw a large amount of mail coming from a single address, which resulted in our admins being woken up multiple times per night to deal with spamming accounts. This went on for a while, with us scrambling to reset the passwords on compromised accounts at all hours of the night – this naturally was not a feasible long-term option. At certain points this year, we were resetting up to forty passwords a week (often more than ten a day!), which was naturally irritating to the users (especially those that had their account compromised more than once…), and was taking up significant amounts of time for our systems teams.

As such, after a few months of work, we’ve partially replaced the SMTP service provided by Microsoft Exchange with something we have much more control over. Having spent much time dealing with the spam issues, we became adept at spotting patterns within them, so we built something to deal with these patterns. The result is a custom-built solution utilising a wide variety of open-source packages running on Linux, with some in-house written filters and rules to block common things we’d seen in a wide variety of attacks. For security reasons, I’m not going to go in to a great deal of detail on exactly how all this works, but as a summary:

  • Connections to our SMTP service no longer necessarily go to Exchange, they may go to our new service depending on a set of rules.
  • The connections to our new service go to one of four Postfix servers which handle the e-mail transaction and mail delivery.
  • These servers are set up to allow us to easily block by username, IP address, sender e-mail address, and a variety of other properties
  • We’ve implemented the use of blacklists that inform us of IPs that are known to have compromised machines on the end of them – we reject all mail from these addresses
  • We’ve put in SpamAssassin. This allows us to detect common spam patterns using a pre-defined set of rules. We’ve also put a lot of custom rules in here on things that we see try and go via our mail servers.
  • We detect IPs that are trying to guess passwords. This was something we didn’t expect to trigger quite as often as it does… the result of this is us now outright-blocking large blocks of IP addresses from talking to our SMTP servers at all.
  • We have some custom-written mail filters (or “milters”) which detect certain other spam characteristics and perform logging

All of this was quietly introduced several weeks back. Since then, the amount of compromised accounts has dropped massively. In the week previous to go-live of this service, we reset 25 passwords (which was about average at the time). In the three weeks since then, we’ve averaged about 5 per week – an 80% reduction. The much larger benefit though is that of all the compromised accounts we’ve detected, only one of them has succeeded in getting any spam out. This number is also somewhat artificially higher as we’ve detected compromised accounts that we previously wouldn’t have been able to. It’s also highlighted some recipient addresses which could possibly belong to the spammers themselves.

It’s not all good news though.

As I said at the start of this post, we deal with one wave of spam and then next wave comes along. We’ve oddly seen an increase of inbound spam since go-live (we have once this month blocked over 1.3 million messages in a single day!) – this may well be a coincidence, but there’s no guarantee. It could be that given we’ve stopped the spam going directly between our users, that they’re falling back to delivering spam the more traditional route. Additionally, having blocked spammers from their original route of getting spam out from our network, they’ve already started another method – using Outlook Web App. This route is trickier for them, but still a feasible method.

And so the war continues.

Jun 27

Exam Results Release

If you’re a student, it almost certainly hasn’t escaped your attention that this semesters exam results were released recently. This is one of the busiest points in the year for a couple of our services as almost all University of Southampton students try to access their results at the same time, whether it by logging in to e-mail or in to the self-service (Student Records) part of Banner. Banner is the system which contains all the information about our students, including what course and modules they are doing, and most notably (in this case) their exam results. There is also impact to our other services – most notably SUSSED, as a large portion of people use that as the route to access the other services.

Focussing on the continuing students results release on the 26th June, let’s start by taking a look at the time line of the day. Latest results were rolled in to Banner (the “roll to academic history” as it’s known) during the morning and then verified. Results were due to be available by e-mail at 2pm, and were to be released on to Self-Service Banner at 4pm. This information was e-mailed to all students and posted on SUSSED several weeks before. Despite these times being publicised, that didn’t stop people logging in much, much earlier!

The e-mail campaign to send results to students is a big one – this Thursday we sent out 12,238 e-mails, so no small task. After much work in the last six months to optimise this, we’ve got this working quite quickly:

  • The e-mail campaign to send results was started at 13:00
  • These e-mails are held on the server so that we can verify the content definitely has the right template and have the correct results in (this is the final check in a whole slew of checks all through the process!)
  • After verifying results, we removed the hold on e-mail at 13:06. At this point, e-mails started to be delivered to students’ mailboxes
  • The e-mail campaign finished just before 13:15 – less than 15 minutes to generate all the e-mails
  • These e-mails then had to pass through Exchange, our Linux edge mail servers and off to Office365. This took until 13:26, at which point all e-mails are dealt with, and we investigate any failures (which there were very few of!)

The graph below shows mails delivered over time. The “Postfix Count” is the number generated and sent out from the server generating the mails. The “Exchange Count” is the number of these messages handled by Exchange (which is the next mail server the messages go to). The “Mailgate Count” is the number of these messages handled by our Linux edge mail servers for delivery to Office365 / other off-site mailboxes. This number will always be less than the Exchange count, as some continuing students still have on-site mailboxes in Exchange.

Mails Delivered

Mails Delivered against Time

Mails Delivered per Minute

Mails Delivered per Minute

So, the entire process took less than half an hour. This is considerably better than the previous results release where mails trickled out very slowly and took the best part of ten hours, and still much better than last years continuing students e-mail campaign which took a little over two hours.

But what about Self-Service Banner? The statistics here are slightly skewed by the fact that we had presessional students starting on the same day as exam results release, so usage here is slightly elevated anyway. Over the 24 hour period from midnight on 26th June, there were 16,610 (!) logins to student records from 6,394 people – a not insignificant number. More than half of these occurred before midday – four hours prior to us releasing marks on to Self-Service Banner. Again, this may be skewed by presessional logins. At peak times (which were at 13:10 and 16:00), we were seeing 43 logins per minute, roughly one every 1.4 seconds. It’s also quite interesting from the access logs to see how desperate some people were to see their results – two students logged in 58 times each during the course of the day!

The graph below shows hits to certain pages in 10 minute intervals during the day:

Hits to Self-Service Banner

Hits to Self-Service Banner

SUSSED was naturally also busy all day. It was interesting to see how quickly news spread about results being out (as 2pm had been publicised to students and they were all available by 13:25). By 13:30 we were seeing a very noticeable increase in sessions. We peaked at about 2300 sessions during the busiest period between 13:30 and 14:30. That said, there were noticeably more sessions all day (around 1200 average, compared to about 850 the day before), but again presessional students skew the data somewhat here.

It was also interesting to see the effect on Blackboard. Yesterday saw 6,134 logins to Blackboard compared to 2,298 the day before. Yet again, these stats are skewed by presessional students starting, but we did see a similar (though not as significant) increase the week before for the finalists marks release.

Quite why we have students logging in to Blackboard (or indeed logging in to Self-Service Banner hours before release) to get results is anyone’s guess. We saw a tweet on Twitter from someone saying they couldn’t find out their exam results from SUSSED hours before we’d even started the e-mail campaign. It may well be that we need to do something more to communicate to students as to how and when they will get their results

Feb 17

Blackboard Performance Update

It’s been two months since we moved Blackboard to the new data centre and performed the various hardware upgrades (see the post we made about that here), so it’s about time that we take a look at how the service has performed since it went live, and how it coped with the demand of the exams and the start of term. We’ll also take a look at usage patterns over the period.

Our own internal monitoring of the new platform shows that the average response time (this is the amount of time it takes to connect, request and receive the HTML content response) of the front page of Blackboard across all front-end web servers is between 66.5 and 79.9 milliseconds since we put it live. As a comparison, the average response time of the old platform since monitoring began was between 478.8 and 687.6 milliseconds. This means we have cut the response times by somewhere in the region of 83% and 90%! To phrase that slightly differently: response times are around TEN TIMES better.

As good as that is, it’s also good to note that we have seen more consistency in performance. Prior to us performing the migration and upgrade, we were seeing Blackboard performance very differently throughout the course of the day. On the old platform, backups of the database and bulk operations to update users on courses used to cause Blackboard to grind to a halt and become very unresponsive – which we would be able to see on the response time graphs. Response times could end up in the 20+ seconds for several minutes at a time. On the new platform, these operations don’t register any impact on our graphs of response times. The worst ever response time we’ve had on the new platform is a fraction over 10 seconds – an incident which has only happened once out of nearly half a million tests. In fact out of all of those tests, only 82 have been more than two seconds and only 492 have been over a second – around 0.1%. The service currently has a 100% uptime.

Google Analytics data shows us additional information regarding page load times. The improvement in responsiveness has helped cut page load times (so this includes the response time, as well as the amount of time to download all supporting content, e.g. scripts, stylesheets and images) by as much as 60%.

It was nice to see that these improvements were noticeable by our users too – several commented on Twitter that they’d noticed that Blackboard was more responsive. The improvements were noticeable at our Malaysia campus (USMC) too. Despite the latency between Southampton and Malaysia, one member of staff told us:

…I am currently teaching at USMC and can confirm a distinct qualitative improvement in Blackboard response times. Although I can’t quantify the improvement, suffice to say it is responding faster here at USMC than it was when I last used it in Southampton!

So what about usage? The reason we waited until we were in semester two before writing this blog post was so that we could see the impact of exams on performance, and also because Blackboard would naturally be very quiet around Christmas (hence why we’d moved it at that point in the first place). In the last two weeks of 2013, we were seeing around 4000-6000 logins per day. Over the days around Christmas, this obviously dropped significantly, but it was interesting to note that there were still 1740 logins to Blackboard on Christmas Day! New Year’s Eve and New Year’s Day were busier than Christmas, with both days having just over 4600 logins.

Logins Per Day

Number of logins each day since go-live on December 15th

The busiest day for Blackboard since the upgrade was Monday 6th January, when over 20,000 people logged in – not unsurprising, being the first day of term. That first week was busy every day of the working week clocking up more than 18,000 logins. In terms of logins, Mondays average out as being the busiest day of the week, with Saturdays being the quietest. The granularity of our data for logins goes down to the hour, so for the average day, there are fewest logins between 4am and 5am, and most between 11am-12pm, followed by a noticeable dip for lunchtime in the following hour. All this can be seen on the graphs below.

Average Hourly Logins against Hour of Day

Average Hourly Logins against Hour of Day

Average Daily Logins against Day of Week

Average Daily Logins against Day of Week

Google Analytics data similarly matches the above, additionally telling us that there have been just short of 8 million page views since go-live.

To summarise: all is going very well in the world of Blackboard performance and stability. The entire Blackboard support and infrastructure teams are meeting up several times in the next few weeks – once to discuss the infrastructure, stability, performance, and looking at what we want to do progress and any improvements that we want to make. More importantly, we’ll be meeting up to discuss the next software upgrade to Blackboard, which will be happening in June this year – more on that later in the year!

Dec 15

More Blackboard Improvement!

As you’re probably aware, Blackboard was scheduled to be unavailable for four days recently for us to do some maintenance. There’s never a good time to take down Blackboard, but this date was chosen to minimise the impact to our users – most students will be off home over this weekend and looking forward to Christmas! But why was this maintenance period necessary? Earlier in the year, we wrote about how we’d made some changes to improve the performance of Blackboard and about how we fixed some issues we were seeing with stability. During this maintenance period, we did more. A lot more.

One of the main aims of this piece of work was to move the service from its current location to our brand new data centre. On the face of it, four days seemed like a lot of down time to move some servers from one place to another, but we weren’t simply lifting-and-shifting the kit. Also, moving the service to the data centre was only the tip of the iceberg – the service was completely rebuilt – new hardware, new OS and greatly upgraded specs.

We discussed in our last post on Blackboard about how we had a couple of Sun SPARC T5240s running the front end webservers, which are load-balanced by some F5 NLBs and some LDOMs (Solaris virtual machines) on an Oracle SPARC T4 series box running the database, file server and collaboration server. The improved Blackboard service does not include any of this hardware – it is all virtualised on VMware. Our VMware hosts are 32-core systems with 128gb of storage – and we have over 50 of them! As part of this move to VMware virtual machines, we also needed to change the OS. We could have opted for Solaris on x86… but the better solution was to move to Linux – Red Hat Enterprise Linux 6 to be precise.

The move to these VMs on the awesome host hardware is already a big boost to performance, but we’ve not stopped there. Previously we had two front-end webservers. We’re doubling this to four, along with the separate collaboration server. Each of the front-ends will be a 4-core, 16gb system, all individually much more powerful than the previous front ends. These front ends are then load balanced by a pair of brand new F5 network load balancers. Behind all of this is a much improved Oracle database server, with 48gb of RAM.

Naturally, we did some testing in a pre-production environment to see exactly how much better this new platform is, and the outcome was quite impressive. To give you some kind of idea:

  • Installing a particular patchset (a selection of upgrades or bugfixes) for this version of Blackboard took around 25 minutes on the old platform. This new platform ran the same task in 7.5 seconds.
  • Restarting Blackboard services on the webservers on the old platform took around 12-15 minutes – we’re down to about a minute now.
  • An issue with Blackboard means that the first user to click the “Personalise Page” link on the existing system causes some high load on the front-ends. On the old environment, this could last for over 10 minutes, whilst on the new environment it recovers in about 20 seconds.
  • With requests for static content (e.g. images), we can now manage over 13,000 requests a second, a more than six times increase

The speed of the new production environment (and the amount of preparation done by the upgrade team beforehand!) meant we actually able to complete all this work way ahead of schedule. Instead of Blackboard being down for the scheduled four days, it was only actually down for about 36 hours!

The proof of the new environment will come in the new year when everyone piles on to Blackboard for the semester one exams and then leading on in to semester two. We’ll be looking very closely at how the new environment handles it, and we’ll have more blog posts about it in the new year, as will our MLE team (whose excellent blog you can read at blog.soton.ac.uk/mle, as well as following them on Twitter, @uos_mle).

Aug 21

Monitoring

With over 1000 servers and almost as many switches and other pieces of routing equipment across the university, one of the most important things is to make sure that everything is working as expected. There are many approaches we could take towards monitoring the services we run, from mostly manual to entirely automated. To save on man-hours, we want to automate as much of the monitoring as possible – this way, if everything is running as expected, we shouldn’t need to do anything, yet when something goes wrong, we get alerted as soon as possible.

Over the past two years, a large amount of effort has been put into improving the primary monitoring system we have, to keep the systems we support running smoothly. The old system was comparatively basic, and performed basic checks on each of our servers such as ping, CPU usage and memory usage. This worked well for monitoring the health of the servers themselves, but didn’t provide any insight as to whether the service itself was working correctly (i.e. was a website or application actually running).

We have now replaced the old system with Icinga (an up-to-date fork of the popular Nagios), which can perform the same checks as the old system, as well as many more and this has already vastly improved our monitoring of services (as opposed to servers). Icinga also has a number of other advantages, as follows:

  1. Customisation: we are able to write our own checks – if there is a new piece of hardware or software which needs monitoring, we don’t have to wait for proprietary checks to be developed if there is an existing API – we are able to write something from scratch. We have already made great use of this, and have developed a custom XML schema which our web developers can use to ensure connections to all parts of a website.
  2. Distributed monitoringas we add more checks into our monitoring system, the load caused by these checks (both in terms of CPU time and disk I/O to the database) increases – while we could add more resources to the server to account for this, this is only feasible up to a point. A much better solution, which Icinga supports is distributing the checks across multiple servers, which all feed back their data to one central server. While this adds in a significant amount of administration overhead, it means we can add in many more checks without fear of overloading one particular server.
  3. Alerts: While the primary view we use is the web interface overview of all alerts, we are also able to get alerts by e-mail and SMS. Once again, the customisation aspect is incredibly useful, as we can configure certain services to send emails to a set group of people, depending on what the error is.
  4. Acknowledgements: A lot of the time, the alerts we get from the monitoring system is due to planned works, in either upgrading or fixing the systems – when this happens, we want to acknowledge all errors with those systems until after the work is complete – Icinga allows us to do this with ease
  5. Customised views: the web interface to Icinga allows us to group servers and the checks we run on them by the service they support. This allows us to immediately tell what service might be affected when a problem is noted, and also allow the various teams within iSolutions to keep a closer track of the health of the services they support.

There’s a number of different ways we could get monitoring information off servers – whether it’s SSH’ing in and running a command, installing an agent on every machine that reports back its status, or using a different interface which exposes this information, such as SNMP.

We have opted for SNMP, primarily due to the fact that it comes installed with most operating systems, can handle a wider range of devices (such as servers, switches, wireless access points, etc.), and is easily configurable without installing anything else. It also doesn’t open up the security hole which would be presented by allowing the monitoring server to SSH to every server.

On the down side, this causes a lot of load on the monitoring server; rather than allowing each server to decide when it monitors itself and then reporting back.

As mentioned above, Icinga allows for distributed monitoring, which is something we make full use of. As well as the “master” server, which collates all data and server up the web interface, we have about ten more “slave” servers, which have individual configurations applied to them, and each monitor a small subset of servers. They then feed back all of this information to the master server, which all in all, significantly reduces the amount of work the master server has to do.

This method has a number of problems associated – we have to also monitor these Icinga slave servers to ensure they don’t go down, and we need to trust that the slaves will always be able to relay information back to the master server. We also have the issue of organising the configuration across all these slave nodes, to ensure that the load of monitoring is distributed evenly, and that no server is monitored more than once.

This is where a tool called nconf has helped us enormously – it is a web interface tool for managing Icinga configuration across multiple servers. It allows us to easily add checks to a server, then assign a certain slave node to monitor that server. It will then generate a new set of configuration files to push to the slave server, copy them across and restart services seamlessly. This saves us the hassle of writing and deploying configuration files by hand, and allows us to make bulk changes very easily.

Using the methods above (primarily SNMP), we are able to monitor specific, expected variables, for example “Is the disk on this server full?”, as well as other checks such as “Does this website respond to HTTP requests”. This doesn’t, however, alert us to any unexpected errors, such as a server losing one of its redundant power supplies (the server would still be responding fine, but it’s not a situation we want to stay in for long). While it would be possible to add in a whole set of new checks to monitor “how many power supplies are active”, what a lot of hardware offers is SNMP traps.

These traps are triggered whenever certain events take place (a power supply failure being one example), or when certain thresholds are reached (the temperature of the server is above the suggested maximum). These can be configured to be sent to a certain location, as long as it is set up to receive traps. Icinga can be set up to handle SNMP traps from servers, which we are currently in the process of doing, though the overhead on doing this is significant.

When a trap is received, very little information is sent along with it apart from the fact that a specific event happened – normally it requires you to look up what the error is before assessing whether it’s important at all – it could well be a simple informational trap. As such, we have to set up a database of trap IDs, and their severity. For example, when we receive a trap saying “a power supply was lost”, we want to be alerted immediately, whereas a trap saying “Everything is OK” can be ignored. The initial set up of this mapping is a tedious process, as many manufacturers have to use their own set of trap IDs to describe events specific to their hardware or software. However, the long term gain from this is we get alerted to errors which we otherwise may not even know to look for, well ahead of when they might start causing us problems.

As well as Icinga, we have other monitoring tools which allow us to see how services change on a day-by-day basis. Our filestore and e-mail systems are among many systems which have our own proprietary monitoring system which graphs various aspects of the systems. These allow us to see usage trends and spot anything out of the ordinary, that simple threshold monitoring may not be able to pick up. There’ll be more on this in a future post!

May 23

Improving Blackboard

A few people may have noticed a short period last week when Blackboard was unavailable / unstable. We noticed a problem with the database server and have been working over the past week to work out what the problem is and how to fix it!

We’ll start with the structure of Blackboard – there are two “front-end” servers, which are what provide all of the content. These are both physical Sun T5240 servers running Solaris, and incoming sessions are balanced between them. There are three other servers, which are all virtual Solaris servers (LDOMs for the technically-inclined!). One of these LDOMs runs a Blackboard installation (which we call the “collaboration” server), but this is also set up to run all of the scheduled tasks and a few other special Blackboard features which are rarely used. This server takes some of the the load off the other two front-ends. Behind these three servers, we have one file server, which hosts all the course content, so both of the front-ends can access the files without holding a duplicate copy. Finally, we have a database server which runs Oracle and contains a very large amount of information about everything Blackboard does!

It is this database server that we believed to be causing the issues (as we were seeing problems with all aspects of both Blackboard front-ends), so we started doing various bits of investigation as to why we were seeing problems. One thing that was highlighted very quickly was that the database server was using 100% of its I/O subsystem (i.e. it was spending lots of time reading and writing to disk). This statistic is not something we monitor all the time, so we were unsure if this was a new problem, but we were fairly confident that it was at least one of the underlying causes of the stability and performance issues we were seeing.

We raised a number of support tickets with Blackboard and Oracle to see if there were any quick fixes we could do to reduce the amount of disk I/O that the database did. One of the solutions we came up with while waiting for a response was to add more memory to the server and add it to the disk cache (the so-called ZFS ARC cache). As this is a virtual server, we could do this live without any downtime, and as soon as we did this, we saw the I/O percentage on the server drop from (almost) 100% to averaging around 30% during the day (overnight is always much quieter, naturally, except for when we’re backing up), and page load times on blackboard went from (on average) 0.8s to 0.2s. While this wasn’t in itself a major issue, it confirms that the database I/O was also hindering the day-to-day usage of blackboard, albeit very slightly.

Blackboard support have since come back to us with a number of other suggestions (and we’ve had other ideas), such as to move some of the heavily used database logs to a separate disk (to stop them hindering other disk operations), and apply a patch to Blackboard which addresses an issue with recent versions of Safari. Blackboard patches are normally fairly low-impact, and there are hundreds which have been released to fix a large variety of issues; Blackboard generally suggest only to apply patches when they recommend them, which is why we hadn’t applied this one earlier.

Recent versions of Safari – when you open a new tab – displays a set of previews of common pages, and for most students, Blackboard is up there in the list of top sites. Safari, being well written and conscientious of bandwidth, as with all Apple products (blatant sarcasm – spot who dislikes Apple), automatically opens a new connection to each of those pages to generate the preview. Many other institutions running Blackboard noticed that these requests were causing unnecessary load (as each one generated a bunch of database queries). The patch we applied to Blackboard caught these requests and returned an error page, so while the thumbnail is now blank, the load should have been reduced on the front-ends and database.  Doing a quick count on the access logs for the past four days, we’re now seeing a slight decrease in the number of requests coming in, and a slight increase in the number of errors being returned, which is what we’d expect. As of yet, we haven’t seen any evidence that this has made a tangible difference to Blackboard, but it’s at least another item crossed off the list.

There’s still quite a bit of work left to be done until we’re happy we’ve done everything we can, though we are being very careful with how much we do in the next three weeks while exams are still on. We’re very reticent to take Blackboard down at all, as we fully understand (we were students once too!) how much everyone across the University rely on it, but we have to also weigh up the chance that Blackboard may crash again during the exam period when deciding not to take any action. For now, we will be trying out a set of changes in our test environment, and once exams are over, applying these to our live environment.

Good luck to everyone in their exams!

Mar 13

Power outages

It appears this morning that there have been a series of power cuts affecting large parts of Southampton, including various University campuses. Estates have had an army of electricians restoring power to buildings, and iSolutions have been bringing up our switches and networking equipment as soon as we can. Phones and networking equipment have been affected across campus, luckily, our datacentre has UPS’, which have kicked in and made sure our server infrastructure stays up!

More general information on the power cuts can be found on Scottish and Southern Energy’s website: http://www.ssepd.co.uk/CustomerService/PowerCuts/PowerTrack/

Update: Scottish and Southern Energy have confirmed this is due to a damaged cable on Burgess Road, and hope to restore stable service by midday

Older posts «