May 23

Improving Blackboard

A few people may have noticed a short period last week when Blackboard was unavailable / unstable. We noticed a problem with the database server and have been working over the past week to work out what the problem is and how to fix it!

We’ll start with the structure of Blackboard – there are two “front-end” servers, which are what provide all of the content. These are both physical Sun T5240 servers running Solaris, and incoming sessions are balanced between them. There are three other servers, which are all virtual Solaris servers (LDOMs for the technically-inclined!). One of these LDOMs runs a Blackboard installation (which we call the “collaboration” server), but this is also set up to run all of the scheduled tasks and a few other special Blackboard features which are rarely used. This server takes some of the the load off the other two front-ends. Behind these three servers, we have one file server, which hosts all the course content, so both of the front-ends can access the files without holding a duplicate copy. Finally, we have a database server which runs Oracle and contains a very large amount of information about everything Blackboard does!

It is this database server that we believed to be causing the issues (as we were seeing problems with all aspects of both Blackboard front-ends), so we started doing various bits of investigation as to why we were seeing problems. One thing that was highlighted very quickly was that the database server was using 100% of its I/O subsystem (i.e. it was spending lots of time reading and writing to disk). This statistic is not something we monitor all the time, so we were unsure if this was a new problem, but we were fairly confident that it was at least one of the underlying causes of the stability and performance issues we were seeing.

We raised a number of support tickets with Blackboard and Oracle to see if there were any quick fixes we could do to reduce the amount of disk I/O that the database did. One of the solutions we came up with while waiting for a response was to add more memory to the server and add it to the disk cache (the so-called ZFS ARC cache). As this is a virtual server, we could do this live without any downtime, and as soon as we did this, we saw the I/O percentage on the server drop from (almost) 100% to averaging around 30% during the day (overnight is always much quieter, naturally, except for when we’re backing up), and page load times on blackboard went from (on average) 0.8s to 0.2s. While this wasn’t in itself a major issue, it confirms that the database I/O was also hindering the day-to-day usage of blackboard, albeit very slightly.

Blackboard support have since come back to us with a number of other suggestions (and we’ve had other ideas), such as to move some of the heavily used database logs to a separate disk (to stop them hindering other disk operations), and apply a patch to Blackboard which addresses an issue with recent versions of Safari. Blackboard patches are normally fairly low-impact, and there are hundreds which have been released to fix a large variety of issues; Blackboard generally suggest only to apply patches when they recommend them, which is why we hadn’t applied this one earlier.

Recent versions of Safari – when you open a new tab – displays a set of previews of common pages, and for most students, Blackboard is up there in the list of top sites. Safari, being well written and conscientious of bandwidth, as with all Apple products (blatant sarcasm – spot who dislikes Apple), automatically opens a new connection to each of those pages to generate the preview. Many other institutions running Blackboard noticed that these requests were causing unnecessary load (as each one generated a bunch of database queries). The patch we applied to Blackboard caught these requests and returned an error page, so while the thumbnail is now blank, the load should have been reduced on the front-ends and database.  Doing a quick count on the access logs for the past four days, we’re now seeing a slight decrease in the number of requests coming in, and a slight increase in the number of errors being returned, which is what we’d expect. As of yet, we haven’t seen any evidence that this has made a tangible difference to Blackboard, but it’s at least another item crossed off the list.

There’s still quite a bit of work left to be done until we’re happy we’ve done everything we can, though we are being very careful with how much we do in the next three weeks while exams are still on. We’re very reticent to take Blackboard down at all, as we fully understand (we were students once too!) how much everyone across the University rely on it, but we have to also weigh up the chance that Blackboard may crash again during the exam period when deciding not to take any action. For now, we will be trying out a set of changes in our test environment, and once exams are over, applying these to our live environment.

Good luck to everyone in their exams!

1 ping

  1. More Blackboard Improvement! » iSolutions Technical Team Blog

    […] looking forward to Christmas! But why was this maintenance period necessary? Earlier in the year, we wrote about how we’d made some changes to improve the performance of Blackboard and about how we fixed […]

Leave a Reply