Motivation
As we use our linked data and triplestores to drive more of our sites and services, it’s becoming apparent that a lot of queries to the store will be repeated with the same query parameters each time (especially for “list” type pages, which serve as jumping off points to individual resources).
While 4store does some partial query caching, it makes sense to avoid hitting the store entirely for frequent queries to slow changing data. Using a separate reverse proxy for this means that applications/sites can either be set to use the cached or live store on an individual basis.
SPARQL queries are just HTTP GET requests, so using a tried and tested web cache looked promising.
Varnish, Nginx, Squid and apache’s own mod_cache all looked promising, but Nginx won out in the end, purely for simplicity of setup and configuration (thanks also to Dan Smith for some advice).
Setting Up Nginx
The examples below assume that you’re running as root (or prefixing them with sudo). Exact locations may vary by linux distribution.
1. Install Nginx
On Ubuntu, this was a simple case of running:
apt-get install nginx
2. Stop the nginx service (if it’s running)
service nginx stop
3. Disable the default site
We only want Nginx to act as a reverse proxy, so remove the symlink to the default site:
cd /etc/nginx/sites-enabled
rm default
4. Create a cache directory for the store
mkdir -p /var/cache/nginx/ts_cache
5. Make sure that the cache dir is writable by the proxy server
On Ubuntu, the default user for nginx is ‘www-data’. So run:
chown www-data:www-data /var/cache/nginx/ts_cache
6. Create a config. file for nginx
cd /etc/nginx/sites-available
Create and edit a file named something like
001-ts_cache
The following is a minimal config. you’ll need to get up and running, though it’ll need tweaking according to your needs. Full details can be found in the proxy module section of the Nginx wiki.
proxy_cache_path /var/cache/nginx/ts_cache
levels=1:2
keys_zone=ts_cache:8m
max_size=1000m
inactive=10m;
server {
listen 8001 default;
server_name localhost;
access_log /var/log/nginx/localhost.access.log;
location / {
proxy_pass http://sparql.example.org:8000/;
proxy_cache ts_cache;
proxy_cache_valid 200 302 10m;
proxy_cache_valid 404 1m;
proxy_cache_methods GET HEAD POST;
}
}
To briefly explain the above:
This sets up a reverse proxy listening on localhost, port 8001. This acts as a cache for the real SPARQL endpoint, which is at http://sparql.example.org:8000/.
It caches any HTTP GET/HEAD/POST requests that result in an HTTP 200 or 302 status for 10 minutes, and caches 404s for 1 minute.
It will cache up to 1000Mb of files, and delete old entries if it exceeds this (max_size=1000m). It will delete files which haven’t been accessed for 10 minutes (inactive=10m).
7. Enable the site config. created above
Create a symlink to the site config. in the sites-enabled directory:
cd ../sites-enabled
ln -s ../sites-available/001-ts_cache
8. Restart Ngninx
service nginx start
If all is well, you should be able to start making requests to your proxy server (on http://localhost:8001/ in the example above). You can also look at the logs in /var/log/nginx/
to keep an eye on things, as well as checking that items are being added to the cache correctly at /var/cache/nginx/
.
Note: At least for Joseki as a backend, you should add
proxy_ignore_headers Cache-Control
to the config, as Joseki (at least within Jetty) has “Cache-Control: no-cache” set by default, which will make the cache superfluous.