Skip to content


GDPR preparations

I am not a lawyer. This blog post represents my own understanding and not an official view of my employer.

The key thing with GDPR is that we’re not going to be perfect, but we can certainly improve. I’ve been working on my team’s GDPR activities since January, and as our team deals with many small, innovative and legacy databases this was… interesting. I’ve audited 88 “information assets”, most of which contain information about people in some way or another.

The single most useful thing we’ve done so far is turn off (and erase) a few datasets and services that were not actually required. I’ve also identified about 4 datasets that I was pulling from source and pushing into systems that no longer exist or now get their data elsewhere.

I’ve been trying to boil things down to some places for people to be thinking about. This isn’t a complete list, it’s more about how to spend limited resources in the most appropriate way.

Why bother?

The obvious answer is “to avoid fines”, but the better answer is “to do right by our users”. Data breaches should be minimised, but we shouldn’t just concentrate on avoiding them, but also minimising the impact if they do happen. Not holding data inappropriately, or having inappropriate locations or access to it can go a long way to reducing the actual harm caused. Keep this in mind and you’ll be on the right track.

Audit all the things!

We need to be aware of what systems we are responsible for that have personal data, what data it contains and why it’s used.

This chore is largely done for our team.

Start with high risks

A common problem I have with addressing GDPR is “what about… <obscure system that has a users table>”

Many of our systems have records of usernames or other IDs that caused actions, eg. approved something, updated a wikipage, updated a webpage. While these are technically in scope of GDPR, they are at the bottom of the TODO list.

For example; we’ve a few dozen sites using the Drupal CMS. Each site has under ten users. The user records may contain a person’s name, they do contain their email address. The contain a list of what that person has edited on the site. It also has the implicit metadata that this list of users are the editors of this website. However, the risk of breach is low to medium. Drupal is a target for hackers, but generally not to steal this kind of data[citation required] (this is an admitted assumption on my part). The damage done by such a breach is also relatively low compared to leaking a larger list or more detailed information.

I find making this calculations difficult, because it feels like I’m saying such a breach doesn’t matter. It does, but other risks matter more and should receive more investment of resources to both prevent happening, and mitigate the consequences if thy do happen. Which brings me nicely to:

Shut off anything unnecessary

This is work we should be doing right now!

Any system which is no longer needed should be shut off.

Systems that have been shut off for over a year should probably have their data erased entirely. It’s tempting to keep things for ever “just in case”. We have to stop doing that. If in doubt, agree it with management and/or the person doing the role that owns that data.

Some systems have unused tables or fields with data about people. These should just be removed.

Some systems have data feeds to/from other systems which provide more information about people than is required. Any reinquired fields should be identified and removed from the feed rather than just ignored by the target system (which is what we sometimes lazily do now).

Remove unused cached copies of personal data.

A more subtle thing is also who has access to data about other people. It’s easiest to remove data entirely, but if that’s not possible then consider how to restrict access to only people who need it. Does everyone need to see the fields with more sensitive information.

It’s worth telling our GPDR team when work like this is done, so they can note that the work was done, or note the system is now off/erased.

Reminder; sensitive data

Breaches regarding data which can cause additional harm to the subject are treated as more serious. The official list is as follows

  • the racial or ethnic origin of the data subject,
  • their political opinions,
  • their religious beliefs or other beliefs of a similar nature,
  • whether they are a member of a trade union (within the meaning of the Trade Union and Labour Relations (Consolidation) Act 1992),
  • their physical or mental health or condition,
  • their sexual life,
  • the commission or alleged commission by them of any offence, or
  • any proceedings for any offence committed or alleged to have been committed by them, the disposal of such proceedings or the sentence of any court in such proceedings.

I’ve been checking for any unexpected issues and found a few surprises:

  • membership of certain mailing lists could indicate trade union membership, sexuality, ethnic origin, religion.
  • “reason for leave”, ie. why someone was off work can include physical and mental health info.
  • I also read through every reason a student has had an extension to a coursework deadline, as this is recorded in the ECS Handin system. It’s a free-text field, but thankfully people have used it responsibly  and just list where the authority for such an extension came from. Although there’s a batch that just say “volcano” which is the coolest excuse for handing your coursework in late!

Data protection statements

This is another thing we should already be doing.

Any service which collects information from people who are not current members of the university should have a statement clearly saying how that data will be used. If you think you might want to use it for another purpose (eg. analysis) later, say so, but don’t be too vague. eg. If someone signs up for a university open day, are we going to add them to our database to send them more information on related topics, or keep this request on file for analysis? We probably are so we should say that.

EPrints lets people request an unavailable paper, and that logs the request and their contact info and passes it on to the creators of the paper. You know what? We probably do want to do some analysis on how that service is used, so we should say so up front. I’m thinking something like

“We may also use this information to help us understand and improve how this service is used. Other than passing your information to the creators of this work, we won’t share individual details outside the University of Southampton, but data aggregated by internet domain, country or organisation might be shared or published in the future.”

While most of the information our staff and students submit into our systems probably doesn’t need additional data protection sign-off, it still may  be required if we’re going to use that data for something unexpected or not to do with their relationship to the university. eg. If we collected data on how email was used by our own members for service improvement, that’s probably not needing a specific statement. If we were using it for a research project, then consent would be required. If in doubt, ask the GDPR office.

Data retention periods

For all retention, it’s a trade off. It may harm people to keep the data to long. It may harm people not to keep it long enough.

The university has several key sets of people we have data about:

  • Students
  • Staff (and “Visitors” who are treated like staff on our IT systems)
  • Non staff and students who interact with us.
  • Research subjects (eg. people a research project collected data on)

Research project’s data retention is usually very clear, and handled as part of ethics. As GPDR beds in, the GDPR principles should be incorporated but consent is generally already given.

Data on interactions with the public (eg. open days, logged IP addresses, conference delegates) will all have an appropriate retention period but it’s not yet clear what these will be.

For data about staff and students the retain period will either be years since the data was created, or years since the student graduated or the person left the university. Possibly it could be a period after they leave the job post.

What we should be doing right now is have a plan for how either of these retention policies could be implemented. I think it’s more likely that the years-since-data-creation method will be used for most things as it’s so much more simple.

SARs: Subject Access Requests

It’s likely to become more common for people to ask for what the organisation knows about them. No all information is covered by this, but we should be ready for it.

What we should be doing right now:

Document the primary way people are identified in each system you are responsible for?

  • Staff/Student number
  • Email – the university provides everyone with several aliases to make this more complex
  • Username
  • ECS username – ECS had a different accounts system and 180 staff who’ve been here forever have a different username to their main uni one
  • UNIX UID
  • ORCID
  • An ID local to this system (if so, is that linked to any of the above? If not how are we going to identify that it’s the right person?)

Think about how the person, or an authorised person could extract all their information from the system in a reasonable form (XML, JSON, ZIP, HTML, PDF…). For many of our systems this would currently be a set of manual SQL queries, but where possible these should be available as tools for both the person to get their own and admin’s to get anybody’s.

These requests are coming. We don’t know exactly how, but it’s probably some people will make them just for curiosity and ask for everything they possibly can. We need to keep the costs of these down.

Obviously, if responding to such a formal request, ensure that you only pass the data on to an appropriate person in legal handling the request. More formal methods are likely to evolve.

The right to be forgotten

Under the old DPA people have always been able to demand that information held about them is correct. In the new one they can also request to have information about them removed.

It seems very unlikely that current staff or students will make such a request about their current job or course, and unclear if that would be a reasonable thing to request. However someone could ask for information about a past course or post to be purged. It’s impossible to find every file or bit of paper, but quite likely we might be asked to remove them from a given system.

What we should be doing right now; considering how we would do this on systems we run, and if it’s a likely request, start to implement features to enable this.

Finally, email mailto: links

This is a novel but simple way to reduce data breaches caused by people picking the wrong email alias from their address book. When someone clicks on a mailto: link in a webpage, it’s usually just the format “acronymsoup@example.org”. However, you can write email addresses with the display names included, so that when people mail to them, it’ll save the human-readable name into their local address book and ensure they are less muddled about what they are sending to. Sometimes big lists of people have similar strings of characters to emails of a single role or office. This can cause data breaches.

Compare these two mailto links:

Continued…

Posted in Best Practice, Data, GDPR.


Maturation of an organisation’s open data service

Our open data service, data.soton.ac.uk, has been around for a long time now. Most of our sister services at other UK universities have been de-invested and are gone or limping along quietly. Currently, we still have a full time open data specialist, Dr Ash Smith.

What matters to our “decision makers” isn’t open data. It’s reputation, student satisfaction, sustainability and saving money. Our open data service enables most of these, with the possible exception of sustainability. I’ve been thinking about how to reframe what we actually have from these perspectives, rather than from an idealist early adopter viewpoint.

What we’ve actually built is a corporate knowledge graph that only contains public information, and primarily contains current information. Our archives actually have what the menu was for lunch in the staff restaurants, and student piazza, every single  day for the last few years. Nobody cares. It’s just what is for lunch today that’s in the live knowledge graph (aka. SPARQL database aka triplestore).

What has open data done for us, anyway?

Having this information all in one place has enabled some valuable services to be produced by our own team. As it was already cleared as open data, there’s no hassle getting permission to use it to make a new service, even though it contains data from several different parts of the organisation.

The crown jewel of the services it enables is maps.soton.ac.uk, but there’s a number of others. The actual “open data” can be viewed as just another service enabled by this knowledge graph. One of the easily missed, but useful features is the ability to view and download simple lists of things, like buildings or parts of the organisation, and to view a page per, er, thing with a summary of information available about that thing. Of these pages, the most valuable is the pages for rooms used for student teaching. These are linked to by the timetable system so are a part of our information infrastructure now.

The open data has enabled several useful developments. Primarily excellent maps of campus produced by Colin Williams (PhD Student) and Chris Baines (Undergraduate). The problem with these maps is that they were so useful we needed to support them when they left and the best approach for us was to nick all the good ideas but rebuild our map from scratch. The current map.southampton.ac.uk wouldn’t exist without their work, which only happened because the data was and is open, so they could play.

Another innovation Colin inspired, was the augmentation of corporate data. Our university didn’t have a good database of building shapes (as a polygon of lat/long points), reference lat/long per building, photo of each building etc. Colin started work producing these as datasets which augmented the official data we get from Planon and since then we’ve hired summer interns to maintain and improve this. This includes the lat/long and a photo of most building entrances. Which wasn’t that much work for interns to create, and needs little curation as not many entrances move or change each year. Once the entrances have ID codes, we can then start to make more data about them, such as which entrance is best to get to a lecture room.

Where we’ve seen less return on our investment is in providing resolvable URIs that give data on a single entity. These return RDF and the learning curve is too sharp for casual users. I’ve spoken to people using regular expressions to extract data from an RDF/XML page, and that is a mismatch between what our users need and what we provide.

Sadly, organisation open data services have not caught on. Yet. It’s still not normal, and I suspect Open Data is just starting it’s way up the “Slope of Enlightenment“. The recent work on UK Goverment open registers is a great example. It’s simple and knows what it’s there to do. It’s learned lessons from data.gov.uk and gov.uk and it’s built on a really well designed API model, that unless you look you wouldn’t notice how simple and elegant it is. It’s a normal and sensible thing for any government to provide in the digital age. It provides official lists of things and the codes for those things. This is simple and valuable, like having a standard voltage for mains, and the same shaped plus, or train tracks in Southampton and London being on the same gauge.  It’s clearly good sense, but didn’t happen by luck.

Our work on the open data service has also taught us loads and I’m proud to have helped lead a session at Open Data camp in Belfast, which produced a document listing crowd sources solutions to improving open data services, and a few years back Alex Dutton (data.ox.ac.uk) and I produced a similar document listing our experiences in dealing with the challenges of setting up an open data service. I’m really proud of both of those. The meta-skill I’ve learned is to be more introspective, both as an individual, and as a community, so we can work out what we’ve learned and share it effectively. Hey, like this blog post! Meta!

Where are we now?

Where we’ve stalled is that we now have all the corporate data that’s practical to get, so new datasets and services are becoming more rare. One of our more recent additions was a list of “Faith-related locations in Southampton“. Which has value to both current students and students considering moving to the city, but from a technical point of view was an identical dataset to the one listing “waste pickup points” for the university. With the exception that a picture of a place of worship is usually quite nice, and a picture of a bin store is… less so.

Over the summer 2017 we had our intern, Edmund King (see this blog for his experiences) experiment with in-building navigation tools. The conclusion was that the work to create and maintain such information for the university estate was to expensive for the value it would provide. When we did real tests we discovered lots of quirks like “that door isn’t to be used by students” or “that internal door is locked at 5pm”, and these all massively complicate the costs of providing a good useful in-building navigation planner. Nice idea, but it can’t be skunkworks, and that’s a perfectly good outcome.

As new datasets are getting rarer, we’ve been looking more at improving rather than expanding. Part of this has been work to harden each part of the service, and get it running on better-supported infrastructure. The old VMs Edward and Bella have lots of legacy and cruft. The names come from the fact Edward used to do all the SPARQL but then the SPARQL moved to Bella. I suggested Beaufort and Edythe as names for the new servers but that’s mostly got me funny looks.

Another part of our current approach is the shocking move to retire datasets! Now we’re focused on quality over quantity, the “locations of interest for the July 2015 open day” dataset needs to just go away. It’s not been destroyed, just removed from public view as not-very-helpful. There’s also a few other datasets that seemed a good idea at the time but are more harmful than useful as they are woefully out of date, like our “list of international organisations we work with” that’s about 6 years out of date.

Where do we go from here?

The biggest issue is “how do we move forward as a service” or maybe even “should we?”. My current feeling is that yes, we should, but focusing with the knowledge graph to enable joined-up and innovative solutions, with open data as just another service depending on that, not the raison d’être for the project. Open data, done right, will continue to enable our staff and students to produce better solutions than we could have thought of and which we can sometimes incorporate back into our offerings. Last year a student project, on the open data  module, produced a facebook chatbot you could ask questions about campus and it would give answers based on your current location, eg. if you asked it “where can I get a coffee” it would identify that “coffee” was a product or service in our database, look at points of service that provided it, filter out ones that were not currently open, and send you a list of suggestions starting with the one physically closest to you. I investigated the complexities of running it for real, and found it was a bit brittle, needing 3rd party APIs and lots of nursing to understand the different ways people ask questions. Also, there’s big data protection implications in asking where people are and what they want in a machine readable way!

The point is that the open data stimulates innovation. Not as much as we like, and it doesn’t do our job as uni-web-support for us, just helps us find ways to do it better.

Long term I think the service needs to stop being a side-project. We should strip back everything that we can’t justify, and just have a knowledge graph be part of our infrastructure, like biztalk. We then turn the things built on top into normal parts of IT infrastructure. Ideally the pages for services, rooms, buildings etc. would merge into the normal corporate website, but this raises odd issues. We have been asked what the value is in providing a page on a shed. For me, it’s obvious, and that makes me bad at explaining it.

We could keep a separate “innovation graph” database which included zany new ideas, and sometimes broke, but the core graph database should be far more strictly managed, with new datasets being carefully considered and tested that they don’t break existing services.

What does the future hold?

In the really long term, well structured, auto-discoveable open data should be the solution to the 773 frustration. If you look at the right-hand side of that diagram almost everything is lists of structured information. That information isn’t special, either. It’s information many other organisations would provide, and with the same basic structure. One day maybe we can have nice discoverable formats for such information and get over using human-readable prose documents to convey it. We did a bit of work early on about suggesting standards for such information from organisations, but this was trying to answer a question that nobody was yet asking. I still think that time will come and when it does we’ll look back and laugh at how dumb 2018 websites were, still presenting information as lists in HTML. The middle ground is schema.org, with which I have a bit of a love-hate thing going. It’s excellent, but answering the wrong question. It helps you get your data into Google. I don’t want a my data needlessly mediated by corporations, but I get most people don’t really care so much about that.

The good news is once people have seen something done a sensible interoperable way it’s hard to go back. I can’t imagine people buying a house with just “Apple” sockets that didn’t fit normal appliances. Then again, computer systems are less compatible now than 10 years ago, so who knows for sure?

I’m optimistic that eventually we’ll achieve some sea-change moment in structured data that will be impossible to backtrack from. But such “luck” requires a lot of work, and we may fail many times before we succeed.

We didn’t quite change the world with data.southampton, but the by-products are valuable enough to easily have returned on the investment.

Posted in Open Data.


Exploiting the Bitcoin aftermath

I don’t fully understand the ins and outs of Bitcoin. What I do just about understand (correct me if I’m wrong!) is that the amount of processing it’s worth doing per day is a product of the price, and vice versa.

As Bitcoin mining is effectively an arms race to producing the most processing power per unit of electricity/money. To do this there is now custom hardware.  Graphics cards and “normal” supercomputers have their theoretical power measured in FLOPS (there’s other better measures, I think FLOPS is the BMI of supercomputing). The important point here is that it stands for “Floating point operations per second”. A floating point number is basically the same as scientific notation. A number and then the number of times to shift the decimal point left or right. Bitcoin mining uses hardware optimised specifically for integer operations, rather than floating point.

Why does this matter?

Well, it seems reasonably likely that Bitcoin might crash and make the vast amount of bespoke hardware deployed to mine it suddenly unprofitable. The owners will have the choice of scrapping/mothballing it or to try and monetise it in other ways which would mean a sudden glut of availability of processing power for integer tasks (unlike the more common  FLOPS). Maybe that’s something we should be preparing for? Could it have a significant impact on commodity data processing? Maybe there’s going to be a window where   teams with tasks  that can run on such hardware can get very good deals… until everyone else catches up.

I’m not experienced with supercomputing, so this is just a back-of-the-envelope theory.

Is this a realistic scenario? Have I overestimated the significance of floating point vs integer hardware?

Posted in Bitcoin.


More conference spam

*** UPDATE : Tue 5th Dec ***

So, I heard back from one of the alleged organisers. They had an email from the event that they had not responded for and had no involvement beyond that! It sounds like this event might be more fake than I realised!

*** ORIGINAL POST ***

This is an open letter to the organisers of ” 4th International Conference and Exhibition on Satellite & Space Missions“. I’ve googled the email addresses of some of this year’s organising committee and sent it to them directly too. I don’t object to getting the occasional mistargetted conference invite, but to claim I’ve been carefully selected pissed me off. Doing the same thing two years running ticked me off enough to write this blog post.

* * *

To the organisers of 4th International Conference and Exhibition on Satellite & Space Missions,

Hi, I’ve just had an email “reminding” me of my invite as a plenary speaker to the above conference.

I’ve just had a look at the conference website and the people I am emailing are the first few people listed as this year’s organisers.

I’m concerned that the email I have been sent is potentially damaging to the reputation of those involved in this conference.

  • This is the first time I’ve been contacted about the 2018 event. I had a similar email for the 2017 event, also claiming to be a “followup” to an earlier email. Sending a first email claiming to be a followup isn’t “clever marketing”, it’s telling a lie to chivvy people into signing up for a scientific conference. It is just possible that two years running that the first email was not delivered, but it seems unlikely.
  • I am not a Doctor. That title has to be earned and I work with PhD students who spend years earning that title. It’s not appropriate or respectful to just stick it on my name as a mailmege.
  • I have absolutely no papers relevant to “Satellite and its allied areas”. Last year I contacted the conference to ask exactly what subject from my profile they were interested in and I was sent a default response.  The statement that someone has “gone through your profile”. I was not told the reason for me being contacted.
  • All this makes the phrase “This is not a spam message, and has been sent to you because of your eminence in the field” inaccurate, and that’s being very generous.

The email has been published at
http://blog.soton.ac.uk/webteam/2017/12/04/satellite_2018/

The email I was sent:

Dear Dr.Christopher Gutteridge , We contacted you by email earlier, since we have not yet received any response from you regarding submission of abstract and your participation as a plenary speaker, we are taking the liberty of re-sending this mail as we are aware that you may be engaged in other activities or my E-mail may not have successfully reached you.

Meet 300+ featured Speakers at Satellite Conferences

Thanks & Regards,
Valentina Esther
Program Director | Satellite 2018
Kemp House, 152 City Road
London EC1V 2NX, UK

4th International Conference and Exhibition on Satellite & Space Missions
June 18-20, 2018 | Rome, Italy
Theme: “Shaping the Future with Latest Advancements in Satellite and Space Missions”
Meet world leading Space Researchers, Scientists, Delegates and Students from 50 different Countries & 5 Continents
A very good day to you.
The purpose of this letter is to invite you as a Speaker at upcoming “4th International Conference and Exhibition on Satellite & Space Missions” during June 18-20, 2018 at Rome, Italy. Satellite 2018 focuses on the theme “Shaping the Future with Latest Advancements in Satellite and Space Missions” which aims in gathering the eminent research communities catalyzing information exchange and networking between researchers and business entrepreneurs of diverse backgrounds fostering advancements in Satellite and Space research.We have gone through your profile and as per your eminence in the arena of Satellite and its allied areas; we would like to have your presence as a Speaker for Satellite-2018 which will definitely add a good impact to our conference. We believe all the scientists and young researchers, who are gathering, will be learnt from your research talk at the conference.

Conference website: Satellite Conferences

Please let us know your interest to be a Speaker at this international event and hence will be glad to provide you with more details for the participation confirmation.

For any information required, do not hesitate to ask for assistance.

We appreciate your time and consideration, and looking forward to hear positive response from you.

Thanks & Regards,
Valentina Esther
Program Director | Satellite 2018
Kemp House, 152 City Road
London EC1V 2NX, UK
Email: satellite@conferenceseries.net

Disclaimer: This is not a spam message, and has been sent to you because of your eminence in the field. If, however, you do not want to receive any email in future from Satellite 2018 then reply us with the subject “remove /opt-out”. We concern for your confidentiality.

To unsubscribe, click here

Posted in Conference Spam.


We need to talk about online advertising

We are getting very used to seeing obvious lies in advertising on reputable websites.

This worries me.

It’s much more serious than mere “clickbait”.

Some of the biggest lies are the ones which use your IP address to guess your city or country. This started out on more “disreputable” sites, where I would see adverts that “Women in Reading want to meet for sex”. Which was odd, as I was in Southampton, but my ISP was giving me IP addresses coded to Reading. It’s clearly a lie, but we accept it, and that acceptance is risky.

These days most websites don’t sell their own advertising, they use companies like Outbrain or Taboola to provide the adverts. The adverts you see are custom to you, based on both location and your own browsing history.

Advert in context on the page

If the Telegraph newspaper carries an advert that has a lie in, it’s relatively easy to complain about. However if the Telegraph website carries a false advert, who’s responsible? The advertiser? The telegraph? The advertising platform? There’s no guarantee these are in the UK, and it’s not clear who’s responsible.

The most alarming advert I’ve seen is one for a “tactical flashlight” which has the text something like “Police in Southampton recommend everyone carries one of these”.

Close up of the advert which appeared on the Telegraph website (also Daily Echo)

It links to a fake news article with $cityname in the URL to tell me that there’s been a rise in violent crime in $cityname and hence the police say you should buy the product. This is beyond immoral, it’s dangerous and almost certainly illegal. The fact that this advert still appears in various forms scares the hell out of me. The fact that some newspapers now muddle the advertising with the “other stories on this site” makes it harder to evaluate the source of information.

Seemingly “legit” news articles as advertising

If you visit “disreputable” parts of the web (porn, piracy etc.) you will get very used to popunders advertising a mix of sex sites, gambling, malware and financial scams (“The Brit Method”) etc. What I’ve noticed in my “research” looking at such sites, is that sometimes the pop-under window is just a news site with a story on. What the hell is going on there? Why is ibtimes.com trying to open innocuous windows on my browser with random stories from their site… are they hoping that getting it in my eyeline will get some social media links? I don’t know.

Your filter might be racist

There’s been some worrying reports of some of the targetted advertising on Facebook being used to offer something only to certain communities. Or target political messages at certain ethnic groups. Also, remember, that location can be a proxy for race. If you’ve rough data on where different races live, filtering by postcode or even town can be quite creepy. Adverts don’t tell you “You are seeing this because you live in the white middle class part of your city”.

It’s a trap

Some online advertising is for flat out scams. The “get rich quick scheme” is alive and well in 2017. The first image on this page is a good example, but I’ve hit reload a bunch of times trying to find one today and for some reason can’t.

What I did find is adspider.io which looks like an interesting tool which tracks online adverts and what sites are showing them.

Here’s a nice example of a site with the hallmarks of a scam (fake comments, fake location heading). For fun I’ve linked to it with $cityname. I found this from an advert on a website for sport.

Possible remedies for issues with online advertising

I’ve been trying to think what to do about this. What should lawmakers, ISPs and media be doing?

First of all, lets put “out of scope ” the ads on on “disreputable” sites, they have no license or “good name” to threaten.

But what can we do about ads on Wired, the Telegraph, or the Daily Echo (our local paper).

Idea one: Advert identification codes

Every unique advert shown on a website should be assigned a unique code for that website or advertising platform. This would let people complain about something more concrete, rather than something entirely ephemeral.

Idea two: Personal advertisement log

A user should be able to click a link near the advert to get a list of every advert they have been shown from this source (website or ad platform) for the past N days. N negotiable, but I’d suggest 90 days minimum. Each advert in this review will also tell them the data used to make the decision to show this advert. Actually this would be nice anyway. Do you ever see a really cool ad on Facebook and then a window pops up over it and by the time you sort that the advert is gone forever. Every instance of every advert should have a unique URL which is visible to anybody.

Idea three: Public advertisement log

This is more hardcore; but I think that EVERY advert shown to EVERY user in the past X days, along with the logic used to create/show it, should be made available to the public.

Idea four: Sort out how to complain and escalate complaints

Who should be held responsible when I visit a site of a UK company, hosted in Germany, using a USA advertising platform showing an advert for a Chinese company? This is tough, and I just don’t know, but we need to find a solution for this questions.

Right now, it’s too easy for a local news site to wash their hands and take no responsibility for the bad behaviour of the advertising platform they use. The best idea I have is that a UK standards agency could ban the use of non-compliant advertising platforms by UK companies.

Problems with these suggestions

Advertisement 72B0391CF0F shown because: Viewer in UK & Viewer searched for “Impotence cure”

It’s very hard to define who a user is in a way that requires them to be able to see their advertising history, even though we know that these companies know exactly who we are… If I search for something on the John Lewis website I see related adverts on other sites the next day.

Another issue is that an advert could be made to contain information that could not be made public because it contained identifiable personal information. If the advert image contained the target viewer’s real name, then you couldn’t publish it to the public along with the reasons it was shown. This could be used as an excuse not to make it public, and “people want personalised adverts” would be an excuse to make it impossible to disclose adverts without breaching someone’s privacy.

Conclusion

The above is based on my own experience browsing the web. Maybe you see different adverts to me? How would I know.

Anyhow, the current situation needs to change, and to do so we need concrete things to ask for. Am I unrealistic or am I not going far enough? What do you think?

Posted in Advertising.


50th years since the “Mother of All Demos”: What’s that got to do with the price of fish?

Demonstrating a user interface to manipulate structured data

So, we’ve been discussing ways to mark the 50th anniversary of the Mother of all Demos [Youtube, Wikipedia]. In this demo, Doug Englebart demonstrated the tools that he and his team had built to make themselves smarter and more effective. Some of these tools would become household items. He was one of the most important inventors in history.

The anniversary is 9th December 2018 (so 13 months from now). There’s some thoughts at http://doug-50.info/

Frode Hegland is head cheerleader for our discussions, and has asked us to think about where we can demonstrate and celebrate Doug’s ideas and vision, and how we can take it further.

So “augmenting human intellect”… how hard can that be?

What I’ve been thinking out is something I don’t have a perfect description of yet. It is about how

Containers being transferred to a cargo ship at the container terminal of Bremerhaven by Hannes Grobe

humans interact with information, researchers and scientists, most of all, but everyone else too. There’s an excellent blog post by Mia Ridge which has 0utlined much of the problem with information in 2017. Our data is anemic. We can move it around the world in moments, and request strings of ones and zeros but we know almost nothing about what these contain.

People are so used to the status-quo that they don’t realise there’s a problem and how much better it could be. It’s like shifting sacks of cargo onto a ship. That used to be “just how you did things”.

The best phrase I’ve got for this idea, so far, is “Intermodal information”. I’m stealing the idea from the freight industry. While I’m stealing, I’ll steal the whole definition from Wikipedia.

Intermodal freight transport involves the transportation of freight in an intermodal container or vehicle, using multiple modes of transportation (e.g.,rail, ship, and truck), without any handling of the freight itself when changing modes. The method reduces cargo handling, and so improves security, reduces damage and loss, and allows freight to be transported faster. Reduced costs over road trucking is the key benefit for inter-continental use. This may be offset by reduced timings for road transport over shorter distances.

The introduction of containers that worked between trains, ships and trucks changed the economy of the world for the better. We’ve already experienced something similar in data. thrice. Storing information digitally was the first. The advent of the packet switching network (IP) meant we can now move data over networks from any computer, to any computer. The IP network sends out packets of data and those packets move over wifi, wires, fibreoptics… even via satellite. The Web (HTTP) was a second revolution in the interoperability of data. Now we could request computer files all over the world, and get them with some basic metadata (mime types tell us a little about how they should be interpreted), and the URL system means we can link to computer files, and talk about them.

It’s no secret that this has changed the world and our species relationship to data.

So what’s the problem?

Data is great but it’s the start of the story, not the end. When you download a webpage it has some “MIME” header bit that is distinct from the file you are downloading. The bit that tells your computer how to interpret the file is called “Content-type”. The value of this is called a “MIME Type” and is generally something like “image/png” or “text/xml” or “application/vnd.google-earth.kml+xml”. Sometimes there’s a character encoding bit as well, eg. “text/html; charset=utf-8”. MIME types work almost the same as the “suffix” on the end of a filename, eg. badger.png, or secrets.html. It’s a lot more useful than just guessing what the file is, but not much better than the filename on a hard-drive.

What I hope we can achieve is a way to better describe the contents of files. There’s different ways to interpret the same file. A KML file is also a valid XML file which is a valid text file, which is a sequence of bytes, which is a sequence of bits. None of that tells us that the KML file describes the locations of park benches in Southampton.

Datasets come in many forms… except they don’t really. On computers, data files are usually structured as either trees of information, where each thingy, has zero or more subthingies, or tabular data where information is organised into sets of homogeneous records where each record has information in more or less the same shape. CSV, Spreadsheets and stuff like that. There’s also “graph” data but that’s less common.

What’s that go to do with the price of fish?

Maine Avenue Fish Market (Bien Stephenson)

What interests me is for our tools to be able to record, transmit and understand the structure and meaning of a file.  This is a distinction between data and information. All mime types tell us is roughly what tools can read a file, but no more. Let’s take a very simple example of a spreadsheet containing a list of prices of fish. All we get from MIME is “application/vnd.ms-excel” which just tells us we can read it in Excel. We know it’s going to have one or more worksheets each with tabular data, but it would be helpful to know for sure that the first worksheet is the one of interest, that it is structured in rows with one row per record and the first row is the headings, that the sheet represents a list of products and their prices. Going further it would be helpful to know it’s about fish, relevant to a certain vendor, that we can validate the vendor really provided these prices and the timescale and audience for which it’s valid. It would helpful to link it to product categories, weight, specifications, species… and to have all those things done automatically and unambiguously with no extra work to anyone.

Hafenarbeiter bei der Verladung von Sackgut – MS Rothenstein NDL, Port Sudan 1960

This is not easy. But, it will happen eventually, somehow, and when it does we’ll look back on this as the olden days and think of the computer files we use now the way we look back on sacks of cargo loaded by gangs of stevedores. We can’t get there in one big jump, but it’s where we should be aiming for. Our data should just work, and get out of our way. Not just open data but all our data.

This is a bit bigger than I usually aim, but the brief of celebrating and extending the work of Doug Englebart is an unreasonable one, so maybe we need to starting thinking beyond what is reasonable…

And for me that’s “Intermodal information”. Hopefully we can come up with a catchier name.

Posted in Doug Englebart, Research Data.


Little bugs, hidden features, and lots of chatting – Week 11

This week has been full of lots of odd jobs for KnowledgeNow. I’ve fixed up the breadcrumbs, added a visual response for a failed feedback submission, cached the navigation data, split the search box out into a partial layout for re-usability, and submitted the project for a code review. The feedback I got from Andy was really useful, and it was reassuring to have somebody else more experienced check over my work.

I think what’s surprised me most about this week has been the amount of talking I’ve done. I’ve discussed how we’ll integrate the website into the existing iSolutions site with Pat, I’ve discussed navigation and template issues with Graeme, and I’ve talked with the Service Management team about integration with Service Now and how we plan on logging search queries and view counts. I’ve decided to even hold a demo of KnowledgeNow for the Service Management team and some of the interns next week to get feedback on it. It will only be a couple of days before I leave, so there probably won’t be enough time to take into account everyone’s feedback but it will be nice to know what people think and to leave a future plan for the team to further develop this.

The information we plan on logging goes beyond my current knowledge of web development, with recording IP addresses, user agents, and search queries along with the id of the article being viewed. It will involve using sessions, something that Martin tells me is trivial but I’ve learnt not to underestimate seemingly small tasks like this, and I look forward to tackling this first thing on Monday morning.

It seems the closer to completing this project I get the slower progress is, things like logging and custom error pages that I expected to be relatively straight forward aren’t, and there aren’t so many easy jobs to fill my time waiting other people or to tackle when my brain gets tired. The learning curve continues, and the list of little jobs to do before going live keeps going, with a new item popping up every time I cross one off. My final week of working for iSolutions will most likely be a frantic race to get my project well and truly production ready, and a battle to prove to the people in charge that they should put it out there.

Posted in Programming.

Tagged with .


Open Data Internship – Week 12 – A retrospective

As I come to the end of my time here I am finding that I am looking back on what I have done. So here I have collected my actions into one place for the purpose of giving next year’s intern an idea of what they should do and where to start. So…

What Have I done?

To start with, I read all the blogs written by last year’s intern.

After doing that I started looking into SPARQL, and eventually put together this query for finding all building without images.

PREFIX soton: <http://id.southampton.ac.uk/ns/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?building (SAMPLE(?label) AS ?Label) (SAMPLE(?lat) AS ?Lat) (SAMPLE(?long) AS ?Long) WHERE {
?building a soton:UoSBuilding
OPTIONAL {
?image a foaf:Image ;
foaf:depicts ?building
}
OPTIONAL{
?building rdfs:label ?label
}
OPTIONAL{
?building geo:lat ?lat .
?building geo:long ?long
}
FILTER (!BOUND(?image))
}GROUPBY ?building

It works by getting all buildings, then optionally finding associated image, label and geodata. Then removing all data points, which do have an image.

Then went out into the world armed with a clipboard, list and camera to take photos of missing buildings. It is best to do this in good weather so planning is vital. I would advise the next person to set this up very early in their internship. Then when the weather is good, go out and gather the data.

The next thing I did was make a map showing building internals. This was my main project for the course of my internship, the aim of which was to make an app, which people could use to navigate around building 37. The source code for this is available here primarily in the map.js file. This file uses leaflet js and leaflet-indoor to draw a map using the university’s curated map tiles, then drawing polygons representing the rooms on the map, with leaflet indoor handling the rooms being on different levels.

I drew these rooms is QGIS, by tracing floor plans given to use for the purpose of this project. These were stored as geojson polygons, with some information about the room, typically type of room (office, way, stairs, lift); level; and a label giving the room number.

The next part of this project was to allow a user to navigate the map. To do this I decided to take a graph based approach. This meant that a user would navigate between nodes via edges. This means that A user would start at one node, which would be connected to another node by one or more edges. A user could then move from one node to another via an edge. The nodes would be placed in rooms as destinations and in corridors and doors to allow for model of travel balancing complexity of the situation and simplification for computational reasons. I started to do this with a breadth first search (BFS), but this quickly become too unwieldy to use as the computational complexity of a BFS is branching factor *depth, which happens to total the number of nodes in the graph, as such the complexity is O(n). So I decided to move to an A* search, using the following heuristic:

Distance travelled along edges + distance directly To End + (0.35 * levels changed). This means that I would sort all nodes on the graph I had travelled to by this value and expand the node that had the smallest value first. Now whilst the complexity of this should still be branching factor *depth, a perfect heuristic gives a branching factor of 1, as you assume you will always take the best route, as for depth this is slightly harder to justify. However, our map is finite and currently to one building, with hopes of eventually the whole campus, the depth would be finite and not directly tied to the number of nodes in the map, so that will be constant a constant value too. Based on these two points the complexity of the new navigation method is O(1).

The first time I drew the nodes to navigate; I drew them in QGIS, and manually added the edges. However, I eventually built a webpage that would allow me to add nodes and draw the edges automatically. This is available here in the edit.html and edit.js files.

I spent a sizable amount of august making a json editor in python, available here. I was mildly proud of this as a vast majority of the functionality is dynamically generated from a schema for the data. However, due to a lack of foresight and testing it was decided that this was not the best way to add/edit map data, so I made the edit.js code instead.

The other thing I did during my time here was to look through the work done by the previous intern. This comprised of looking through an app they had written for data gathering purposes and some data they gathered which had not made it onto the open data site. Once I found the data that had not been added, I worked out what it was for and got it added to the website.

Posted in Geo, Javascript, Open Data, Programming, python.


Recaptcha, routing, and more testing – Week 10

The slow and steady progress towards having a web application ready to go live continues. This week, I’ve added recaptcha to form submissions and cleaned up routing to allow for urls such as /KnowledgeBaseArticle/ArticleDetails/KB0011703 as opposed to /KnowledgeBaseArticle/ArticleDetails?query=”KB0011703 “. The branding is also in the process of being redone, being carefully designed to fit seamlessly with the existing iSolutions website. It will, as far as the users are concerned, simply be a new section of the website, with all of the menus and navigation options remaining identical. This has proved more difficult than I would have hoped simply because the templates and standards I expected don’t exist. I was provided with a link to a json file that is supposed to represent the core university and iSolutions navigation bars, but on further inspection, they don’t quite match up to what’s displayed on the university website and one of the navigation bars isn’t represented in the json at all. When I asked about this, one of the responses I got from another iSolutions web developer was along the lines of “you can always add static links, just keep an eye on them, I’ve done it for projects before”. And when I explained that I don’t want my website to fall behind and no longer work properly after I leave, I was told “that’s normal, all websites stop working at some point”. As an inexperienced web developer, perhaps I came into this field too optimistically, and expecting more from existing systems than is entirely reasonable. Perhaps a future project, for either a full time employee here or another intern, would be to standardise the navigation across the university website and to provide a template to the entirety of iSolutions that can be updated in one place, instead of getting people all across the company running around trying to clean up the fallout whenever a minor change is made to the core website. We all work to avoid code duplication in our own projects, but if we could work to avoid code duplication across all projects then our systems would work better, and we’d all spend a little less time reinventing the wheel and a little more time on the interesting innovative work that really matters.

Moving on from Knowledge Now, this week Pat and I took a trip up to Highfield Campus to meet with the open data team to discuss student engagement in the run up to the new academic year. We started off by talking about the different kinds of users who could possibly be interested in the open data service, and then moved on to looking at the open data website and considering how it could be improved for each of these kinds of users. Having found interaction with the open data service frustrating in the past, it was interesting to talk to the people behind making and maintaining the website. The first point we came across immediately, was the state of the navigation menu on the website. It’s thoroughly confusing and the entire purpose of this service isn’t immediately clear. (Not to mention the lack of university branding, if only there were a shared university template they could have used.)

opendata

The Open Data Service Homepage

I was strongly in favour of splitting the services provided using open data and access to the open data itself in order to allow for non-technical students to use the services more easily with less risk of getting scared away by technical jargon and data they have no interest in using. At first this idea seemed to be seen as too much work for too little value, but in my mind it was simply a case of moving links around and bearing in mind whether a page should be targeted at technical individuals or not. By the end of the session, it was agreed that the navigation menu should be split between technical and non technical details, and that links between the two should be minimal. I’m glad that I managed to get my point of view heard, and I really think it will help with engaging technical and non-technical students alike. Even computer scientists like things to be organised every now and then.

That wasn’t our only field trip this week, we also had an intern-wide trip to the data centre. It was really interesting to see where all of the university systems are running, and the technology involved is incredible. The entire building has been kitted out specifically to deal with super computers and server racks, with underfloor and overhead cooling, and the most thorough anti-fire precautions I’ve ever seen in my life. As cool as it was to see this kit in action, I felt thoroughly out of my depth. I’m not a hardware nerd, and when people describe engineers as people who have been taking apart machines since they were children I never feel that that really describes me. I’m a software engineer through and through, I preferred decision maths to physics at school and nothing’s changed. I’m grateful to have been able to seen this kit, but the talk we were given about the specs of various machines and the mechanics of the system supporting them all went over my head.

Data centre tour

Data centre tour

Posted in Data, Open Data, Programming.

Tagged with .


Off with the training wheels – Week 9

Last week, I spent the majority of my time refactoring and restructuring KnowledgeNow with Martin’s help, either pair programming or checking in with him for what I should be changing and how, then going away and coming back when it’s done. Martin left for two weeks holiday at the beginning of this week, but I was under the impression last Friday that there really wasn’t much left to be done, and that I’d be able to polish it off nice and quickly all by myself. It turns out I was wrong, and one week later if feels like I’ve hardly dented my “polishing off” checklist. The missing tests turned out to be trickier to write than expected, and the extra features that needed adding were a lot more involved than I gave them credit for. The team were really supportive, and were happy to help me crack any particularly tough problems, but not having anybody else working full time on this project with me was a bigger shock than I was expecting. Suddenly I was captain of my own ship, and it was a lot more work than I realised. It’s been a really fun experience though, and the satisfaction of having my web application now report feedback back into service now, seeing those results in the ServiceNow web app itself, was fantastic after all those hours of work to make that happen.

Adding the “Was this helpful?” functionality involved adding methods to the ServiceNow api wrapper (originally written for Fast Track Tickets) that allow new items to be posted to the Knowledge Base Feedback table. This now means that users can report whether or not they found an article helpful and are then presented with a comment box to add extra information if they wish to.

hepfulcomments

The rest of my week was spent cleaning this code up and adding tests, something that I always seem to underestimate in terms of the time and effort required. Crossing the 50% line for test code coverage shouldn’t have been a huge achievement but it felt like it. My plan for the coming week is to keep the momentum going with writing tests and to try to cover the entirety of the project as thoroughly as possible before refactoring again and requesting a code review from the team.

Posted in Programming.

Tagged with .