Skip to content


A first look at the dev8d twitter network

A couple of days before setting off for dev8D, I set a script running to log the changes in friends/followers of twitter accounts related to dev8D.

In looking at the data, I’ll talk about three sets of twitter users:

  1. Wiki users – this is anyone who has registered on the dev8D wiki with a twitter account. Anyone is this category is assumed to have been an attendee at dev8D.
  2. Dev8D community users – this is anyone who has mentioned ‘dev8D’ in a twitter post or is a wiki user
  3. Total users – someone who has been followed by or follows someone in the two categories above

So far I’ve just looked at the smallest possible category: those who registered on the dev8D wiki with their twitter accounts, numbering 113 in total.

I captured 9 days of useful data, from 22nd Feb 2010 to 3rd March 2010.

The Numbers

In summary:

Those 113 attendees followed:

  • 158 other attendees (i.e. wiki users)
  • 250 dev8D community users (including wiki users)
  • 565 total users (including wiki/community)

and were followed by:

  • 73 other attendees (i.e. wiki users)
  • 149 dev8D community users (including wiki users)
  • 644 total users (including wiki/community)

Putting those figures into bar charts:
Changes in twitter friends of dev8d wiki users

Changes in twitter followers of dev8D wiki users

Source Data

Raw data is available in a sqlite3 database at:
dev8d_twitter_network_2010-03-03.db (or view one of the images above, and get the data passed into the google charts query string)

Observations and Comments

Data oddities:

  • There’s some discrepancy between wiki follows and wiki followed by numbers, the higher of the two is actually the true value. This is due to the fact that I’m only looking at diffs in a person’s network, i.e. connections between dev8D attendees that were made before the event or before signing up to the wiki are ignored.
  • The numbers for 2010-02-22 are higher than expected, due to the fact that data collection failed for the two days preceding it, causing all the figures to be lumped together (dividing that day’s numbers by 3 would give a more realistic estimate of changes for that day). I’m not sure why data collection failed, network problems? PC crash? Temporary twitter rate limit ban? Script bug?

On average, each and every dev8d attendee on twitter gained 6 followers over dev8D (and a couple of days surrounding it). I’d be interesting in seeing how this compares to the average over the rest of the year.

Also of interest, is that fact that dev8D was attended by 200 or so people per day (450 on the first day, which included a linked data meetup), but was mentioned by around 500 different people on twitter. Hopefully this is some evidence that news/outcomes/interest from the event reaches far beyond its immediate participants.

Next step is to look at using GraphViz to produce some diagrams of the changes in network over time. Suggestions on visualisation for this are more than welcome – my network diagrams so far look like squashed spiders…

(continued in my next post)

Posted in dev8d, twitter.


Dev8D

Dave and I are now back from Dev8D which was awesome, inspiring and exhausting. We asked the delegates to capture some of the outputs from the event on a wiki page and as you can see, the ideas had and projects started is pretty impressive. The wiki (plus some emotional blackmail on my part to make people update it) seemed to work pretty well.

The semantic nature of the wiki was not as successful as we hoped, but it did make it possible to pull out information on talks and things. We had real trouble working out how to link the programme data URIs to the wiki nodes. It got tricky when things moved around. In retrospect I would recommend making ID’s for programme sessions by day/timeslot and nodes on the wiki by content, eg. “Python_Lab”.

The community used the program data to produce a few fun tools before the event.

I also realised that given the semantic data in the wiki I could add to the frontpage a “on now” for each location, and a current talk/next talk for the lightning talks room. It doesn’t work now as the event is over. But feel free to steal ideas from our rather scary semantic mediawiki templates.

@samscam asked me to add Lat/Long data to the programme per location (he had already worked them out) which was easy enough and @agdturner asked me to add geohashes, which were easy when I found a PHP lat/long to geohash library. They are quite interesting. A single value which describes a location on earth, the more characters the higher the resolution. Two locations with the same first 8 characters are near each other to a certain tolerance.

Anyhow that gave me an idea… (see next post)

Posted in Conference Website, dev8d, Templates, twitter.


The dangers of an indelible life

As part of my attempt at a hack to find deleted webpages from the wayback machines I wrote a query to dbpedia (wikipedia made semantic webby) to ask for all current UK MPs, their DOB and the URL of their Alma Mater.

Turns out that wasn’t very interesting as they were almost all too old to have been at university at the time of the first webpages.

UK MPs DOB

UK MPs DOB

I ran the data through a spreadsheet and got a nice curve, at least. Clearly the next election will be a different story. Looking at the graph above the MPs at uni. after 1995ish will shift from 20 to 60, so multiply 60 by the average number who stand for each seat.

Except the Internet archive only allows “prefix” searches, and they *currently* store URLs as http://www.soton.ac.uk/badger/ so I can search for an exact website by prefix, but not *.soton.ac.uk. However they are looking into storing & indexing URLs as uk.ac.soton.www/badger/ which would allow me to construct a magic search for “Gordon Brown” AND “pissed” on deleted pages from his university at the time he was there.

All in all this means that we’ll have to start accepting that teenagers do dumb stuff, and that the odd youtube movie or teenage angst blog post doesn’t make you bad leader material. (being squeaky clean your entire life implies a scary personality type!)

However, in the short term, we’ll be seeing some political casualties as investigative hackers go see what they can find.

So I say vote for leaders based on what they do and say as adults, not based on the mistakes they made in their teens now indelibly etched on the permascroll.

Posted in Internet Archive.

Tagged with , , .


Tracking twitter for dev8d

In the run up to Dev8D, there’s been a lot of twitter activity, which will hopefully continue throughout and after the event. I’m hoping that something useful can come out of it, so I’ve started a running a couple of scripts based around this.

Capturing twitter posts

Firstly, I’m simply capturing all tweets which mention ‘dev8d’ using twitter’s search API.

I’ve written a small python script to do this, available here:
archive-twitter-search

It’s designed to be a quick solution to archiving twitter posts related to events (or anything with a keyword), with zero configuration/setup (you do still need to specify a search term though…).

Every 10 minutes, any new tweets mentioning dev8d are stored in a SQLite database. They should be easy enough to query or export to another format in this way.

Capturing the twitter network

The second (and potentially more interesting) thing I’m logging is the change in the network of dev8d attendees on twitter.

I’m capturing all users with a twitter account who’ve registered on the Dev8D wiki, and any user who posted/received a tweet mentioning dev8d.

I’m then periodically (as often as twitter’s rate limit allows) getting the list of friends and followers for each user, and logging when any changes to these lists occur.

By joining users and lists of friends/followers together, it should be possible to build up a picture of the connections between developers attending dev8d, before, during and after the event.

I’m hoping that over time (well, over dev8d at least), I’ll be able to see how the network of attendees changes, expands and grows together.

The data will all be freely available (upon request for now – I’m running the scripts on a desktop PC), but I’m planning to add the data to data.dev8d.org after the event.

Posted in SQL, twitter.

Tagged with .


Graphite: PHP Library for hacking linked data

I really wanted a library to let me do really quick and fun things with linked data, and I couldn’t find one, so I wrote one.

Of course two thirds of the way through, some twitter buddies sent me some links to similar work, one of which by an ex-member of ECS. I’ve made mine use the same function names, because there’s no good reason not to.

Try it out and let me know what you think: http://lemur.ecs.soton.ac.uk/~cjg/Graphite/ (I’ll find a more long term home for it after #Dev8D)

Posted in Uncategorized.


New BBC website style

I’ve just read this extraordinary post about the BBC website.

http://www.bbc.co.uk/blogs/bbcinternet/2010/02/a_new_global_visual_language_f.html

It sounds like they are embarking what should be an impossible task. It looks like they are going to achieve that task, with style and class. (sorry, css puns seem funny before I’ve had coffee)

Posted in Uncategorized.


Too much mail!

I get too much email, but I’ve got to the point where I just accept vast numbers of irrelevant email and delete them on sight. But it’s at the stage where it seems worth putting in the effort to cut this down. Each one is just one more drop in an ocean. However the levies are looking like the may break so from midnight to midnight on the 16th I’ve not deleted any emails, so that I can take some action…

I’m doing a combination of stricter filters, moving some alerts from daily to weekly and unsubscribing from stuff and companies I don’t care about. Number in brackets is approx emails saved per day (inc. weekends)

  • Facebook; I’ve removed myself from all the groups which send me regular updates I just don’t care about. I’m thinking of de-friending the 3 or 4 people who invite me to more than one irrelevant thing a week. (-1)
  • “is now following you on twitter” emails – I have a whole heap of twitter accounts for work related things so these are utterly uninteresting to me, they are being filtered into the bin from today! (-3)
  • Google alerts – 2 for my name which are useful, 2 for the university which are useful, one for “eprints” which has become noise due to the vast number of pages with the word in. Solution; combine the ones for my name into a single search for  “Christopher Gutteridge” OR “Chris Gutteridge”, and do the same for “University of Southampton” OR “Southampton University”, and lose the eprints web search (but keep the low traffic news search for eprints). (-3)
  • Mailman – I admin about 8 lists, and some of the daily messages and “message discarded” messages are just irrelevant, so they’re going in the “trivia” folder (-5)
  • Companies, organisations, techrepublic blogs and online services sent me around 16 messages yesterday. 5 of which I kept, 2 I shifted to weekly emails and I unsubscribed from 9 (which was annoying as I had to get password reminder emails etc. for some sites) (-10)
  • I got 7 Nagios alerts, of which 5 which are not relevant to me. I built our monitoring system, so currently get all mails. But this is getting silly. I’m going to stop myself getting mails for systems I know nothing about. (-5)
  • Cron emails & “Logwatch” emails – these are potentially useful in a crisis, so I’m going to file them in a side folder rather than delete them. (-11)

So that’s an estimated 38 less emails a day! Which about a 25% reduction, saving me maybe 10-15 minutes every day (including saturday & sunday). This doesn’t sound like much but a back-of-the-envelope calculation says it will save me over 50 hours a year, and that’s 50 really really boring hours of hitting “delete”! And about 15,000 keypresses less RSI too!

Posted in Uncategorized.

Tagged with , , .


Replacing MySQL with a triple-store

Some academics have urged me to consider using an triple-store as a back-end for some of our websites, as oppose to our normal MySQL approach. I’m not convinced, but it’s an interesting challenge. I started by looking into what common “patterns” we use SQL which we would need to replicate in RDF. Or change how we approach problems in the first place.

Our usual MySQL “patterns”

  1. Create a record from <object>
    We create record, which is in effect a serialisation of an object. Most often it represents a human, an account, an event, an organisation or an article (text + metadata). We use the database to generate a unique key for the item, in the current context. Generally an integer. In MySQL we use AUTO_INCREMENT for this, but every SQL DB has a varient.
  2. Delete record with <ID>
  3. Update record with <ID> to match <object>
  4. Retrieve record with <ID>
  5. Find/retrieve records matching <criteria>

Update can reasonably be abstracted to “delete then create” so lets ignore it.

“Find” and “retrieve” require some new techniques, but are not a big concern.

My current understanding is that when adding a set of triples you can say that they are all part of a “graph” with URI <x>, and later you can remove or replace all URIs from graph <x>.

The one thing entirely missing is the ability to generate new integers in a sequence.

I’ve been given two suggested solutions by the experts…

UUID

Suggestion one, to use UUID (universal unique IDs) or hashes. But the problem is, I want to use these in URLs and URIs and I want to use http://webscience.org/person/6.html not http://webscience.org/person/e530d212-0ff1-11df-8660-003048bd79d6.html

Flock a file

A second suggestion, was to flock a local file containing the next value. (lock file, read file, update file, unlock file). This would work, but I want the current position in each sequence to be stored with my other data, and accessed/modified using the same rights as can read/write the triple store. That’s what I’m used to with MySQL.

My Idea 1: Sequence Service

My first idea, was to create a stand alone service which could run on the same server as the triple store, and you could query it, via HTTP or command line, for a new integer in an sequence. Sequences could be identified via a URI.

http://dbserver:12345/next_in_sequence?seq=http://webscience.org/people_sequence

Which would return “23” then “24” etc. The locking code could be handled in the sequence server, and the assumption would be that only trusted parties could connect (like SPARQL). This service could work by:

  • Locking (all requests processed sequentially)
  • Querying the triple store for <http://webscience.org/people_sequence> <currentValue> ?value
  • Replacing the triple with ?value+1
  • Unlocking
  • Returning ?value+1

While this is a bit icky, it does mean that my data remains stored in one place, including the state of each sequence.

What this doesn’t do is provide one access point. All SQL implementations provide a solution for this, and I suspect that, long term, so will triple stores. But I can’t see the purists liking it going through the same access interface as it’s clearly a hack.

Non technical concerns with RDF back-ends

On a non-technical note, I’m also concerned that an RDF+PHP solution is not very maintainable. You can’t easily hire someone with these skills yet.

Posted in Best Practice, Database, RDF, SQL.


iPad

I figured I’d jump on the bandwagon and put up a post about the Apple iPad launch today.

I think it’s great, Apple can iron out some of the kinks so by the time a sensible, affordable and open bit of tablet hardware is available they’ll know what mistakes to avoid.

Given that I own a laptop and a mobile (Android G1) I really don’t see quite where the iPad fits into my gadget ecology. I think I’d use it for reading stuff on the way into work and at lunch. Not worth the price tag to me, and in the current economic climate, there’s not much chance of convincing someone that I need one “to check our website renders OK on it”.

Apple Store UK currently has a piccy of an iMac with the (reimaginated) USS Enterprise on it, and the iPad does get a huge + for the fact that it reminds me of the pads which Kirk had to sign in Star Trek, but I think I’d prefer a touchscreen on my laptop. I went from a Psion 5 to a laptop and it took some time to adjust to the non-touch screen.

Apple aren’t dumb, but I don’t think this first iPad is going to change the way people use mobile devices. But I think it’s the vanguard of more affordable, and open, cousins which will end up lying around on coffee tables 5 years from now.

I think the moral of this story is don’t stop designing websites to fit on small screens.

UPDATE:
Having slept on it, and rechecking the dimensions (24 x 19 x 1.3) cm, 0.7KG, it will have about the shape of a clipboard. I can see it being rapidly adopted by people who need to be very organised while looking classy; wedding planners, Maître d’, ensigns on starship bridges…

They won’t fit in your pocket, they are about 2cm wider and taller than the size of a modern hardback book. As Nick Humfrey @njh tweeted yesterday, “Apple iPad = win for manbags?”

Posted in Uncategorized.

Tagged with , .


MediaWiki Customisations for Dev8D

I’ve been working on setting up a heavily customised MediaWiki for JISC‘s Dev8D event at http://wiki.2010.dev8d.org/.

This led to setting up and developing a bunch of features I wanted to see in the wiki (and re-use on others), since I couldn’t find any suitable extensions/plugins to do the job.

I’ll put implementation details up in separate posts in the near future.

New features:

  1. Login integration with twitter – Users can log in to the wiki either by creating a wiki account as normal, or by automatically creating an account/login using their twitter credentials.

    Recent events (especially JISC ones) have been fairly twitter heavy (twitter announcements by organisers, twitter walls etc.), so this seemed like a better unified login solution than OpenID for this particular event (though supporting both would be great if I had more time).

  2. Twitter feeds on wiki pages – any wiki page can have a box on the right hand side which displays results from twitter searches, and refreshes periodically (example up on the wiki front page).

    Tags have been defined each area of the event, and tweets made during the event should show up on appropriate wiki pages.

    The searches are performed client-side, to avoid any possible rate-limiting problems.

  3. Scripted updates of wiki pages – used to regenerate parts of the wiki from the event programme (and more importantly, to let me edit wiki pages using vi…).

    I’ve set up a number of pages with comments used as delimiters. The script then updates only the text within those delimiters, allowing the rest of the page to be edited as any other on a wiki.

    For example, the page on the Python Lab is built dynamically, and changes whenever the programme does.

As well as the above, off the shelf functionality includes the Semantic MediaWiki extension (used to mark up sessions, talks, users), embedded vimeo videos, code highlighting, and reCAPTCHA for spam prevention.

Posted in Conference Website, PHP.

Tagged with , , , , .