{"id":1468,"date":"2016-07-26T14:31:38","date_gmt":"2016-07-26T14:31:38","guid":{"rendered":"http:\/\/blog.soton.ac.uk\/webteam\/?p=1468"},"modified":"2016-07-29T08:03:28","modified_gmt":"2016-07-29T08:03:28","slug":"open-data-internship-open-data-pipelines","status":"publish","type":"post","link":"https:\/\/blog.soton.ac.uk\/webteam\/2016\/07\/26\/open-data-internship-open-data-pipelines\/","title":{"rendered":"Open Data Internship: Open Data Pipelines"},"content":{"rendered":"<p>I&#8217;ve spent this week reorganising the folders in My Documents. &#8220;But Callum!&#8221; you might say, &#8220;Isn&#8217;t that a complete waste of a week?&#8221;. Perhaps for some. In reality, I&#8217;ve been working towards creating an Open Data Pipeline.<\/p>\n<p>Open Data Pipeline is a term I just created, and I think it refers to something like this:<\/p>\n<p><a href=\"http:\/\/blog.soton.ac.uk\/webteam\/files\/2016\/07\/OpenDataPipeline.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-medium wp-image-1472\" src=\"http:\/\/blog.soton.ac.uk\/webteam\/files\/2016\/07\/OpenDataPipeline-300x204.png\" alt=\"OpenDataPipeline\" width=\"300\" height=\"204\" srcset=\"https:\/\/blog.soton.ac.uk\/webteam\/files\/2016\/07\/OpenDataPipeline-300x204.png 300w, https:\/\/blog.soton.ac.uk\/webteam\/files\/2016\/07\/OpenDataPipeline.png 470w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>In this post, I&#8217;m going to outline the pipeline I&#8217;ve created so far and the lessons I&#8217;ve learned in making it.<\/p>\n<p><strong>The pipeline so far<\/strong><\/p>\n<p>So far, the pipeline has four key stages:<\/p>\n<p><em>Gathering<\/em> &#8211; This is gathering the raw data using a system. This could just be pen and paper. In my case, I&#8217;m using a homegrown tool called OpenGather. It&#8217;s a web application designed for gathering categorised open data. You can input data, record GPS locations and send the data for remote storage. This database then exports a CSV file.<\/p>\n<p><em>Storage<\/em> &#8211; This is the long-term storage for data. Currently, I&#8217;m using an Excel (or Calc) Workbook for text and numeric data. It has a sheet each for data gathered by the tool and for data I&#8217;ve had to enter by hand. A folder of images is kept for each category. For example, &#8220;Buildings&#8221; or &#8220;Portals&#8221;. Long term, the folder hierachy and exact data format still need work.<\/p>\n<p><em>Processing<\/em> &#8211; This is taking the stored data and converting it into a format ready for publishing. To process the output from OpenGather, I use another homegrown tool. Yet unnamed, it reformats the data as a CSV file, marking any missing data for completion. Optionally, it attempts to use timestamps to match up data with the images taken using the camera.<\/p>\n<p><em>Publishing<\/em> &#8211; This is the act of making the data available to the public. To do this, I hand a USB stick with the data on to Ash. Occasionally I have to copy CSV data to a Google Document. The rest of the open data service takes over from here!<\/p>\n<div id=\"attachment_1475\" style=\"width: 310px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/blog.soton.ac.uk\/webteam\/files\/2016\/07\/EXCEL_2016-07-26_15-32-37.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1475\" class=\"size-medium wp-image-1475\" src=\"http:\/\/blog.soton.ac.uk\/webteam\/files\/2016\/07\/EXCEL_2016-07-26_15-32-37-300x192.png\" alt=\"One of the Excel sheets used for long term storage\" width=\"300\" height=\"192\" srcset=\"https:\/\/blog.soton.ac.uk\/webteam\/files\/2016\/07\/EXCEL_2016-07-26_15-32-37-300x192.png 300w, https:\/\/blog.soton.ac.uk\/webteam\/files\/2016\/07\/EXCEL_2016-07-26_15-32-37-768x492.png 768w, https:\/\/blog.soton.ac.uk\/webteam\/files\/2016\/07\/EXCEL_2016-07-26_15-32-37-1024x656.png 1024w, https:\/\/blog.soton.ac.uk\/webteam\/files\/2016\/07\/EXCEL_2016-07-26_15-32-37.png 1205w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><p id=\"caption-attachment-1475\" class=\"wp-caption-text\">One of the Excel sheets used for long term storage<\/p><\/div>\n<p>With that said, here are some of the things to do and avoid when building a pipeline:<\/p>\n<p><strong>Things to do<\/strong><\/p>\n<ul>\n<li><em>Use an appropriate set of input fields for each category &#8211;<\/em> I originally had the fields &#8220;Timestamp, Tag, Category, Latitude, Longitude, Accuracy&#8221; for everything. I found that for some objects, such as Images, I could throw away the geo data. For others, I needed extra data I later had to enter by hand.<\/li>\n<li><em>Be consistent in your data gathering process<\/em> &#8211; For example, knowing that an image is always taken after the data is entered is extremely useful. It can be used to later infer information you&#8217;ve forgotten, lost or never had.<\/li>\n<li><em>Keep a backup of your data<\/em> &#8211; You never know when your tools will delete or corrupt an important file. Excel did this to me more than once!<\/li>\n<li><em>Be thorough in gathering your data &#8211;<\/em> Gathering too much data and throwing it away is far easier than needing it and not having it.<\/li>\n<li><em>Challenge your assumptions and provide for corner case &#8211;<\/em> I guarantee making assumptions about the properties of a data type will come back to bite you.<\/li>\n<li><em>Get some data up &#8211;<\/em> Even if it&#8217;s just one entry, it&#8217;s a great feeling getting it hosted for the world to see.<\/li>\n<li><em>Process the data as a single, large batch &#8211;<\/em> Removing the need to repeatedly process different chunks of data will save time in the long run.<\/li>\n<\/ul>\n<p><strong>Things to avoid<\/strong><\/p>\n<ul>\n<li><em>Taking photos in many formats &#8211;<\/em> Some cameras take both RAW and JPG. This makes storing the files and matching images to data entries that much harder. Use a single image format and convert it as you need it.<\/li>\n<li><em>Directly using Geolocation data &#8211;<\/em> For most types of object, I&#8217;ve found GPS accuracy to be too low (6m radius at best). I used a clickable map to get accurate data, using the GPS to roughly centre it. If nothing else, precise data adds a level of professionalism.<\/li>\n<li><em>Using Excel for CSV files &#8211;<\/em> If you do, format it all as &#8220;text&#8221;. Otherwise, Excel is fond of re-formatting your data to be less accurate when you save it back to CSV.<\/li>\n<\/ul>\n<p><strong>So where am I going from here?<\/strong><\/p>\n<p>My upcoming ideas for the OpenGather tool involve:<\/p>\n<ul>\n<li>Using different input fields for each data type. This should make processing the data more accurate.<\/li>\n<li>Provide the option to submit data to iSolutions via Serviceline. This should allow each contribution from students to be reviewed.<\/li>\n<\/ul>\n<p>In the longer term, I&#8217;m looking to:<\/p>\n<ul>\n<li>Start work on an Open Data link validator. This tool will detect broken URIs and URLs, flagging them for correction.<\/li>\n<li>Start building maps of Union facilities ready for use during bunfight.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve spent this week reorganising the folders in My Documents. &#8220;But Callum!&#8221; you might say, &#8220;Isn&#8217;t that a complete waste of a week?&#8221;. Perhaps for some. In reality, I&#8217;ve been working towards creating an Open Data Pipeline. Open Data Pipeline is a term I just created, and I think it refers to something like this: [&hellip;]<\/p>\n","protected":false},"author":98708,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[352,4224,30036],"tags":[438766,803381,803380,803385,6506,803378,803372,803379,20554,65462,74,803384],"class_list":["post-1468","post","type-post","status-publish","format-standard","hentry","category-data","category-geo-2","category-open-source","tag-advice","tag-data-storage","tag-gathering-data","tag-not-to-do","tag-open-data","tag-open-data-pipeline","tag-opengather","tag-pipeline","tag-processing","tag-publishing","tag-tips","tag-todo"],"_links":{"self":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts\/1468","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/users\/98708"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/comments?post=1468"}],"version-history":[{"count":4,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts\/1468\/revisions"}],"predecessor-version":[{"id":1476,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts\/1468\/revisions\/1476"}],"wp:attachment":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/media?parent=1468"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/categories?post=1468"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/tags?post=1468"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}