{"id":222,"date":"2011-07-11T11:06:09","date_gmt":"2011-07-11T10:06:09","guid":{"rendered":"http:\/\/blog.soton.ac.uk\/oneshare\/?p=222"},"modified":"2011-08-27T09:06:35","modified_gmt":"2011-08-27T08:06:35","slug":"campusroar-a-simple-python-crawler","status":"publish","type":"post","link":"https:\/\/blog.soton.ac.uk\/oneshare\/2011\/07\/11\/campusroar-a-simple-python-crawler\/","title":{"rendered":"CampusROAR: A Simple Python Crawler"},"content":{"rendered":"<p>Over the last week I&#8217;ve been building a simple Python web crawler for grabbing RSS feeds from a domain.\u00a0 The one I built in Bash using command line tools (namely wget) has several downsides which needed fixing so I started work on a customized Python one.\u00a0 Wget did not allow filtering of pages by HTTPHeader\/mimetype, and while it did allow filtering by file extension, many of the files on the domain I was scraping were extensionless so couldn&#8217;t be filtered.\u00a0 There was also no way to filter by filesize &#8211; so I couldn&#8217;t restrict it to smaller files instead which would have been an ok solution.\u00a0 Instead the crawler had to download huge files on occasion which either had no file extension or was one I had not thought to blacklist.<\/p>\n<p>The Python crawler I created had several advantages therefore.\u00a0 It first checks the HTTPHeader, and only requests the full page if it&#8217;s HTML or XML (to either get all the links off of, or to check for RSS in the case of XML).\u00a0 It then parses and processes the full page and gets the links from it, or if it&#8217;s an RSS or ATOM feed, adds it to a list of them.\u00a0 You have to give it a starting page and a domain restriction &#8211; it starts crawling from one and cannot leave the given domain.\u00a0 If you did not give it a domain restriction it would probably continue forever (or at least until you had spidered the whole internet).<\/p>\n<p><!--more--><\/p>\n<p>crawl.py<\/p>\n<pre>import config\r\nimport urllib2\r\nimport feedparser\r\nimport lxml.html\r\nimport urlparse\r\nimport datetime\r\n\r\n#Globals\r\nignore = []\r\nfeeds = []\r\nurllist = set()\r\noutput = \"feeds.txt\"\r\n\r\ndef main():\r\n\r\n url = config.START_URL\r\n\r\n FILE = open(output,\"a\")\r\n FILE.write(\"CRAWL STARTED ON: \"+ str(datetime.datetime.today()) +\"\\n\")\r\n FILE.close()\r\n\r\n process_url(url)\r\n\r\n while len(urllist) &gt; 0:\r\n process_url(url)\r\n url = urllist.pop()\r\n\r\n FILE = open(output,\"a\")\r\n FILE.write(\"END OF CRAWL.\\n\")\r\n FILE.close()\r\n\r\n print \"URLs: \", urllist\r\n print \"Feeds: \", feeds\r\n print \"Ignore: \", ignore\r\n\r\nclass HeadRequest(urllib2.Request):\r\n ## Custom Headrequest Request class for urllib2 to get page headers.\r\n def get_method(self):\r\n return \"HEAD\"\r\n\r\ndef getPageMime(url):\r\n\r\n #########################################\r\n #Determines the mimetype of the given url.\r\n #########################################\r\n\r\n try:\r\n response = urllib2.urlopen(HeadRequest(url), timeout = 20)\r\n content = response.info()[\"content-type\"]\r\n contents = content.split(\";\")\r\n return contents[0]\r\n except:\r\n return None\r\n\r\ndef getPageAndParse(url, contenttype):\r\n\r\n #########################################\r\n #Takes a url and mimetype and parses, returning a list of valid urls in the page,\r\n #and a flag which is True if the page is an RSS or Atom feed and false otherwise.\r\n #########################################\r\n\r\n response = urllib2.urlopen(url)\r\n page = response.read()\r\n address = response.url\r\n\r\n feed = False\r\n if (contenttype == \"application\/rss+xml\" or contenttype ==\"application\/xhtml+xml\"):\r\n feed = parseforRSS(page, address)\r\n\r\n urls = []\u00a0\u00a0 \u00a0\r\n if (contenttype == \"text\/html\"):\r\n urls = parseforURLs(page, address)\r\n\r\n return (feed, urls)\r\n\r\n## Parse page for RSS tags and return true if found.\u00a0\u00a0 \u00a0\r\ndef parseforRSS(page, address):\r\n\r\n #########################################\r\n #Takes a page and its url, returns True if the page is an RSS or Atom feed.\r\n #########################################\r\n if (feedparser.parse(page).version):\r\n\r\n return True\r\n return False\r\n\r\ndef parseforURLs(page, address):\r\n\r\n #########################################\r\n #Takes a page and its url, returns a list of absolute urls linked to on the page.\r\n #########################################\r\n\r\n try:\r\n webpage = lxml.html.fromstring(page)\r\n urls = webpage.xpath('\/\/a\/@href')\r\n validurls = []\r\n for item in urls:\r\n if urlparse.urljoin(address, item).startswith('http'):\r\n validurls.append(urlparse.urljoin(address, item))\r\n return validurls\r\n except lxml.etree.XMLSyntaxError:\r\n return []\r\n\r\ndef process_url(url):\r\n\r\n #########################################\r\n #Processes a url for extracting links or finding feeds.\r\n #########################################\r\n\r\n print \"Processing url: \", url\r\n\r\n ignore.append(url)\r\n mime = getPageMime(url)\r\n if (mime == \"application\/rss+xml\" or mime == \"application\/xhtml+xml\" or mime == \"text\/html\"):\r\n result = getPageAndParse(url,mime)\r\n if result[0] == True:\r\n if (config.DOMAIN in urlparse.urlparse(item).netloc):\r\n feeds.append(url)\r\n FILE = open(output,\"a\")\r\n FILE.write(url)\r\n FILE.close()\r\n else:\r\n for item in result[1]:\r\n if item not in ignore:\r\n if (config.DOMAIN in urlparse.urlparse(item).netloc):\r\n urllist.add(item)\r\n\r\nif __name__ == \"__main__\":\r\n main()<\/pre>\n<p>config.py<\/p>\n<pre>## Settings for RSSCrawl web crawler.\r\n\r\nCRAWLER_NAME = 'RSSCrawl'\r\n\r\nSTART_URL = 'http:\/\/www.ecs.soton.ac.uk'\r\nDOMAIN = 'www.ecs.soton.ac.uk'<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Over the last week I&#8217;ve been building a simple Python web crawler for grabbing RSS feeds from a domain.\u00a0 The one I built in Bash using command line tools (namely wget) has several downsides which needed fixing so I started &hellip;<\/p>\n<p class=\"read-more\"> <a class=\"more-link\" href=\"https:\/\/blog.soton.ac.uk\/oneshare\/2011\/07\/11\/campusroar-a-simple-python-crawler\/\"> <span class=\"screen-reader-text\">CampusROAR: A Simple Python Crawler<\/span> Read More &raquo;<\/a><\/p>\n","protected":false},"author":188,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[4013],"class_list":["post-222","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-campusroar"],"_links":{"self":[{"href":"https:\/\/blog.soton.ac.uk\/oneshare\/wp-json\/wp\/v2\/posts\/222","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.soton.ac.uk\/oneshare\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.soton.ac.uk\/oneshare\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/oneshare\/wp-json\/wp\/v2\/users\/188"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/oneshare\/wp-json\/wp\/v2\/comments?post=222"}],"version-history":[{"count":4,"href":"https:\/\/blog.soton.ac.uk\/oneshare\/wp-json\/wp\/v2\/posts\/222\/revisions"}],"predecessor-version":[{"id":313,"href":"https:\/\/blog.soton.ac.uk\/oneshare\/wp-json\/wp\/v2\/posts\/222\/revisions\/313"}],"wp:attachment":[{"href":"https:\/\/blog.soton.ac.uk\/oneshare\/wp-json\/wp\/v2\/media?parent=222"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/oneshare\/wp-json\/wp\/v2\/categories?post=222"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/oneshare\/wp-json\/wp\/v2\/tags?post=222"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}