Recently I’ve been looking at methods to get all the feeds from the University web presence and compile them all into a big list for later use. This is harder than it looks since the University website is a large beast, with hundreds of sub-domains.
I’ve been writing a basic web spider just using basic command line tools and bash, focusing around wget’s -r flag which downloads all the files in a given domain. It iteratively goes through each subdomain it finds and gets these files too. After each subdomain it deletes all files over 1mb to save space, since it’s unlikely that any web pages will be this large and that’s all we’re concerned about. Afterwards it looks for rss tags in all the remaining files and gives the paths to them.
Currently it’s going through downloading everything, which I imagine will take a long while. Oh well.