Although RSS scraping (or feed scraping) sounds like some sort of especially unpleasant dental procedure, it is in fact an indispensable trick for accessing regularly-updated content. This post describes what it is, why you might want to do it, and gives an example of a service that makes feed scraping easy.
RSS (Really Simple Syndication) is, as I’ve argued before, the key technology for quickly keeping up with online content. (See also George on RSS and online services, Lincoln Mullen on RSS and academic journals, and Mark on RSS and your library’s catalog.) When sites publish RSS feeds, you can read their content more quickly–and usually can do more things (save, share, etc.) with that content–in a news aggregator. (Amy’s written on the timesaving powers of Google Reader, and the comments to this post round up a healthy number of alternatives. I personally use NetNewsWire on my desktops, and Reeder on my iPad.)
RSS is near-ubiquitous on blogs and news sites, but every once in a while, you’ll find a site that 1) provides regularly-updated information that you want to check frequently, but 2) doesn’t provide an RSS feed. When that’s the case, what you want to do is “scrape” the content of the page into a feed, which you can then subscribe to in your aggregator. There is a shadier use of the term RSS scraping, which refers to sites that exist to republish the content of others, larded up with ads and malware. You probably won’t find tips about that at ProfHacker.
Best of all, feed scraping is pretty easy to do. There are a variety of services out there that support it, but one of the easiest to use is Feed43. If all you want to do is to set up a single feed for a site, you don’t even need to register. Registration is free, which makes it easy to edit feeds you’ve set up, and you can also pay a variety of different fees if you want your feeds updated more regularly.
In the rest of the post, I’m going to show how this works. My example is going to come from EPL Talk, a website and podcast for fans of Premier League soccer. Now, EPL Talk provides an RSS feed (free in snippet view, or full-text for paid subscribers), and also pushes news through their Twitter account. But it also maintains, as a separate resource, a Premier League TV Schedule for US Viewers. This page isn’t in the main RSS feed, but it’s exactly the sort of information I want to arrive regularly, so I can put must-see games into my calendar, or set them to record. There would, after all, be hell to pay if the 8-year-old missed a
Man City Liverpool game.
(If you need all of your examples to be about strictly academic topics, you can see William J. Turkel’s how-to on web scraping, which is where I learned about Feed43 a few years back. But the principles are the same.)
If you’re planning to scrape a feed, you need to find a pattern. Here’s what the relevant bit of the EPL Talk tv schedule looks like:
(There’s some other text on the page, but we’ll ignore it for now.)
As you can see, this page has many elements–ads, a comments feed, a Twitter feed, and more. What I need to do is to identify a repeatable structure that Feed43 can use.
To make the feed, go to Feed43 website, and click “Create your feed.” It will ask you to specify the “source page address,” which is the page you want to scrape–in this instance, http://www.epltalk.com/premier-league-tv-schedule/ Click “Reload.” (A bit confusing, because you haven’t loaded the page yet, but it’s right.)
Feed43 then shows you the source code for the page you’re scraping. In this instance, it takes a bit to figure out what the most basic structure I can search for is, but here it is:
EPL Talk bolds the date, and then gives any games televised on that day in an unordered list. That’s what I need Feed43 to search for, and the search needs to include both the date and the list in order to be useful.
Feed43 uses a wildcard that lets you search for anything between various tags. This search, in the “Item (Repeatable) Search Pattern” box, will grab what we need:
This tells Feed43 to look for a paragraph set off in bold (the date) followed by an unordered list. Because it’s an “Item Search Pattern,” Feed43 knows to grab all the instances of this formatting on the page.
When you click “Extract,” Feed43 shows you a preview of your feed:
(%1) and (%2) are labels for Feed43 that let you format the results. Formatting is in fact the last step: You give the feed a title, link, and description (which Feed43 autopopulates from the source page), and then format your feed. In this instance, I like the date-games format, so I put them in the Item Content Template field like so:
Here’s what the result looks like in NetNewsWire:
Because this is a free feed, there’s a little ad that says “Delivered by Feed43 service,” which is fine by me.
Now, these particular results will be displayed in a slightly annoying way, which is that, on every date that EPL Talk publishes an update, the feed will order the games by day of the week, like so:
If the world were ordered entirely according to my liking, games would be displayed by date, not weekday. But that’s not actually annoying enough for me to care about fixing, so I just leave it the way it is. (I can even rationalize leaving it this way, since weekday games almost have to go on the DVR, while weekend games can often be watched live. Laziness is a wonderful thing!)
At the bottom of the page, Feed43 gives you the URL for your feed, which you can then paste into your aggregator in order to subscribe, and it gives you a URL to edit the feed, should it ever become necessary to do so.
And that’s it! The schedule of US-broadcast Premier League games now shows up automagically in NetNewsWire, and I don’t have to fear the 8-year-old’s wrath.
Do you prefer another feed scraping service? Let us know what it is, and why, in comments!Return to Top