When I find a new blog that I really like or think has some good information I usually want to read the whole thing, and in the past, I've been frustrated by how annoying it is to do it online. I just want all posts in one document, free of ads, that I can read from start to finish.
I've also been having the itch to learn the Perl programming language, and I thought this would be a good project to get my feet wet.
Since the majority of blogs I read are in written in WordPress, it's pretty easy to take advantage of the GET string that can be used to access each post. The GET string looks like http://www.devtrench.com/?p=2. Normally, post id's start with 1 and auto increment from there, so all we have to do is figure out what the id is of the latest post, and write some code that will loop over a scraper that many times. Sometimes the latest post id will be hidden in the HTML of the post somewhere, and other times you just have to guess at it for a while until you figure it out.
Here's the Perl code that I'm currently using to do this. It's meant to be run from the command line, and I output the data to a file using '> output.txt' after the command. Also, I ripped off most of this code from PLEAC-Perl, which is an online Perl cookbook. Cookbooks can really accelerate learning a language when you want to do specific tasks like this.
Hopefully this code is commented enough for you to understand what is going on.
[perl]#!/usr/bin/perl -w
# http://www.devtrench.com - How to Scrape a Wordpress Blog
# These packages are required so load them
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use URI::Heuristic;
# Usage: perl fetch_blog.pl http://www.domain.com highest_post_id start_pattern end_pattern
my $raw_url = shift; # should be the domain with http:// and no trailing slash
my $post_num = shift; # this is the highest id number that you can find in the blog
my $start_pattern = shift; # a start pattern to match. we don't want the entire page so find some unique text before the blog post. it's best to put the pattern in single quotes
my $end_pattern = shift; # some unique text that signifies the end of a blog post
my $url = URI::Heuristic::uf_urlstr($raw_url);
$| = 1; #to flush next line
# fire up a new user agent that can browse the web for us
my $ua = LWP::UserAgent->new();
$ua->agent("DEVTRENCH.COM WordPress Post Fetcher v0.1"); # give it time, it'll get there
# now loop through all posts
for ($i=1;$i<=$post_num;$i++) {
# here we request each post from the site using the get string common to wordpress blogs
my $req = HTTP::Request->new(GET => $url.'/?p='.$i);
$req->referer("http://www.devtrench.com"); # perplex the log analysers
my $response = $ua->request($req);
if ($response->is_error())
{
# uh oh there was an error, probably a 404 which is caused by deleted or non published posts
printf "
ERROR POST #$i: %s
Post #$i
".$content."
This was a fun experiment for me and a great way to finally get around to learning Perl. I always find that if I create small tasks like this for myself that learning a language or framework is much more interesting and rewarding. Now if you want to do this in PHP, which is what I usually program in, you can follow the same logic using curl as your user agent.
Hint 1: If you use this code as is it will print out HTML. Surround the HTML that it generates with basic <html> and <body> tags and create your own styles in the <head>. Then you can view it in a web browser and print to PDF.
Hint 2: The blog I was trying to scrape did not allow me to hotlink images from my HTML file, but saving the page as HTML from FireFox was a workaround for that, and then I saved that created page as a PDF.
Comments (7)
Jul 11, 2008 at 12:35 PM
Scott (spammer) Fish:
This is a really cool way to scrape. Thanks for the code!
Jul 19, 2008 at 10:08 AM
Link Building Bible:
Yah, this is a great idea..... i have that same problem... i find a new blog i wanna read about, and then I hate trying to navigate through it all...
Apr 19, 2009 at 05:48 PM
Mike Wanner:
I have a quick question that maybe related - I am looking for a program that can pull all the COMMENTS from posts. For example, on a site that is similiar to mine I want to grab all the people who are posting and put it into a format or list. Many people post and put their full information and URL/URI. This was I can build a list quickly. Any ideas?
Apr 19, 2009 at 08:15 PM
devtrench:
My PERL skills are still pretty green :) In PHP you'd use preg_match_all on a pattern that would find a single comment. That would find all the comments and store them in an array for further processing. You'll need to find a PERL guru or try a Google search to find something that might work with this script. If anyone else that knows PERL would like to make a suggestion for this, that would be great!
Aug 03, 2010 at 01:20 PM
Arunabh Das:
I would recommend doing this in PHP as a schedule cron task so that it can dump it to your DB and allow you to mashup content from other blogs.
Aug 18, 2010 at 08:46 AM
Joe Kenedy:
Has anyone actually written this in PHP. I would be interested in the code if you have it to share... Thanks -j
Nov 06, 2010 at 11:13 AM
» What’s the best way to read through the archives of a blog? Drija:
[...] found this blog post about some guy’s Perl code to do this. I really don’t have the expertise to implement [...]