DEVTRENCH.COM

How to Scrape an Entire Wordpress Blog

When I find a new blog that I really like or think has some good information I usually want to read the whole thing, and in the past, I've been frustrated by how annoying it is to do it online. I just want all posts in one document, free of ads, that I can read from start to finish.

I've also been having the itch to learn the Perl programming language, and I thought this would be a good project to get my feet wet.

Strategy

Since the majority of blogs I read are in written in WordPress, it's pretty easy to take advantage of the GET string that can be used to access each post. The GET string looks like http://www.devtrench.com/?p=2. Normally, post id's start with 1 and auto increment from there, so all we have to do is figure out what the id is of the latest post, and write some code that will loop over a scraper that many times. Sometimes the latest post id will be hidden in the HTML of the post somewhere, and other times you just have to guess at it for a while until you figure it out.

Code

Here's the Perl code that I'm currently using to do this. It's meant to be run from the command line, and I output the data to a file using '> output.txt' after the command. Also, I ripped off most of this code from PLEAC-Perl, which is an online Perl cookbook. Cookbooks can really accelerate learning a language when you want to do specific tasks like this.

Hopefully this code is commented enough for you to understand what is going on.

[perl]#!/usr/bin/perl -w
# http://www.devtrench.com - How to Scrape a Wordpress Blog

# These packages are required so load them
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use URI::Heuristic;

# Usage: perl fetch_blog.pl http://www.domain.com highest_post_id start_pattern end_pattern
my $raw_url = shift; # should be the domain with http:// and no trailing slash
my $post_num = shift; # this is the highest id number that you can find in the blog
my $start_pattern = shift; # a start pattern to match. we don't want the entire page so find some unique text before the blog post. it's best to put the pattern in single quotes
my $end_pattern = shift; # some unique text that signifies the end of a blog post
my $url = URI::Heuristic::uf_urlstr($raw_url);

$| = 1; #to flush next line

# fire up a new user agent that can browse the web for us
my $ua = LWP::UserAgent->new();
$ua->agent("DEVTRENCH.COM WordPress Post Fetcher v0.1"); # give it time, it'll get there

# now loop through all posts
for ($i=1;$i<=$post_num;$i++) {
# here we request each post from the site using the get string common to wordpress blogs
my $req = HTTP::Request->new(GET => $url.'/?p='.$i);
$req->referer("http://www.devtrench.com"); # perplex the log analysers
my $response = $ua->request($req);

if ($response->is_error())
{
# uh oh there was an error, probably a 404 which is caused by deleted or non published posts
printf "

ERROR POST #$i: %s


", $response->status_line;
} else {
my $content;
$content = $response->content();
my $test;
$_ = $content;
# this is the regular expression that will match the start and end pattern
if ($test = /\Q$start_pattern\E(.*)\Q$end_pattern\E/s) {
$content = $1;
}
# this line prints out what the regex found as html, but it could easily be modified to print as plain text as well
print "

Post #$i

".$content."


";
}
} [/perl]

This was a fun experiment for me and a great way to finally get around to learning Perl. I always find that if I create small tasks like this for myself that learning a language or framework is much more interesting and rewarding. Now if you want to do this in PHP, which is what I usually program in, you can follow the same logic using curl as your user agent.

Hint 1: If you use this code as is it will print out HTML. Surround the HTML that it generates with basic <html> and <body> tags and create your own styles in the <head>. Then you can view it in a web browser and print to PDF.

Hint 2: The blog I was trying to scrape did not allow me to hotlink images from my HTML file, but saving the page as HTML from FireFox was a workaround for that, and then I saved that created page as a PDF.