How to Scrape an Entire Wordpress Blog

When I find a new blog that I really like or think has some good information I usually want to read the whole thing, and in the past, I've been frustrated by how annoying it is to do it online. I just want all posts in one document, free of ads, that I can read from start to finish.

I've also been having the itch to learn the Perl programming language, and I thought this would be a good project to get my feet wet.


Since the majority of blogs I read are in written in WordPress, it's pretty easy to take advantage of the GET string that can be used to access each post. The GET string looks like Normally, post id's start with 1 and auto increment from there, so all we have to do is figure out what the id is of the latest post, and write some code that will loop over a scraper that many times. Sometimes the latest post id will be hidden in the HTML of the post somewhere, and other times you just have to guess at it for a while until you figure it out.


Here's the Perl code that I'm currently using to do this. It's meant to be run from the command line, and I output the data to a file using '> output.txt' after the command. Also, I ripped off most of this code from PLEAC-Perl, which is an online Perl cookbook. Cookbooks can really accelerate learning a language when you want to do specific tasks like this.

Hopefully this code is commented enough for you to understand what is going on.

[perl]#!/usr/bin/perl -w
# - How to Scrape a Wordpress Blog

# These packages are required so load them
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use URI::Heuristic;

# Usage: perl highest_post_id start_pattern end_pattern
my $raw_url = shift; # should be the domain with http:// and no trailing slash
my $post_num = shift; # this is the highest id number that you can find in the blog
my $start_pattern = shift; # a start pattern to match. we don't want the entire page so find some unique text before the blog post. it's best to put the pattern in single quotes
my $end_pattern = shift; # some unique text that signifies the end of a blog post
my $url = URI::Heuristic::uf_urlstr($raw_url);

$| = 1; #to flush next line

# fire up a new user agent that can browse the web for us
my $ua = LWP::UserAgent->new();
$ua->agent("DEVTRENCH.COM WordPress Post Fetcher v0.1"); # give it time, it'll get there

# now loop through all posts
for ($i=1;$i<=$post_num;$i++) {
# here we request each post from the site using the get string common to wordpress blogs
my $req = HTTP::Request->new(GET => $url.'/?p='.$i);
$req->referer(""); # perplex the log analysers
my $response = $ua->request($req);

if ($response->is_error())
# uh oh there was an error, probably a 404 which is caused by deleted or non published posts
printf "

ERROR POST #$i: %s

", $response->status_line;
} else {
my $content;
$content = $response->content();
my $test;
$_ = $content;
# this is the regular expression that will match the start and end pattern
if ($test = /\Q$start_pattern\E(.*)\Q$end_pattern\E/s) {
$content = $1;
# this line prints out what the regex found as html, but it could easily be modified to print as plain text as well
print "

Post #$i


} [/perl]

This was a fun experiment for me and a great way to finally get around to learning Perl. I always find that if I create small tasks like this for myself that learning a language or framework is much more interesting and rewarding. Now if you want to do this in PHP, which is what I usually program in, you can follow the same logic using curl as your user agent.

Hint 1: If you use this code as is it will print out HTML. Surround the HTML that it generates with basic <html> and <body> tags and create your own styles in the <head>. Then you can view it in a web browser and print to PDF.

Hint 2: The blog I was trying to scrape did not allow me to hotlink images from my HTML file, but saving the page as HTML from FireFox was a workaround for that, and then I saved that created page as a PDF.