DEVTRENCH.COM

Screen Scraping: How to Screen Scrape a Website with PHP and cURL

Screen Scrape with PHP and cURLScreen scraping has been around on the internet since people could code on it, and there are dozens of resources out there to figure out how to do it (google php screen scrape to see what I mean). I want to touch on some things that I've figured out while scraping some screens. I assume you have php running, and know your way around Windows.

Those are all of my tips. Here is some screen scrape code that I use.

To call curl just write a function like this. This is so much easier than using the php commands, but you probably don't want to use a shell_exec command on a web server where someone can put in their own input. That might be bad. I only use this code when I run it locally.

function GetCurlPage ($pageSpec) {
    return shell_exec("curl $pageSpec");
}

This is the code that calls the curl function. We start by using the output buffer, this greatly speeds up our code. This particular code would grab the title of a page and print it:

ob_start();
$url = 'http://www.example.com';
$page = GetCurlPage($url);
preg_match("~(.+)~",$page,$m);
print $m[1];
ob_end_flush();

To run your script from the command line and generate output to a file you simply call it like this:

php my_script_name.php > output.txt

Any output captured by the output buffer will be printed to the file you pass the output to.

This is a very simple example that doesn't even check to see if the title exists on that page before it prints, but hopefully you can use your imagination to expand this into something that might grab all of the titles on an entire site. A common thing that I do is use a program like Xenu Link Sleuth to build my list of links I want to scrape, and then use a loop to go through and scrape every link on the list (in other words, use Xenu for your spider and your code to process the results). This was how I build the Shoemoney Blog Archive list. The challenge and fun with screen scraping is how can you use that data that is out there to your advantage. The End