Thursday, November 22, 2007

My Web Spider

I'm writing a web spider as a test to better my php skills. A web spider is a script that looks for urls and gathers page data for later usage, GoogleBot is an example of a web spider. So far I've Indexed 8,861 of the 142,967 urls already spidered. I've only been running the spider a few hours (4 tabs) and my page_data table is already 736,217mb. This is only text that gets stored, I strip out html tags.

How it works:
A page is set to reload every 2 seconds. This initiates the parsing of a page, stores all new urls into the database, and displays a cool url list to show me it's working.


Every time the page loads, a random url is loaded. The page is scanned and it's urls are stored along with the page data. There is a 10% chance that the spider will go into 'source mode' which scans a preset list of urls that contain a constant supply of new urls like digg and del.icio.us.

The regular expression that parses out the urls:
/href="([^"]+)"/
The urls that are loaded from the database are loaded at random and must of been updated over 3 hours earlier (unless in source mode). I have had to keep pushing this time back as more urls are added. I will most likely need to create a system to determine which urls should be parsed faster to allow faster updating of news/social sites.

The engine automatically ignores javascript, doubleclick.net and a few other gay things. I'll add more checks as I find need to.

I need another soda. I'll write more later as I feel like adding to the engine. l8r

5 comments:

Pak Behl said...

Today, the interwebs, tomorrow THE WORLD! What will u do with all zee data?

da404lewzer said...

I just like watching the randomness of what it finds. Makes me feel complete lol

Unknown said...

Sounds cool! Any tips on how to create this in PHP? :)

Unknown said...

for titus:
$web_page=file_get_contents('starting URL here');
parse '$web_page' using a regular expression, such as preg_match_all("/href=[\'\"](.*)[\'\"]/iU",$web_page, $results);
look up 'preg_match_all()' for return information, and just add the data to a list to search in turn.

b0r1s said...

cool!
im developing a web spider too in php/mysql. the engine gets all url, scans, adds new urls,with new words and reprogramms itself for the next domain scan based on metatag visit after...domains are divided by starting letter so for each letter ther is a spider, like this in a short time everything is scanned and renewed, till now everythin seems ok, the only thing needed is space, i do really need tons of Terabytes :)