How it works:
A page is set to reload every 2 seconds. This initiates the parsing of a page, stores all new urls into the database, and displays a cool url list to show me it's working.
Every time the page loads, a random url is loaded. The page is scanned and it's urls are stored along with the page data. There is a 10% chance that the spider will go into 'source mode' which scans a preset list of urls that contain a constant supply of new urls like digg and del.icio.us.
The regular expression that parses out the urls:
/href="([^"]+)"/The urls that are loaded from the database are loaded at random and must of been updated over 3 hours earlier (unless in source mode). I have had to keep pushing this time back as more urls are added. I will most likely need to create a system to determine which urls should be parsed faster to allow faster updating of news/social sites.
The engine automatically ignores javascript, doubleclick.net and a few other gay things. I'll add more checks as I find need to.
I need another soda. I'll write more later as I feel like adding to the engine. l8r
5 comments:
Today, the interwebs, tomorrow THE WORLD! What will u do with all zee data?
I just like watching the randomness of what it finds. Makes me feel complete lol
Sounds cool! Any tips on how to create this in PHP? :)
for titus:
$web_page=file_get_contents('starting URL here');
parse '$web_page' using a regular expression, such as preg_match_all("/href=[\'\"](.*)[\'\"]/iU",$web_page, $results);
look up 'preg_match_all()' for return information, and just add the data to a list to search in turn.
cool!
im developing a web spider too in php/mysql. the engine gets all url, scans, adds new urls,with new words and reprogramms itself for the next domain scan based on metatag visit after...domains are divided by starting letter so for each letter ther is a spider, like this in a short time everything is scanned and renewed, till now everythin seems ok, the only thing needed is space, i do really need tons of Terabytes :)
Post a Comment