Showing posts with label database. Show all posts
Showing posts with label database. Show all posts

Thursday, November 22, 2007

My Web Spider

I'm writing a web spider as a test to better my php skills. A web spider is a script that looks for urls and gathers page data for later usage, GoogleBot is an example of a web spider. So far I've Indexed 8,861 of the 142,967 urls already spidered. I've only been running the spider a few hours (4 tabs) and my page_data table is already 736,217mb. This is only text that gets stored, I strip out html tags.

How it works:
A page is set to reload every 2 seconds. This initiates the parsing of a page, stores all new urls into the database, and displays a cool url list to show me it's working.


Every time the page loads, a random url is loaded. The page is scanned and it's urls are stored along with the page data. There is a 10% chance that the spider will go into 'source mode' which scans a preset list of urls that contain a constant supply of new urls like digg and del.icio.us.

The regular expression that parses out the urls:
/href="([^"]+)"/
The urls that are loaded from the database are loaded at random and must of been updated over 3 hours earlier (unless in source mode). I have had to keep pushing this time back as more urls are added. I will most likely need to create a system to determine which urls should be parsed faster to allow faster updating of news/social sites.

The engine automatically ignores javascript, doubleclick.net and a few other gay things. I'll add more checks as I find need to.

I need another soda. I'll write more later as I feel like adding to the engine. l8r