php for fun: mysql

Friday, November 23, 2007

MySQL speed improvements

MyISAM seems to be faster than InnoDB with the queries that I've been using for my web spider. I switched over my tables and found that my crawls were running 3 times faster because MyISAM supports INSERT DELAYED.

The Queries
For the example, we will using entry #383558 (http://en.wikipedia.org/wiki/Category:World_Wrestling_Entertainment_alumni).

SELECT uID, uURL FROM urls WHERE uError = 0 AND uUpdated < DATE_SUB(NOW() , interval 12 hour ) ORDER BY rand() LIMIT 1

Select a random URL from the url table that hasn't errored and hasn't been updated in over 12 hours.

UPDATE urls SET uUpdated = NOW() WHERE uID = 383558 LIMIT 1

Set the last updated time to now() in the url table.

INSERT DELAYED INTO data (dURLID, dData) VALUES ( 383558, 'Category:World Wrestling Entertainment alumni - Wikipedia, the free encyclopedia\n /**/...' ) ON DUPLICATE KEY UPDATE dData=VALUES(dData)

This inserts (or updates) the current page's text into the data table.

INSERT DELAYED IGNORE INTO urls (uURL, uAdded, uSiteID) VALUES ( 'http://en.wikipedia.org/favicon.ico', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Kurt_Angle', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Bryan_Clark', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Peter_Gasperino', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Robert_Horne_%28wrestler%29', NOW(), 383558 )

This inserts more urls to the url table and records the parent url.

UPDATE urls SET uError = 1 WHERE uID = 383558 LIMIT 1

This marks the page as an error (e.g. 404). It will not be spidered in the future without this mark being removed.

This is an extract from a typical set of querys that would get executed after a crawl. Obviously the dData wouldn't be truncated and there would be a lot more urls lol. tacos. l8r

Thursday, November 22, 2007

My Web Spider

I'm writing a web spider as a test to better my php skills. A web spider is a script that looks for urls and gathers page data for later usage, GoogleBot is an example of a web spider. So far I've Indexed 8,861 of the 142,967 urls already spidered. I've only been running the spider a few hours (4 tabs) and my page_data table is already 736,217mb. This is only text that gets stored, I strip out html tags.

How it works:
A page is set to reload every 2 seconds. This initiates the parsing of a page, stores all new urls into the database, and displays a cool url list to show me it's working.

Every time the page loads, a random url is loaded. The page is scanned and it's urls are stored along with the page data. There is a 10% chance that the spider will go into 'source mode' which scans a preset list of urls that contain a constant supply of new urls like digg and del.icio.us.

The regular expression that parses out the urls:

/href="([^"]+)"/

The urls that are loaded from the database are loaded at random and must of been updated over 3 hours earlier (unless in source mode). I have had to keep pushing this time back as more urls are added. I will most likely need to create a system to determine which urls should be parsed faster to allow faster updating of news/social sites.

The engine automatically ignores javascript, doubleclick.net and a few other gay things. I'll add more checks as I find need to.

I need another soda. I'll write more later as I feel like adding to the engine. l8r

php for fun

Friday, November 23, 2007

MySQL speed improvements

Thursday, November 22, 2007

My Web Spider

$me->about();

other_stuffs();

blog(bArchive, dTree);