Friday, November 23, 2007

MySQL speed improvements

MyISAM seems to be faster than InnoDB with the queries that I've been using for my web spider. I switched over my tables and found that my crawls were running 3 times faster because MyISAM supports INSERT DELAYED.

The Queries
For the example, we will using entry #383558 (http://en.wikipedia.org/wiki/Category:World_Wrestling_Entertainment_alumni).

SELECT uID, uURL FROM urls WHERE uError = 0 AND uUpdated < DATE_SUB(NOW() , interval 12 hour ) ORDER BY rand() LIMIT 1
Select a random URL from the url table that hasn't errored and hasn't been updated in over 12 hours.

UPDATE urls SET uUpdated = NOW() WHERE uID = 383558 LIMIT 1
Set the last updated time to now() in the url table.

INSERT DELAYED INTO data (dURLID, dData) VALUES ( 383558, 'Category:World Wrestling Entertainment alumni - Wikipedia, the free encyclopedia\n /**/...' ) ON DUPLICATE KEY UPDATE dData=VALUES(dData)
This inserts (or updates) the current page's text into the data table.

INSERT DELAYED IGNORE INTO urls (uURL, uAdded, uSiteID) VALUES ( 'http://en.wikipedia.org/favicon.ico', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Kurt_Angle', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Bryan_Clark', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Peter_Gasperino', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Robert_Horne_%28wrestler%29', NOW(), 383558 )
This inserts more urls to the url table and records the parent url.

UPDATE urls SET uError = 1 WHERE uID = 383558 LIMIT 1
This marks the page as an error (e.g. 404). It will not be spidered in the future without this mark being removed.


This is an extract from a typical set of querys that would get executed after a crawl. Obviously the dData wouldn't be truncated and there would be a lot more urls lol. tacos. l8r

2 comments:

SakuyasLove said...

exactly would you use all that data for?

da404lewzer said...

@sakuyaslove - maybe a search engine or something of the like