Showing posts with label insert. Show all posts
Showing posts with label insert. Show all posts

Friday, November 23, 2007

MySQL speed improvements

MyISAM seems to be faster than InnoDB with the queries that I've been using for my web spider. I switched over my tables and found that my crawls were running 3 times faster because MyISAM supports INSERT DELAYED.

The Queries
For the example, we will using entry #383558 (http://en.wikipedia.org/wiki/Category:World_Wrestling_Entertainment_alumni).

SELECT uID, uURL FROM urls WHERE uError = 0 AND uUpdated < DATE_SUB(NOW() , interval 12 hour ) ORDER BY rand() LIMIT 1
Select a random URL from the url table that hasn't errored and hasn't been updated in over 12 hours.

UPDATE urls SET uUpdated = NOW() WHERE uID = 383558 LIMIT 1
Set the last updated time to now() in the url table.

INSERT DELAYED INTO data (dURLID, dData) VALUES ( 383558, 'Category:World Wrestling Entertainment alumni - Wikipedia, the free encyclopedia\n /**/...' ) ON DUPLICATE KEY UPDATE dData=VALUES(dData)
This inserts (or updates) the current page's text into the data table.

INSERT DELAYED IGNORE INTO urls (uURL, uAdded, uSiteID) VALUES ( 'http://en.wikipedia.org/favicon.ico', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Kurt_Angle', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Bryan_Clark', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Peter_Gasperino', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Robert_Horne_%28wrestler%29', NOW(), 383558 )
This inserts more urls to the url table and records the parent url.

UPDATE urls SET uError = 1 WHERE uID = 383558 LIMIT 1
This marks the page as an error (e.g. 404). It will not be spidered in the future without this mark being removed.


This is an extract from a typical set of querys that would get executed after a crawl. Obviously the dData wouldn't be truncated and there would be a lot more urls lol. tacos. l8r