The Queries
For the example, we will using entry #383558 (http://en.wikipedia.org/wiki/Category:World_Wrestling_Entertainment_alumni).
SELECT uID, uURL FROM urls WHERE uError = 0 AND uUpdated < DATE_SUB(NOW() , interval 12 hour ) ORDER BY rand() LIMIT 1Select a random URL from the url table that hasn't errored and hasn't been updated in over 12 hours.
UPDATE urls SET uUpdated = NOW() WHERE uID = 383558 LIMIT 1Set the last updated time to now() in the url table.
INSERT DELAYED INTO data (dURLID, dData) VALUES ( 383558, 'Category:World Wrestling Entertainment alumni - Wikipedia, the free encyclopedia\n /**/...' ) ON DUPLICATE KEY UPDATE dData=VALUES(dData)This inserts (or updates) the current page's text into the data table.
INSERT DELAYED IGNORE INTO urls (uURL, uAdded, uSiteID) VALUES ( 'http://en.wikipedia.org/favicon.ico', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Kurt_Angle', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Bryan_Clark', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Peter_Gasperino', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Robert_Horne_%28wrestler%29', NOW(), 383558 )This inserts more urls to the url table and records the parent url.
UPDATE urls SET uError = 1 WHERE uID = 383558 LIMIT 1This marks the page as an error (e.g. 404). It will not be spidered in the future without this mark being removed.
This is an extract from a typical set of querys that would get executed after a crawl. Obviously the dData wouldn't be truncated and there would be a lot more urls lol. tacos. l8r
2 comments:
exactly would you use all that data for?
@sakuyaslove - maybe a search engine or something of the like
Post a Comment