Wednesday, November 28, 2007

Learning, the Long Way

Doing something like
print 'This is a test';
is pretty simple, wouldn't you say? Print is a simple command that outputs what its given to the browser. Indeed simple. Now lets look at this:
function foo($data){
echo $data;
}
${'*'}='foo';
${'*'}('This is a test');
This example is a bit more abstract. It does the same thing, but in a different way. First, ${} is called a variable-variable. it allows you to use a string to specify a variable. $name and ${'name'} are the same variable. Valid PHP variables start with an understore or alpha character and then may later have numbers. Here ${'*'} creates a variable $* if you will. This is not a valid variable! Consider the following:
echo $*; //ERROR
echo ${'*'}; // outputs: foo
Ok, so $* is invalid, but how does it work? If a variable-variable defines it, anything flys. Moving on, the second line ${'*'}(); is what is called a variable-function. Basically, it allows you to store the name of a function inside of a variable. Then, by adding () to the variable, it runs the function thats name matches the variable data. If no function is found an error will occour. Consider the following lines which are now technically the same.
${'*'}('Test 1-2-3'); // outputs: Test 1-2-3
foo('Test 1-2-3'); // outputs: Test 1-2-3
The reason we used foo as an echo wrapper is because echo and print are language constructs and not functions. Be sure to play with variable-variables and variable-functions, they are alot of fun and are the next step towards making modular components and classes! This is technically an alternative to:
call_user_func('foo', 'This is a test');

More info here: Variable-variables and Variable-functions

wika wika out

Friday, November 23, 2007

MySQL speed improvements

MyISAM seems to be faster than InnoDB with the queries that I've been using for my web spider. I switched over my tables and found that my crawls were running 3 times faster because MyISAM supports INSERT DELAYED.

The Queries
For the example, we will using entry #383558 (http://en.wikipedia.org/wiki/Category:World_Wrestling_Entertainment_alumni).

SELECT uID, uURL FROM urls WHERE uError = 0 AND uUpdated < DATE_SUB(NOW() , interval 12 hour ) ORDER BY rand() LIMIT 1
Select a random URL from the url table that hasn't errored and hasn't been updated in over 12 hours.

UPDATE urls SET uUpdated = NOW() WHERE uID = 383558 LIMIT 1
Set the last updated time to now() in the url table.

INSERT DELAYED INTO data (dURLID, dData) VALUES ( 383558, 'Category:World Wrestling Entertainment alumni - Wikipedia, the free encyclopedia\n /**/...' ) ON DUPLICATE KEY UPDATE dData=VALUES(dData)
This inserts (or updates) the current page's text into the data table.

INSERT DELAYED IGNORE INTO urls (uURL, uAdded, uSiteID) VALUES ( 'http://en.wikipedia.org/favicon.ico', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Kurt_Angle', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Bryan_Clark', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Peter_Gasperino', NOW(), 383558 ),( 'http://en.wikipedia.org/wiki/Robert_Horne_%28wrestler%29', NOW(), 383558 )
This inserts more urls to the url table and records the parent url.

UPDATE urls SET uError = 1 WHERE uID = 383558 LIMIT 1
This marks the page as an error (e.g. 404). It will not be spidered in the future without this mark being removed.


This is an extract from a typical set of querys that would get executed after a crawl. Obviously the dData wouldn't be truncated and there would be a lot more urls lol. tacos. l8r

Thursday, November 22, 2007

Using "IF Comments" in PHP

Here's a nifty hack using comments to quickly enable/disable a block of code. I haven't seen examples of this before so I feel special. Here we go.

Teh Test
Suppose you have the following code:
$data = get_crap();

echo "Debug:"
print_r($data);
I store some data to $data and as a debug I want to view the contents of $data. I want to disabled the debug quickly. This is going to look weird at first but I will explain.

Example Code:
$data = get_crap();

///*
echo "Debug:"
print_r($data);
//*/
///* is a valid comment. it starts with //. php ignores the /*
The echo and print_r work because only the line above was commented.
//*/ is a valid comment. it starts with //. php ignores the */

To disable execution of code:
(Remove the very first set of //'s)
$data = get_crap();

/*
echo "Debug:"
print_r($data);
//*/
/* is a multi-line block comment. php ignores everything until it finds a matching */
The echo and print_r are not seen by the compiler.
//*/ was ignored. Execution resumes after the */

Wait a second...
Why not just do this: ?
$data = get_crap();

/*
echo "Debug:"
print_r($data);
*/
Why not just use /* */ why /* //*/ ??

Beacuse! if you remove the first /* to 'uncomment' without removing the */ (an extra step by the way) you get this:
$data = get_crap();


echo "Debug:"
print_r($data);
*/
And this:

Parse error: parse error in c:\www\project\file.php on line n

meh, l8r

My Web Spider

I'm writing a web spider as a test to better my php skills. A web spider is a script that looks for urls and gathers page data for later usage, GoogleBot is an example of a web spider. So far I've Indexed 8,861 of the 142,967 urls already spidered. I've only been running the spider a few hours (4 tabs) and my page_data table is already 736,217mb. This is only text that gets stored, I strip out html tags.

How it works:
A page is set to reload every 2 seconds. This initiates the parsing of a page, stores all new urls into the database, and displays a cool url list to show me it's working.


Every time the page loads, a random url is loaded. The page is scanned and it's urls are stored along with the page data. There is a 10% chance that the spider will go into 'source mode' which scans a preset list of urls that contain a constant supply of new urls like digg and del.icio.us.

The regular expression that parses out the urls:
/href="([^"]+)"/
The urls that are loaded from the database are loaded at random and must of been updated over 3 hours earlier (unless in source mode). I have had to keep pushing this time back as more urls are added. I will most likely need to create a system to determine which urls should be parsed faster to allow faster updating of news/social sites.

The engine automatically ignores javascript, doubleclick.net and a few other gay things. I'll add more checks as I find need to.

I need another soda. I'll write more later as I feel like adding to the engine. l8r

Tuesday, November 20, 2007

Hello Everyone

My name is Charlie and I love writing applications in php. I've been using php close to 5 years and have learned a lot of shortcuts along the way. I hope to be able to share greatest my ideas with people who love to code.

A lot of my 'fun' projects are what I call 'retarded.' I like to go out of the box to achieve similar results by writing my own functions. This allows me to better understand what php is doing internally. Variable-variables, variable-functions, and objects: I LOVE YOU GUYS. These are some (not all) of the best things in php.

As I start to think of interesting things to put on here I shall. Until then, peace my bitches!