ESS: The Real Desert Scraping
ESS stands for "Eli said so" ;)
as you can read in the latest article on bluehatseo.com, eli explained a technique how to dig deep in the net for "fresh" content. let me cut those steps down:
1. buy a domain name and setup wildcard subdomains
2. find a site where you can get a daily updated list of deleted domains and scrape this list every day
3. while you loop through the new domains fetch the archive.org wayback machine website from that domain and look if there is something indexed
4. scrape this wayback machine content, clean it and save it in the database
now lets see if we can do a small script.
first of all we reg a new domain. if this is done we point it to our server and create a new vhost for that domain. i do it that way with my lighttpd:
$HTTP["host"] =~ "(^|^www\.)domain\.com$" {
server.document-root = "/www/pages/domain.com/htdocs/"
accesslog.filename = "/www/logs/access_www_domain_com.log"
}
else $HTTP["host"] =~ "\.domain\.com$" {
server.document-root = "/www/pages/domain.com/subs/"
accesslog.filename = "/www/logs/access_sub_domain_com.log"
}
now we have to scrape the domain lists. as example i use a .org list from deactivatedon.com.
$d = file_get_contents( 'http://deactivatedon.com/new/org_070623_0001.html' );
preg_match_all( '/[a-z-]+?\.org/', strtolower( $d ), $m );
now we have all domains in the matches array. let's do something really cool:
$pspell = pspell_new( 'en' );
foreach( $m[0] as $d ) {
if( strlen( $d ) <= 22 && $d != "w3.org" && strpos($d,"xn--") === false && substr_count( $d, '-' ) <= 1 ) {
$t = str_replace( '-', '', $d );
$check = array();
$words = array();
for( $j = 0; $j < ( strlen( $t ) - 5 ); $j++ ) {
for( $i = 4; $i < strlen( $t ); $i++ ) {
if( pspell_check( $pspell, substr( $t, $j, $i ) ) ) {
$check[$j]++;
$words[] = substr( $t, $j, $i );
}
}
}
if( count( $check ) > 0 ) {
$domains[str_replace( array("\n","\r","\t"), "", $t )] = $words;
}
}
}
thats an extrem cool way to find out if the domains we've found are crap or not. it goes over each domain and uses pspell to find out if there are real words in it. but thats not enough: we save the words pspell found so we kind of generate in a very lazy way a keyword list for each domain.
you need pspell for this. "apt-get install php5-pspell" did it for me. you probably have to ask your admin... etc.
if you want to know if pspell is already installed just execute it will tell you if its not installed.
i think i dont have to explain why we exclude w3.org.
now we need to fetch the data from the wayback machine:
$words = array_values( $domains );
$domains = array_keys( $domains );
for( $i = 0; $i < count( $domains ); $i++ ) {
$url = explode( '.', $domains[$i] );
$q = 'SELECT * FROM subs WHERE
domain = "'.mysql_escape_string( $url[count($url)-2] ).'" AND
tld = "'.mysql_escape_string( $url[count($url)-1] ).'" LIMIT 1';
$r = mysql_query( $q );
if( mysql_num_rows( $r ) == 1 ) continue;
$s = @file_get_contents( 'http://web.archive.org/web/*/http://www.'.$domains[$i] );
if( empty( $s ) && $x[$i] < 3 ) {
$x[$i]++;
$i--;
continue;
}
if( strpos( $s, ' Sorry, no matches.' ) !== false ) continue;
$s = str_replace( array("\n","\r","\t"), '', $s );
$m = array();
preg_match_all( '/<a href="http:\/\/web\.archive\.org\/web\/([0-9]+?)\/http:\/\/([^"]+?)">[a-zA-Z0-9,\s]+?<\/a>[\s\*]+?<br>/', $s, $m, PREG_SET_ORDER );
if( count( $m ) == 0 ) continue;
$pages = array();
foreach( $m as $d ) {
$pages[substr( $d[1], 0, 4 )][] = $d;
}
foreach( $pages as $key => $p ) {
$z = count( $p );
if( $z > $c ) $k = $key;
}
$link = $pages[$k][count($pages[$k])-1];
for( $j = 0; $j < 5; $j++ ) {
$site = file_get_contents( 'http://web.archive.org/web/'.$link[1].'/http://'.$link[2] );
if( empty( $site ) ) continue;
break;
}
$lsite = strtolower( $site );
if( empty( $site )
|| strpos( $lsite, 'was registered' ) !== false
|| strpos( $lsite, 'not in archive.' ) !== false
|| strpos( $lsite, 'for sale' ) !== false
|| strpos( $lsite, '<frameset' ) !== false ) {
continue;
}
$dom = $url[count($url)-2].'.'.$url[count($url)-1];
$data = clean( $site, $dom );
if( !$data ) continue;
$hash = crc32( $dom );
$hash = sprintf( "%u", $hash );
$q = 'INSERT INTO subs SET
domain = "'.mysql_escape_string( $url[count($url)-2] ).'",
tld = "'.mysql_escape_string( $url[count($url)-1] ).'",
hash = '.$hash.',
`keys` = "'.mysql_escape_string( implode( '|', $words[$i] ) ).'",
content = "'.mysql_escape_string( implode( "\n", $data ) ).'",
ts = NOW()';
mysql_unbuffered_query( $q );
}
yeah, thats everything you need to start. the "clean" function removes the html and creates an array with sentences and thats it. you got all the really nasty things to make your own small scraper based on what eli posted.
have fun!
before i forget it, the code above is just quick&dirty so most of it is crap... but it works :P
Jun 27th 2007
Nice to see something besides your love notes to Jon. :)
I will work this over tonight and see how it goes.
Jul 1st 2007
good stuff dude, can’t wait to play with this.
Jul 3rd 2007
This is excellent, thanks for letting me know about it!
Jul 9th 2007
I will try this tonight, let’s see how your code works…
thanks for your job for now.
Jul 16th 2007
if everybody uses this wont that kind of defeat the point of using it for UNIQUE content?