Efficiently decode a short url

By John Swindells 11 Jul 2010 | Comment

| Share:

I am building a reference of short urls, and for each one need to record its final destination url, server software and page title. Curl almost makes this a very simple process, but lacks the ability to download a limited number of bytes from the response. file_get_contents does allow you to limit the bytes received, but lacks the ability to automatically follow the Location: response header; it also may be disabled for remote file requests - the allow_url_fopen restriction.

I therefore decided to write my own, efficient version, based on fsockopen.

class link { function resolve ($url,&$server,&$title) { $this->_resolveredirects=0; return $this->_resolve($url,$server,$title); } function _resolve ($url,&$server,&$title) { if ($this->_resolveredirects > 5) { print "link::resolve($url) too many redirects\n"; return $url; } $ports=array("http"=>80,"https"=>443,); if (!preg_match("@^(https?)://([^:/\?]+)(.*)$@i",$url,$match)) { print "link::resolve($url) unknown format\n"; return $url; } $proto=strtolower($match[1]); $hostname=$host=strtolower($match[2]); $uri=$match[3]; if($proto=="https") $hostname = "ssl://$host"; $port=$ports[$proto]; $fp = fsockopen($hostname, $port, $errno, $errstr, 10); if (!$fp) { print "link::resolve($url) $errstr ($errno)\n"; return; } $out = "GET $uri HTTP/1.1\r\n"; $out .= "Host: $host\r\n"; $out .= "User-Agent: Mozilla/5.0 (compatible; Link Resolver)\r\n"; $out .= "Connection: Close\r\n\r\n"; fwrite($fp, $out); $rxlimit=10000; $rx=0; stream_set_timeout($fp,5); while (!feof($fp)) { if($rx > $rxlimit) break; $line = fgets($fp,2048); $line=trim($line); $rx+=strlen($line); if (preg_match("/^Location:\s*(.+)$/mi",$line,$match)) { fclose($fp); $this->_resolveredirects++; return $this->_resolve($match[1],$server,$title); } if (preg_match("/^Server:\s*(.+)$/si",$line,$match)) $server=$match[1]; if (preg_match("@<title>\s*([^\r\n]+)\s*</title>@si",$line,$match)) { $title=$match[1]; break; } } fclose($fp); return $url; } }

Create a link object and call its request method to kick off the recursive _request method. I use a variable within the object to count the number of requests I'm making, so it doesn't run infinitely if there's a redirect misconfiguration on the remote server.

When downloading the response, the function only gets what it needs, stopping after it's found a <title> tag or if it's hit the 10000 byte limit (whichever comes first).

If you find this useful, spot a bug or possible improvement, or have a question, just post a comment using the link at the top of this post.

Comments

It's quiet in here...Add your comment

Efficiently decode a short url

Comments

Related Reading