Efficiently decode a short url

By 10 Jul 2010 | Comment
I am building a reference of short urls, and for each one need to record its final destination url, server software and page title. Curl almost makes this a very simple process, but lacks the ability to download a limited number of bytes from the response. file_get_contents does allow you to limit the bytes received, but lacks the ability to automatically follow the Location: response header; it also may be disabled for remote file requests - the allow_url_fopen restriction.

I therefore decided to write my own, efficient version, based on fsockopen.

class link { function resolve ($url,&$server,&$title) { $this->_resolveredirects=0; return $this->_resolve($url,$server,$title); } function _resolve ($url,&$server,&$title) { if ($this->_resolveredirects > 5) { print "link::resolve($url) too many redirects\n"; return $url; } $ports=array("http"=>80,"https"=>443,); if (!preg_match("@^(https?)://([^:/\?]+)(.*)$@i",$url,$match)) { print "link::resolve($url) unknown format\n"; return $url; } $proto=strtolower($match[1]); $hostname=$host=strtolower($match[2]); $uri=$match[3]; if($proto=="https") $hostname = "ssl://$host"; $port=$ports[$proto]; $fp = fsockopen($hostname, $port, $errno, $errstr, 10); if (!$fp) { print "link::resolve($url) $errstr ($errno)\n"; return; } $out = "GET $uri HTTP/1.1\r\n"; $out .= "Host: $host\r\n"; $out .= "User-Agent: Mozilla/5.0 (compatible; Link Resolver)\r\n"; $out .= "Connection: Close\r\n\r\n"; fwrite($fp, $out); $rxlimit=10000; $rx=0; stream_set_timeout($fp,5); while (!feof($fp)) { if($rx > $rxlimit) break; $line = fgets($fp,2048); $line=trim($line); $rx+=strlen($line); if (preg_match("/^Location:\s*(.+)$/mi",$line,$match)) { fclose($fp); $this->_resolveredirects++; return $this->_resolve($match[1],$server,$title); } if (preg_match("/^Server:\s*(.+)$/si",$line,$match)) $server=$match[1]; if (preg_match("@<title>\s*([^\r\n]+)\s*</title>@si",$line,$match)) { $title=$match[1]; break; } } fclose($fp); return $url; } }

Create a link object and call its request method to kick off the recursive _request method. I use a variable within the object to count the number of requests I'm making, so it doesn't run infinitely if there's a redirect misconfiguration on the remote server.

When downloading the response, the function only gets what it needs, stopping after it's found a <title> tag or if it's hit the 10000 byte limit (whichever comes first).

If you find this useful, spot a bug or possible improvement, or have a question, just post a comment using the link at the top of this post.


Comments

It's quiet in here...Add your comment

Web Development Survey!
Which web technologies would you like to see the back of, and why?