I was fortunate enough to attend the PHP UK Conference 2012, on Friday 24 February. The theme of the conference was about how well PHP scales; watch the two keynote speeches by Rasmus Lerdorf (PHP founder) and Hugh Williams (eBay's Engineering VP) and decide for yourself! Information on the talks, and links to slideshares for each talk, are available at phpconference.co.uk/talks/2012.
Warning about HashDoS: Upgrade to PHP 5.3.10! Hardly anyone in the audience appeared to know about this extremely simple denial-of-service attack vector.
Innovations featuring integration into PHP: Rasmus was very excited about libevent+ZMQ (php.zero.mq) for fast IPC, and SVM (a fraud detection system) looks great too.
PHP 5.4 testing is on-going, but needs help from the development community: qa.php.net for quality assurance; gcov.php.net for coverage reports; bugs.php.net/random for a random PHP bug to check out; edit.php.net for documenting with PHP DocBook; wiki.php.net for RFCs, todo's, etc.
PHP6 There is no PHP6 planned atm - unicode problems (code bloat and performance issues) have stalled development.
Apache 2.4 looks good so far with PHP.
This was Saturday's keynote, so I didn't see it live. Hugh started off with PHP in 1997, using it as a teaching language. He went on to write several O'Reilly books about PHP and MySQL.
eBay at scale: to put it in perspective, it handles a transaction on a pair of shoes every 7 seconds. Its search engine has 250 million queries per day, and there are 2 billion daily web page views. There are, at any one time, over 300 million active listings. They have 9 peta-bytes of data, and the databases handle 75 billion queries per day. There are 6 million item updates per hour.
Scaling secrets:
This is about the role of the data scientist, trying to extract meaning from random web data. James' website is LifestyleLinking.net.
He uses zeromq to feed new data sources into the system. He combines new content plus definition (eg, wikipedia) with a score matrix (keyword frequencies) to find new keywords. The process is about data conditioning (cleanup) and data product (creating something new). It's important that we don't assume or define fixed rules; let the data dictate what's important.
Data Science influencers: Jeff Hammerbacher; Hilary Mason (bitly); Monica Rojita (linkedin); John Rauser (Amazon); Andrew Starkey (blue-flow.com)
Brandon works on the Mozilla bug handling team.
Always keep a layer of code for data source & processing.
Think about what we want to do with the input data.
Use a standard data format - don't pass around custom data handlers (eg, a mongo cursor)
Use the correct storage medium: available / reliable / consistent - choose any two!
The goal: be able to choose a different data source type very easily.
Brandon recommends Elastic Search as a search solution.
An insightful talk, but not as engaging or inspiring as I would have hoped for.
Think about the creative process: prepare - incubate - insight - verify. Allow time to incubate a solution. Having prepared & gathered information for the task at hand, don't try to immediately gain insight into the solution.
The brain function: it appears to work with either big picture or detailed thinking, but not at the same time on the same thing. Think of it as a dual-CPU model, but with shared memory.
Synchronous (step-by-step) CPU has logic, analysis, abstraction.
Asynchronous (all-at-once) CPU is based on vision and analogy, looks at things in a 'rich mode'.
Useful creative skills: pattern matching (spatial awareness); holistics (the big picture); analogy, metaphors (common language & humour to make things easier to understand); pair programming (one being creative, one being analytical); synthetics (prototyping & unit testing).
Quick win: use vmstat for identifying issues.
Use siege for load testing.
Set timing points to see how fast specific code blocks are.
Checking code use: inclued (code analysis); xdebug. Use meld for comparing xdebug output from different machines, to see why they behave differently. To study xcache results, use Kcachegrind (linux) or Qcachegrind (mac).
For deeper analysis, valgrind goes into PHP itself.
Alternatively, you can use APD.
For high-performance profiling: XHProf + XHGUI - on live environments too.
Tracking memory usage is an issue: PHP frees up memory when it wants.
This talk was very informative and easy to follow; we should be able to play with SVM quite easily.
SVM's purpose is to take input data (the vectors), learn from known fraudulent events, and accurately predict future fraudulent events. I can see this system working well in detecting all sorts of events, not just fraud; for example, detecting people who are trying to game a social network with multiple accounts. The key is in providing sufficient range of input vectors - let the system figure out which ones are important.
Ian recommends Chris Manning (Machine Learning) for further reading. Also, a Stanford online course.
In case you missed PHP UK 2012, or if you're seriously into PHP, head over to php|tek in Chicago: May 22-25, 2012. Check out tek12.phparch.com for more information.