UK PHP Conference 2012: a question of scale

I was fortunate enough to attend the PHP UK Conference 2012, on Friday 24 February. The theme of the conference was about how well PHP scales; watch the two keynote speeches by Rasmus Lerdorf (PHP founder) and Hugh Williams (eBay's Engineering VP) and decide for yourself! Information on the talks, and links to slideshares for each talk, are available at phpconference.co.uk/talks/2012.

Keynote by Rasmus Lerdorf

Warning about HashDoS: Upgrade to PHP 5.3.10! Hardly anyone in the audience appeared to know about this extremely simple denial-of-service attack vector.

Innovations featuring integration into PHP: Rasmus was very excited about libevent+ZMQ (php.zero.mq) for fast IPC, and SVM (a fraud detection system) looks great too.

PHP 5.4 testing is on-going, but needs help from the development community: qa.php.net for quality assurance; gcov.php.net for coverage reports; bugs.php.net/random for a random PHP bug to check out; edit.php.net for documenting with PHP DocBook; wiki.php.net for RFCs, todo's, etc.

PHP6 There is no PHP6 planned atm - unicode problems (code bloat and performance issues) have stalled development.

Apache 2.4 looks good so far with PHP.

Keynote by Hugh Williams

This was Saturday's keynote, so I didn't see it live. Hugh started off with PHP in 1997, using it as a teaching language. He went on to write several O'Reilly books about PHP and MySQL.

eBay at scale: to put it in perspective, it handles a transaction on a pair of shoes every 7 seconds. Its search engine has 250 million queries per day, and there are 2 billion daily web page views. There are, at any one time, over 300 million active listings. They have 9 peta-bytes of data, and the databases handle 75 billion queries per day. There are 6 million item updates per hour.

Scaling secrets:

  • Query rewrites (mis-spells; plurals) to make search intuitive - run a much more complex query in the background. Understand the user's intent, by identifying when users do or don't interact with the results. This can be done because of the vast amount of data available.
  • Sort results by item quality attributes (eg, image aspect ratio; seller rating; watchers). eBay works out the likelihood of an item selling, and how much it should sell for.
  • Prioritise event analysis, so that only the most interesting events are analysed.
  • Heatmap tool: visualise queries to identify when user isn't satisfied.
  • Project Cassini: the new search engine, based on commodity hardware.
  • Test often, using lots of little tests.
  • ql.io: open-source API platform, helping client programs make fewer calls. SQL-like query language to specify the data we require. Data transferred in json format.

Big Data by James Littlejohn

This is about the role of the data scientist, trying to extract meaning from random web data. James' website is LifestyleLinking.net.

He uses zeromq to feed new data sources into the system. He combines new content plus definition (eg, wikipedia) with a score matrix (keyword frequencies) to find new keywords. The process is about data conditioning (cleanup) and data product (creating something new). It's important that we don't assume or define fixed rules; let the data dictate what's important.

Data Science influencers: Jeff Hammerbacher; Hilary Mason (bitly); Monica Rojita (linkedin); John Rauser (Amazon); Andrew Starkey (blue-flow.com)

Data abstraction by Brandon Savage

Brandon works on the Mozilla bug handling team.

Always keep a layer of code for data source & processing.

Think about what we want to do with the input data.

Use a standard data format - don't pass around custom data handlers (eg, a mongo cursor)

Use the correct storage medium: available / reliable / consistent - choose any two!

The goal: be able to choose a different data source type very easily.

Brandon recommends Elastic Search as a search solution.

Creative Coding by June Henriksen

An insightful talk, but not as engaging or inspiring as I would have hoped for.

Think about the creative process: prepare - incubate - insight - verify. Allow time to incubate a solution. Having prepared & gathered information for the task at hand, don't try to immediately gain insight into the solution.

The brain function: it appears to work with either big picture or detailed thinking, but not at the same time on the same thing. Think of it as a dual-CPU model, but with shared memory.

Synchronous (step-by-step) CPU has logic, analysis, abstraction.

Asynchronous (all-at-once) CPU is based on vision and analogy, looks at things in a 'rich mode'.

Useful creative skills: pattern matching (spatial awareness); holistics (the big picture); analogy, metaphors (common language & humour to make things easier to understand); pair programming (one being creative, one being analytical); synthetics (prototyping & unit testing).

Profiling apps by Derick

Quick win: use vmstat for identifying issues.

Use siege for load testing.

Set timing points to see how fast specific code blocks are.

Checking code use: inclued (code analysis); xdebug. Use meld for comparing xdebug output from different machines, to see why they behave differently. To study xcache results, use Kcachegrind (linux) or Qcachegrind (mac).

For deeper analysis, valgrind goes into PHP itself.

Alternatively, you can use APD.

For high-performance profiling: XHProf + XHGUI - on live environments too.

Tracking memory usage is an issue: PHP frees up memory when it wants.

Fraud detection by Ian Barber

This talk was very informative and easy to follow; we should be able to play with SVM quite easily.

SVM's purpose is to take input data (the vectors), learn from known fraudulent events, and accurately predict future fraudulent events. I can see this system working well in detecting all sorts of events, not just fraud; for example, detecting people who are trying to game a social network with multiple accounts. The key is in providing sufficient range of input vectors - let the system figure out which ones are important.

Ian recommends Chris Manning (Machine Learning) for further reading. Also, a Stanford online course.

Another conference...

In case you missed PHP UK 2012, or if you're seriously into PHP, head over to php|tek in Chicago: May 22-25, 2012. Check out tek12.phparch.com for more information.


Comments

It's quiet in here...Add your comment