Developer

Wikipedia reveals how Facebook's hip hop can pep up PHP sites

How Wikipedia halved the time it takes to make edits to its pages by switching to Facebook's Hip Hop Virtual Machine.

Internal releases of Hip Hop Virtual Machine are named after rap luminaries such as T-Pain, Ultramagnetic, Vanilla Ice and Will.i.am. Image: HHVM

Sluggish performance can undermine a website's chances of building a loyal user base.

Failure to keep websites snappy has been credited with hastening the decline of once busy online destinations such as Friendster.

Mindful that Wikipedia should avoid a similar fate, the charitable organisation running the world's most popular general reference site set out to streamline the software that handles encyclopaedia edits.

For help the Wikimedia Foundation turned to Facebook, the social networking giant whose software handles posts by more than 1.35 billion users.

Both Wikipedia and Facebook rely on PHP as a scripting language to help generate web pages for their users.

While PHP is one of the most popular server-side scripting languages in the world, used by 240 million websites as of 2013, its performance has limitations.

"As a single-threaded language, PHP cannot be parallelized - that is, a task that takes up a whole CPU core can't be sped up by spreading it over multiple cores," explained Wikimedia's principal software engineer Ori Livneh in a blog post.

The speed at which PHP is executed is also held back by it being an interpreted language, rather than compiled as is the case with C++.

To improve PHP's performance Facebook created Hip Hop Virtual Machine (HHVM), an interpreter that executes PHP source code.

HHVM is able to execute PHP code more rapidly than the standard interpreter known as Zend. It achieves this in a number of ways, as explained by Wikimedia's Livneh.

"HHVM is able to extract high performance from PHP code by acting as a just-in-time (JIT) compiler, optimizing the compiled code while the program is already running," he said.

"While running, HHVM analyzes the code in order to find opportunities for optimization. The first few times a piece of code is executed, HHVM doesn't optimize at all; it executes it in the most naive way possible. But as it's doing that, it keeps a count of the number of times it has been asked to execute various bits of code, and gradually it builds up a profile of the "hot" (frequently invoked and expensive to execute) code paths in a program. This way, it can be strategic about which code paths to study and optimize."

Boosts to performance

After switching all of its app servers to from Zend to HHVM, Wikimedia is reporting significant reductions in how long it takes to save edits to pages on Wikipedia and other Wikimedia sites.

The median time to save a page has dropped from about 7.5 seconds to 2.5 seconds and the mean time to save a page was reduced from six to three seconds.

"Given that Wikimedia projects saw more than 100 million edits in 2014, this means saving a decade's worth of latency every year from now on," said Livneh.

Image: Brett Simmers / HHVM

The average page load time for logged-in users also fell from about 1.3 seconds to 0.9 seconds.

The CPU load on app servers has "dropped drastically", from about 50 percent to 10 percent, allowing the foundation to scale back plans to purchase new machines.

"These numbers look great but they don't tell the whole story. Total page save time includes time spent waiting on the database (or any other IO), so the effects on CPU usage were even more pronounced," wrote HHVM software engineer Brett Simmers in a blog post.

Simmers goes into more detail about how performance was improved, explaining the behaviour in the graph below.

Image: Brett Simmers / HHVM

"There are two clear steps downward: the initial switch to HHVM, and then a configuration tweak. The tweak was enabling the hhvm.server_stat_cache config option, which caches the results of stat calls and uses inotify to detect when files are changed," he said.

One caveat to bear in mind however, according to Simmers, is part of the reason performance improvements were so pronounced was that the foundation was upgrading from PHP 5.3, an older version of PHP.

"PHP 5.6 has some substantial performance improvements over 5.3, so the bump from PHP 5.6 to HHVM would likely be less dramatic," he wrote in a blog post.

"That said, switching all API/web servers over to HHVM is by no means the end of this process. There are a number of other tweaks that can be made to further improve HHVM's performance, most notably using Repo Authoritative mode."

Other developers have pointed out that PHP 7 is due out later this year, with the promise of performance boosts due to significant refactoring of the Zend engine.

Implementing the changes

While HHVM will run most existing PHP there are minor incompatibilities, and of course the transition was not entirely straightforward for the foundation. The move required about six months of work by engineers from Wikimedia, with help from Facebook.

Work was necessary to ensure compatibility with the three PHP extensions written in C that Wikimedia uses:

  • A sandbox for the Lua scripting language that enables editors to set up templates
  • Wikidiff2, which generates documents highlighting differences between different versions of wiki pages
  • a "fast string search" ability to allow rapid sub-string matching

Getting these extensions ready required significant effort, including a rewrite of the compatibility layer for HHVM, with Wikimedia contributing about 20,000 lines of code upstream.

Once work was complete on these extensions, fixing edge cases and updating server configuration tools such as Puppet, as well as server OSes, Wikimedia began the transition.

It undertook the changeover in stages, offering access to the HHVM servers as an opt-in Beta so that remaining problems could be ironed out. Servers were gradually switched over one at a time until the beginning of December, when all users were moved to HHVM.

Future plans for HHVM

HHVM offers tools tools to spot bottlenecks and unnecessary work, which Wikimedia plans to use to streamline its applications to be better suited to running on large clusters of computers.

The flame graph shown below visualises which parts of the code consume most CPU time, keeping track of which other parts of the code they have been called from. This data was gathered by an HHVM extension called 'Xenon', which snapshots server activity at regular intervals.

Image: Ori Linveh / Wikimedia

HHVM's static program analysis, inferring what sort of values a variable can assume - e.g. strings, integers, or floating point numbers - should also improve code quality, said Livneh.

Wikimedia has also expressed interest in Hack, the language developed by Facebook to overcome some of PHP's shortcomings, such as the lack of asynchronous processing, and which can also be executed by HHVM.

In the long run the switch to HHVM might even enable changes to how content is created on Wikipedia.

"With HHVM, we have the capacity to serve dynamic content to a far greater subset of visitors than before," Livneh said.

"In the big picture, this may allow us to dissolve the invisible distinction between passive and active Wikipedia users further, enabling better implementation of customization features like a list of favorite articles or various "microcontribution" features that draw formerly passive readers into participating."

Visit TechRepublic