This is the fourth installment of a mulitpart part series where we aim to share with you some of the technical aspects of what powers the Managed WordPress Hosting system we developed here at page.ly, how we started, the recent server improvements and a bit on the things to come. [Part 1] [Part 2] [Part 3]
Time to go big.
When we started this journey many years ago, we made system choices fitting at the time. We grew, and grew some more, and kept on growing and the old system was straining. In the summer of 2011 we began work on page.ly 3.0. Our most advanced Managed WordPress Hosting system to date. A system designed for reliability, security, scalability and most of all performance. It’s all the hawtness of page.ly, just faster and more reliable.
The build process was long and unfortunately not without it’s problems. The goals of page.ly3 were:
- It has to scale easily
- It has to be fast
- It has to be secure
- It has to be reliable
That’s it. SImple order right?
Our plan was to develop page.ly3 next to pagely2. Build a mansion next to a condo, and then come moving day, everyone just packs up and moves nextdoor. However it turned out that we moved and improved things in pieces.
We started by consulting some of the best and brightest the web community has to offer. I spent some hours consulting with my good friend Joe Stump (Ex lead architect at digg) among others with the ‘scaling’ knowledge. We developed a simple plan. Use small nodes, to do specific tasks, and scale out in layers.
We decided we needed dedicated machines for mysql, web, varnish, memcache, and the like. We needed load balancers out front and we wanted robust and fast storage behind it. We also had to get rid of Plesk. While it was fine for a smaller page.ly, it simply did not scale and frankly is a festering turd, like all control panels in our opinion.
The first thing we did was add more database servers. 2 new slaves coming off the master. We then added WordPress’s HyperDB system wide to segment those reads and writes. We went from a single 8core 17gb machine to 3, 4core 10gb machines. The slaves happily process 700-800 queries a second at 0.20 load.
We then setup the new storage. Instead of running storage local to each machine and working out a rsync strategy we went with NFS. At first we went with a solid 700gb disk but then decided it was better to serve 10x70gb disks up to the webnodes by NFS4. Firehost also upgraded all our storage to Fibre. The disk speeds are crazy fast, and are still in the 90mb/s range serving over NFS. We do all hard file operations directly on the storage nodes which speeds things up as well.
Zeus load balancers were setup out front in a failover configuration, and dual Varnish nodes directly behind it. This gives us a lot of flexibility allowing us to segment and direct traffic around or through specific varnish nodes directly to some or all of the web nodes. It’s very fault tolerant as if both varnish nodes fall over, everything goes gracefully to the web nodes. Varnish is of course a web caching system, serving pages and static assets in 20-50ms a 4x improvement over pagely2.
We setup some other nodes to handle a few management tasks like puppet, memcache, cron jobs, backups, logging, etc.
We also had to gut all that code that relied on Plesk, since it is dead to us. It is not terribly pretty right now but we bashed together a temporary set of scripts to install and manage our system. This is just a band-aid for now, and we’ll give you a preview of the new system in the final chapter of this series.
Finally we direct certain types of traffic to specific nodes tuned for the task. We dont blanket all web nodes with all traffic. As an example wp-cron.php commands hammer a pool of servers all day, allowing others to work without those nasty processes sapping cpu.
Here is a snapshot of what the page.ly 3.0 server topology consists of. Below we will talk about each part of it.
- All traffic of course comes from the world famous interwebs, a series of tubes that connect you to LOLcats wikipedia.
- That traffic hits the Firehost security layer. This is like fort knox for data. Thousands of nefarious attacks are blocked here daily.
- Good traffic then travels over our 10gb internal wan and hits our twin Zeus load balancers. These bad boys are the latest and greatest in awesome. So much so Rackspace uses them like a mo-fo but calls them Cloud Load Balancers, which is total BS cause Zeus lived above the clouds like a Boss.
- Internal DB read requests are load balanced to the DB Slave pool.
- Web requests are seamlessly routed around Varnish if needed to land directly at the web pool for normal processing.
- Requests are then routed to our Varnish layer where cached content (images, js, css, and full page output) is served directly back to the user at crazy fast speeds.
- Backend requests from Varnish are handled by our pool of Litespeed web servers. These are well equipped but dummy machines that do nothing but process PHP and serve up web pages. The nice thing here is that we can scale out, adding as many as we need for as long as we need. Why not nginx? We are testing some options and benchmarking a variety of webservers for future use.
- The Database queries come from the web servers through HyperDB, a WordPress database class designed and used by the folks at WordPress.com. Reads & Writes go to separate database pools tuned for their particular task. Again here we can scale out as many slaves as we need with replicating masters.
- All Files are served from our latest generation SAN array which we can scale damn near infinitely in size. This large volume is mounted as many smaller NFS shares to provide optimum performance while maintaining a global shared file system. When we feel like it, an upgrade to SSD disks is a click away.
- The entire system is controlled by Puppet with manifests to maintain server configs and handle server cloning.
- Our Utility node does all manner of tasks like taking out the garbage, running Nagios, and keeping my vodka glass filled.
- The Page.ly node, is frankly where the magic happens regarding our Managed WordPress system. Server architecture is cool and all, but without badass software it’s just 1’s and 0’s.
- Finally, every node is HA and the entire system is backed up every evening with snapshots. We can restore any node, any data, at any time.
Currently we vary between 16-24 nodes running depending on load.
El Pollo Diablo
El Pollo diablo is essentially the same architecture stack. Typically we route the traffic thru the load balancers, to a dedicated varnish node and web server pool specific to that client. Sometimes the client runs their own mysql pool as well, sometimes they take advantage of our mysql cluster. The flexibility allows customers true ‘dedicated’ resources but also use of our redundant and failover safe systems as needed. Customers have been very pleased with the hybrid approach allowing them flexibility and a ‘managed’ WordPress solution.
We did not share every piece of the puzzle as there are a few blackops systems in place to add that little extra awesome that makes Page.ly special. Sure it cost us more to roll out than most people’s annual salaries but we spent it out of love, you guys deserve it.
This is not rocket science as robust and redundant server stacks are the norm for large scale websites. However we are happy that now we have implemented a system that can keep up with our insane growth AND serve not 1 but thousands of websites at speed and scale.
Phase 5: The transition
Moving clients to the new systen was not easy. It spanned over 3 months from October 2011 to just after Christmas. We moved 1 of the old plesk servers at a time, rsyncing all that data over, popping ip’s off those nodes and onto the load balancers. We did the last client move after Christmas and finally shut down those old nodes. After every block, we would tune and test, uncover bad mojo and cure it.
October-December 2011 were probably the hardest days our company has faced, and hopefully ever will. We had hard time getting the balance just right. Redoing our storage setup, and Varnish configs at least 4x while serving live traffic. It’s hard to plan for what you dont know. An example of which is a nasty bug in PHP itself.. that essentially lstats a file 4x-8x EVERY single time it requests it. On local storage you probably would not notice the hit, over NFS with the millions of files we access millions of times a day.. it was literally shredding the SAN, and sending IO on the web nodes through the roof. This is just 1 of the handful of things we found at scale, as in they only cropped up under load.
We had to work through each on of these, again under live conditions. As once we moved a block of customers, there was no going back.
We saw some client attrition as we worked through the issues. We win them with a solid offering and stellar support, but even the most understanding customer is only going to hang around so long if you cant keep the system up for more than 10 hours at time. We wish them well, and we know they will be back. They always come back as the other solutions out there just dont stack up (their words, not ours).
Even with ALL that.. investing over $100k in labor and new hardware, losing some nice customers, and slogging through a painful 3 month transition, we still grew. Every month since we started has been our best month to date and it shows no signs of letting up.
While we were at it, We hired 4 employees in 2011, more than doubled our customer base and Page.ly got a new site design as well.
Getting race ready
I enjoyed racing cars back when it seemed I had free time. You always put in a couple warm up laps to heat the tires before rolling on the throttle. The promise of page.ly3 is nearly realized but we still have some tuning to do.
We are investigating alternate storage systems and are looking at everything from gluster to netapp, to GFS. As Joe said, “Disk is always the bottleneck". We are constantly tuning and tweaking the varnish nodes to maximize hitrate and delving deep in to WordPress itself to tune it for our system. At this point we have essentially forked WordPress. We maintain a set of patches we apply to new versions and test before rolling out to the customers.
We are also looking at alternate caching systems around the database, and object cache to make full use of memcache, APC is also running on a very narrow fileset that could be expanded. So we still have a bit of work to do, but we are making progress in days now instead of months.
So the servers are nice and fast now. What about page.ly itself? You know the app. We’ll give you a preview in the final part of this series.