I find it funny that I’m 28 years old, and I’m doing the exact same thing I did when I was 11. I’m on a computer at 4:30am, eyes bloodshot, yet not the least bit tired. I’m still instant messaging, only now I’m doing it on gchat instead of AOL. Steve Case is on to Living Social & Zipcar, and now I’m an adult responsible for re-architecting and redesigning a major newspaper’s web infrastructure. Life can be really crazy like that sometimes. My name is Brian Rossi, and I’m the Web Architect for the Pittsburgh Post-Gazette. Marco Cardamone asked me one time if I had any hobbies and interests. I told him “this is my hobby; creating things, it’s what I love.” He replied that he was afraid I was going to say that. I like to make people happy with products, because digital products make me happy. I believe they make peoples lives easier, more fulfilling, and ultimately better. They give Autistic children more ways to communicate, and everyone else benefits from a newly found ability to connect. If a product I’ve created is the one people used when swerving between lanes because they looked away from the road for the length of a football field, or even the one that recaptured time in the bathroom on the iPad, then it’s all worth it.
Not a lot of people get to say they are actually doing what they wanted with their life. Like truly doing what they wanted. Not the fake “Oh I really got used to this” version that people try to pass off as not being a compromise. I feel very blessed to have my passion and my career be synonymous. I get to create things, watch people use them, and make even better things based off of data.
When I was around twelve, I decided I wanted to build robots. In fact, I wanted to go to Carnegie Mellon so I’d know how to build robots. I didn’t end up going to CMU, but in a strange twist of fate, I did end up building robots, learning mostly from Google. Admittedly, not the kind of robots most people think of that do cool shit like vacuum their floor then plug themselves in afterwards. I’m referring to intelligent multi-part web computing systems. Systems that have conversations with one another when I’m not watching them, and can solve problems many times over, much faster than I could dream of doing in my own head. Products that enrich my life, and hopefully others’ as well.
So guess what? We’re launching the new Pittsburgh Post-Gazette web site at last. It’s crazy to say/type those words. It’s been my baby for over a year now, and this period of my life has been an amazing one. I’ve had the pleasure of working along side a lot of fascinating and brilliant people (both inside & outside the PG, as well as in Academia), while focusing on a project that touches millions of users. I’ve learned more about Newspapers and corporate America than I even could have predicted, and I take from this experience the infinite wisdom of people who have worked in the news industry and corporations for most of their lives. To sum it up: very cool experience.
There’s the obvious question of “how?”. I did an initial case study and posted it in the comments of an article about the beta testing and user centered Post-Gazette redesign on Poynter.org, so this will be more of the long form version. The Poynter comment I wrote only talked about our feedback loop and design, and not really the engineering behind the Pittsburgh Post-Gazette redesign. I think this is all valuable information for anyone trying to do enterprise level work. Most of the sites I see that contain information about tech such as nginx, php-fpm, & varnish aren’t really talking from an enterprise stand point, although could be, therefore this article is all about framing it in that context.
There’s a lot to this story, but I’ve intertwined case study with deep tech. I hope I don’t lose different audiences in the midst… soooooo here goes…
I struggled for a while as to what sort of look and feel to propose for the new design. I interviewed nearly everyone in the news room, and collected their ideas for what they envisioned for their piece of the site. Our director of Digital Strategy, Pat Scanlon, told me it was extremely important I make sure they were heard (which lucky for me I agreed with completely.) I also visited almost every news site on the planet. I spent a bunch of hours chopping and mocking, until I finally got an idea of where to go with it. There were multiple sites and several people (Ed Kost and David Shribman) that I drew inspiration from, which really helped when trying to explain the vision to Chris Szekley, our Senior Marketing Designer. Chris is a monster in photoshop. I wanted something fun, non-depressing, and very intuitive to use. I also wanted a Developer API, but more on that later. Newspapers as an industry aren’t very good at user experience, user centered design, or anything feedback related. Thankfully I can say the people I work with at the Post-Gazette honestly get it. They embrace change, rather than running from it. I get in trouble in life sometimes because I’m too user focused, and I want to make the perfect product. It’s easy to get sucked down the rabbit hole trying to service every request, so the trick is to prioritize what is going to be the biggest bang for the buck. I like wow factor stuff. It can be small, but we’re talking features that make people stop and go “oh that’s cool,” then play with that functionality.
An example of this is our drag and drop template admin that I thought would be a great tool for Editorial. David Esaias (the ad-hoc member of the Projects team) did an amazing job bringing the vision to life. I call him the jQuery killer for it. He also spent countless hours building template pages. The tool makes rearranging the page easy, and makes the experience feel more expensive. People like iPhones because they feel nice, not necessarily because they do the most stuff. IPhones do certain tasks really, really well. Well, they are also adored because they are a platform, however that’s getting off topic. Either way, as Tim Dunham said to me, “be in the business of releasing software, not building software.”
On a side note, David Esaias did a pretty amazing job on our video page. It was one he just sort of took the concept and ran with, and came back with some thing really cool. Allan Walton, the head of Video/Multimedia was a great guy to work with, very patient. We did his section last because I wanted it to be special. Actually, the new photo page will have been last, and we have something pretty cool in the works for it.
We went with the existing design because users continually told us it was really easy to use. In user testing they appreciated the colors because it allowed them to quickly navigate to the content that is most interesting to them. It still has a newsy feel to it, so I think it was a good first step given our current design which was done in 2006-2007. Editorial appreciated the fact that I genuinely cared about what they wanted, and I valued their honesty about what they liked and didn’t like.
Big Bolts Need Big Nuts: Scalable Architecture
One of the coolest parts of this project is the technology. We do have a Developer API (not yet open), and everything is built as SaaS and PaaS. It is MVC architecture, even down to the server infrastructure. I decided to use one of the later iterations of my framework/template engine that I’ve been building for almost three years to power the site. It was really cool to get to see it run at scale. To accomplish our developer API, I was lucky enough that Brandon Sherman had done a project for us while he was attending CMU that was a RESTful API wrapper. It was a pretty cool piece of engineering. We used the base of that project to create the data controller on the new site.
The site had to be extremely scalable (millions of page views) and much faster than the existing site. More specifically, I wanted our initial page load to be under 200ms. Now, there are two ways to scale a site. Let me define “successful scaling” before we move forward:
- The ability to handle large amounts of concurrent users without drastic performance drops
- The ability to support the infrastructure within the business for period X and X * Y
- Ability to scale with hardware only. Add servers, not code (config file is okay)
The easy (expensive) way to scale is just throw hardware at it. Simply keep adding more servers horizontally, and it will eventually handle the traffic. The smart way to scale is still horizontal (a little bit vertical doesn’t hurt either), in addition to the use of several cache layers along the way to reduce the amount of servers needed, as well as the carbon footprint.
We use 4 Layers of Cache for this new architecture, expandable to 5 if needed.
- File Cache
We use the file cache on any INSERT, UPDATE, or DELETE. This completely saves our database. Someone told me they use MS-SQL instead of mySQL because mySQL cant handle the load. My response was that we don’t need to handle load because I was smart about how we use the database. Whenever a CRUD operation takes place, we pre-cache “widgets” on our section fronts (anything that may have been affected by that change), and cache stories for 30 days when they are accessed for the first time. A 30 day cache pretty much takes them past their heaviest viewing period, unless it was something super viral like money falling out of a truck. Even then, if it wasn’t in the file cache, it will be for another 30 days after it’s next access. On DELETEs and UPDATEs we simply invalidate the story cache, and update our section front widgets. Vendor RSS feeds are made into plugins, which have individual cache TTLs and are cached via a cron task. Almost everything is pre-cached so the database almost never gets touched. We’ve turned expensive database calls into cheap stores. File caching is the slowest of all forms, but works perfectly here because we have APC on the next layer.
- Op-Code Cache – APC
Ahhh, good ol’ APC and it’s 99.7% hit-rate. When web applications are built in a dynamic programming language like PHP, they need to be run and interpreted each time someone accesses a page. To counter the fact that we use file caching to pre-cache all of our pages, we use APC on this layer to save us on all of the expensive disk I/O to actually build a page (each section box is a cache file.) As you know, rotational disks are the slowest part of a computer because they are the only part that still utilizes physical movement. Go use a MacBook Air and watch how many tabs and programs you can open simultaneously. It’s night and day. I personally have an Intel 320 in my Macbook, and that bitch is fast. SSD Drives are great for caching servers too. We could go with one of those super fast Fusion-io SSDs (Solid State Drives), but for our application op-code caching solves our problem for far less money. The concept of op-code caching is that APC cache basically takes the result of 1000′s of calculations and disk accesses our PHP application makes, and stores it with a hash for quick retrieval. This eliminates the need for the server to keep processing the code over and over, and more importantly accessing the disk over and over. Below are the stats from APC, showing our hit rate.
- Reverse Proxy Cache – Varnish
So we’re almost there. We are heros for our small, inexpensive database, and our servers love us because we’re not hammering their wussy spinning disks. We’re not Charlie Sheen winning yet, but boy are we close. We need that Varnish. Varnish with some Goddess. Well, maybe just Varnish. Admittedly, this was the trickiest layer for me. At first glance I thought I had to figure out a way to completely avoid cookies for non-authenticated users for the cache to be effective. That actually wasn’t true, I just had to be creative with my Varnish VCL script. If Varnish sees cookies being tossed back and forth, it is going to pass on the hit every time. Passing on a hit means we didn’t deliver a cached page, and had to rely on the server to possibly process it, but at least fetch it from a lower layer of cache. Our goal is to deliver a cached page as often as possible, so our request stays away from our other more expensive layers. I realized that I simply (haha) needed to add some regex code in my VCL to only invalidate on certain cookies, like a login cookie for instance. I also noticed that the cache was in a race condition for the first device type to get a hit. Meaning, if a mobile phone hit the website first, and a desktop machine made a request in the same cache period, they both would be delivered the mobile site. Obviously that’s not cool. To solve the problem I added a little more VCL code to Varnish, instructing it to detect the same user agents my PHP engine does, and use them as part of the hash. Remember the hash is what Varnish uses to lookup a cache entry, kinda like an id for that specific page. By adding the user agent to the hash along with our URL, it serves different pages for different user agents.
Varnish sits on the server between the user and the web serving daemon (in our case nginx) to quickly serve pages so that the web daemon doesn’t get touched, similar to how we use APC so that PHP doesn’t need reinterpreted. Below are our stats from the varnishstat script. Notice the roughly 85% hit rate. Keep in mind our hit rates are so high on our caches because we’re serving generally static content. As things become more dynamic, I’ll have to get more creative with my cache design. It’s sort of like a video game, only not.
- Browser Cache
This is the simplest one. Everyone knows this layer. When someone says “clear your cookies and cache”, this is what they’re talking about. It stores local copies of files that aren’t changing on user’s hard drive, so they can be quickly accessed. Network turn-around time is one of the biggest costs in scalability. Sort of like rotational disk access. The browser cache also respects the same headers that varnish and/or nginx supplies. PHP can set the headers also. Any static images, css, or js file are set to default far-future expire / Expires max. Using a CDN greatly helps with speed too.
- Memory Cache – database (memcached) not implemented
I won’t go too far into this one, although it’s one of the most important layers if you’re doing a lot of dynamic stuff. When doing real time infrastructure, memcached allows for lots of database activity. This does the same thing that our file cache does, only it’s a lot faster. Remember this speed isn’t needed in an article serving infrastructure, because our articles aren’t going to change very much once published (in terms of cache). APC cleans all of that up for us, so we use our cheap disk space rather than expensive RAM. This however could be useful for something like an openid or oauth2 server, where there is a need to verify the same key/token often.
Nginx vs Apache
The final piece of this I’m going to talk about is our tech stack on the servers. I chose to ditch apache, because it just isn’t good with concurrency. I could post a graph showing it dying, but there’s already a ton of them on google. The short of it is that apache is thread based, so after X amount of concurrent users hit the server (X is your RAM divided by the size of an apache thread), the machine runs out of RAM and starts swapping to disk. Once the slow disk starts getting used like RAM, say goodbye to the web server. I chose nginx because it’s event driven, so it’s memory management is way more linear. Apache hits a hockey stick curve on a graph at high concurrency, where as nginx stays pretty straight and narrow. It allows for a lot more concurrent users than apache, and leaves you with RAM for other things.
Yay for us, we came in under 200ms. In fact, we came in at around 18% under 200ms. Our response time was 164ms on this call. Success.
And that concludes our broadcast
One final thing I’d like to say. We got the servers up in one week after getting access to them It’s a shame because I don’t think most people can truly appreciate the engineering behind all of all of it especially on such a short time line. I had the pleasure of working with Darren Leonard and Sean Engel, two absolutely amazing network guys. I came to them with an aggressive tech stack, which they took it in stride and banged out in a super short period of time. I can always appreciate people who go out of their comfort zone in order to achieve greatness, and value every moment of it.
I lied. Two more final things to say. Laura Schneiderman (Web content producer) deserves a paragraph of her own, and so does Mary Leonard (deputy managing editor). I told David Shribman prior that this project ending that it would not have been the same without Laura. She constantly fought to ensure the newsroom’s interests were kept in mind, and kept me on track at a lot of points in the project. If we would have done a over/under during the holiday season for who could estimate/guess the launch date, she would have won. She said March 10, and we launched March 13. Saying that it was an immense pleasure to work with her would be an understatement.
I really want to thank everyone at the Post-Gazette, I know this hasn’t been easy on anyone. I’ve never been so impressed in my life at how a large group of people could come together and accomplish a task this big. It’s the biggest project I’ve completed in my life to date, and I’m very proud of it. I am thankful I work with such amazing individuals. We have wonderful leadership and great owners. I’m especially grateful of Liam Durbin, our CIO, because he pushed hard for date accountability and timelines.
Additional thank yous to Chris Chamberlain (our President), Pat Scanlon, Janeen Osselborn, David Shribman, Tom O’Boyle, Tim Lakey, Bill Gibbs, Joe Cronin, Floyd Traud, Rich Medeiros (for the patience), and all of the leadership at the PG for trusting me. Big kudos to the entire operations team as well because they kept our old site up and running (which was no light task, trust me) while we built the new one.
UPDATE: I’ve been benchmarking post-gazette.com against some major newspapers, and found since we launched we’ve become one of the fastest major newspaper websites in the US based on initial response time. Our average response time is 119ms.
<3 you mommy and daddy