Today I’d like to share with you Shopzilla’s redesign of our consumer site and content delivery infrastructure.
I’ll talk to you about Our company and why we started this project in the first place I’ll tell you about what we built – which is 1 part software and 2 parts obsession with performance and measurement A deep dive into the site architecture How we’re evolving our architecture to maintain performance while scaling our data Front End Performance techniques Then, we’ll look at our performance gains and our data which show a direct and quantifiable correlation between speed and money
Shopzilla is one of the largest and most comprehensive online shopping networks on the web through our leading comparison shopping sites Bizrate.com and Shopzilla.com and the Shopzilla Publisher Program We help shoppers find best value, for virtually anything from 1000’s retailers Across our network we serve more than 100M impressions per day to anywhere from 20-30M unique visitors searching as many as 8000x/second for more than 105M products
Our original comparison shopping marketplace was launched on Bizrate.com in the summer of 2000 * At that time, our business was relatively simple: Bizrate.com as comparison shopping in the US Over the next several years our business evolved into something significantly more complex * First we added a number of co-brands and syndication partners through our API Then we introduced a new product of our own into the market – Shopzilla.com * Finally, just when all that worked, we took a software ecosystem designed to run in a single location and deployed it active-active to several datacenters across the US and Europe
Its testament to a great engineering team that we were able to incrementally evolve our architecture to support our growth We were always in a rush to market Often took the “quickest” possible route when we built in all that change A site code base designed for a single brand in the US became this super complex brand-management and content delivery system – but never through a remodel – always through addition. Grew our user base; 20% growth year-over-year Grew our product inventory; doubling or more every year We had to develop proprietary systems to solve scalability problems We had to utilize exotic hardware to vertically scale
The previous incarnation of our site was implemented in Perl running under mod_perl on Apache 1.3 Largely a two tier architecture. Site communicated with the search engine, metadata storage to enrich search results and a database for other content and metadata All DB access utilized stored procedures to simplify queries Almost all reference data was cached during startup
Many pages have a lot of content, requiring many calls to the database and metadata stores to pull in that content; latency of a single request was poor No progressive rendering so the time to first byte / initial page display was very long Process model and large memory consumption severely limited the number of requests served by a single server instance We didn’t have great runtime instrumentation of request flow so it was very difficult to understand what was going on beyond looking at SQL queries and log statements Front end pages also had a lot of content requiring lots of requests to external resources to render a page
We were serving 9 or more web sites with different look and feel, localization and business logic from a single codebase Making changes to one site carried a great risk of affecting other sites High rate of change on a single code base Teams couldn’t easily take their sites in different directions
In 2007 we decided to rebuild our site. The first decision was do we refactor our current software or do we start over We decided to start over. We have a fundamentally different business, we need fundamentally different software Our design principals were pretty basic: Simple is the new “clever”; performance and quality are design decisions; and you get what you measure Shopzilla is a “scrum shop” This thing was too big to NOT have continuous feedback mainly because the site had grown over 7 years, nobody really even knew everything it did We decided that our scrum sprints would be 2 weeks Most importantly, we decided that we had to have continuous feedback from our users This turned out to be a hugely important decision #1 it gave us a huge tool to manage risk. Since we decided to maintain the compatibility of the URL structure, we used a proxy by A10 networks to serve up our new site infrastructure, one page at a time! #2 it allowed us to keep up a constant drumbeat of progress for the company. Momentum was key for the company and actual, live, production launches were key for the team As a result we launched our first page for our first site on December of 07 Of course, this wasn’t just a page, it was the first version of the site framework as well Over the first 2 quarters of 2008, we gradually released more pages and increased the % of traffic we exposed to the new site until the full launch of Shopzilla on July 1 st . Since the sites were supposed to be functionally identical, we were able to monitor all our same business metrics as key indicators of any issue and course-correct along the way – KNOWING we could bail back to the old site with a simple configuration change. With the release of Shopzilla in July we started the development of Bizrate With the Bizrate release you’ll notice we had far fewer public releases We were confident in our site framework and our risk strategy shifted from proving the approach to getting Bizrate live by our holiday shopping peak Finally in mid-november, we shifted 100% of our US site traffic to our s2 platform We’ll refer to this timeline again as we look at our performance gains
I’ll dive into detail about all these topics Simplify the web application layer Decompose site into functionality separate, individually testable loosely coupled services Define performance SLAs Load test before every release; failure to meet an SLA is a defect Instrument and measure production code Cache where appropriate Apply best-practice UI performance techniques
Loosely coupled web services Each layer is independently development, tested Independently scalable; redundancy built in Hardware load balancers used for every cluster
Web application is Java 1.6, Tomcat 6 Spring MVC Custom TAL templating engine Services are JAX-RS utilizing Apache CXF framework Database Access via Hibernate with Ehcache L2 caching Oracle 10g database We’re incorporating Oracle Coherence data grid for distributed caching
We picked 1.5 seconds full page load as an aggressive number based on the size and weight of our pages With streaming HTTP responses we figured an approximate 650ms server side response time to still allow 1.5s full page load
Web application tier is a mashup of data from numerous sources All network communications via HTTP; no direct database access
We utilized the Java Concurrency API to implement an asynchronous, concurrent service invocation framework Independent services are invoked in parallel Dependent service invocations may be chained Future results only used during rendering of the template; so no blocking until the results are actually required for rendering
Some pages may request data from up to 30 sources Helps reduce latency of a single request Streamed HTTP responses ensures HTML is returned to clients as it becomes available
Each service defines a coarse-grained API to return enriched data to the site The service may consult numerous data sources to produce its results Each service is implemented generically to work for all sites and countries Defined a versioning strategy to permit incremental updates even with backwards incompatible schema changes Greatly increases parallel development Each service defines its own SLA which is dictated by all the site clients that require access to the service Each service is tested and measured to its SLA
Each service defines an XSD and produces a Maven jar artifact that provides a java client API Use Apache HttpClient with multi-threaded connection pools with stale connection checking enabled Set connection and socket timeouts in many cases to ensure that even when a service chokes the site may degrade gracefully by omitting the failed content and rendering the rest of the page Service endpoints must support multiple inputs returning a list of results; the number of service invocations is a constant instead of proportional to the number of data elements on the page We use JAXB to unmarshal XML data to Java objects
So we’ve built an architecture that we believe to be performant and scalable. How do you go about testing this? There are a lot of moving parts * Highly concurrent requests, dozens of services, resource accesses Strategy: Each service is performance tested in isolation to its SLA. Then the full stack is performance tested
We have a pre-production environment that mimics production on a smaller scale We replay urls from our logs to generate sample load We utilize JMeter to generate load against the site Scripts emit graphs showing 95 th percentile response times and throughput at varying layers of concurrency We designed to meet our seasonal peak load and then 50% more traffic on top of that; a significant increase for a mature web site Each data center is built to carry 100% of traffic, but in practice it is split evenly across all data centers
So what do you do if you’re not meeting performance and you’ve isolated the problem to a particular product? Emit as much information as you can to identify timings when crossing layer boundaries Assign each request a unique id and access log it Write timing information to a performance log and correlate it to the unique id Pass the unique id through HTTP headers to all downstream services to allow correlation across layers
Utilize YourKit Java Profiler to identify poorly performing requests and concurrency bottlenecks Some lessons learned: Turn console logging off; log to files instead Reduce logging output; ERROR level for us Use StAX instead of Xerces; the latter has numerous synchronization blocks that cause it to choke under high concurrency XML parsing Retest service layers to ensure they continue to meet SLA Ensure that sequential service dependencies do not cause high latency requests; sometimes re-implement endpoints to provide richer content
We utilize JMX beans to emit performance information Every service call is annotated to emit moving average response time and moving 95 th percentile response time We graph service calls over time using Graphite A small percentage of requests in production emit logging information describing service calls We can produce waterfall graphs of server-side service invocations
We began leveraging Ehcache as an L2 cache with Hibernate Advantage: Easy to get started Disadvantages: Each service instance caches the exact same data so does not scale well to large data sets Query cache keys are essentially based on query inputs; not useful if there is a sparse mapping from inputs to data We have some extremely large data sets that we wanted to cache We evaluated and subsequently implemented Oracle Coherence on a number of projects
Oracle Coherence is an in-memory distributed data grid solution for clustered applications and application servers. Automatically and dynamically partitions data in memory across multiple servers Provides continuous data availability even in the event of a server failure Can perform in-memory grid computations and parallel transaction and event processing
We wanted to build an automated system that would allow us to create the most optimal URLs on our web pages to improve Search Engine Optimization; any link rendered on our site might be subject to a rule. We needed to build a backoffice system to crunch large volumes of data to produce the rule set We needed a solution that would allow us to publish an evolving data set of rules, made available to dozens of clients that would query the rule set at a high rate.
We worked with Oracle consultants to develop a data model and an architecture for both parts of the solution We built the backoffice system utilizing a small grid to compute the rule set configured with write-behind to persist the data to a database We publish new rules to the site grid as they become available Dozens of stateless client processes access the distributed site grid simultaneously We access the cache thousands or perhaps tens of thousands of times per second We thought near-caching might be necessary; turns out remote access is just fine
The ability to specifically and directly manage our URLs have created significant opportunity for our business One of the most successful SEO projects to date
We maintain a repository of keywords that we bid on with the major search engines. It’s a map whose keys are simply unique IDs and values are a rich object that describes the experience we’d like to direct the user to Every paid ad clicked on requires us to consult the map The data has grown to more than 600 million entries We started with a single in-memory cache on a single server When we outgrew that we devised our own partitioning scheme and fronted it with a (Java) service to provide seamless access to the distributed data set
Our data was copied to our remote datacenters by scheduled jobs It took a long time to publish new data sets and was prone to failure We ultimately reduced the number of times we published data to once per week Making the data live required restarts of multiple cache processes The simple act of restarting something can often end in tears
Coherence allowed us to scale our data beyond a single physical server using a distributed cache Automatically partitions our data We implemented read-through to transparently cache new data We configured the eviction policy to keep enough data to satisfy all the unique requests over a 90 day period No have no batch processes to ship gigabytes of data No delay in publishing new data Always on
Faster turnaround time for new paid placements No restarts lowers risks of errors resulting in bad user experiences; fewer alarms Performance is consistent, even when data is changing and we have much less software to maintain No batch jobs to maintain No partitioning logic
We operate one site grid per data center. The site grid has multiple functional caches. We’re currently upgrading it to double its capacity: 6 physical instances twin socket, dual core low power CPUs 32 Gb RAM Currently based on Coherence 3.4, upgrading to Coherence 3.5 16 JVM nodes with 1.5 Gb Heap each (Up from 8 JVMs on our previous grid) Distributed Cache Configurations Multiple functionally-separate caches There are 40+ client nodes connecting to each site grid.
We development against isolated instances of the grid We performance test against a pre-production grid configured identically to production, but on a smaller scale We’ve tested various scenarios like killing an entire server and watching Coherence repartition backup data dynamically to ensure no performance degradation
Steve Souders indicated that only 10% of the time is spent on the server side. Netflix have documented 20% at most It takes a lot of effort to get to that ratio But front-end performance enhancements can give great improvements I’m going to talk about some of the YSlow techniques that we’ve implemented
We’ve only recently begun to tackle this in earnest We took all our static images and sprited them to minimize the number of requests If you need help with that visit http://spriteme.org/ Next step will be to reduce CSS and Javascript resources; there are tradeoffs between deferred loading and progressive enhancement
“ It’s the Latency, Stupid” in 1996 Stuart Cheshire noted that if the speed of light in fibre is 66% of the speed of light in a vacuum, the round-trip time from California to Boston can never be less than 43ms. And that doesn’t count equipment latency and packet loss Move your content closer to your end users Every resource except dynamic HTML pages are served from a CDN We offload 100s of Gb data transfer to our CDN
Our origin server sets expiry headers; recently working to increase that to far-futures Resource requests include a unique id that allows us to effectively expire content during releases to avoid browsers caching stale CSS or JS Everything is compressed; our hardware load balancers gzip non-CDN content for us CSS and Javascript is minified
Yahoo recommends 2 – 4 hostname lookups per page. We use 3 different hostnames for JS+CSS, static, dynamic images plus 1 for the base page. Unfortunately 3 rd party advertisements add in a bunch more
We instituted a rule that requires every rendered link to result in zero redirects on navigation or resource downloads. 3 rd party advertisements do their own thing
A simple HTTP request might be 500 bytes. Our site utilizes a cookie to store around 1.5Kb of session data On a page with 60 external resources, we’d be sending 90Kb of data upstream for no reason Removing improved our top line revenue by 0.8%
Our CDN origin server allows server-side dynamic resizing of images We ensure our <img> elements always specify the exact width and height of the requested image
We accidentally discovered we had a multi-layered ICO We removed the layers and it dropped from over 2Kb to 318 bytes
Flushing the buffer early allows HTML to be shipped to the end user sooner We haven't experimented much with this, but the default Tomcat configuration maintains an 8Kb (uncompressed) buffer, so we do get auto-flushing at 8Kb boundaries We're looking into flushing more explicitly, such as post-header, post major section on the page.
We’re currently implementing Keynote to continuously monitor our performance from outsite the firewall On a periodic basis it visits a variety of URLs on our site and measures full page load time from different geographic areas Measures on both backbone (T1) connections and real connections such as DSL and Cable Provides detailed waterfall graphs, screenshots in the event of errors and alarms
So, What was our actual user experience page load time before and after? (can you spot) Webmetrics as the external monitor Performance times do not measure banner load time SLA’s for page load don’t include banners since via iframes our content is there for the user even while the banners are still loading
Did we make any money?
Site conversion increased from 8-12% (conversion slide)
Actually getting people to our site is obviously another big aspect of our financial performance (sem session slide) We can look at our Google SEM Sessions as a way to visualize the relationship between performance and abandonment
With our launch of s2 in biz.co.uk we also learned the significance of performance on the SEM relevance algo (biz UK slide) We launched on 5/18 and by 5/29 Google had figured out that our site was fast again This isn’t organic improvement obviously, we believe we were in a penalty box
In addition to conversion rates and sessions, we saw a number of other benefits (SUMMARY slide) PV’s up by about 25% But there were a number of other behind the scenes improvements: 50% less infrastructure Our site is significantly more available And while keeping it up, we are able to change the product at more than twice our previous pace
Is performance worth it? - YES
Simplicity, quality, performance are design decisions We Get What We Measure