Lessons from Google’s internal SRE methods for cloud efficiencies
As a new generation of corporations navigate the efficiencies of cloud computing, they are faced with a new challenge: running a business in a brand-new environment without the benefit of tried and true methods.
“The industry has done a really fabulous job of telling people how to get to cloud, but we’re awful about telling them how to live there,” said Dave Rensin (pictured), director of customer reliability engineering and network capacity at Google Cloud.
Rensin spoke with John Furrier (@furrier) and Jeff Frick (@JeffFrick ), co-hosts of theCUBE, SiliconANGLE Media’s mobile livestreaming studio, during the recently concluded Google Cloud Next event in San Francisco. They discussed Google site reliability engineering and how the concept is being turned outwards to help businesses operate successfully in the cloud. (* Disclosure below.)
Parsing work for machines and human judgment
In 2004 Google LLC had just gone public, and internal calculations showed that in 10 years the company would need a million systems operators just for their popular search function. In its unorthodox way, Google reimagined its production systems by applying software engineering skills to operations problems and named the method Site Reliability Engineering, or SRE.
“The basic philosophy is simple, give to the machines all the things machines can do, and keep for the humans all the things that require human judgment. That’s how we get to a place where like, 2,500 SREs run all of Google,” Rensin said.
A primary principle of SRE is to forget about aiming for perfection. “Any system involving people is going to have errors. So any goal you have that assumes perfection, 100 percent uptime, 100 percent customer satisfaction, zero error, that kind of thing, is a lie,” Rensin said, going on to explain that there is a “magic line” — known as the service level objective — marking the boundary between satisfied, and unsatisfied customers. Operate below the SLO line and customers are angry; operate above it and resources are being wasted on incremental improvements that customers don’t notice.
“The difference between perfection, 100 percent, and the line you need [the SLO], which is very business-specific, we say treat as a budget,” Rensin said. This “error budget” represents time and money that can be spent on innovation.
As director of customer reliability engineering, Rensin takes Google’s internal SRE methodology and turns it outwards to work with businesses of all sizes. Google has published a book on SRE, with an accompanying workbook to help guide companies through implementing SRE in their own operations.
“Our goal is that every firm from five to 50,000 can follow these principles. And they can. We know they can do it, and it’s not as hard as they think,” Rensin concluded.
Watch the complete video interview below, and be sure to check out more of SiliconANGLE’s and theCUBE’s coverage of the Google Cloud Next event. (* Disclosure: Google Cloud sponsored this segment of theCUBE. Neither Google nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)
Photo: SiliconANGLE
Since you’re here …
… We’d like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.
If you like the reporting, video interviews and other ad-free content here, please take a moment to check out a sample of the video content supported by our sponsors, tweet your support, and keep coming back to SiliconANGLE.