Navigation path

Left navigation

Additional tools

Cloud Infrastructure Planning @ Google…..

Today, three guys from the “Operation Decision Support” group at Google came to campus to recruit interns and full employees. …. surprise, surprise

It looks that they were very happy with one of the summer internals who came from UC berkeley and came again to introduce their department and describe their challenges and working methods.

They started by presenting the three business areas of Google:

  • The 100 billion dollar business: Ads, Search, Access (G-Fiber), Public Cloud
  • The 10 billion dollar business: You Tube, Nest, Play, Android, Chrome
  • The “bold bets”: Life Sciences, Self driving cars, Energy, Robotics, Space X

Pas mal… as they say in French !!!

The Operation Decision Support (ODS) Group is a kind of “consulting group” in Google specialized in “Capacity Planning” and impact analysis on costs and pricing of the solutions and the involved resources. This is particularly important because they have to invoice external customers and ….. charge back the internal ones …..  sounds familiar?  😉

Google has 12 Data Centers around the world, 6 of which are in the United Sates and 4 in Europe. Google expends 7 to 8 billion dollars in infrastructure per year… big, big money !!!

ODS has 50 people, among which there are 20 PhDs specialized in Statistics and Operational research. they also have experts in  modeling and supply chain. The group is becoming very important to the company; they plan to recruit between 15 and 50 new employees this year.

They presented a couple of interesting problems to illustrate the work of the group related to capacity planning, utilization of resourrces and costs.

The first one: “Increasing the utilization of the infrastructure (CPU, memory, disk space…) through oversubscription” (internal and external customers)

Google has Tier1 and Tier2 customers with different SLAs that normally subscribe for specific capacity that is not fully used all the time. The problem to be solved is how “oversubscribe” to sell capacity to more customers (internal or external).

There are three ways of approaching this problem:

  • Easier: Resell surplus in Tier1 and Tier2 (which on average use around 25% of the contracted capacity) with no SLA for the overcapacity sold.
  • Harder: Resell surplus in Tier1 as Tier2 with SLA
  • Hardest: Oversubscribe Tier1 with no change to its SLA.

In the first case, utilization changes with the time zone, there are peaks and valleys and there is no SLA, no guarantee, no problem.

Perhaps some “guarantee” could be provided by statistical extrapolation methods. For instance , for batch processing, it could be guaranteed that the batch is executed in the next 24 hours.

In the second case, it is necessary to collect detailed utilization data to estimate growth and security margins (Safety Stock) to guarantee SLA.

In the third case a more sophisticated analysis of the time series of data of every task run in the Tier1 environment. Workloads per task , in general, do not peak simultaneously what allows for a predictable “surplus” to be sold if some safety stock is taken.

It looks easy but , Thomas Olavson, the director of ODS, says that it is not evident. So, how to make this approach acceptable, taking into account that the final decision is in the hands of the implementing department (engineering, production ,etc) or the executive team?. Here is the method:

  1. Partner with the engineers: Fully, understand the issue, work together, pilot before roll out
  2. Build credibility and trust over the time
  3. Overcome “taboos”
    1. Clear SLA
    2. Explore Tier2 with statistically based SLAs
    3. Demonstrate economic impact
    4. Pilot, pilot and pilot.

The second case was related to the deployment of G-Fiber. Google Fiber is the fiber-to-the-premises service of Google in the US, providing broadband Internet and television to a small and slowly increasing number of locations. The service was first introduced to one of the biggest municipalities of Kansas City and Missouri, followed by expansion to other 20 Kansas City area suburbs within 3 years. Initially proposed as an experimental project, Google Fiber was announced as a viable business model on December 12, 2012.

Google is ,at the end of the day, a Content Service Provider and wants to provide high quality content at optimal speed to the users to increase satisfaction. A solution to it would be that the Connectivity Service Providers plug their “pipes” directly to the Google Data Centers what is unrealistic since they are normally in remote places. Therefore the solution is to bring Google infrastructure close to the users and this is exactly what Google is doing with the G-Fiber Service.

Answering the question where and when to build what infrastructure , is a tough optimization problem…. For a not very complicated deployment, the model would have some 30000 variables and more than 30000 constrains….

Brian Eck, now senior consultant at ODS, a former IBM employee who has been working with Google for the last two years (he says as a joke that the two years have been like “dog years” since he feels that he has been working at Google for 14 years !!), is a specialist in logistics and he was confronted to the same problem in manufacturing at IBM and concluded that the optimization approach was not the way to go….

Instead, he and his colleagues have developed a “Scenario Analysis Tool” for a reduced number of locations translating the alternative deployment roadmaps into a five year cost/cash model. The inputs for the model are the demand and the topology provided by the engineering team, the equipment footprint (calculated)and unit cost of all the cost components , also provided by the engineers. The result is a cost model with total cash flow over 5 years.

They call the model a “Big Special Purpose Calculator” which is also very useful to study “What if” scenarios and that can be generalized to other kind of problems in Google (some “super users” are doing it already).

One of the decisions that have to be taken in a deployment of this type is if it is better, given the cost of the workforce including travelling, either to install overcapacity in locations now and come back in one or two years to update, or to set up a local team and visit periodically and upgrade as necessary.

Applying the model to a specific deployment case, the latter option allowed 10 M$ savings …..

The model was first implemented using a spreadsheet; it contained 60 worksheets with some 300 line each and very complex formulas but allowed the fine tuning of the model. Once it was done, it was implemented using the R Statistical package.

The critical success factors are not very different from the ones mentioned in the previous case but here, there are additional ones

  • Strike the right level of detail; “what to include what to omit”
  • Standardize data: power, colo contracts, workforce, etc

Once more a very interesting talk… it is amazing what is going on in the Bay Area..

Students were queuing to hand their CVs or get the contact point….  I wish I could….  😉

Stay tuned for more…..







Leave a Reply