Exclusive: Mike Schroepfer on Cloud Collision and Scaling Facebook [Interview]

UPDATED 20:24 EDT / OCTOBER 21 2009

Exclusive: Mike Schroepfer on Cloud Collision and Scaling Facebook [Interview]

Mike Schroepfer spoke today at the Web 2.0 Summit, and dropped the now headlined factoid that the site’s 300 million users spend a collective 8 billion minutes on the site each day.

Other key stats he highlighted in his keynote:

– 1.2 million photos are served on Facebook each second.

– 15,000 sites that have integrated Facebook Connect.

– Facebook is accessed via its API 5 billion times daily.

– There’s 1.2 million users per Facebook engineer.

I sat down last week with Mike and talked to him about his vision for the engineering department at Facebook, the Haystack photo delivery system, his thoughts on cloud computing in general, and a lot of the topics he talked briefly about during his speech today. Full transcript of the conversation is after the jump.

Mike Schroepfer, Facebook’s Vice President of Engineering

Mike Schroepfer is the key executive working directly with Mark Zuckerberg to extend Facebook’s leadership on the engineering side of their massive market growth. Mike was previously the VP of Engineering at Mozilla where he headed up the engineering team responsible for Firefox. Before that he was the CTO of Sun Microsystem’s data center automation division.

John Furrier: Tell me about your vision for your engineering organization at Facebook.

Mike Schroepfer: Facebook’s vision of engineering is to handle its rapid growth, huge user base, and scale. Facebook is a consumer site and it is growing rapidly. Our core values are about moving fast. That is the ability to innovate and iterate as fast as possible. You never know what you got is the right {product} so it’s really important to get it out in the market and test it with people to see what they think. It’s not about what they say, but also how they use it. The question is “do they use it, do they enjoy it, do they come back to the site, and do they engage it?”

On the product side the number one thing we focus on as an engineering organization is how to build the ability to sustainably innovate and iterate so we can continue to roll out the new features, technology, and new functions.

On the infrastructure side there are two major challenges. First challenge is just keep the site running to handle the current grow from just over a year ago the user base was greater then 100 million 30 day active users and today well north of 250 million of 30 day active users – nearly a tripling in active user base. This kind of growth puts a strain on every part of their system. You never know how things are going to scale until you double or triple the load and see what breaks. The first job is to just keep things running and the team has done an amazing job in doing that. I haven’t seen anything scale this quick without having any major issues.

The second big challenge is making the infrastructure cost effective. In the long run to make quick fast innovation and iteration is building on the right infrastructure. If we built this on traditional languages and tools (like assembly language) things would have moved very slowly. The layering of abstractions – scripting languages, prebuilt web toolkits, etc – allows people to throw things together quickly. Facebook thinks about things this way for example if they want to rollout a new feature to 250 million users tomorrow there is a lot of scalability concerns to worry about if I’m building that entire infrastructure from scratch for my one feature it might take me one year to get right. Conversely, if I could rely on a proven scaled out architecture it just says if I could implement in this new way new feature can rollout and deploy to the entire datacenter infrastructure and be available ready and reliable for 250m users tomorrow. Then you can move much faster. We’ve had college interns and new college grads and other people pushing features to the live site in just weeks in joining the company. This is the focus on the infrastructure side.

John Furrier: Operating at scale is a major differentiator and other like Google are talking about it as key benefit of having such a large infrastructure in terms of iterating and innovating. The question is how does a new young and growing company like Facebook leverage their unique infrastructure and growing scale?

Mike Schroepfer: For Facebook they look at this issue as a big deep system problem. If you look at the background of many of the engineers at Facebook they are all deep computer scientists in a variety of disciplines related to systems whether it’s operating systems, networking, or storage from right out of college to 10-20 year industry veterans from other companies that have seen large scale systems like Netapp, VMWare, Google, and others. We have seen big scale system before and have learned lessons in the past. We don’t want implementing your first file system on your first day at huge scale there is just a lot to learn. So we have the deep technologists that you need to do this combined with boldness to try new things. Our goal is to move fast and the other half of our mission is to break stuff. We have a willingness to try stuff out and see what happens even on the system side where engineering is difficult. A real willingness to build a new piece of hardware to build a new brand new piece of software that’s a new storage system. Throw it in production and see what happens. Test it out with real load and see if your environment scales. The wiliness to build stuff and try it by fire and not assume that everyone else has figured it out the problem is critical and part of our attitude of the engineering organization. For example, if you look at Haystack storage system which is how they store all of their photos is seeing a tremendous amount of traffic and usage with over 1 billion uploads a month. People don’t realize it but photos are critical to Facebook. It not just storing the photos and sticking them in a mountain of disk and maybe they’ll get them in a week. We also serve a tremendous volume of photos to users. Every time you load the home page you see lots of photos. Page views on photos.php is very high and the people are clicking on lots photo all the time. Haystack is an integrated storage and serving system – 2u boxes that store the photos and serve them back directly to users at a high rate. It’s a high density storage and high throughput in terms of rates.

John Furrier: Talk about the Haystack project and share with us what it is at and where it is going? Are you using commodity gear and open source code?

Mike Schroepfer: Haystack started with 3 people and it’s in production now. Every time you upload a new photo you’re storing that photo on one of these new Haystack boxes. It’s an interesting case study in how this got deployed. Commodity hardware it’s using commodity drives to get really high density, Linux based on everything, and Facebook has build a software layer on top of this hardware to manage the storage of the photos. Facebook in essence has built a domain specific storage system (one of several that we build and are working on others). It’s optimized for the specific photos use case for Facebook. For example for most of their photos they store in four different sizes because we are also serving up from the box you have the tiny version in the profile and big versions if your clicking through and two other size. Rather then rescaling on the fly we actually as the photo is being uploaded we resize into four different sizes and store it in one file with a storage system that understand that the photos are available in 4 different sizes. The system can then natively index the photos in four different sizes. That whole system is built together and put into production.

Storing photos are a big deal. There are many things that can go wrong and the users value photos. Privacy and loss of data are core elements to getting the product complete. Facebook ran this in production under a double write to manage the possible errors and validate and iterate to make sure it worked. Once we had this validation we put it fully into production and turned off the old system and ran the new one.

John Furrier: The Haystack photo system is a great example of how new development processes can be achieved. How do you balance tuning performance with adding new features?

Mike Schroepfer: The lower level that your base infrastructure and base abstractions are the more flexibility you have in developing new features. We’ve done a good job in scaling out very robust storage layer very robust caching layer through memcache –easy ways to stuff generic data in and out of these databases not domain specific. Domain specific systems (like photos) are layered on top of the lower level system. You can adapt new features really fast. If you think about an operating system approach – many things have been built upon a basic file system abstraction. A lot of this is about thinking about where you can build the simplest but most useful abstraction and build a whole set of infrastructure around that so you’re not too tied directly to a specific domain and that gives the people developing features a lot more freedom and speed.

John Furrier: Scaling? How much farther can you take these open systems and tools like Memcache and MySQL?

Mike Schroepfer: We didn’t think we could take it this far. We are not only scaling these open source products, but taking them to another level via a systems approach.

With individual memcache servers we are investigating the limits per a single note. Are currently doing about 800k ops per second on these boxes and they are close to the limit. However we see innovation on a network scale rather than a per box limit and are looking at building system on top of memcache specifically in storing and accessing data. There is room for more innovation.

Mysql is separate from mem cache so in order to get mysql box running at scale you need tons of RAM. Working on memcache helps drive scale for MySql.

John Furrier: Are you building an operating system? Is Facebook turning into a big supercomputer?

Mike Schroepfer: People often miss that Facebook architecture is in essence like a big supercomputer. There is a set of distributed data but to perform most operations they have to touch lots of different parts of the system. So render a users home page FB has to query a couple hundreds of databases and touch a bunch of cache nodes and talk to other sets of services just to build that one web page that’s because the data set is so wide. It not like say webmail – I can build a webmail and keep a user on one server and load the inbox in one big file and it can look like a single user system. Unlike webmail example our system is a complex system that looks like big distributed computer.

John Furrier: What is your vision of Cloud Computing?

Mike Schroepfer: The first think is to really understand your application. Whenever you’re building both the infrastructure and application you have a big advantage. You can figure out the right building blocks are relative to the application that you have or are building. I don’t think that I could repurpose our infrastructure to build something entirely different. Our environment is very tuned at a low level for the kind of data access and kind of data we store and move around in our system. It’s very efficient and easy to develop for us. The real challenge for cloud is adaptability to different environments. For ex ample if your application is high transaction throughput verses large data sets then you can have different design centers on those different elements. You need to understand the application and understand what your doing and what’s important – is security most important, is reliability the most important, or is adaptation to new features most important. These will drive different design considerations. Also you need to understand that most organizations are only good at a couple things so it’s a hard look to figure out your core competency – do I have the right people it takes to build big scalable reliable systems and if not is that something I want to invest in or do I want to give it to someone who has that expertise.

John Furrier: How do you look at the balance between interdisciplinary focus between app and infrastructure engineers

Mike Schroepfer: It’s a challenge in any organization. There is no silver bullet but there are a rare set of people who are able to cross those boundaries. They can stick the application in their brain then they can go think like a network or hardware person. Having those one or two people around are invaluable. They can be the glue between teams to understand both the requirements between teams.

John Furrier: How do you view virtualization? Do you it as a role within your infrastructure? Classic Hypervior or new applications like network and configuration management?

Mike Schroepfer: If you at it’s (virtualization) basic concept which is to break the 1:1 dependency between something and software or some system and the actual physical hardware it runs then that concept is very solid. Anyone who runs a big infrastructure does it.

We have load balanced web pool which is effectively a virtualized resource which means your web request goes to any number of web servers. I can shoot a web server and take it down and you would never know it because we can send that traffic to somewhere else. Same for our memcache clusters and same thing for our database clusters so we effectively built in virtualization into our infrastructure layer where there no single piece of hardware or system that we can’t take down and you would notice. We are even thinking about this at the cluster and datacenter level – the notion of taking down entire clusters and datacenters. It’s virtualization as a system concept through the system.

John Furrier: Facebook practices what they preach – they are very engagement and action oriented. This culture is mainly driven by the founder Mark Zuckerberg. How involved is Mark Zuckerberg in engineering?

Mike Schroepfer: Mark is actively involved in engineering at Facebook. One interesting thing about Facebook is the desire of the founder to stay and drive the business both at the business level but at the engineering level. Mark sits directly in the center of the engineering group. He’s involved in a good way and is very tight with the engineers. Facebook is a flat organization and very interactive with each other. Mark often sits with engineers and bounces ideas and direction with them. One of the main reasons for moving to their new building to house the entire company is to have that “one place” for the entire company.

Mark brings a positive counter force to the organization which is a desire to do the right thing with the product. He pushed the team to think differently and to understand what’s going on with the product and not be set in certain ways. Mark pushed the team to rethink their assumptions and change gears if necessary.

John Furrier: Facebook’s growth in undeniable and the question many have is will they buy companies to help them grow. In the past other companies like Cisco have acquired companies to help manage the growth. Will Facebook do the same? As Facebook goes on a buying spree how do you integrate companies?

Mike Schroepfer: Facebook is still early in this evolution. Friendfeed is our second acquisition so it’s a bit early to speculate what our long term plan is. Cisco would for the most part tend to integrate their companies into their sales and marketing engine.

Facebook core product is the platform and the Facebook web site and we are building all the experience around being your personal identity online communicating with the people you care about. We don’t think about little add-ons or “can we buy this widget over there in the corner”, but things that are much more tightly coupled to the core.

We do think about growth but want to maintain high quality hiring. We still have a relatively small and tight nit engineering community with a couple hundred engineers.

We are growing aggressively but we want to grow carefully. The quality of the people and the efficiency is critical to Facebook. Metric that we track is how many eng to how many users – north of one million to one for some time. This is 10-30x any other tech company. If you look at our projects where other companies have say 20-30 engineers Facebook has 2 or 3. It’s very efficient model for Facebook. Teams are peaked around the core distributed platform for rapid development.

John Furrier: Facebook has had a great response from the developer community. Now we see Twitter getting a ton of developer mindshare. What are you doing in engineering to enable a positive developer experience on top of the Facebook platform?

Mike Schroepfer: We launched Facebook connect in December and am pleased with the adoption. It’s on tens of thousands of sites from small blogs to big sites like Yahoo. The next evolution of the platform is taking the social data with you – taking your friends and identity and communications capability and taking it with you outside of Facebook to Yahoo homepage, Netflix, or to Xbox. This is the future where we are taking things for Facebook platform.

We are pioneering in a lot of ways and there are lots of hard problems. The balance we are trying to strike is making sure users have control of their data and have an understanding of what’s happening. Goal of security is to make sure people understand what their doing and what’s happening in the system. From a technology perspective the goal is to continue to drive adoption on FB Connect and getting more sites integrating it. Plus continue to make sure the experience is great on the FB Website.

As an example there are many developers and businesses on the social platform because they have figured out how to leverage the social experience to invite friends to play a game or communicate in that way. “People are figuring out how to navigate the ecosystem and figure out how to build the experience that people really value.”

Some developers, businesses, and apps on Facebook have more users than some of the most compelling startups in terms of user adoption and traffic.

John Furrier: Personal question: what you’ve learned from your career. How do you see the world in five years. Any lessons that you want to share from open source movement? From an innovation or developer perspective what industry disruption or key trends to you see?

Mike Schroepfer: There are a couple of obvious things. Thing are more connected than ever before. More power in my pocket both in terms of computing and connectivity than on a desktop only a few years ago.

On open source there are a lot of lessons then just the basic development process that any development organization can learn from. Every open source project and community is different. Some may think that they are monolithic thing but the communities are different. They are different people different process and have different backgrounds. The common thread among them is that distributing decisions especially on technology basis as far into the leaves of the development process as possible is one of the things that most organizations can learn. The engineer working on the code and give that person the authority and power because they will make good decisions. Pushing authority, ideas, and development to the leaves of the tree is really critical for these open source projects.

Distributed development is another key area due to the tools available.

The way we are thinking about cloud is what are all those application and infrastructure need that they don’t have to rebuild. For example a Facebook Connect button eliminates the registration flow and reduces the development time and the time to get a functioning web site and service. Across the industry whether it’s development models in open source, whether building better abstractions and frameworks, or whether its providing high level applications services ready to go. All of these things allows us to continue to iterate more quickly and return to the days where over the weekend one or two developers can really build a great product (on top of Facebook platform).

John Furrier: Is there anything that you can point to in the market where innovation is needed?

Mike Schroepfer: There has been a long struggle on the storage side. We are still using old database concepts and everything else has evolved on top of it. If you look at any framework out there a lot of the secret sauce is in the object relational mapping – how do I move to this object based system to a stored model into tables in a database. We haven’t exactly figure out what the right data model is to store things on the web. A lot of people are shoehorning things into databases because it’s a proven and reliable technology and we are using them too. I would hope that in ten years there are some different storage paradigm that are not straight relational database to allow us to more flexibly store and query data.

The future applications are about data it’s not about compute. Compute is not intensive. Instead it’s the huge data set that we’re trying to make sense of, query in real time, and trying to do interesting things with. Access, storage and computation across the data are where all the real challenges are.

John Furrier: We’ve blogged at siliconANGLE that data is the new development kit. What is you’re vision of data being part of the developer environment?

Mike Schroepfer: Some of the things that are interesting to the public is trending information, information that is important to the individual. All of the use case that most people care about is what their friends think about. The model on Facebook is the closer and social – trusted network. It’s about recommendations. Recommendations are about trust and people. Making sense of the data as it maps to the individual not just as a raw computational set. What it (data) means to me as a person is the most interesting part and how you hone that.

Since you’re here …

… We’d like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.

If you like the reporting, video interviews and other ad-free content here, please take a moment to check out a sample of the video content supported by our sponsors, tweet your support, and keep coming back to SiliconANGLE.