By Steven Levy
Outside the Council Bluffs data center, radiator-like cooling towers chill water from the server floor down to room temperature.
Photo: Google/Connie Zhou
So far, though, there’s one area where Google hasn’t ventured: designing its own chips. But the company’s VP of platforms, Bart Sano, implies that even that could change. “I’d never say never,” he says. “In fact, I get that question every year. From Larry.”
Even if you reimagine the data center, the advantage won’t mean much if you can’t get all those bits out to customers speedily and reliably. And so Google has launched an attempt to wrap the world in fiber. In the early 2000s, taking advantage of the failure of some telecom operations, it began buying up abandoned fiber-optic networks, paying pennies on the dollar. Now, through acquisition, swaps, and actually laying down thousands of strands, the company has built a mighty empire of glass.
But when you’ve got a property like YouTube, you’ve got to do even more. It would be slow and burdensome to have millions of people grabbing videos from Google’s few data centers. So Google installs its own server racks in various outposts of its network—mini data centers, sometimes connected directly to ISPs like Comcast or AT&T—and stuffs them with popular videos. That means that if you stream, say, a Carly Rae Jepsen video, you probably aren’t getting it from Lenoir or The Dalles but from some colo just a few miles from where you are.
Over the years, Google has also built a software system that allows it to manage its countless servers as if they were one giant entity. Its in-house developers can act like puppet masters, dispatching thousands of computers to perform tasks as easily as running a single machine. In 2002 its scientists created Google File System, which smoothly distributes files across many machines. MapReduce, a Google system for writing cloud-based applications, was so successful that an open source version called Hadoop has become an industry standard. Google also created software to tackle a knotty issue facing all huge data operations: When tasks come pouring into the center, how do you determine instantly and most efficiently which machines can best afford to take on the work? Google has solved this “load-balancing” issue with an automated system called Borg.
These innovations allow Google to fulfill an idea embodied in a 2009 paper written by Hölzle and one of his top lieutenants, computer scientist Luiz Barroso: “The computing platform of interest no longer resembles a pizza box or a refrigerator but a warehouse full of computers … We must treat the data center itself as one massive warehouse-scale computer.”
This is tremendously empowering for the people who write Google code. Just as your computer is a single device that runs different programs simultaneously—and you don’t have to worry about which part is running which application—Google engineers can treat seas of servers like a single unit. They just write their production code, and the system distributes it across a server floor they will likely never be authorized to visit. “If you’re an average engineer here, you can be completely oblivious,” Hölzle says. “You can order x petabytes of storage or whatever, and you have no idea what actually happens.”
But of course, none of this infrastructure is any good if it isn’t reliable. Google has innovated its own answer for that problem as well—one that involves a surprising ingredient for a company built on algorithms and automation: people.
At 3 am on a chilly winter morning, a small cadre of engineers begin to attack Google. First they take down the internal corporate network that serves the company’s Mountain View, California, campus. Later the team attempts to disrupt various Google data centers by causing leaks in the water pipes and staging protests outside the gates—in hopes of distracting attention from intruders who try to steal data-packed disks from the servers. They mess with various services, including the company’s ad network. They take a data center in the Netherlands offline. Then comes the coup de grâce—cutting most of Google’s fiber connection to Asia.
Turns out this is an inside job. The attackers, working from a conference room on the fringes of the campus, are actually Googlers, part of the company’s Site Reliability Engineering team, the people with ultimate responsibility for keeping Google and its services running. SREs are not merely troubleshooters but engineers who are also in charge of getting production code onto the “bare metal” of the servers; many are embedded in product groups for services like Gmail or search. Upon becoming an SRE, members of this geek SEAL team are presented with leather jackets bearing a military-style insignia patch. Every year, the SREs run this simulated war—called DiRT (disaster recovery testing)—on Google’s infrastructure. The attack may be fake, but it’s almost indistinguishable from reality: Incident managers must go through response procedures as if they were really happening. In some cases, actual functioning services are messed with. If the teams in charge can’t figure out fixes and patches to keep things running, the attacks must be aborted so real users won’t be affected. In classic Google fashion, the DiRT team always adds a goofy element to its dead-serious test—a loony narrative written by a member of the attack team. This year it involves a Twin Peaks-style supernatural phenomenon that supposedly caused the disturbances. Previous DiRTs were attributed to zombies or aliens.
As the first attack begins, Kripa Krishnan, an upbeat engineer who heads the annual exercise, explains the rules to about 20 SREs in a conference room already littered with junk food. “Do not attempt to fix anything,” she says. “As far as the people on the job are concerned, we do not exist. If we’re really lucky, we won’t break anything.” Then she pulls the plug—for real—on the campus network. The team monitors the phone lines and IRC channels to see when the Google incident managers on call around the world notice that something is wrong. It takes only five minutes for someone in Europe to discover the problem, and he immediately begins contacting others.
“My role is to come up with big tests that really expose weaknesses,” Krishnan says. “Over the years, we’ve also become braver in how much we’re willing to disrupt in order to make sure everything works.” How did Google do this time? Pretty well. Despite the outages in the corporate network, executive chair Eric Schmidt was able to run a scheduled global all-hands meeting. The imaginary demonstrators were placated by imaginary pizza. Even shutting down three-fourths of Google’s Asia traffic capacity didn’t shut out the continent, thanks to extensive caching. “This is the best DiRT ever!” Krishnan exclaimed at one point.
The SRE program began when Hölzle charged an engineer named Ben Treynor with making Google’s network fail-safe. This was especially tricky for a massive company like Google that is constantly tweaking its systems and services—after all, the easiest way to stabilize it would be to freeze all change. Treynor ended up rethinking the very concept of reliability. Instead of trying to build a system that never failed, he gave each service a budget—an amount of downtime it was permitted to have. Then he made sure that Google’s engineers used that time productively. “Let’s say we wanted Google+ to run 99.95 percent of the time,” Hölzle says. “We want to make sure we don’t get that downtime for stupid reasons, like we weren’t paying attention. We want that downtime because we push something new.”
Nevertheless, accidents do happen—as Sabrina Farmer learned on the morning of April 17, 2012. Farmer, who had been the lead SRE on the Gmail team for a little over a year, was attending a routine design review session. Suddenly an engineer burst into the room, blurting out, “Something big is happening!” Indeed: For 1.4 percent of users (a large number of people), Gmail was down. Soon reports of the outage were all over Twitter and tech sites. They were even bleeding into mainstream news.
The conference room transformed into a war room. Collaborating with a peer group in Zurich, Farmer launched a forensic investigation. A breakthrough came when one of her Gmail SREs sheepishly admitted, “I pushed a change on Friday that might have affected this.” Those responsible for vetting the change hadn’t been meticulous, and when some Gmail users tried to access their mail, various replicas of their data across the system were no longer in sync. To keep the data safe, the system froze them out.
The diagnosis had taken 20 minutes, designing the fix 25 minutes more—pretty good. But the event went down as a Google blunder. “It’s pretty painful when SREs trigger a response,” Farmer says. “But I’m happy no one lost data.” Nonetheless, she’ll be happier if her future crises are limited to DiRT-borne zombie attacks.
One scenario that dirt never envisioned was the presence of a reporter on a server floor. But here I am in Lenoir, earplugs in place, with Joe Kava motioning me inside.
We have passed through the heavy gate outside the facility, with remote-control barriers evoking the Korean DMZ. We have walked through the business offices, decked out in Nascar regalia. (Every Google data center has a decorative theme.) We have toured the control room, where LCD dashboards monitor every conceivable metric. Later we will climb up to catwalks to examine the giant cooling towers and backup electric generators, which look like Beatle-esque submarines, only green. We will don hard hats and tour the construction site of a second data center just up the hill. And we will stare at a rugged chunk of land that one day will hold a third mammoth computational facility.
But now we enter the floor. Big doesn’t begin to describe it. Row after row of server racks seem to stretch to eternity. Joe Montana in his prime could not throw a football the length of it.
Some halls in Google’s Hamina, Finland, data center remain vacant—for now.
Photo: Google/Connie Zhou
During my interviews with Googlers, the idea of hot aisles and cold aisles has been an abstraction, but on the floor everything becomes clear. The cold aisle refers to the general room temperature—which Kava confirms is 77 degrees. The hot aisle is the narrow space between the backsides of two rows of servers, tightly enclosed by sheet metal on the ends. A nest of copper coils absorbs the heat. Above are huge fans, which sound like jet engines jacked through Marshall amps.
We walk between the server rows. All the cables and plugs are in front, so no one has to crack open the sheet metal and venture into the hot aisle, thereby becoming barbecue meat. (When someone does have to head back there, the servers are shut down.) Every server has a sticker with a code that identifies its exact address, useful if something goes wrong. The servers have thick black batteries alongside. Everything is uniform and in place—nothing like the spaghetti tangles of Google’s long-ago Exodus era.
Blue lights twinkle, indicating … what? A web search? Someone’s Gmail message? A Glass calendar event floating in front of Sergey’s eyeball? It could be anything.
Every so often a worker appears—a long-haired dude in shorts propelling himself by scooter, or a woman in a T-shirt who’s pushing a cart with a laptop on top and dispensing repair parts to servers like a psychiatric nurse handing out meds. (In fact, the area on the floor that holds the replacement gear is called the pharmacy.)
How many servers does Google employ? It’s a question that has dogged observers since the company built its first data center. It has long stuck to “hundreds of thousands.” (There are 49,923 operating in the Lenoir facility on the day of my visit.) I will later come across a clue when I get a peek inside Google’s data center R&D facility in Mountain View. In a secure area, there’s a row of motherboards fixed to the wall, an honor roll of generations of Google’s homebrewed servers. One sits atop a tiny embossed plaque that reads july 9, 2008. google’s millionth server. But executives explain that this is a cumulative number, not necessarily an indication that Google has a million servers in operation at once.
Wandering the cold aisles of Lenoir, I realize that the magic number, if it is even obtainable, is basically meaningless. Today’s machines, with multicore processors and other advances, have many times the power and utility of earlier versions. A single Google server circa 2012 may be the equivalent of 20 servers from a previous generation. In any case, Google thinks in terms of clusters—huge numbers of machines that act together to provide a service or run an application. “An individual server means nothing,” Hölzle says. “We track computer power as an abstract metric.” It’s the realization of a concept Hölzle and Barroso spelled out three years ago: the data center as a computer.
As we leave the floor, I feel almost levitated by my peek inside Google’s inner sanctum. But a few weeks later, back at the Googleplex in Mountain View, I realize that my epiphanies have limited shelf life. Google’s intention is to render the data center I visited obsolete. “Once our people get used to our 2013 buildings and clusters,” Hölzle says, “they’re going to complain about the current ones.”
Asked in what areas one might expect change, Hölzle mentions data center and cluster design, speed of deployment, and flexibility. Then he stops short. “This is one thing I can’t talk about,” he says, a smile cracking his bearded visage, “because we’ve spent our own blood, sweat, and tears. I want others to spend their own blood, sweat, and tears making the same discoveries.” Google may be dedicated to providing access to all the world’s data, but some information it’s still keeping to itself.
Tidak ada komentar