Facebook Releases Improved Hadoop Data Processing Scheduler
November 20, 2012
By Mae Kowalke
, TMCnet Contributor
Few companies have the raw data that Facebook (News - Alert) gains from its dominance in the social networking space. Every 24 hours, more than half a petabyte of new data shows up on the Facebook servers, according to a recent Facebook blog post.
“Ad-hoc queries, data pipelines and custom MapReduce jobs process this raw data around the clock to generate more meaningful features and aggregations,” noted the post. The data is a top strategic asset for the firm, but only if it analyzes and used it meaningfully.
Which is why MapReduce had to go. Facebook uses the open-source Apache Hadoop data processing platform for its analysis, but Hadoop’s built-in data processing scheduler was inefficient. So Facebook developed its own scheduler, named Corona.
Mmmmm, Corona. The new Facebook scheduler is able to utilize up to 95 percent of cluster resources, according to Facebook tests, whereas MapReduce could only put about 70 percent of the Facebook cluster resources to use at any given time.
Corona also improves upon other MapReduce limitations. MapReduce typically delayed queries before executing them, the Facebook team noted, and the framework offered no way to easily schedule non-MapReduce processing on the same cluster. Further, software upgrades required system downtime that forced the halting of existing jobs.
All of these were addressed by the Corona solution.
“In performance tests, Corona took around 55 seconds to fill an empty workspace, whereas MapReduce took 66 seconds -- which constitutes a 17 percent improvement,” reported InfoWorld. “Job are started more quickly now, as well, within 25 seconds, down from 50 seconds with MapReduce.”
Initially Facebook tested the new framework on 500 of its nodes, reported the Facebook blog post. When Corona proved effective, it was rolled out to all non-mission critical jobs, including those utilizing more than 1,000 servers. Now all Hadoop workloads are scheduled by Corona.
Facebook is not the only company that noticed MapReduce needs improvement. Apache itself also knows the problem and is working on MapReduce 2.0, an overhaul of MapReduce called Yarn.
Facebook examined Yarn as an alternative to growing its own solution, but the company’s engineers were unsure whether Yarn could accommodate data processing jobs as large as Facebook would throw at the scheduler. Most firms do not have Facebook’s data to crunch.
“Corona has become an integral part of Facebook’s data infrastructure and helps power big data analytics for many teams across the company,” said Facebook in the blog post. “We are continuing to improve it and are very excited about launching the upcoming features that will enable it to meet the ever-growing needs of our teams for years to come.”
As per open source norms, the Corona solution developed by Facebook is freely available online for others to use as open-source software. This is the same version of the software currently in use at Facebook, the company noted in the blog.
Time to hit the Like button.
Want to learn more about the latest in communications and technology? Then be sure to attend ITEXPO Miami 2013, Jan 29- Feb. 1 in Miami, Florida. Stay in touch with everything happening at ITEXPO (News - Alert). Follow us on Twitter.
Edited by Rachel Ramsey