What is Hadoop?
November 19, 2012
By Rachel Ramsey
, TMCnet Web Editor
Smartphones, tablets, MP3 players, SMS messaging, YouTube, Facebook (News - Alert), online banking and Wi-Fi all are examples of day-to-day technologies that people use frequently and help create data. Thanks to mobile devices, social networks, Internet searches, e-commerce, video archives and other advancements in technology, the big data phenomenon came about thanks to the amount of data being generated today by these technologies. Big data refers to the collection of data sets so large and complex, it’s impossible to process them with the usual databases and tools.
Companies pursue big data because it can be revelatory in spotting business trends, improving research quality, and gaining insights in a variety of fields, from IT to medicine to law enforcement and everything in between and beyond.
Hadoop was born out of a need to process big data, as the amount of generated data continued to rapidly increase. As the Web generated more and more information, it was becoming challenging to index the content, so Google (News - Alert) created MapReduce in 2004. Yahoo then created Hadoop as a way to implement the MapReduce function. MapReduce is a programming model for processing large data sets, typically used to do distributed computing on clusters of computers. Google MapReduce and Hadoop are two different implementations (instances) of the MapReduce framework/concept. Hadoop is now an open-source Apache implementation project to store and process data.
Overall, Hadoop enables applications to work with huge amounts of data stored on various servers. Hadoop’s functions allow the existing data to be pulled from various places (since now, data is not centralized, but distributed in places using cloud technology) and use the MapReduce technology to push the query code and run a proper analysis, therefore returning the desired results.
As for the more specific functions, Hadoop has a large-scale file system (Hadoop Distributed File System or HDFS), which can write programs, manage the distribution of programs, then accept the results and generate a data result set.
Developed by Doug Cutting, Cloudera's chief architect and the chairman of the Apache Software Foundation, Apache Hadoop was initially inspired by papers published by Google outlining its approach to handling an avalanche of data, and has since become the de facto standard for storing, processing and analyzing hundreds of terabytes, and even petabytes of data. Hadoop was named after Cutting’s son’s toy elephant.
Image via SAP (News - Alert)
Instead of relying on expensive, proprietary hardware and different systems to store and process data, Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big. And in today’s hyper-connected world where more and more data is being created every day, Hadoop’s breakthrough advantages mean that businesses and organizations can now find value in data that was recently considered useless.
Edited by Rachel Ramsey