R Programming Language Can Manipulate Hadoop Now with RHadoop
January 17, 2013
By Mae Kowalke
, TMCnet Contributor
Statisticians who are familiar with the R programming language now are better able to use Hadoop to run MapReduce jobs or access HBase tables. Revolution (News - Alert) Analytics has created RHadoop, a collection of three R packages that let users run MapReduce jobs entirely from within R as well as giving them access to their Hadoop files and HBase tables, according to a recent MapR Technologies blog post.
“You get all the statistical analysis capabilities of your R environment with the enterprise grade, massively scalable, distributed compute provided by MapR’s Hadoop distribution,” enthused the blog post.
The packages have been implemented and tested in Cloudera's distribution of Hadoop (CDH3) & (CDH4), and R 2.15.0, according to the project’s Github page.
RHadoop consists of rmr, which functions provide Hadoop MapReduce functionality in R; rhdfs, which functions provide file management of the HDFS from within R; and rhbase, which functions provide database management for the HBase distributed database from within R.
The rmr2 package uses Hadoop streaming to invoke R on individual tasktracker nodes so R and the rmr2 package need to be installed on the client machine from which the user runs R as well as on all the tasktracker nodes in their MapR cluster, according to the MapR posting. “Once installed, just set up environment variables to point to the Hadoop command and the Hadoop streaming jar, and you can run R MapReduce jobs on your MapR cluster.”
With the rhdfs package, there’s a client interface to files on the user’s MapR cluster through the Hadoop command.
“This machine does need to have the MapR client software installed and be configured to access your cluster,” noted the MapR post. “As long as you can run ‘hadoop fs’ commands from the shell, you can use rhdfs.”
The rhbase package accesses HBase via the HBase Thrift server which is included in the MapR HBase distribution.
“The rhbase package is a Thrift client that sends requests and receives responses from the Thrift server,” explained MapR. “The Thrift server listens for rhbase’s Thrift requests and in turn uses the HBase HTable java class to access HBase. For an R developer, this is all transparent!”
Adds the MapR blog, “For simplicity, rhbase defaults to using a local Thrift server on the machine where R and rhbase are installed. This is a client machine where you would run the HBase shell. Since rhbase is a client-side technology, it only needs to be installed on the client system that will access the MapR HBase cluster. Nothing additional needs to be installed on your HBase cluster nodes.”
And with that, statisticians now have yet another way to slice and dice big data.
Want to learn more about the latest in communications and technology? Then be sure to attend ITEXPO Miami 2013, Jan 29- Feb. 1 in Miami, Florida. Stay in touch with everything happening at ITEXPO (News - Alert). Follow us on Twitter.
Edited by Rachel Ramsey