These are instructions for running a single-node Hadoop cluster on a Mac. This is my preferred setup for local map-reduce development; I used to use a virtual machine, but I’ve found that running it on my native system works just as well and is much easier.
Step 1: Download a Hadoop distribution
Any Hadoop distribution should do; either download the latest stable version from the releases page or get a tarball of the Cloudera Distribution. The Cloudera Hadoop distributions tend to have useful patches that open up compatibility with more of the ecosystem.
I like to unpack Hadoop to ~/Library/Hadoop/. You can put it wherever you like, but I’ll assume in the remainder of these instructions that you’re following my convention.
$ wget http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u1.tar.gz
$ tar xzvf hadoop-0.20.2-cdh3u1.tar.gz
$ mv hadoop-0.20.2-cdh3u1 ~/Library/Hadoop
This distribution contains the code for both the map-reduce system and the Hadoop Distributed File System (HDFS).
Step 2: Configure Hadoop
All of Hadoop’s configuration files live in ~/Library/Hadoop/conf/. Start with hadoop-env.sh.
Hadoop requires a Java 6 JRE, which is the default on all Macs running Snow Leopard or Lion. Add the following line somewhere in your hadoop-env.sh:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home
You’ll need to run four Hadoop JVMs for a complete single-node cluster. By default a Hadoop daemon JVM uses 1 GB of heap; this is definitely overkill for a dev only, single machine node running on a workstation. I typically cut it down to 200 MB, by adding the following line to my hadoop-env.sh:
export HADOOP_HEAPSIZE=200
Next, you’ll need to add some properties to your conf/core-site.xml, which sets configuration properties applicable for all of the daemons.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/path/to/Hadoop/tmp</value>
</property>
</configuration>
fs.default.name sets the default file system as an HDFS instance whose master runs on localhost. hadoop.tmp.dir defines the path on the local filesystem that the Hadoop daemons will use for persistence. I use ~/Library/Hadoop/tmp.
conf/hdfs-site.xml is used for HDFS-specific configuration. Here, you’ll want to set dfs.replication to 1, so that HDFS will only try to store one copy of each file… any more would be impossible in a single node cluster.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- hdfs-site.xml -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Almost done! The last thing you need to do is configure your map-reduce system in conf/mapred-site.xml.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- mapred-site.xml -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>2</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>2</value>
</property>
</configuration>
mapred.job.tracker defines the location of the map-reduce master. The two *.tasks.maximum properties define the number of map and reduce slots that will be available on your box. I set both to two on my quad-core iMac, and I can run test jobs without feeling any kind of performance hit on the rest of the system.
Step 3: Set environment variables
You’ll want to set the HADOOP_HOME environment variable and add the Hadoop executable to your PATH. Add these lines to your .profile:
export HADOOP_HOME=$HOME/Library/Hadoop
export PATH=$PATH:$HADOOP_HOME/bin
Step 4: Format HDFS
Open up Terminal and run the command:
$ hadoop namenode -format
This will prepare the hadoop.tmp.dir location you set up for HDFS storage.
Step 5: Launch Hadoop
Hadoop comes with shell scripts to bring a cluster up and down, but I don’t use them locally. Instead, I prefer to just run each of the Hadoop daemons in the foreground in a Terminal window so I can check their logs easily. Since it can get tedious to manually run each of the four 1 Hadoop daemons (the namenode, datanode, jobtracker, and tasktracker), I’ve whipped up a script using the completely rad terminal multiplexer tmux to start up all the daemons in a tmux session 2. I can reattach to the tmux session whenever I want to check the logs, and I just Control-C my way through each screen when I’m done with my cluster.
You can either launch a cluster either by invoking my script:
$ hadoop-tmux
Or else you can launch the daemons individually in separate terminal windows:
$ hadoop namenode
$ hadoop datanode
$ hadoop jobtracker
$ hadoop tasktracker
Step 6: Estimate Pi
Once you’re cluster is up and running, it’s time to test it out. Hadoop ships with a jar of example map-reduce jobs and newer versions include a job that estimates pi using the Monte carlo method. I like pi estimation for testing a new cluster because it doesn’t rely on any input data.
$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar pi 10 100
You should now have a functional Hadoop cluster and a woefully inaccurate estimate of pi in your hands.
-
Any experts following along will note that I’m not mentioning the secondary namenode. When running locally for development, I don’t care about data integrity on HDFS or its long term performance. Therefore, I see no compelling reason to run the secondary namenode. ↩
-
I’m completely glossing over what tmux is, how it works, etc. If you’re not familiar with tmux (or GNU Screen, which it aims to replace), I would recommend just starting each of the daemons manually. But seriously, learn about tmux. It’s awesome. ↩