September 2011
Mon Tue Wed Thu Fri Sat Sun
« Aug   Oct »
 1234
567891011
12131415161718
19202122232425
2627282930  

Month September 2011

Running a Hadoop pseudo cluster on a Mac

These are instructions for running a single-node Hadoop cluster on a Mac. This is my preferred setup for local map-reduce development; I used to use a virtual machine, but I’ve found that running it on my native system works just as well and is much easier.

Step 1: Download a Hadoop distribution

Any Hadoop distribution should do; either download the latest stable version from the releases page or get a tarball of the Cloudera Distribution. The Cloudera Hadoop distributions tend to have useful patches that open up compatibility with more of the ecosystem.

I like to unpack Hadoop to ~/Library/Hadoop/. You can put it wherever you like, but I’ll assume in the remainder of these instructions that you’re following my convention.

$ wget http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u1.tar.gz
$ tar xzvf hadoop-0.20.2-cdh3u1.tar.gz
$ mv hadoop-0.20.2-cdh3u1 ~/Library/Hadoop

This distribution contains the code for both the map-reduce system and the Hadoop Distributed File System (HDFS).

Step 2: Configure Hadoop

All of Hadoop’s configuration files live in ~/Library/Hadoop/conf/. Start with hadoop-env.sh.

Hadoop requires a Java 6 JRE, which is the default on all Macs running Snow Leopard or Lion. Add the following line somewhere in your hadoop-env.sh:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home

You’ll need to run four Hadoop JVMs for a complete single-node cluster. By default a Hadoop daemon JVM uses 1 GB of heap; this is definitely overkill for a dev only, single machine node running on a workstation. I typically cut it down to 200 MB, by adding the following line to my hadoop-env.sh:

export HADOOP_HEAPSIZE=200

Next, you’ll need to add some properties to your conf/core-site.xml, which sets configuration properties applicable for all of the daemons.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- core-site.xml -->
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:8020</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/path/to/Hadoop/tmp</value>
    </property>
</configuration>

fs.default.name sets the default file system as an HDFS instance whose master runs on localhost. hadoop.tmp.dir defines the path on the local filesystem that the Hadoop daemons will use for persistence. I use ~/Library/Hadoop/tmp.

conf/hdfs-site.xml is used for HDFS-specific configuration. Here, you’ll want to set dfs.replication to 1, so that HDFS will only try to store one copy of each file… any more would be impossible in a single node cluster.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- hdfs-site.xml -->
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Almost done! The last thing you need to do is configure your map-reduce system in conf/mapred-site.xml.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- mapred-site.xml -->
<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:8021</value>
    </property>
    <property>
        <name>mapred.tasktracker.map.tasks.maximum</name>
        <value>2</value>
    </property>
    <property>
        <name>mapred.tasktracker.reduce.tasks.maximum</name>
        <value>2</value>
    </property>
</configuration>

mapred.job.tracker defines the location of the map-reduce master. The two *.tasks.maximum properties define the number of map and reduce slots that will be available on your box. I set both to two on my quad-core iMac, and I can run test jobs without feeling any kind of performance hit on the rest of the system.

Step 3: Set environment variables

You’ll want to set the HADOOP_HOME environment variable and add the Hadoop executable to your PATH. Add these lines to your .profile:

export HADOOP_HOME=$HOME/Library/Hadoop
export PATH=$PATH:$HADOOP_HOME/bin

Step 4: Format HDFS

Open up Terminal and run the command:

$ hadoop namenode -format

This will prepare the hadoop.tmp.dir location you set up for HDFS storage.

Step 5: Launch Hadoop

Hadoop comes with shell scripts to bring a cluster up and down, but I don’t use them locally. Instead, I prefer to just run each of the Hadoop daemons in the foreground in a Terminal window so I can check their logs easily. Since it can get tedious to manually run each of the four 1 Hadoop daemons (the namenode, datanode, jobtracker, and tasktracker), I’ve whipped up a script using the completely rad terminal multiplexer tmux to start up all the daemons in a tmux session 2. I can reattach to the tmux session whenever I want to check the logs, and I just Control-C my way through each screen when I’m done with my cluster.

You can either launch a cluster either by invoking my script:

$ hadoop-tmux

Or else you can launch the daemons individually in separate terminal windows:

$ hadoop namenode
$ hadoop datanode
$ hadoop jobtracker
$ hadoop tasktracker

Step 6: Estimate Pi

Once you’re cluster is up and running, it’s time to test it out. Hadoop ships with a jar of example map-reduce jobs and newer versions include a job that estimates pi using the Monte carlo method. I like pi estimation for testing a new cluster because it doesn’t rely on any input data.

$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar pi 10 100

You should now have a functional Hadoop cluster and a woefully inaccurate estimate of pi in your hands.


  1. Any experts following along will note that I’m not mentioning the secondary namenode. When running locally for development, I don’t care about data integrity on HDFS or its long term performance. Therefore, I see no compelling reason to run the secondary namenode. 

  2. I’m completely glossing over what tmux is, how it works, etc. If you’re not familiar with tmux (or GNU Screen, which it aims to replace), I would recommend just starting each of the daemons manually. But seriously, learn about tmux. It’s awesome. 

VT Code Camp Slides

These are my slides from Vermont Code Camp. I had a great time; I’ve only lived in Burlington for a year, and this was my first real exposure to the larger geek community. Seems like it’s thriving! Thanks to anyone who came out and I hope to see you there next year.

Also, my map-reduce cryptanalysis demo is on GitHub. It requires Maven to build and a Hadoop cluster to run. Don’t worry - it doesn’t have to be a big Hadoop cluster. Most of the time, I just run a single node on my Mac.