Monday, 9 February 2015

An update on Hadoop Versions

Our popular Hadoop for Java Developers course was recorded using version 2.4.0 of Hadoop. Since the course was released there have been some further releases of Hadoop, with the current version being 2.6.0.

There are no differences in the content that we cover on the course between the two versions of Hadoop, so the course is completely valid if you wish to use 2.6.0 or 2.4.0. In this blog post, however, I want to point out a reason to stick with version 2.4.0, and a couple of pointers that you should be aware of if you are going to use 2.6.0. I'll also mention the process to upgrade from 2.4.0 to 2.6.0.

Which Version of Hadoop should I use?

If you're starting to develop with Hadoop today then you might just want to download the latest version from the Hadoop website (2.6.0) and there is only really one reason that I can think of not to do this... and that is that Amazon's Elastic Map Reduce (EMR) service, which can be used to run Hadoop jobs "in the cloud" is not yet compliant with versions of Hadoop newer than 2.4.0.

Although the code that you'll write on the course is identical in both versions of Hadoop, if you compile your code with the 2.6.0 jar files you'll not be able to run it on EMR. For this reason we suggest you consider sticking with 2.4.0, at least while learning Hadoop, so that you can experience EMR (we cover how to set up and run an EMR job on the course). If you plan to use Hadoop on EMR in a production scenario then you must stick to 2.4.0 until Amazon update the EMR service to work with newer versions.

You can download a copy of version 2.4.0 from this link.


If I am going to use 2.6.0, what do I need to know?

The only things to be aware of if you wish to study the course with version 2.6.0 of Hadoop are:

(1)  Your standard installation path will be opt/hadoop-2.6.0/ instead of /opt/hadoop-2.4.0/ so you'll want to change the references to that in the following two script files that are provided with the course:
startHadoopPseudo
startHadoopStandalone

(2) When you install hadoop, you'll edit either .bashrc or .profile - make sure you also put the reference to the correct folder name in here also. Also, you'll be creating symbolic links to the Hadoop configurations - again make sure you use the correct folder names when you set these up.


What happens if I want to upgrade from 2.4.0 to 2.6.0?

If you have been running with 2.4.0 and wish to upgrade to 2.6.0, you just need to do the following:

(1) Download and unpack the 2.6.0 files from the Hadoop website - place these in /opt/hadoop-2.6.0/
(2) Create the configuration folders under /opt/hadoop-2.6.0/etc as you did for Hadoop 2.4.0 (you can actually copy the configuration folders from your 2.4.0 installation as they'll be valid for 2.6.0)
(3) edit your .bashrc (linux) or .bash_profile (Mac) to change the location of the Hadoop files in the HADOOP_PREFIX and PATH variables from 2.4.0 to 2.6.0
(4) Close your terminal window and open a new one to ensure that the updated environment variables and path varaible are loaded.
(5) run the script resetHDFS - you must be in the Scripts directory to run this script - this will reformat the HDFS file system and will create the symbolic links needed to use the Pseudo configuration. After running this script, enter the JPS command and check that you have the various daemons running (namenode, datanode etc)
(6) Your code, compiled with 2.4.0 will work in 2.6.0 - if you wish to recompile with 2.6.0, remove all the Hadoop jar files from the build path, and then re-add them from the folders under /opt/hadoop/2.6.0/share/hadoop