Monday 13 October 2014

Hadoop Course - Understanding the Scripts

Over the last week or so we've had a few support calls asking questions about the scripts provided in chapter 5 of the course, that are used to switch Hadoop between standalone and pseudo-distributed modes.

This post will explain in a bit more detail what each script is and how it works. These are custom scripts that I've developed while working with Hadoop, so you won't (probably!) find them elsewhere on the internet, but I think they make the process of managing Hadoop configurations on a development machine really easy.

Until you've got through chapter 8 of the course not everything in this post will make sense, but feel free to contact me if you have any questions after reading this - or raise a support call through https://www.virtualpairprogrammers.com/technical-support.html


There are 4 scripts provided with the course:

(1) resetHDFS

This script is designed to clear down your HDFS workspace - that is to empty out all the files and folders in the Hadoop file system. It's like formatting a drive. What the script actually does is:

  • stop any running Hadoop processes
  • delete the HDFS folder structure from your computer
  • recreate the top level HDFS folder, and set its permissions so that the logged on user can write to it
  • run the hdfs format command - this will create the sub-folder structure needed
  • restart the hadoop processes
  • create the default folder structure within HDFS that's required for your pseudo-distributed jobs (/user/yourusername)

NOTES:

(1) You must be in the folder where the script is located to run this script. You should run it by entering the following command:

./resetHDFS

(2) The script contains a number of lines that must be run with admin privileges - these contain the word sudo. As a result, running this script will require you to enter your admin password 1 or more times. Although this might seem frustrating, you will not be running this script regularly - only when you wish to delete all your data, and then it's a quick and easy way to do it.

(3) Because this script creates the HDFS required file and folder structures, we use it to create them for the first time. When the course was first released there was a typing error - on line 2, sudo was misspelt sduo. This has been corrected but if you have downloaded a copy with the typo, you might wish to correct it!

(2) startHadoopPseduo

This script will switch Hadoop into Pseudo-distributed mode - if you're currently in standalone mode then this is the only script you need to run.

What the script actually does is:




  • remove the existing symbolic link to the configuration directory
  • create a new symbolic link to the configuration directory containing the pseudo-distributed configuration files
  • start the Hadoop processes
(3) stopHadoop

This script simply stops the Hadoop processes - it should be run if you're in pseudo-distributed mode and are going to switch back to standalone mode. It doesn't change any configuration settings, it just stops the processes running. 

(4) startHadoopStandalone

This script removes the existing symbolic link to the configuration directory, and creates a new symbolic link to the configuration directory containing the standalone files. Although I've called this script "startHadoopStandalone" it doesn't actually start anything, as no processes run in standalone mode.

So... which scripts do you need to run and when:

If you're in standalone mode and you want to be in pseudo-distributed mode, just run startHadoopPseudo

If you're in pseudo distributed mode and you want to be in standalone mode, first run stopHadoop and then run startHadoopStandalone

If you have just switched on your machine and want to run in either mode - just run the relevant startScript. In this instance you don't need to run the stop script because you have no running processes if you have just booted up.


Friday 3 October 2014

Hadoop Course - Setting Environment Variables - correction to chapter 5

We have just been made aware of an error in the Hadoop course, chapter 5, at approximately 19 minutes in the video. This error has been fixed in and the video was replaced on 22nd October - so this will only affect you if you downloaded chapter 5 before the 22nd October 2014. All customers can download the replacement video if preferred.

The issue relates to the part of the video that deals with setting your environment variables, and instructs you to edit either your .bashrc file (Linux) or .bash_profile (Mac)

There is a mistake in the last 2 lines that I ask you to add to these files - the lines reference the HADOOP_INSTALL variable - this should in fact reference HADOOP_PREFIX as we haven't set HADOOP_INSTALL.

The last 2 lines to be added should therefore be:
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_PREFIX/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"

Please accept my apologies for this error. When I display my own .bashrc file (at about 20 minutes into the video) you'll see the correct information shown.