Matt Greencroft's Blog

Monday, 13 October 2014

Hadoop Course - Understanding the Scripts

Over the last week or so we've had a few support calls asking questions about the scripts provided in chapter 5 of the course, that are used to switch Hadoop between standalone and pseudo-distributed modes.

This post will explain in a bit more detail what each script is and how it works. These are custom scripts that I've developed while working with Hadoop, so you won't (probably!) find them elsewhere on the internet, but I think they make the process of managing Hadoop configurations on a development machine really easy.

Until you've got through chapter 8 of the course not everything in this post will make sense, but feel free to contact me if you have any questions after reading this - or raise a support call through https://www.virtualpairprogrammers.com/technical-support.html

There are 4 scripts provided with the course:

(1) resetHDFS

This script is designed to clear down your HDFS workspace - that is to empty out all the files and folders in the Hadoop file system. It's like formatting a drive. What the script actually does is:

stop any running Hadoop processes
delete the HDFS folder structure from your computer
recreate the top level HDFS folder, and set its permissions so that the logged on user can write to it
run the hdfs format command - this will create the sub-folder structure needed
restart the hadoop processes
create the default folder structure within HDFS that's required for your pseudo-distributed jobs (/user/yourusername)

NOTES:

(1) You must be in the folder where the script is located to run this script. You should run it by entering the following command:

./resetHDFS

(2) The script contains a number of lines that must be run with admin privileges - these contain the word sudo. As a result, running this script will require you to enter your admin password 1 or more times. Although this might seem frustrating, you will not be running this script regularly - only when you wish to delete all your data, and then it's a quick and easy way to do it.

(3) Because this script creates the HDFS required file and folder structures, we use it to create them for the first time. When the course was first released there was a typing error - on line 2, sudo was misspelt sduo. This has been corrected but if you have downloaded a copy with the typo, you might wish to correct it!

(2) startHadoopPseduo

This script will switch Hadoop into Pseudo-distributed mode - if you're currently in standalone mode then this is the only script you need to run.

What the script actually does is:

remove the existing symbolic link to the configuration directory
create a new symbolic link to the configuration directory containing the pseudo-distributed configuration files
start the Hadoop processes

(3) stopHadoop

This script simply stops the Hadoop processes - it should be run if you're in pseudo-distributed mode and are going to switch back to standalone mode. It doesn't change any configuration settings, it just stops the processes running.

(4) startHadoopStandalone

This script removes the existing symbolic link to the configuration directory, and creates a new symbolic link to the configuration directory containing the standalone files. Although I've called this script "startHadoopStandalone" it doesn't actually start anything, as no processes run in standalone mode.

So... which scripts do you need to run and when:

If you're in standalone mode and you want to be in pseudo-distributed mode, just run startHadoopPseudo

If you're in pseudo distributed mode and you want to be in standalone mode, first run stopHadoop and then run startHadoopStandalone

If you have just switched on your machine and want to run in either mode - just run the relevant startScript. In this instance you don't need to run the stop script because you have no running processes if you have just booted up.

Friday, 3 October 2014

Hadoop Course - Setting Environment Variables - correction to chapter 5

We have just been made aware of an error in the Hadoop course, chapter 5, at approximately 19 minutes in the video. This error has been fixed in and the video was replaced on 22nd October - so this will only affect you if you downloaded chapter 5 before the 22nd October 2014. All customers can download the replacement video if preferred.

The issue relates to the part of the video that deals with setting your environment variables, and instructs you to edit either your .bashrc file (Linux) or .bash_profile (Mac)

There is a mistake in the last 2 lines that I ask you to add to these files - the lines reference the HADOOP_INSTALL variable - this should in fact reference HADOOP_PREFIX as we haven't set HADOOP_INSTALL.

The last 2 lines to be added should therefore be:

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_PREFIX/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"

Please accept my apologies for this error. When I display my own .bashrc file (at about 20 minutes into the video) you'll see the correct information shown.

Wednesday, 10 September 2014

Groovy Course Correction - Chapter 21 (Files and Templates)

I've been made aware today (thanks to a customer asking us to help solve a problem with his code relating to chapter 21 of the Groovy course) of a mistake on the video and with the file supplied for this chapter.

In the exercise I set you to practice with templates, I show on screen the file called DailyCheckInTemplate.txt from the practicals and code folder. This is at approximately 15:22 on the video.

The video tells you to copy the file from the templates folder in chapter 17, and shows you the file on screen. Unfortunately the file provided and shown is not right - it includes fields like $it.date - these should be $date.

The problem with using $it.date is that Groovy will be looking for a key of it.date in the map of properties we supply to the template engine, but the keys won't be preceeded with the "it" - that is we'll be creating a map with a key of "date" and not "it.date".

If you use the version of the template supplied in the starting workspace, you'll get an error message similar to this:

Caught: groovy.lang.MissingPropertyException: No such property: it for class: SimpleTemplateScript1
groovy.lang.MissingPropertyException: No such property: it for class: SimpleTemplateScript1
at SimpleTemplateScript1.run(SimpleTemplateScript1.groovy:1)

The version of this file in the final workspace for chapter 21 is correct, so please pick up the DailyCheckinTemplate.txt file from the following location instead, and you'll not have any problems with your code:

PracticalsAndCode / End of Chapter Workspaces / Chapter 21 - Files and Templates / Hotel Manager / templates / DailyCheckInTemplate.txt

Thursday, 14 August 2014

Hadoop - new course coming soon!

I'm pleased to announce that recording of my next Virtual Pair Programmers course is now almost complete. The course is covering Hadoop for Java Developers. If you haven't heard of Hadoop (is there anyone who hasn't?) this is a framework for distributing the processing of large amounts of data accross a a network of computers.

The course assume some basic Java knowlege, but no prior knowledge of Hadoop or its Map Reduce programming model.

Once recording is complete, there will be an edit phase and of course some post-production work to complete, but the likely running order is as follows:

An overview of what Hadoop is, and introducing the concept of the map-reduce programming model
Getting to grips with map-reduce, including creating some map-reduce code in standard Java
Hadoop operating modes, and how to set up and install Hadoop
Creating our first Hadoop Map-Reduce job
The Hadoop Distributed File System (HDFS)
Understanding the map-reduce process flow, including combine and shuffle
Looking at map reduce job configuration options, such as file formats, runtime options
Creating custom data types
Chaining multiple jobs, and adding extra Map steps to jobs
Optimising jobs
Working with JDBC databases
Unit testing (with MRUnit)
Secondary Sorting (sorting the values as well as the keys)
Joining Data from multiple files
Using the Amazon EMR service

The course has a number of real world examples throughout and two large case studies to work through too so there's lots of practical exercises. As well as model answers and sample code throughout, I'm also including some templates that I use for my own map-reduce jobs which you'll be able to re-use in your own projects.

If you are a Microsoft Windows user, then you need to know that installing Hadoop on Windows is hard, so in the course, I ask you to use a virtual machine running Linux... and I'll talk you through how to install and configure that... no prior knowledge of Linux is required. Mac and Linux users can either install Hadoop directly, or also use a virtual machine - all the options are covered.

The course should be going live some time in September so keep an eye out on this blog or the Virtual Pair Programmers' facebook page for more information.

Tuesday, 24 June 2014

Why you can't use Derby with Hadoop

I'm currently in the middle of writing my next course for Virtual Pair Programmers, which will be on using Hadoop. Typically in Virtual Pair Programmers courses, we use Apache's Derby database. We choose this because that's because it's light-weight, and so easy to distribute. It needs pretty much no installation / configuration etc. We can provide students with a full copy of the application and sample databases, and they can avoid having to spend time setting up and configuring a database server, such as MySQL, and having to import a database.

One of the topics we'll be covering on the Hadoop course is the use of the DBInputFormat and DBOutputFormat to read from and write to a relational database (we'll be learning about Sqoop too but the same issue will affect Sqoop... it's just that I've not got to that part of the script just yet!).

In preparing some data and test code to use on the course, I've today discovered that Hadoop just won't work with Derby. I find this somewhat surprising, given that both projects come from the Apache camp, but having spent several hours digging to find out why this might not work, I've finally found the issue. There's really not much available online about this point so I thought I'd write a blog post about it in the hope that it helps someone in the future avoid the pain I've been through today!

On trying to get database reads working, I've been coming up with a horrible looking error message. I won't bore you with the full stack trace; the important part of it is:

java.sql.SQLSyntaxErrorException: Syntax error: Encountered "LIMIT" at line 1, column 75.

The issue here is that Hadoop generates SQL statements in the background to read from the database. Rather than reading the whole table in one go, each map method call will read the next record. The SQL that Hadoop generates (that we can't see) includes the LIMIT keyword... and as per the derby FAQ this keyword is not supported.

So it seems that there's just no easy way to read in or write out to a Derby database from Hadoop. So on the course we'll be using MySQL to learn how to work with relational databases directly from Hadoop, but for anyone using Derby and wanting to work with Hadoop, I think the only option is going to be to create a dump of the data in text format for Hadoop to import.

If you have found a way to get Derby working with Hadoop please do let me know!

Thursday, 1 May 2014

Groovy Programming is now available!

I'm excited to announce that Groovy Programming, the latest training course from Virtual Pair Programmers (my second course for them) is now available to purchase!

As with all Virtual Pair Programmers' courses, Groovy Programming is written from scratch for delivery by video, but is based on many years of experience in working with and teaching the language. We believe you'll learn far more quickly than from reading books - in fact you'll cover everything you need to be a competent Groovy programmer but at a fraction of the cost of a face-to-face course, and naturally with the convenience that our unique training methods give - the ability to download and keep all the video files, so you can study at a time and place that suits you!

Groovy Programming contains 12 hours of video, but with lots of practical exercises it will take most students around a week to complete. Also included with the download are complete code for all worked exercises, and guidance notes for the tasks, as well as all the software you need (except Groovy and Eclipse, but we cover on the videos how to install and configure these!)

There's a full breakdown of the content of Groovy Programming on our website, but if you have any queries about this course, you're welcome to contact me through https://www.virtualpairprogrammers.com/contact.html. In the meantime, thank you for your continued support, and we hope you continue to enjoy our courses.

Monday, 3 March 2014

Running db-derby with recent releases of Java

This blog post is an errata item for my Java Fundamentals course, and will also apply for the other Virtual Pair Programmers courses where we use the db-derby database.

I discovered while recording the upcoming Groovy course that there has been a security change in the most recent release of Java (1.7.0.51) that has meant that the default configuration for db-derby no longer works. Running db-derby with the startNetworkServer command will result in an error message which says somewhere early on in the error message:

access denied ("java.net.SocketPermission" "localhost:1527" "listen,resolve")

The easiest and quickest way to overcome this seems to be to run the database on a higher port number, such as 50000 - to do this, instead of running the startNetworkServer command, run the following instead to start the db-derby database:

NetworkServerControl -p 50000 start

In your code, you'll need to change the connection strings to incorporate the new port number, so that the line of code which creates the connection includes the port number - it should look like this:

conn = DriverManager.getConnection("jdbc:derby://localhost:50000/library");

This should overcome the error - if you find you have any further unexpected errors that you can't resolve however, do get in contact via the Virtual Pair Programmers contact us page!