Thursday 14 August 2014

Hadoop - new course coming soon!

I'm pleased to announce that recording of my next Virtual Pair Programmers course is now almost complete. The course is covering Hadoop for Java Developers. If you haven't heard of Hadoop (is there anyone who hasn't?) this is a framework for distributing the processing of large amounts of data accross a a network of computers.

The course assume some basic Java knowlege, but no prior knowledge of Hadoop or its Map Reduce programming model.

Once recording is complete, there will be an edit phase and of course some post-production work to complete, but the likely running order is as follows:


  • An overview of what Hadoop is, and introducing the concept of the map-reduce programming model
  • Getting to grips with map-reduce, including creating some map-reduce code in standard Java
  • Hadoop operating modes, and how to set up and install Hadoop
  • Creating our first Hadoop Map-Reduce job
  • The Hadoop Distributed File System (HDFS)
  • Understanding the map-reduce process flow, including combine and shuffle
  • Looking at map reduce job configuration options, such as file formats, runtime options
  • Creating custom data types 
  • Chaining multiple jobs, and adding extra Map steps to jobs 
  • Optimising jobs
  • Working with JDBC databases
  • Unit testing (with MRUnit)
  • Secondary Sorting (sorting the values as well as the keys)
  • Joining Data from multiple files
  • Using the Amazon EMR service
The course has a number of real world examples throughout and two large case studies to work through too so there's lots of practical exercises. As well as model answers and sample code throughout, I'm also including some templates that I use for my own map-reduce jobs which you'll be able to re-use in your own projects.

If you are a Microsoft Windows user, then you need to know that installing Hadoop on Windows is hard, so in the course, I ask you to use a virtual machine running Linux... and I'll talk you through how to install and configure that... no prior knowledge of Linux is required.  Mac and Linux users can either install Hadoop directly, or also use a virtual machine - all the options are covered.

The course should be going live some time in September so keep an eye out on this blog or the Virtual Pair Programmers' facebook page for more information.