Tuesday 24 June 2014

Why you can't use Derby with Hadoop

I'm currently in the middle of writing my next course for Virtual Pair Programmers, which will be on using Hadoop. Typically in Virtual Pair Programmers courses, we use Apache's Derby database. We choose this because  that's because it's light-weight, and so easy to distribute. It needs pretty much no installation / configuration etc. We can provide students with a full copy of the application and sample databases, and they can avoid having to spend time setting up and configuring a database server, such as MySQL, and having to import a database.

One of the topics we'll be covering on the Hadoop course is the use of the DBInputFormat and DBOutputFormat to read from and write to a relational database (we'll be learning about Sqoop too but the same issue will affect Sqoop... it's just that I've not got to that part of the script just yet!).

In preparing some data and test code to use on the course, I've today discovered that Hadoop just won't work with Derby. I find this somewhat surprising, given that both projects come from the Apache camp, but having spent several hours digging to find out why this might not work, I've finally found the issue. There's really not much available online about this point so I thought I'd write a blog post about it in the hope that it helps someone in the future avoid the pain I've been through today!

On trying to get database reads working, I've been coming up with a horrible looking error message. I won't bore you with the full stack trace; the important part of it is:

java.sql.SQLSyntaxErrorException: Syntax error: Encountered "LIMIT" at line 1, column 75.

The issue here is that Hadoop generates SQL statements in the background to read from the database. Rather than reading the whole table in one go, each map method call will read the next record. The SQL that Hadoop generates (that we can't see) includes the LIMIT keyword... and as per the derby FAQ this keyword is not supported.

So it seems that there's just no easy way to read in or write out to a Derby database from Hadoop. So on the course we'll be using MySQL to learn how to work with relational databases directly from Hadoop, but for anyone using Derby and wanting to work with Hadoop, I think the only option is going to be to create a dump of the data in text format for Hadoop to import.

If you have found a way to get Derby working with Hadoop please do let me know!