PDI & MongoDB (ETL)

Having still not yet fully got to grips with network configuration and waiting for delivery of a final SD card, i’m turning my attention to ETL, specifically learning Pentaho Data Integration (Kettle).


I’m reading Pulvirenti and Roldan’s Pentaho Data Integration Cookbook which is pretty recent (mid 2011). Notes below are taken from there.

All of the ETL jobs/packages featured in Pentaho Data Integration Cookbook can be downloaded here and opened up in Kettle.

There are also some further tutorials here and some useful Pentaho setup instructions here

Connecting to a database

It looks like there are 3 ways to connect to a database from Pentaho

  1. Create a connection to a supported DBMS
  2. Create a driver-based connection
  3. Use the bigdata utilities within Spoon.

In order to create a DB connection, you’ll need

  1. Host name – or IP address of the DB server (*I still haven’t quite yet fixed IPs on the 5 Raspberry Pis*)
  2. Port number (*as above – some settings on R Pi still to be made*)
  3. User Name
  4. Password

3 & 4 are generally null for MongoDB.


  1. Open Spoon and create a new transformation
  2. View > right-click ‘connections’, select ‘new’ The DB connection dialog appears
  3. Select the DB engine

However, there is no DBMS connection (yet) for MongoDB in Spoon. Instead, switch to the design view and choose Bigdata.

Again, I am do not fully understand which IP and ports to enter until I have finished the sysconfig on each R Pi. The pic below shows me connecting to a MongoDB instance on my laptop, using the default (unchanged) port. I would need to specify the collection if several existed.


Alternatively, if you want to create a connection to a DBMS not in the list; First of all you have to get the JDBC driver for MongoDB.

Copy the .jar file containing the driver to the folder libtext/JDBC directory inside the Kettle installation directory. Create the connection.

In this case, choose Generic Database. In the settings frame specifiy the connection string(?), the driver class name, the uname and pw. In order to find these settings you will have to refer to the driver documentation. I’m trying to locate this information from the 10gen community/driver developers.

Re-use the connection

Avoid creating the DB connection again and again by sharing the connection. Right-click the DB connection under the DB connections tree and click on share. The connection will now be available to be used in all transformations & jobs. Shared connections are shown in bold in the console.

Check the connection is open at runtime

Insert a preceding step Check DB connection. The entry will return True or False.

Creating parameterised inputs

If you need to create an input dataset with data coming from an existing database you do it using a table input step. If the SELECT statement is simple and doesn’t change, it can be entered into the table input settings window. However, most of the time the dataset being selected is dynamic and the query needs to be flexible.

  1. Create a transformation
  2. Before getting the data segment, you have to create a stream that will provide the parameters for the statement
  3. Create a stream that builds a pre-cursor dataset with a single row and 2 columns; in the case of the sampledata dataset being used , we use the <product line> parameter and the <scale> parameter – but these could be any relevant fields from a MongoDB collection. Select a single column value for each eg ‘Classic Cars’ for <prouductline_par> and ‘1:10’ for productscale_par
  4. Add a data grid step or generate rows step.
  5. Now drag to the canvas a table input step and create a hop from the last step of the stream created above, towards this step.
  6. Now you can configure the table input step . Double-click it, select the appropriate DB connection and specify the appropriate filtersconditions in the WHERE clause ie WHERE productline = ? AND productscale = ?
  7. In the insert data from step list, select the name of the step that is linked to the table input step. Close the window.
  8. Finally, select the table input step and do a preview of the transformation. You should see a limited set of results based on the parameters.

How it works…

When you need to execute a SELECT statement with parameters, the 1st thing you have to do is to build a stream that provides the parameter values needed by the statement. The stream can be made of just one step eg a data grid with to-match values. The important thing is that the last step delivers the proper values to the table input step.

The last step in the stream is then linked to the table input step. The query is prepared and the values coming to the table input step are assigned/bound to the placeholders – that is where you used the ? symbol(s).
– The number of fields coming to a table input must be exactly the same as the number of questions marks found in the query
– ‘?’ can only be used to parameterise value expressions. Keywords or identifiers eg table names cannot be parameterised using this method, but via another method.

Executing the SELECT statement several times, each for a different set of parameters

…more coming soon


PDI – Writing to MongoDB

Got this to work today (kind of). I downloaded a file from http://data.gov.uk/dataset/gp-practice-prescribing-data

Showing GP prescription data over time and by anonymised surgery areas, 4.1m rows, 5 columns.

Imported the csv file from the desktop in to PDI, then wrote out to MongoDB. Note *no need to define a schema up-front here*, a nice feature of MongoDB/noSQL DBs. Although, admittedly it is a simple and consistent (non-nested) dataset.Image

The resultant new ‘collection’ in MongoDB.

Useful guidance here http://wiki.pentaho.com/display/EAI/MongoDb+output

The full set of data has been written to a new MongoDB collection, but the results look a bit funny. Need to re-check the setup of the ETL package.

Made a few tweaks and now the data looks good.

I still haven’t got my Raspberry Pis set up. If I had, it would be interesting to try to get this job to run in parallel using the 51MB RAM in each of the four PIs. I’d hope it’d be a fair bit faster!
– It’s currently writing about 12.5k rows/second on the laptop alone, the job runs in 5m 26s.

I’m going to try and achieve the same now using MongoImport

Example Syntax:

mongoimport --db users --collection contacts --type csv --file /opt/backups/contacts.csv

There’s some good discussion on the Google group ‘Mongodb-user’ about sharding and ‘chunking’ data prior to load. Now, if only I could get these 4 Raspberry Pis working!