PDI & MongoDB (ETL)

Having still not yet fully got to grips with network configuration and waiting for delivery of a final SD card, i’m turning my attention to ETL, specifically learning Pentaho Data Integration (Kettle).


I’m reading Pulvirenti and Roldan’s Pentaho Data Integration Cookbook which is pretty recent (mid 2011). Notes below are taken from there.

All of the ETL jobs/packages featured in Pentaho Data Integration Cookbook can be downloaded here and opened up in Kettle.

There are also some further tutorials here and some useful Pentaho setup instructions here

Connecting to a database

It looks like there are 3 ways to connect to a database from Pentaho

  1. Create a connection to a supported DBMS
  2. Create a driver-based connection
  3. Use the bigdata utilities within Spoon.

In order to create a DB connection, you’ll need

  1. Host name – or IP address of the DB server (*I still haven’t quite yet fixed IPs on the 5 Raspberry Pis*)
  2. Port number (*as above – some settings on R Pi still to be made*)
  3. User Name
  4. Password

3 & 4 are generally null for MongoDB.


  1. Open Spoon and create a new transformation
  2. View > right-click ‘connections’, select ‘new’ The DB connection dialog appears
  3. Select the DB engine

However, there is no DBMS connection (yet) for MongoDB in Spoon. Instead, switch to the design view and choose Bigdata.

Again, I am do not fully understand which IP and ports to enter until I have finished the sysconfig on each R Pi. The pic below shows me connecting to a MongoDB instance on my laptop, using the default (unchanged) port. I would need to specify the collection if several existed.


Alternatively, if you want to create a connection to a DBMS not in the list; First of all you have to get the JDBC driver for MongoDB.

Copy the .jar file containing the driver to the folder libtext/JDBC directory inside the Kettle installation directory. Create the connection.

In this case, choose Generic Database. In the settings frame specifiy the connection string(?), the driver class name, the uname and pw. In order to find these settings you will have to refer to the driver documentation. I’m trying to locate this information from the 10gen community/driver developers.

Re-use the connection

Avoid creating the DB connection again and again by sharing the connection. Right-click the DB connection under the DB connections tree and click on share. The connection will now be available to be used in all transformations & jobs. Shared connections are shown in bold in the console.

Check the connection is open at runtime

Insert a preceding step Check DB connection. The entry will return True or False.

Creating parameterised inputs

If you need to create an input dataset with data coming from an existing database you do it using a table input step. If the SELECT statement is simple and doesn’t change, it can be entered into the table input settings window. However, most of the time the dataset being selected is dynamic and the query needs to be flexible.

  1. Create a transformation
  2. Before getting the data segment, you have to create a stream that will provide the parameters for the statement
  3. Create a stream that builds a pre-cursor dataset with a single row and 2 columns; in the case of the sampledata dataset being used , we use the <product line> parameter and the <scale> parameter – but these could be any relevant fields from a MongoDB collection. Select a single column value for each eg ‘Classic Cars’ for <prouductline_par> and ‘1:10’ for productscale_par
  4. Add a data grid step or generate rows step.
  5. Now drag to the canvas a table input step and create a hop from the last step of the stream created above, towards this step.
  6. Now you can configure the table input step . Double-click it, select the appropriate DB connection and specify the appropriate filtersconditions in the WHERE clause ie WHERE productline = ? AND productscale = ?
  7. In the insert data from step list, select the name of the step that is linked to the table input step. Close the window.
  8. Finally, select the table input step and do a preview of the transformation. You should see a limited set of results based on the parameters.

How it works…

When you need to execute a SELECT statement with parameters, the 1st thing you have to do is to build a stream that provides the parameter values needed by the statement. The stream can be made of just one step eg a data grid with to-match values. The important thing is that the last step delivers the proper values to the table input step.

The last step in the stream is then linked to the table input step. The query is prepared and the values coming to the table input step are assigned/bound to the placeholders – that is where you used the ? symbol(s).
– The number of fields coming to a table input must be exactly the same as the number of questions marks found in the query
– ‘?’ can only be used to parameterise value expressions. Keywords or identifiers eg table names cannot be parameterised using this method, but via another method.

Executing the SELECT statement several times, each for a different set of parameters

…more coming soon


Successful Debian (6.0) installation

It’s nice working on a 42″ plasma TV in the living room(!) where the internet router is situated. I managed to get a DVI <> HDMI cable for the old Dell Dimension (which  means no more bulky old Dell monitor)

Installed a Debian image onto the previously WinXP Dell, using Unetbootin and a USB drive.

Fun with .tar filesImage

I now have 5 x R Pi with 32GB SD cards and MongoDB installed on Raspbian O/S, plus MongoDB installed on the old Dell.

I have also installed Oracle Virtual Box which lets you create virtual machines relatively easily.

My new 8-port network switch should hopefully come in the post today.
Should be in business!

Don’t forget, you’ll need to configure WiFi access too! Although for the RPi, this was pretty helpful, as was this & this

The Dell is so old it doesn’t have an internal wireless card. Instead I use a wifi dongle
$ lsusb
Bus 001 Device 004: ID 0846:4260 NetGear, Inc. WG111v3 54 Mbps Wireless [realtek RTL8187B]

$ sudo iwlist wlan0 scan | grep ‘BT’

$ man 5 interfaces
$ man 8 wpa_supplicant
$ man 8 iwconfig
$ man 8 iwlist

Take a look/install these Unix GUIs to help with setting up a network connection

  • kmanager
  • wicd

How to install MongoDB on a Raspberry Pi…Really!

/Update 25 Sept – some handy links/

My 4th Pi arrived today, with an SD card pre-loaded with the O/S. Gave me the opportunity to try a ‘fresh’ config, after making several botched attempts not fully knowing what to do previously.

Write the Raspian Operating System to the SD card

Copy the latest O/S from the R Pi website.
If you haven’t already, get the image writer software for Windows
Download the binaries, not the source)
Extract the zipped O/S archive. Find the image file from within the image writing software. Write to the card (takes 5 mins or so).

Modify initial config settings

On 1st boot, I changed the following default settings. I also repeated some of these below, manually. Following great advice from Chris Elsmore, and his blog

  • Expand rootfs – expand the root partition to fill the SD card
  • Overscan – enabled
  • memory_split – how much memory should the GPU have? I set mine to 32 of the 512MB as I won’t be running graphically intensive apps. I hope this frees up most of the resource to run MongoDB
  • ssh – enabled
  • boot_behaviour – start desktop on boot
  • update – could not resolve mirror sites etc (I assume config for eth0 connection needs to be done?)

Change Hostname:

I also changed the hostname of my Pi, each of the 5 are housed in a colured plastic case.

Change hostname to match case - keep it simple!

Change hostname to match case – keep it simple!

$ sudo nano /etc/hostname

- this opens up a texteditor containing 1 line. I changed 'raspberrypi' to 'yellow' “CTRL+O” then “Y” to quit) 

$ sudo nano /etc/hosts

– again, this opens up a texteditor. this time containing 8 lines. Replace the last line (mine had ‘IP raspberrypi’ ) with the hostname you chose above ie ‘yellow’

$ sudo /etc/init.d/hostname.sh start (to enable the changes).

The prompt should now say pi@’newhostname’ in my case pi@yellow. A further check is to run the command
$ hostname

Enable SSH:

SSH lets you remote login to a Pi. I haven’t yet needed to do this as I have >2 HDMI ports on my monitor, so just flip between video inputs.http://ts1.mm.bing.net/th?id=I.4650776073995288&pid=15.1&W=160&H=160

$ ssh-keygen (I hit enter for all three options to accept defaults and no passphrase)

$ sudo service ssh start (to start sshd)

$ sudo update-rc.d ssh defaults (to run the ssh server on startup by default)

>>Generating public/private rsa key pair
>>Enter file in which to save the key (/home/pi/.ssh/id_rsa)
>>Enter passphrase (empty for no passphrase):
>>Enter same passphrase again:
>>Your identification has been saved in /home/pi/.ssh/id_rsa
>>Your public key jas been saved in /home/pi/.ssh/id_rsa.pub.
>>The key fingerprint is: [a long string!]

$ sudo service ssh start (to start sshd)
>>[ ok ] starting OpenBSD Secure Shell server: sshd

$ sudo update-rc.d ssh defaults (to run the ssh server on startup by default)
>update rc.d: using dependency based boot sequencing
>update rc.d: warning: default stop runlevel arguments (0 1 6) do not match ssh Default-Stop values (none)

Now to config an internet connection so I can try to install MongoDB on this ‘fresh’ card.

Config an internet connection:

http://ts4.mm.bing.net/th?id=I.4520067346075075&pid=15.1&W=113&H=160You’ll need to get out on to the internet so that the Pi can see the various mirrors and repositories it used to pull down updates.

$ dmesg | grep ^usb
$ dmesg | grep ^wlan
$ dmesg | grep ^wireless
$ dmesg | grep ^firmware

One of these commands should indicate the manufacturer of your USB wireless adapter. Mine (from Maplin) was “Ralink

Search the library for the appropriate package
$ apt-cache search ralink

Install the package
$sudo apt-get install firmware-ralink

Now create a config file to specify what type of encryption your home router has, necessary IDs, passwords etc
$ sudo nano /etc/network/interfaces

Add the following three lines to the bottom of the file (ignoring the numbers)
1. auto wlan0
2. iface wlan0 inet dhcp
3. wpa-conf /etc/wpa.conf

CTRL+X to save

The last line makes ref to a config file, wpa.conf which needs to be created. The file will be used by wpasupplicant, designed to provide Linux with an easy way to connect to networks secured with by WPA (most home commercial broadband networks I guess).

$sudo nano /etc/wpa.conf

* This is specifically my home SSID. My password is a combination of 10 alphanumerics *

The Pi’s wireless network is now (supposedly !) updated & configured, and will begin the next time the Pi is restarted. To start the wireless network without rebooting

$ sudo ifup wlan0
To make sure it’s working

$ ping  -C  1 http://www.raspberrypi.org
ping: unknown host http://www.raspberrypi.org

In my case, things weren’t working, so I flipped over to the desktop to use the wifi config utility. On doing so, I could see the home broadband network and it was then easy to connect.

$ startx
To start the windows desktop emulator
Run the Wifi config programme, click the scan button, connect to the relevant network, provide any required info.

Install MongoDB via GitHub

Because of the limitations/power of the Pi (it is £30 after all!), you need to get the version that has been specially created for the R Pi.

Source control/repositories

Install the requisite packages on the Pi

sudo apt-get install git-core build-essential scons libpcre++-dev xulrunner-dev libboost-dev libboost-program-options-dev libboost-thread-dev libboost-filesystem-dev

I could not get this to execute successfully as a single script, so instead install each individual component one-by-one

$sudo apt-get install git-core
$ sudo apt-get install libboost-filesystem-dev

This seemed to work better.

Check for any updates to these
$ sudo apt-get update

Pull the files from this fork on Github:

git clone git://github.com/RickP/mongopi.git

Build it (this took about four hours!):

cd mongopi

you’ll see lots of these lines as the build progresses for several hours…
{standard input}: nnnnn: Warning: swp{b} use is deprecated for this architecture

Install it (this took about 3 hours)

$ sudo scons --prefix=/opt/mongo install

This will install mongo in /opt/mongo. Takes several hours. To get other programs to see it, add this dir to your $PATH:

Check the filepath where mongo is executed
pwd = present working directory

$ PATH=$PATH:/opt/mongo/bin/
$ export PATH

While you’re waiting for everything to install, take a look at the primer/tutorial on MongoDB on CPAN

All looks good….

Now. I can at last fire up MongoDB on the Pi! Or, so I thought….

pi@clear ~/mongoppi $ mongod
 ERROR: dbpath (/data/db) does not exist
 Create this directory or give existing directory in --dbpath
 , terminating

So, I assume I need to create the directories
sudo mkdir -p /data/db
mkdir: cannot create directory ‘data’ : No space left on device

Check disk usage
$ df -h
Reveals that the SD card is completely full.

I run scons -c to clean up any temp files hanging around after the installation. This frees up just 573M. That doesn’t feel like sufficient room to get some data on, and I’m also suspicious the install has really worked, given it seems to have taken all the space. So, time to experiment with some larger cards!

http://ts1.mm.bing.net/th?id=I.4937349179376252&pid=15.1&W=160&H=156Got a 32MB SD card from Maplin from 19.99. Re-ran the install. Looks promising. Cheapies can be found here
– Now both have 27G (94%) of 30G spare. That should give me 125GB or so across the array of five R PIs. One GB equates to about 1000 thick books.

Scan for the 20 largest files, perhaps there are some temp install files that can be cleared up.

#du -a /var | sort -n -r | head -n 20

Create the necessary /data/db
mkdir -p /data/db
try mongo again

Returns error: [initandlisten] exception in initAndListen: 10309 Unable to create/open lock file: /data/db/mongo.lcok errno13: permission denied Is a mongo instance already running?, terminating

Check to see if any mongod instances are running
$ps -ef | grep mongod
– Suggests none are.

Therefore I suspect it is some kind of permissions conflict as the /data/db directory was created as ‘root’.

Seems like I’m not the only one to experience this common problem
$ sudo chown $USER /data/db
Seems to do the trick!

Configuring network access between devices / creating a lan

Now that Mongo is installed and you have updated packages, configured the PIs etc. It’s time to connect them all into a network hub/LAN. My primary device is my laptop, running windows 7. This is connected over Wifi to a BT router. It is this information that needs to be propagated to the R Pis. Thanks to instructions on Simon The PiMan’s blog.

> ipconfig /all|more
This pipes ipconfig information out 1-page at a time and can be tabbed through by pressing the spacebar.

Page 2 has the information required. The two important bits of info we need for setting up a Raspberry Pi with a fixed IP address are (1) Default Gateway which is the router to access the internet, in my case it has an IP address of and (2) IP address, which is

As you can see, Default Gateway has the same IP address as for the DHCP server  – I believe this means local addresses will be allocated by the same router (although my knowledge of networking is low!)

As my I.P. address is and also uses DHCP then the use of addresses greater than is unlikely to clash with the DHCP server, so i will start the five Raspberry Pi’s from this point.

My network is as the home local network is usually the 1st 3 parts of the IP address plus a 0 (but this is dependant on the Subnet Mask, usually irrelevent for most home users – and i’m out of my depth now)

So to conclude – My 5 Pis needs the following items to be setup within my LAN.

1 has yet to arrive!
Hooking in the 5 R PIs into a home LAN

# The first 3 digits of the gateway plus a Unique number per machine up to 254,
# Starting at 200 as it gives me 54 potential Pi addresses (max is 254)
# Highlighted in step 2 Subnet Mask
# The first 3 PARTS of the gateway plus a 0 ending
# The first 3 PARTS of the gateway plus a 255 ending
# Highlighted in step 2 Default Gateway

[ ~memo: not sure where this info gets written to (yet) ]

NOTE:- Details of your ADSL router are specific to the router so you will need to look at the documentation for the router if you want more details, and will also need this *to enable a DMZ for your Pi to run from*.

Creating a LAN. Use colour coded ethernet cables. It all gets rather messy after a while!

Hopefully all the network stuff is more or less done now, which should enable me to hook everything up and get things ‘talking’. The fun hopefully starts when I can start using MongoDB to shard data across each Pi node!

Sharding with MongoDB

Sharding with MongoDB

Big thanks to Simon & Chris for their generous help!

Update 02 April 13
Just noticed a new link/site

Pi config/setup

Thanks to some great suggestions & help from Simon via his blog
I managed to ‘see’ a Pi alive. Looks like the O/S is good and now I just need to run some updates/config.

For some reason, I could not see the RPi on my laptop. I can on an HDMI TV. Turns out both the laptop and Pi are HDMI *out* only – this is why it works on a TV with an HDMI input.

[Note to self (after a very expensive ethernet and ‘gold plated’ 3m HDMI lead from Asda) Farnell is *far* cheaper for everything needed than Maplin, Asda etc]

So, tomorrow, I’ll be buying a powered USB hub so I can connect a mouse and keyboard to the PIs, and the Pi to the bedroom TV(!) and begin the O/S stuff.

Painful progress today, but progress nontheless.

Simon also recommends XMing for emulation/control.
– I guess i’ll be battling with that soon!

Connecting/seeing devices


So, I think I need to connect the R Pi to the laptop via HDMI-to-HDMI. However, there’s only 1 HDMI port on the laptop, so am not sure how I can see all 3 (turns out you can remote login using SSH – assuming you get the setups right and/or buy an HDMI splitter) Anyway, being able to see each device one at a time and run commands (and make some progress) will be a good result today

R Pi

How to see the PIs?

So, I have successfully (I think) flashed the Raspian O/S onto each of the 2 SD cards.
Note to self: Buy an SD card with it pre-installed next time!

Initially, just one of the PIs changed from a red to a green light. I assume the O/S hasn’t been successfully written onto the one with the red light (but how to tell?).

Even though all devices (excl the Dell which isn’t turned on) are connected into the network switch device (TP-link). The light for port(2) on the switch is out. Probably echoes the above.

I swapped the SD cards between the 2 PIs and now have 4 lights on the switch. And all the PIs are green. Odd.

So, looks like we have a cluster and the O/S flashing onto each card has been successful. An afternoon of my life i’d rather not have to repeat!

Successfully flashed the O/S onto the PIs

I do like green lights!

What next?!

So, I run Puppet and hope(!) it can see the ‘members’ or ‘nodes’. TBH, I don’t know how to use Puppet (yet).

But, it only sees the laptop :(

Doh! Was hoping Puppet could see everything connected into the network switch (laptop & 3 RPi)

I know I have to do some O/S stuff now

But cannot ‘see’  each PI. Maybe I need to buy a video cable and connect them to a monitor, but had hoped I could switch to them from the laptop? One for tomorrow!

Update: Until you have successfully got SSH enabled and figured out the IP [ip show addr] address for each raspberry Pi, i’d recommend a HDMI switch, which lets you easily toggle between devices, without having to plug/unplug HDMI cables from your monitor

Let’s you flip between 3 devices. I have laptop in HDMI port 1 and the 3 R Pis into the switch and into HDMI port 2