PDI – Writing to MongoDB

Got this to work today (kind of). I downloaded a file from http://data.gov.uk/dataset/gp-practice-prescribing-data

Showing GP prescription data over time and by anonymised surgery areas, 4.1m rows, 5 columns.

Imported the csv file from the desktop in to PDI, then wrote out to MongoDB. Note *no need to define a schema up-front here*, a nice feature of MongoDB/noSQL DBs. Although, admittedly it is a simple and consistent (non-nested) dataset.Image

The resultant new ‘collection’ in MongoDB.

Useful guidance here http://wiki.pentaho.com/display/EAI/MongoDb+output

The full set of data has been written to a new MongoDB collection, but the results look a bit funny. Need to re-check the setup of the ETL package.

Made a few tweaks and now the data looks good.

I still haven’t got my Raspberry Pis set up. If I had, it would be interesting to try to get this job to run in parallel using the 51MB RAM in each of the four PIs. I’d hope it’d be a fair bit faster!
– It’s currently writing about 12.5k rows/second on the laptop alone, the job runs in 5m 26s.

I’m going to try and achieve the same now using MongoImport

Example Syntax:

mongoimport --db users --collection contacts --type csv --file /opt/backups/contacts.csv

There’s some good discussion on the Google group ‘Mongodb-user’ about sharding and ‘chunking’ data prior to load. Now, if only I could get these 4 Raspberry Pis working!