Got this to work today (kind of). I downloaded a file from http://data.gov.uk/dataset/gp-practice-prescribing-data
Showing GP prescription data over time and by anonymised surgery areas, 4.1m rows, 5 columns.
Imported the csv file from the desktop in to PDI, then wrote out to MongoDB. Note *no need to define a schema up-front here*, a nice feature of MongoDB/noSQL DBs. Although, admittedly it is a simple and consistent (non-nested) dataset.
The full set of data has been written to a new MongoDB collection, but the results look a bit funny. Need to re-check the setup of the ETL package.
I still haven’t got my Raspberry Pis set up. If I had, it would be interesting to try to get this job to run in parallel using the 51MB RAM in each of the four PIs. I’d hope it’d be a fair bit faster!
– It’s currently writing about 12.5k rows/second on the laptop alone, the job runs in 5m 26s.
I’m going to try and achieve the same now using MongoImport
mongoimport --db users --collection contacts --type csv --file /opt/backups/contacts.csv
There’s some good discussion on the Google group ‘Mongodb-user’ about sharding and ‘chunking’ data prior to load. Now, if only I could get these 4 Raspberry Pis working!