MongoImport

When running the mongoimport, the options to use are:
–collection – the collection to create or import into
–jsonArray- tells mongoimport to expect multiple documents, all contained within a single array.
–file – the file to be imported An example of the mongoimport run:

Advertisements

Sample dataset (a la Northwind) for MongoDB

It would be nice to get hold of a sample, pre-built dataset of reasonable scale and complexity – something like Northwind that ships with MS SQL Server – rather than having to fiddle around in 2 databases converting data to and from. One can be found here.

  • Open a command prompt, start the mongo server by going in to the bin directory and typing in mongod
  • now take another command prompt and go to the bin directory again and write following command
    C:\mongodb\bin>mongoimport –db test –collection zips –file d:\sample\zips.json
  • The import should show some “imported 29470 objects”.

The data looks like this

{"city": "ACMAR", "loc": [-86.51557, 33.584132], "pop": 6055, "state": "AL", "_id": "35004"}
{"city": "ADAMSVILLE", "loc": [-86.959727, 33.588437], "pop": 10616, "state": "AL", "_id": "35005"}
{"city": "ADGER", "loc": [-87.167455, 33.434277], "pop": 3205, "state": "AL", "_id": "35006"}
{"city": "KEYSTONE", "loc": [-86.812861, 33.236868], "pop": 14218, "state": "AL", "_id": "35007"}
{"city": "NEW SITE", "loc": [-85.951086, 32.941445], "pop": 19942, "state": "AL", "_id": "35010"}
{"city": "ALPINE", "loc": [-86.208934, 33.331165], "pop": 3062, "state": "AL", "_id": "35014"}
{"city": "ARAB", "loc": [-86.489638, 34.328339], "pop": 13650, "state": "AL", "_id": "35016"}
{"city": "BAILEYTON", "loc": [-86.621299, 34.268298], "pop": 1781, "state": "AL", "_id": "35019"}
{"city": "BESSEMER", "loc": [-86.947547, 33.409002], "pop": 40549, "state": "AL", "_id": "35020"}
{"city": "HUEYTOWN", "loc": [-86.999607, 33.414625], "pop": 39677, "state": "AL", "_id": "35023"}

PDI – Writing to MongoDB

Got this to work today (kind of). I downloaded a file from http://data.gov.uk/dataset/gp-practice-prescribing-data

Showing GP prescription data over time and by anonymised surgery areas, 4.1m rows, 5 columns.

Imported the csv file from the desktop in to PDI, then wrote out to MongoDB. Note *no need to define a schema up-front here*, a nice feature of MongoDB/noSQL DBs. Although, admittedly it is a simple and consistent (non-nested) dataset.Image

The resultant new ‘collection’ in MongoDB.
Image

Useful guidance here http://wiki.pentaho.com/display/EAI/MongoDb+output

The full set of data has been written to a new MongoDB collection, but the results look a bit funny. Need to re-check the setup of the ETL package.

Made a few tweaks and now the data looks good.

I still haven’t got my Raspberry Pis set up. If I had, it would be interesting to try to get this job to run in parallel using the 51MB RAM in each of the four PIs. I’d hope it’d be a fair bit faster!
– It’s currently writing about 12.5k rows/second on the laptop alone, the job runs in 5m 26s.

I’m going to try and achieve the same now using MongoImport
http://docs.mongodb.org/manual/reference/mongoimport/

Example Syntax:

mongoimport --db users --collection contacts --type csv --file /opt/backups/contacts.csv

There’s some good discussion on the Google group ‘Mongodb-user’ about sharding and ‘chunking’ data prior to load. Now, if only I could get these 4 Raspberry Pis working!
https://groups.google.com/forum/?fromgroups=#!searchin/mongodb-user/import/mongodb-user/9sZoq5iN1KY/XqT-swxWqk0J