MongoDB Applied Design Patterns – Ch.4

Operational Intelligence pp.37 Converting tx data to actionable information.

  • Of course, the starting point for any of these techniques is getting the raw tx data into your datastore.
  • Once you have the data, the 1st priority is to generate actionable reports – ideally in real time
  • Finally, we explore more traditional batch “Hierarchical” aggregation

Consider a typical Apache weblog record: – frank [10/Oct/2000:13:55:36 -0700] “GET /apache_pb.gif HTTP/1.0” 200 2326 One option might be to bung the whole data row into its own document ie the whole string in one slot. However, this is not particularly useful (hard to query) or efficient. For example, if you wanted to find events on the same page you’d have to write a nasty regex which would require a full scan of the collection. The preferred approach is to extract the relevant information into individual fields in a MongoDB document. The UTC timestamp data format stores the verbose timestamp as a meager 8 bytes, rather than the natural 28 bytes. Consider the following document that captures all the data from the log entry:


_id: Objectid(…), host: “”, logname: null, user: ‘frank’, time: ISOGetData(“2000-10-10T20:55:36Z”), request “GET /apache_pb.gif HTTP/1.0”, status: 200, request_size: 2326, referrer: “Http://……”, user_agent: “Whatever browser, O/S etc@


MongoDB has a configurable write concern which trades off write consistency with write speed. w=0 means you do not require Mongo to acknowledge receipt of the insert. w=1 the opposite. The former is obviously faster but may lose some data. using j=TRUE tells Mongo to use an on-disk journal file to persist data before writing data back to the ‘regular’ data files. This is the safest, but slowest option. You can also require that Mongo replicate the data to replica set(s) before returning, And combine these strategies e.g. >>>, j=TRUE, w=N)  [n>1] However, the chapter does not go on to suggest how one might parse such raw weblog data into something more structured. However, here’s a worked example using Pentaho. A Kettle package is available here ************************************************************************************************* Now the data has been processed (parsed) one can begin querying. >>> q_events ={‘path’:’/apache_pb.gif’}) Would return all documents wit the apache_pb.gif value in the path field


INDEXING? Be mindful about performance and indexing. >>> db_events.ensure_index(‘path’) **Be wary of the size they take up in RAM. It makes sense to index here as the entire number of ‘path’ values is small in relation to the number of documents, which curtails the space the index needs. >>>db.command(‘collstats’, ‘events’) [‘indexSizes’] will show the size of the index


>>> q_events =‘time’:{‘$gte’:datetime(2000,10,10), ‘$lt’:datetime(2000,10,11)}) Will return documents from the events collection that occurred between October and November

>>>q_events = db_events.find({‘host’: ‘’, ‘time’:{‘$gte:datetime(2000,10,10)}}) Returns events on host on or after Oct 2000 *Note performance may be improved by a compound index. A great blog on Mongo Indexing here

Counting requests by date and page Finding requests is useful, but often the query will need summarisation. This is best done using MongoDB’s aggregation framework 

Here’s a link translating SQL queries to Aggregation Queries

In the below example, you can consider $match = WHERE; $project = SELECT and $group = GROUP BY


MongoDB job trends

MongoDB Job Trends graph

Demand for Puppet (cluster administration) is similarly strong. The combination of both would be quite sought after I imagine

Puppet Job Trends graph

And ‘traditional’ BI remains strong and a much bigger market

BI Job Trends graph

MongoDB iReport, @kchodorow books

MongoDB iReport

Yippee! I can now write data (manually at the moment!) into MongoDB and read it out into iReport!

Big thanks to @mattdahlman

Tomorrow I’m going to explore unetbootin (thanks to Pete Griffiths) and maybe reformat the old dell to be a Linux machine. I can’t help but think having 3 different O/S in the eventual cluster is going to be troublesome.

I’ve also ordered 2 more @kchodorow books ‘50 Tips and Tricks for MongoDB Developers

& ‘Scaling MongoDB’ – which hopefully will help me get my head around clustering/sharding within MongoDB.