MongoDB Applied Design Patterns – Ch.4

Operational Intelligence pp.37 Converting tx data to actionable information.

  • Of course, the starting point for any of these techniques is getting the raw tx data into your datastore.
  • Once you have the data, the 1st priority is to generate actionable reports – ideally in real time
  • Finally, we explore more traditional batch “Hierarchical” aggregation

Consider a typical Apache weblog record: – frank [10/Oct/2000:13:55:36 -0700] “GET /apache_pb.gif HTTP/1.0” 200 2326 One option might be to bung the whole data row into its own document ie the whole string in one slot. However, this is not particularly useful (hard to query) or efficient. For example, if you wanted to find events on the same page you’d have to write a nasty regex which would require a full scan of the collection. The preferred approach is to extract the relevant information into individual fields in a MongoDB document. The UTC timestamp data format stores the verbose timestamp as a meager 8 bytes, rather than the natural 28 bytes. Consider the following document that captures all the data from the log entry:


_id: Objectid(…), host: “”, logname: null, user: ‘frank’, time: ISOGetData(“2000-10-10T20:55:36Z”), request “GET /apache_pb.gif HTTP/1.0”, status: 200, request_size: 2326, referrer: “Http://……”, user_agent: “Whatever browser, O/S etc@


MongoDB has a configurable write concern which trades off write consistency with write speed. w=0 means you do not require Mongo to acknowledge receipt of the insert. w=1 the opposite. The former is obviously faster but may lose some data. using j=TRUE tells Mongo to use an on-disk journal file to persist data before writing data back to the ‘regular’ data files. This is the safest, but slowest option. You can also require that Mongo replicate the data to replica set(s) before returning, And combine these strategies e.g. >>>, j=TRUE, w=N)  [n>1] However, the chapter does not go on to suggest how one might parse such raw weblog data into something more structured. However, here’s a worked example using Pentaho. A Kettle package is available here ************************************************************************************************* Now the data has been processed (parsed) one can begin querying. >>> q_events ={‘path’:’/apache_pb.gif’}) Would return all documents wit the apache_pb.gif value in the path field


INDEXING? Be mindful about performance and indexing. >>> db_events.ensure_index(‘path’) **Be wary of the size they take up in RAM. It makes sense to index here as the entire number of ‘path’ values is small in relation to the number of documents, which curtails the space the index needs. >>>db.command(‘collstats’, ‘events’) [‘indexSizes’] will show the size of the index


>>> q_events =‘time’:{‘$gte’:datetime(2000,10,10), ‘$lt’:datetime(2000,10,11)}) Will return documents from the events collection that occurred between October and November

>>>q_events = db_events.find({‘host’: ‘’, ‘time’:{‘$gte:datetime(2000,10,10)}}) Returns events on host on or after Oct 2000 *Note performance may be improved by a compound index. A great blog on Mongo Indexing here

Counting requests by date and page Finding requests is useful, but often the query will need summarisation. This is best done using MongoDB’s aggregation framework 

Here’s a link translating SQL queries to Aggregation Queries

In the below example, you can consider $match = WHERE; $project = SELECT and $group = GROUP BY

MongoDB Semantics

Now, I need to begin losing my relational vocabulary and start talking Mongo !

Some useful additional background here:
The MongoDB query language is not SQL, but 10gen describes it as a simple, expressive language with a straightforward syntax for efficient querying. Examples of simple query statements include “sum,” “min,” “max,” and “average.” These sorts of operators would be familiar to any database veteran or analyst, and they’re applied in a real-time data-processing pipeline that delivers sub-second performance, according to 10gen.

Other available query statements include “project,” which is used to select desired attributes and ignore everything else. “Group” lets you combine results with desired attributes. “Match” is a filter than can be used to eliminate documents from a query. “Limit,” “skip” and “sort,” are statements used in much the same way they’re used in SQL: to limit a query to a desired number of results, to skip over a given number of results, and to sort results alphabetically, numerically or by some other value.

SQL Terms/Concepts MongoDB Terms/Concepts
database database
table collection
row document or BSON document
column field
index index
table joins embedded documents and linking
primary keySpecify any unique column or column combination as primary key. primary keyIn MongoDB, the primary key is automatically set to the _idfield.
aggregation (e.g. group by) aggregation frameworkSee the SQL to Aggregation Framework Mapping Chart.

And query operators

WHERE $match
GROUP BY $group
HAVING $match
SELECT $project
ORDER BY $sort
LIMIT $limit
SUM() $sum
COUNT() $sum
join No direct corresponding operator; however, the $unwind operator allows for somewhat similar functionality, but with fields embedded within the document.

+ Some example query statements

SQL                                                           MongoDB


FROM users
WHERE status = "A"
    { status: "A" } )
FROM users
WHERE status = "A"
OR age = 50
    { $or: [ { status: "A" } ,
             { age: 50 } ] } )
FROM users
WHERE age > 25
AND   age <= 50
   { age: { $gt: 25, $lte: 50 } } )

And some really helpful analytical scenarios here



Day one!

Today I picked up a new laptop & installed various bits needed for my MSc project. The project aim (at this point in time, subject to change!) is to build a scaleable BI solution, using a noSQL DB, commodity hardware and open source tools for around £100.

Acting as 4 nodes (512MB RAM, 4GB SD memory)

I haven’t yet thought about what data I may end up processing, but am drawn to unstructured documents,

I’m planning on using a document database ‘MongoDB’ and the ‘topography’ of the ‘cluster’ is as follows

  • 1x Samsung 3-series laptop, windows 7, 6GB RAM, Pentium 1-5
  • 1 x Dell 5150, WinXP (showing its age, probably around 7 years) 4GB RAM, some kinda processor!
  • 4 x Raspberry Pi 512MB all with 4GB HD cards
commodity hardware - apart from the new laptop

raspberry pi, an old knackered dell desktop and a new laptop.

So far, I’ve installed

  • MongoDB mongodb-win32-x86_64-2008plus-2.2.1
  • Java
  • Tableau for data viz
  • Pentaho for ETL (& more possibly)
  • Jaspersoft iReport for report authoring (although may use Tableau if a suitable db connection can be found, or Pentaho)