MongoDB Aggregation Framework

Image

Lynn, John (2013-06-17). MongoDB Aggregation Framework Principles and Examples

What is the aggregation framework?

The aggregation framework was created to provide a “means to calculate aggregated values without having to use map-reduce” according to the introduction in the MongoDB documentation at: http:// docs.mongodb.org/ manual/ core/aggregation/# overview . Not that there’s anything wrong with using map-reduce, but it’s often viewed as a complicated way to do a simple thing – calculate aggregate values. The aggregation framework simply offers a easier way to do this simple thing.”

A simple example: US states with population over 10 million

db.zipcodes.aggregate(
{$ group:{ _id:” $ state”, totalPop:{ $ sum:” $ pop”}}},
{$ match:{ totalPop:{ $ gte: 10* 1000* 1000}}}
)

The idea here is that the pipeline operators flow results from one operator to the next . So in this example, the documents from the zipcodes collection in its entirety are presented to the $group operator to work on; And then the documents coming out of $group are presented as input to $match.

$skip
// skip 1 document, starting from the beginning of the set of documents, and then output the rest. This will output all but the first document, which may be a lot of output
printjson( db.zipcodes.aggregate( { $ skip : 1} ))

$unwind
In our northwind sample collection of customer orders, notice that the line items in each order are contained in an array called orderItems. Each item in this array is a sub-document with the product category, supplier, product name, and pricing information contained in unitPrice, as well as quantity of the item ordered. A common aggregation to be performed is: calculating order total – the total for each order, based on the line items.
db.northwind.aggregate(
{$ match: {” orderId”: 10253}}
, {$ unwind: “$ orderItems”}
)

Now that the document for order 10253 has been exploded into 3 documents by $ unwind, a simple $ group operator can be utilized to get an order total.

db.northwind.aggregate( {$ match: {” orderId”: 10253}} , {$ unwind: “$ orderItems”} , {$ group: {_id: “$ orderId”, “OrderTotal”: {$ sum: “$ orderItems.unitPrice”
}}
)

$ project
The $ project operator is used to “re-shape” the documents, inserting and deleting fields or whole sub-documents; creating computed values using existing fields or constants as input; conditionally including parts of documents, and more. Using $ project, you can do some very creative manipulation of the documents in the pipeline.

 

Advertisements

MongoDB Applied Design Patterns – Ch.4

Operational Intelligence pp.37 Converting tx data to actionable information.

  • Of course, the starting point for any of these techniques is getting the raw tx data into your datastore.
  • Once you have the data, the 1st priority is to generate actionable reports – ideally in real time
  • Finally, we explore more traditional batch “Hierarchical” aggregation

Consider a typical Apache weblog record: 127.0.0.1 – frank [10/Oct/2000:13:55:36 -0700] “GET /apache_pb.gif HTTP/1.0” 200 2326 One option might be to bung the whole data row into its own document ie the whole string in one slot. However, this is not particularly useful (hard to query) or efficient. For example, if you wanted to find events on the same page you’d have to write a nasty regex which would require a full scan of the collection. The preferred approach is to extract the relevant information into individual fields in a MongoDB document. The UTC timestamp data format stores the verbose timestamp as a meager 8 bytes, rather than the natural 28 bytes. Consider the following document that captures all the data from the log entry:

{

_id: Objectid(…), host: “127.0.0.1”, logname: null, user: ‘frank’, time: ISOGetData(“2000-10-10T20:55:36Z”), request “GET /apache_pb.gif HTTP/1.0”, status: 200, request_size: 2326, referrer: “Http://……”, user_agent: “Whatever browser, O/S etc@

}

MongoDB has a configurable write concern which trades off write consistency with write speed. w=0 means you do not require Mongo to acknowledge receipt of the insert. w=1 the opposite. The former is obviously faster but may lose some data. using j=TRUE tells Mongo to use an on-disk journal file to persist data before writing data back to the ‘regular’ data files. This is the safest, but slowest option. You can also require that Mongo replicate the data to replica set(s) before returning, And combine these strategies e.g. >>> db.events.insert(event, j=TRUE, w=N)  [n>1] However, the chapter does not go on to suggest how one might parse such raw weblog data into something more structured. However, here’s a worked example using Pentaho. A Kettle package is available here ************************************************************************************************* Now the data has been processed (parsed) one can begin querying. >>> q_events = db.events.find({‘path’:’/apache_pb.gif’}) Would return all documents wit the apache_pb.gif value in the path field

__________________________________________

INDEXING? Be mindful about performance and indexing. >>> db_events.ensure_index(‘path’) **Be wary of the size they take up in RAM. It makes sense to index here as the entire number of ‘path’ values is small in relation to the number of documents, which curtails the space the index needs. >>>db.command(‘collstats’, ‘events’) [‘indexSizes’] will show the size of the index

__________________________________________

>>> q_events = db.events.find(‘time’:{‘$gte’:datetime(2000,10,10), ‘$lt’:datetime(2000,10,11)}) Will return documents from the events collection that occurred between October and November

>>>q_events = db_events.find({‘host’: ‘127.0.0.1’, ‘time’:{‘$gte:datetime(2000,10,10)}}) Returns events on host 127.0.0.1 on or after Oct 2000 *Note performance may be improved by a compound index. A great blog on Mongo Indexing here

Counting requests by date and page Finding requests is useful, but often the query will need summarisation. This is best done using MongoDB’s aggregation framework 

Here’s a link translating SQL queries to Aggregation Queries

In the below example, you can consider $match = WHERE; $project = SELECT and $group = GROUP BY

MongoDB’s “Aggregation Framework”

In my earlier post I built up a very simple report using the MongoDB connector in Jaspersoft iReport.

Today I noticed you can natively query MongoDB from within Jaspersoft. See more here

Some examples;

In SQL:

SELECT *
  FROM users
  WHERE age > 25
   AND   age <= 50
  ORDER BY age

When using Mongo:

{
   collectionName : 'users',
   findQuery : { 
        age: { $gt: 25, $lte: 50 } 
   },
   sort : {
        'age': 1
   }
}

See: http://docs.mongodb.org/manual/reference/sql-comparison/

An examples of using MongoDB’s ‘Aggregation Framework‘ which I think as pre-computed views/aggregations/filtered results sets

In SQL:
SELECT cust_id, ord_date, SUM(price) AS total
FROM orders
GROUP BY cust_id, ord_date
HAVING total > 250
ORDER BY total DESC

When using Mongo:

{
   runCommand: {
        aggregate : 'orders',
        pipeline : [
                {
                        $group : {
                                _id : { cust_id: '$cust_id', ord_date: '$ord_date' }
                                total: { $sum : '$price' }
                        }
                },
                { 
                        $match: { 
                                total: { $gt: 250 } } 
                        }
                },
                {
                        $sort : {
                                total : -1
                        }
                }
        ]
}

This got me curious about the aggregation frameworkHere’s another couple of examples of how it might be applied.

Each document in this collection has the following form:

{
  "_id": "10280",
  "city": "NEW YORK",
  "state": "NY",
  "pop": 5574,
  "loc": [
    -74.016323,
    40.710537
  ]
}

In these documents:

  • The _id field holds the zipcode as a string.
  • The city field holds the city.
  • The state field holds the two letter state abbreviation.
  • The pop field holds the population.
  • The loc field holds the location as a latitude longitude pair.

To Calculate States with Populations Over 10 Million

db.zipcodes.aggregate( { $group :
                         { _id : "$state",
                           totalPop : { $sum : "$pop" } } },
                       { $match : {totalPop : { $gte : 10*1000*1000 } } } )