MapReduce links

MongoDB Aggregation Framework


Lynn, John (2013-06-17). MongoDB Aggregation Framework Principles and Examples

What is the aggregation framework?

The aggregation framework was created to provide a “means to calculate aggregated values without having to use map-reduce” according to the introduction in the MongoDB documentation at: http:// manual/ core/aggregation/# overview . Not that there’s anything wrong with using map-reduce, but it’s often viewed as a complicated way to do a simple thing – calculate aggregate values. The aggregation framework simply offers a easier way to do this simple thing.”

A simple example: US states with population over 10 million

{$ group:{ _id:” $ state”, totalPop:{ $ sum:” $ pop”}}},
{$ match:{ totalPop:{ $ gte: 10* 1000* 1000}}}

The idea here is that the pipeline operators flow results from one operator to the next . So in this example, the documents from the zipcodes collection in its entirety are presented to the $group operator to work on; And then the documents coming out of $group are presented as input to $match.

// skip 1 document, starting from the beginning of the set of documents, and then output the rest. This will output all but the first document, which may be a lot of output
printjson( db.zipcodes.aggregate( { $ skip : 1} ))

In our northwind sample collection of customer orders, notice that the line items in each order are contained in an array called orderItems. Each item in this array is a sub-document with the product category, supplier, product name, and pricing information contained in unitPrice, as well as quantity of the item ordered. A common aggregation to be performed is: calculating order total – the total for each order, based on the line items.
{$ match: {” orderId”: 10253}}
, {$ unwind: “$ orderItems”}

Now that the document for order 10253 has been exploded into 3 documents by $ unwind, a simple $ group operator can be utilized to get an order total.

db.northwind.aggregate( {$ match: {” orderId”: 10253}} , {$ unwind: “$ orderItems”} , {$ group: {_id: “$ orderId”, “OrderTotal”: {$ sum: “$ orderItems.unitPrice”

$ project
The $ project operator is used to “re-shape” the documents, inserting and deleting fields or whole sub-documents; creating computed values using existing fields or constants as input; conditionally including parts of documents, and more. Using $ project, you can do some very creative manipulation of the documents in the pipeline.



Model Data to Support Keyword Search

  • Data Models > 
  • Model Data to Support Keyword Search


MongoDB 2.4 text searc




New in version 2.4.

Searches text content stored in the text index. The text command is case-insensitive.

The text command returns all documents that contain any of the terms; i.e. it performs a logical OR search. By default, the command limits the matches to the top 100 scoring documents, in descending score order, but you can specify a different limit.

The text command has the following syntax:

db.collection.runCommand( "text", { search: <string>,
                                    filter: <document>,
                                    project: <document>,
                                    limit: <number>,
                                    language: <string> } )

The text command has the following parameters:

Field Type Description
search string A string of terms that MongoDB parses and uses to query the text index. Enclose the string of terms in escaped double quotes to match on the phrase. For further information on the search field syntax, see The search Field.
filter document Optional. A query document to further limit the results of the query using another database field. Use any valid MongoDB query in the filter document, except if the index includes an ascending or descending index field as a prefix. If the index includes an ascending or descending index field as a prefix, the filter is required and the filter query must be an equality match.
project document Optional. Limits the fields returned by the query to only those specified. By default, the _id field returns as part of the result set, unless you explicitly exclude the field in the project document.
limit number Optional. The maximum number of documents to include in the response. The textcommand sorts the results before applying the limit. The default limit is 100.
language string Optional. The language that determines the list of stop words for the search and the rules for the stemmer and tokenizer. If not specified, the search uses the default language of the index. For supported languages, see Text Search Languages. Specify the language in lowercase.
Returns: The text command returns a document that contains a field results that contains an array of the highest scoring documents, in descending order by score. See Output for details.



The complete results of the text command must fit within the BSON Document Size. Otherwise, the command will limit the results to fit within the BSON Document Size. Use the limit and the project parameters with the text command to limit the size of the result set.


  • If the search string includes phrases, the search performs an AND with any other terms in the search string; e.g. search for"\"twinkle twinkle\" little star" searches for "twinkle twinkle" and ("little" or "star").
  • text adds all negations to the query with the logical AND operator.
  • The text command ignores stop words for the search language, such as the and and in English.
  • The text command matches on the complete stemmed word. So if a document field contains the word blueberry, a search on the term blue will not match. However, blueberry or blueberries will match.



You cannot combine the text command, which requires a special text index, with a query operator that requires a different type of special index. For example you cannot combine text with the $near operator.

The search Field

The search field takes a string of terms that MongoDB parses and uses to query the text index. Enclose the string of terms in escaped double quotes to match on the phrase. Additionally, the text command treats most punctuation as delimiters, except when a hyphen - negates terms.

Prefixing a word with a hyphen sign (-) negates a word:

  • The negated word excludes documents that contain the negated word from the result set.
  • A search string that only contains negated words returns no match.
  • A hyphenated word, such as pre-market, is not a negation. The text command treats the hyphen as a delimiter.


The following examples assume a collection articles that has a text index on the field subject:

db.articles.ensureIndex( { subject: "text" } )

Search for a Single Word

db.articles.runCommand( "text", { search: "coffee" } )

This query returns documents that contain the word coffee, case-insensitive, in the indexed subject field.

Search for Multiple Words

The following command searches for bake or coffee or cake:

db.articles.runCommand( "text", { search: "bake coffee cake" } )

This query returns documents that contain either bake or coffee or cake in the indexed subject field.

Search for a Phrase

db.articles.runCommand( "text", { search: "\"bake coffee cake\"" } )

This query returns documents that contain the phrase bake coffee cake.

Exclude a Term from the Result Set

Use the hyphen (-) as a prefix to exclude documents that contain a term. Search for documents that contain the words bake or coffee but do not contain cake:

db.articles.runCommand( "text", { search: "bake coffee -cake" } )

Search with Additional Query Conditions

Use the filter option to include additional query conditions.

Search for a single word coffee with an additional filter on the about field, but limit the results to 2 documents with the highest score and return only the subject field in the matching documents:

db.articles.runCommand( "text", {
                                  search: "coffee",
                                  filter: { about: /desserts/ },
                                  limit: 2,
                                  project: { subject: 1, _id: 0 }
  • The filter query document may use any of the available query operators.
  • Because the _id field is implicitly included, in order to return only the subject field, you must explicitly exclude (0) the _id field. Within the project document, you cannot mix inclusions (i.e. <fieldA>: 1) and exclusions (i.e. <fieldB>: 0), except for the_id field.

Search a Different Language

Use the language option to specify Spanish as the language that determines the list of stop words and the rules for the stemmer and tokenizer:

db.articles.runCommand( "text", {
                                    search: "leche",
                                    language: "spanish"

See Text Search Languages for the supported languages.



Specify the language in lowercase.


The following is an example document returned by the text command:

   "queryDebugString" : "tomorrow||||||",
   "language" : "english",
   "results" : [
         "score" : 1.3125,
         "obj": {
                  "_id" : ObjectId("50ecef5f8abea0fda30ceab3"),
                  "quote" : "tomorrow, and tomorrow, and tomorrow, creeps in this petty pace",
                  "related_quotes" : [
                                       "is this a dagger which I see before me",
                                       "the handle toward my hand?"
                  "src" : {
                             "title" : "Macbeth",
                             "from" : "Act V, Scene V"
                  "speaker" : "macbeth"
   "stats" : {
               "nscanned" : 1,
               "nscannedObjects" : 0,
               "n" : 1,
               "nfound" : 1,
               "timeMicros" : 163
   "ok" : 1

The text command returns the following data:

For internal use only.

The language field returns the language used for the text search. This language determines the list of stop words and the rules for the stemmer and tokenizer.

The results field returns an array of result documents that contain the information on the matching documents. The result documents are ordered by the score. Each result document contains:

The obj field returns the actual document from the collection that contained the stemmed term or terms.

The score field for the document that contained the stemmed term or terms. The score field signifies how well the document matched the stemmed term or terms. See Control Results of Text Search with Weights for how you can adjust the scores for the matching words.

The stats field returns a document that contains the query execution statistics. The stats field contains:

The nscanned field returns the total number of index entries scanned.

The nscannedObjects field returns the total number of documents scanned.

The n field returns the number of elements in the results array. This number may be less than the total number of matching documents, i.e. nfound, if the full result exceeds the BSON Document Size.

The nfound field returns the total number of documents that match. This number may be greater than the size of the results array, i.e. n, if the result set exceeds the BSON Document Size.

The timeMicros field returns the time in microseconds for the search.

The ok returns the status of the text command.

Text Search Languages

The text index and the text command support the following languages:

  • danish
  • dutch
  • english
  • finnish
  • french
  • german
  • hungarian
  • italian
  • norwegian
  • portuguese
  • romanian
  • russian
  • spanish
  • swedish
  • turkish



If you specify a language value of "none", then the text search has no list of stop words, and the text search does not stem or tokenize the search terms.

MongoDB Applied Design Patterns – Ch.4

Operational Intelligence pp.37 Converting tx data to actionable information.

  • Of course, the starting point for any of these techniques is getting the raw tx data into your datastore.
  • Once you have the data, the 1st priority is to generate actionable reports – ideally in real time
  • Finally, we explore more traditional batch “Hierarchical” aggregation

Consider a typical Apache weblog record: – frank [10/Oct/2000:13:55:36 -0700] “GET /apache_pb.gif HTTP/1.0” 200 2326 One option might be to bung the whole data row into its own document ie the whole string in one slot. However, this is not particularly useful (hard to query) or efficient. For example, if you wanted to find events on the same page you’d have to write a nasty regex which would require a full scan of the collection. The preferred approach is to extract the relevant information into individual fields in a MongoDB document. The UTC timestamp data format stores the verbose timestamp as a meager 8 bytes, rather than the natural 28 bytes. Consider the following document that captures all the data from the log entry:


_id: Objectid(…), host: “”, logname: null, user: ‘frank’, time: ISOGetData(“2000-10-10T20:55:36Z”), request “GET /apache_pb.gif HTTP/1.0”, status: 200, request_size: 2326, referrer: “Http://……”, user_agent: “Whatever browser, O/S etc@


MongoDB has a configurable write concern which trades off write consistency with write speed. w=0 means you do not require Mongo to acknowledge receipt of the insert. w=1 the opposite. The former is obviously faster but may lose some data. using j=TRUE tells Mongo to use an on-disk journal file to persist data before writing data back to the ‘regular’ data files. This is the safest, but slowest option. You can also require that Mongo replicate the data to replica set(s) before returning, And combine these strategies e.g. >>>, j=TRUE, w=N)  [n>1] However, the chapter does not go on to suggest how one might parse such raw weblog data into something more structured. However, here’s a worked example using Pentaho. A Kettle package is available here ************************************************************************************************* Now the data has been processed (parsed) one can begin querying. >>> q_events ={‘path’:’/apache_pb.gif’}) Would return all documents wit the apache_pb.gif value in the path field


INDEXING? Be mindful about performance and indexing. >>> db_events.ensure_index(‘path’) **Be wary of the size they take up in RAM. It makes sense to index here as the entire number of ‘path’ values is small in relation to the number of documents, which curtails the space the index needs. >>>db.command(‘collstats’, ‘events’) [‘indexSizes’] will show the size of the index


>>> q_events =‘time’:{‘$gte’:datetime(2000,10,10), ‘$lt’:datetime(2000,10,11)}) Will return documents from the events collection that occurred between October and November

>>>q_events = db_events.find({‘host’: ‘’, ‘time’:{‘$gte:datetime(2000,10,10)}}) Returns events on host on or after Oct 2000 *Note performance may be improved by a compound index. A great blog on Mongo Indexing here

Counting requests by date and page Finding requests is useful, but often the query will need summarisation. This is best done using MongoDB’s aggregation framework 

Here’s a link translating SQL queries to Aggregation Queries

In the below example, you can consider $match = WHERE; $project = SELECT and $group = GROUP BY

counting the occurrence of words in a field (column)

How to do it?

I have a series of documents in a MongoDB collection, each doc has 7 columns.

Col7 is a large text field which contains the Ts and Cs, scraped from various websites.

I want to count the freqy of each word in that field, for each object.

So, for example, col 7 in document 1 may contain ‘delete’ 100 times.

Col 7 in doc 90 may contain ‘delete’ 2 times.

More to follow…


Thanks to Terradata’s Chris Hillman for the following suggestion:

…That seems like a nice MapReduce task, I’d do it like this
input Key, Value
    <documentID, DocumentText>
    tokenise the document
Output Key, Value
    <documentID concatenated with token, 1> For DocID 50 and token “plum” This would look like 50|plum, 1
input Key, Value
    <documentID concatenated with token, 1>
    sum the counts by docid|token
Output Key, Value
    <documentID concatenated with token, count>
Up to you if you use the pipe d”|” delimiter for this or maybe use a tab “\t” to make the output easier to parse I’ve not used     Mongo so I don’t know what format it accepts
For the gold star solution use a “combiner” in the Map step to do a pre-reduce on the docid|token keys and reduce I/O
Basically the same as standard word count but with the docid concatenated on the token



What is mongoSQL?

mongoSQL is a free Mac, Windows, Linux/Unix and Web client UI application that talks directly to MongoDB by MongoDB, Inc.  It does not require any other server software other than MongoDB.  SQL scripts can be created, executed, saved and opened from the UI.  When executed, the SQL is translated behind the scenes into the appropriate MongoDB API calls.  Data can be inserted, updated, deleted and selected.  Selected documents are returned and displayed.  The user only has to work with familiar SQL syntax and not with the JSON and/or Javascript that the MongoDB shell and APIs require.  (Although for more expert MongoDB users, the SQL can be dissected and the corresponding JSON and/or Javascript will be displayed.)


Why mongoSQL?

Using SQL to query and filter unstructured data, such as that in Big Data databases like MongoDB, is sometimes debated.  Many people believe that SQL is not really the best tool.  While that is true sometimes, for many jobs it is definitely up to the task.  It carries with it some of the same limitations as MongoDB, such as no joins.  But it supports simple and complex SQL along with aggregation that can optionally use either MongoDB’s aggregation framework or map reduce.  There are no architectural changes necessary because mongoSQL is a client UI.

The most likely users of mongoSQL are people who:

– Are more comfortable with SQL than with writing JSON / Javascript when specifying conditions, aggregations, sorts, etc.

– Do not want to commit architecturally yet to additional Big Data tools.

– Are not proficient with the available Big Data tools or do not require much of the functionality of these tools.

– Want to use a friendly SQL UI.

– Need to vet and possibly modify MongoDB data directly.

– Want to translate SQL into the necessary JSON / Javascript required by the MongoDB shell or API.


Referencing data with DBRef

pp77. & here

When Googling DBRef, i stumbled across quite a few recommendations not to use it, but think of an alternative e.g. here and here. From a quick skim, these seem to be objections based around application performance.

With DBRef, the DB ref is stored as a standard embedded object. 
{ $ref : <collectionName>, $id : <id value>[. $db : <database name>] }

Suppose I had added some data and in one collection and wanted to reference it in another? For example…

<more coming soon…>