Ch3: Data In, Data Out (inverted indexes)

In the real world, not all entities of the same type looks the same. One person might just have a home telephone, another a cell # and another both, another none of these. In the RDBMS world, each entities demands its own column, or to be modelled as a KV pair. This leads to waste and redundancy. Brazillian people may have 10+ family names, a Brit one.

The problem comes when we need to store these entities. Traditionally, this as been accomplished through a RDBMS with columns and rows.

Of course, we don’t only need to store the data, we need to query, use the data. While noSQL solutons (eg MongoDB) exist that allow us to store objects as documents, they still require us to think about how we want to query our data, and which fields require an index to speed up retrieval.

In ES, all data in every field is indexed by default. Every field has a dedicated inverted index and unlike most other DBs, it can use all of those inverted indices in the same query

What is an inverted index?


Document orientation

Objects are seldom is simple list of KV pairs!

More often than not they are complex data structures that contain dates, geo, objects, arrays of values etc.

sooner or later you’re going to want to store these objects (in a database), trying to do this with the rows and columns of a relational DB…means flattening the object to fit the schema – usually one field per column – and then have to reconstruct it every time it is retrieved 

Consider this JSON document:

“first name”: “john”,
“last name”: “smith”,
“info”:  {
“bio”: “Some blurb……”,
“interests”: [football, cricket]
“join_date”: “2015/15/2”

MongoDB Data Types

The last five datatypes (date, object id, binary data, regex, and JavaScript code) are non-JSON datatypes; specifically, they are special datatypes that BSON allows you to use. These are a bit unusual and I doubt i’ll ever need them,

String: This commonly used datatype contains a string of text (or any other kind of
characters). This datatype is used mostly for storing text values (e.g., “Country” :

Integer (32b and 64b): This type is used to store a numerical value (e.g., { “Rank” :
1 } ). Note that the integer isn’t encapsulated by quotes

Boolean: This datatype can be set to either TRUE or FALSE.

Double: Used to store floating point values.

Min / Max keys: Used to compare a value against the lowest and highest BSON elements, respectively.

Arrays: This datatype is used to store arrays (e.g., [“MongoDB”, “CouchDB” ,”Cassandra”]).

Timestamp: Used to store a timestamp. This can be handy for recording when a document has been modified or added.

Object: This datatype is used for embedded documents.

Null: Used to store a Null value.

Symbol: This datatype is used identically to a string (see above); however, it’s
generally reserved for languages that use a specific symbol type.

Date *: This datatype is used to store the current date or time in UNIX time format

Object ID *: This datatype is used to store the document’s ID.

Binary data *: Used to store binary data.