noSQL is not a panacea…

NoSQL is not a panacea
The past decade has seen a huge amount of innovation in scalable data systems.
These include large-scale computation systems like Hadoop and databases such as
Cassandra and Riak. These systems can handle very large amounts of data, but with
serious trade-offs.
Hadoop, for example, can parallelize large-scale batch computations on very large
amounts of data, but the computations have high latency. You don’t use Hadoop for
anything where you need low-latency results.

NoSQL databases like Cassandra achieve their scalability by offering you a much
more limited data model than you’re used to with something like SQL. Squeezing
your application into these limited data models can be very complex. And because the
databases are mutable, they’re not human-fault tolerant.

These tools on their own are not a panacea. But when intelligently used in conjunction
with one another, you can produce scalable systems for arbitrary data problems
with human-fault tolerance and a minimum of complexity. This is the Lambda
Architecture…

ES: Filter types

and filter
bool filter
exists filter
geo bounding box filter
geo distance filter
geo distance range filter
geo polygon filter
geoshape filter
geohash cell filter
has child filter
has parent filter
ids filter
indices filter
limit filter
match all filter
missing filter
nested filter
not filter
or filter
prefix filter
query filter
range filter
regexp filter
script filter
term filter
terms filter
type filter

Chapter 12: Structured search

pp181.
Combining Filters

SELECT product
FROM products
WHERE (price 20 OR productID = ‘XXYYZZ’)
AND (price !=30)

Bool Filter

The bool filter is comprised 3 sections

{
“bool” : {
“must”:          [ ],
“should”:       [ ],
“must_not”:   [ ],
}
}

-MUST: All of the clauses must match. The equivalent of AND
-SHOULD: At least ONE of the clauses must match. The equivalent of OR
-MUST_NOT: All of the clauses must NOT match. The equivalent of NOT

_________________________________________________________________________

To replicate the preceding SQL example, we will take the two term filters that we used previously and place them inside the should clause of a bool filter, and add another clause to deal with the NOT condition:

GET /my_store/products/_search
{
   "query" : {
      "filtered" : { 
         "filter" : {
            "bool" : {
              "should" : [
                 { "term" : {"price" : 20}}, 
                 { "term" : {"productID" : "XHDK-A-1293-#fJ3"}} 
              ],
              "must_not" : {
                 "term" : {"price" : 30} 
              }
           }
         }
      }
   }
}
Note that we still need to use a filtered query to wrap everything.
These two term filters are children of the bool filter, and since they are placed inside the should clause, at least one of them needs to match.
If a product has a price of 30, it is automatically excluded because it matches a must_notclause.

Our search results return two hits, each document satisfying a different clause in the bool filter:

"hits" : [
    {
        "_id" :     "1",
        "_score" :  1.0,
        "_source" : {
          "price" :     10,
          "productID" : "XHDK-A-1293-#fJ3" 
        }
    },
    {
        "_id" :     "2",
        "_score" :  1.0,
        "_source" : {
          "price" :     20, 
          "productID" : "KDKE-B-9947-#kL5"
        }
    }
]
Matches the term filter for productID = "XHDK-A-1293-#fJ3"
Matches the term filter for price = 20

Chapter 8: (Sorting and) relevance

(pp117) What is Relevance?

The relevance score of each document is represented by a +ve float called _score. The higher the _score, the more relevant the document.

A query clause generates a _score for each document. How that score is calculated depends on the type of query used. 

Different queries are used for different purposes.
– A fuzzy query might determine the _score by calculating how similar the spelling of the query term is to that within documents.
– A terms query would incorporate the percentage of terms that were found.

The standard similarity algorithm used in ES is known as a term freqy/inverse document freqy or TD/IDFwhich takes the following factors into account:

-Term Freqy
-Inverse Document Freqy
-Field-length norm : How long is the field? The longer it is, the less likely it is that the words in the field will be relevant. A term appearing in a <short title> field carries more weight than the same term appearing in a long <content> field.

Example:

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
Source: http://www.tfidf.com/

Ch3: Data In, Data Out (inverted indexes)

In the real world, not all entities of the same type looks the same. One person might just have a home telephone, another a cell # and another both, another none of these. In the RDBMS world, each entities demands its own column, or to be modelled as a KV pair. This leads to waste and redundancy. Brazillian people may have 10+ family names, a Brit one.

The problem comes when we need to store these entities. Traditionally, this as been accomplished through a RDBMS with columns and rows.

Of course, we don’t only need to store the data, we need to query, use the data. While noSQL solutons (eg MongoDB) exist that allow us to store objects as documents, they still require us to think about how we want to query our data, and which fields require an index to speed up retrieval.

In ES, all data in every field is indexed by default. Every field has a dedicated inverted index and unlike most other DBs, it can use all of those inverted indices in the same query

What is an inverted index?

Document orientation

Objects are seldom is simple list of KV pairs!

More often than not they are complex data structures that contain dates, geo, objects, arrays of values etc.

sooner or later you’re going to want to store these objects (in a database), trying to do this with the rows and columns of a relational DB…means flattening the object to fit the schema – usually one field per column – and then have to reconstruct it every time it is retrieved 

Consider this JSON document:

{
“email”:”x@y.com”,
“first name”: “john”,
“last name”: “smith”,
“info”:  {
“bio”: “Some blurb……”,
“age”:41,
“interests”: [football, cricket]
},
“join_date”: “2015/15/2”
}

SQL > Mongo?

I’ve previously talked about translating/moving data from Mongo to SQL, but how about the reverse of that?! 

Mongify
data translator for moving your SQL data to MongoDB.

Mongify helps you move your data without worrying about the IDs or foreign IDs. It even allows you to embed your data into other documents.

 

MapReduce links

http://wiki.summercode.com/mongodb_aggregation_functions_and_ruby_map_reduce_basics

https://github.com/shimondoodkin/nodejs-mongodb-app/wiki/explanation-of-map-reduce

http://www.mongovue.com/2010/11/03/yet-another-mongodb-map-reduce-tutorial/#comment-117

http://cookbook.mongodb.org/patterns/pivot/

http://stackoverflow.com/questions/9337343/mongodb-map-reduce-tutorial

(Example) Lamba Architecture

Lambda_Arch2-613x4881[1]
Batch
Hadoop
Cassandra (as storage engine)
MongoDB (as storage engine)
Serving
ElephantDB
Speed
Storm
   Fast retrieval K/V DB?

Link

MongoSQL

What is mongoSQL?

mongoSQL is a free Mac, Windows, Linux/Unix and Web client UI application that talks directly to MongoDB by MongoDB, Inc.  It does not require any other server software other than MongoDB.  SQL scripts can be created, executed, saved and opened from the UI.  When executed, the SQL is translated behind the scenes into the appropriate MongoDB API calls.  Data can be inserted, updated, deleted and selected.  Selected documents are returned and displayed.  The user only has to work with familiar SQL syntax and not with the JSON and/or Javascript that the MongoDB shell and APIs require.  (Although for more expert MongoDB users, the SQL can be dissected and the corresponding JSON and/or Javascript will be displayed.)

 

Why mongoSQL?

Using SQL to query and filter unstructured data, such as that in Big Data databases like MongoDB, is sometimes debated.  Many people believe that SQL is not really the best tool.  While that is true sometimes, for many jobs it is definitely up to the task.  It carries with it some of the same limitations as MongoDB, such as no joins.  But it supports simple and complex SQL along with aggregation that can optionally use either MongoDB’s aggregation framework or map reduce.  There are no architectural changes necessary because mongoSQL is a client UI.

The most likely users of mongoSQL are people who:

– Are more comfortable with SQL than with writing JSON / Javascript when specifying conditions, aggregations, sorts, etc.

– Do not want to commit architecturally yet to additional Big Data tools.

– Are not proficient with the available Big Data tools or do not require much of the functionality of these tools.

– Want to use a friendly SQL UI.

– Need to vet and possibly modify MongoDB data directly.

– Want to translate SQL into the necessary JSON / Javascript required by the MongoDB shell or API.