ES: Filter types

and filter
bool filter
exists filter
geo bounding box filter
geo distance filter
geo distance range filter
geo polygon filter
geoshape filter
geohash cell filter
has child filter
has parent filter
ids filter
indices filter
limit filter
match all filter
missing filter
nested filter
not filter
or filter
prefix filter
query filter
range filter
regexp filter
script filter
term filter
terms filter
type filter

Chapter 12: Structured search

Combining Filters

SELECT product
FROM products
WHERE (price 20 OR productID = ‘XXYYZZ’)
AND (price !=30)

Bool Filter

The bool filter is comprised 3 sections

“bool” : {
“must”:          [ ],
“should”:       [ ],
“must_not”:   [ ],

-MUST: All of the clauses must match. The equivalent of AND
-SHOULD: At least ONE of the clauses must match. The equivalent of OR
-MUST_NOT: All of the clauses must NOT match. The equivalent of NOT


To replicate the preceding SQL example, we will take the two term filters that we used previously and place them inside the should clause of a bool filter, and add another clause to deal with the NOT condition:

GET /my_store/products/_search
   "query" : {
      "filtered" : { 
         "filter" : {
            "bool" : {
              "should" : [
                 { "term" : {"price" : 20}}, 
                 { "term" : {"productID" : "XHDK-A-1293-#fJ3"}} 
              "must_not" : {
                 "term" : {"price" : 30} 
Note that we still need to use a filtered query to wrap everything.
These two term filters are children of the bool filter, and since they are placed inside the should clause, at least one of them needs to match.
If a product has a price of 30, it is automatically excluded because it matches a must_notclause.

Our search results return two hits, each document satisfying a different clause in the bool filter:

"hits" : [
        "_id" :     "1",
        "_score" :  1.0,
        "_source" : {
          "price" :     10,
          "productID" : "XHDK-A-1293-#fJ3" 
        "_id" :     "2",
        "_score" :  1.0,
        "_source" : {
          "price" :     20, 
          "productID" : "KDKE-B-9947-#kL5"
Matches the term filter for productID = "XHDK-A-1293-#fJ3"
Matches the term filter for price = 20

Chapter 8: (Sorting and) relevance

(pp117) What is Relevance?

The relevance score of each document is represented by a +ve float called _score. The higher the _score, the more relevant the document.

A query clause generates a _score for each document. How that score is calculated depends on the type of query used. 

Different queries are used for different purposes.
– A fuzzy query might determine the _score by calculating how similar the spelling of the query term is to that within documents.
– A terms query would incorporate the percentage of terms that were found.

The standard similarity algorithm used in ES is known as a term freqy/inverse document freqy or TD/IDFwhich takes the following factors into account:

-Term Freqy
-Inverse Document Freqy
-Field-length norm : How long is the field? The longer it is, the less likely it is that the words in the field will be relevant. A term appearing in a <short title> field carries more weight than the same term appearing in a long <content> field.


Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

Ch3: Data In, Data Out (inverted indexes)

In the real world, not all entities of the same type looks the same. One person might just have a home telephone, another a cell # and another both, another none of these. In the RDBMS world, each entities demands its own column, or to be modelled as a KV pair. This leads to waste and redundancy. Brazillian people may have 10+ family names, a Brit one.

The problem comes when we need to store these entities. Traditionally, this as been accomplished through a RDBMS with columns and rows.

Of course, we don’t only need to store the data, we need to query, use the data. While noSQL solutons (eg MongoDB) exist that allow us to store objects as documents, they still require us to think about how we want to query our data, and which fields require an index to speed up retrieval.

In ES, all data in every field is indexed by default. Every field has a dedicated inverted index and unlike most other DBs, it can use all of those inverted indices in the same query

What is an inverted index?

Document orientation

Objects are seldom is simple list of KV pairs!

More often than not they are complex data structures that contain dates, geo, objects, arrays of values etc.

sooner or later you’re going to want to store these objects (in a database), trying to do this with the rows and columns of a relational DB…means flattening the object to fit the schema – usually one field per column – and then have to reconstruct it every time it is retrieved 

Consider this JSON document:

“first name”: “john”,
“last name”: “smith”,
“info”:  {
“bio”: “Some blurb……”,
“interests”: [football, cricket]
“join_date”: “2015/15/2”

Installing ES on Windows

install/update latest version of (JDK)
See my post about Tika for how to do this

Extract then start .bin file

Set up a Windows service

Windows users can configure Elasticsearch to run as a service to run in the background or start automatically at startup without any user interaction. This can be achieved through service.bat script under bin/ folder which allows one to install, remove, manage or configure the service and potentially start and stop the service, all from the command-line.


run service.bat manager
Change General to automatic


Installing MARVEL, which comes with an interactive console ‘sense’, which makes it easy to talk to ES directly from a browser.

Marvel is available as a plugin

Turn off monitoring of your local cluster
echo ‘marvel.agent.enabled: false’ >> .config/elasticsearch.yml

Running ES
./bin/elasticsearch -d to run in background as a daemon

Test it out
curl ‘http://localhost:9200/?pretty&#8217;

Book: ElasticSearch – The Definitive Guide

“Unfortunately, most DBs are astonishingly inept at extracting actionable knowledge from your data. Sure, they can filter by timestamp or exact values, but can they perform full-text search, handle synonyms, and score documents by relevance? Can they generate analytics and aggregations from the same data? Most important, can they do this in real time without big batch-processing jobs?

ES encourages you to explore and utilize your data, rather than letting it rot in a warehouse because it is too difficult to query.

(pp3) ES is more than just Lucene, it can also be described as follows:

  • A distributed real-time document srore where every field is indexed and searchable
  • A distbuted search engine with real time analytics
  • Capable of scaling to hundreds of servers and petabytes of structured and unstructured data