Chapter 8: (Sorting and) relevance

(pp117) What is Relevance?

The relevance score of each document is represented by a +ve float called _score. The higher the _score, the more relevant the document.

A query clause generates a _score for each document. How that score is calculated depends on the type of query used. 

Different queries are used for different purposes.
– A fuzzy query might determine the _score by calculating how similar the spelling of the query term is to that within documents.
– A terms query would incorporate the percentage of terms that were found.

The standard similarity algorithm used in ES is known as a term freqy/inverse document freqy or TD/IDFwhich takes the following factors into account:

-Term Freqy
-Inverse Document Freqy
-Field-length norm : How long is the field? The longer it is, the less likely it is that the words in the field will be relevant. A term appearing in a <short title> field carries more weight than the same term appearing in a long <content> field.

Example:

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
Source: http://www.tfidf.com/

Advertisements

Book: ElasticSearch – The Definitive Guide

http://www.amazon.com/Elasticsearch-Definitive-Guide-Clinton-Gormley/dp/1449358543/ref=sr_1_fkmr0_1?ie=UTF8&qid=1424028696&sr=8-1-fkmr0&keywords=elastic+search+book+the+definitive+guide

“Unfortunately, most DBs are astonishingly inept at extracting actionable knowledge from your data. Sure, they can filter by timestamp or exact values, but can they perform full-text search, handle synonyms, and score documents by relevance? Can they generate analytics and aggregations from the same data? Most important, can they do this in real time without big batch-processing jobs?

ES encourages you to explore and utilize your data, rather than letting it rot in a warehouse because it is too difficult to query.

(pp3) ES is more than just Lucene, it can also be described as follows:

  • A distributed real-time document srore where every field is indexed and searchable
  • A distbuted search engine with real time analytics
  • Capable of scaling to hundreds of servers and petabytes of structured and unstructured data