(pp117) What is Relevance?
The relevance score of each document is represented by a +ve float called _score. The higher the _score, the more relevant the document.
A query clause generates a _score for each document. How that score is calculated depends on the type of query used.
Different queries are used for different purposes.
– A fuzzy query might determine the _score by calculating how similar the spelling of the query term is to that within documents.
– A terms query would incorporate the percentage of terms that were found.
The standard similarity algorithm used in ES is known as a term freqy/inverse document freqy or TD/IDF, which takes the following factors into account:
-Inverse Document Freqy
-Field-length norm : How long is the field? The longer it is, the less likely it is that the words in the field will be relevant. A term appearing in a <short title> field carries more weight than the same term appearing in a long <content> field.
Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.