Reading ‘algorithms of the intelligent web’

Link

Just beginning ‘Taming Text – How to find, organize and manipulate it’

Getting started and taming text
Foundations of taming text
Searching
Fuzzy string matching
Identifying people, places and things
Clustering text
Classification, categorization and tagging
Building an example question answering system
Untamed text: exploring the next frontier

Link

I am diverging my research and looking at Elastic Search on a R-Pi cluster. An excellent installation how-to can be found here

Link

Model Data to Support Keyword Search

Data Models >
Model Data to Support Keyword Search

Model Data to Support Keyword Search

Note

Keyword search is not the same as text search or full text search, and does not provide stemming or other text-processing features. See the Limitations of Keyword Indexes section for more information.

In 2.4, MongoDB provides a text search feature. See Text Search for more information.

If your application needs to perform queries on the content of a field that holds text you can perform exact matches on the text or use $regex to use regular expression pattern matches. However, for many operations on text, these methods do not satisfy application requirements.

This pattern describes one method for supporting keyword search using MongoDB to support application search functionality, that uses keywords stored in an array in the same document as the text field. Combined with a multi-key index, this pattern can support application’s keyword search operations.

Pattern

To add structures to your document to support keyword-based queries, create an array field in your documents and add the keywords as strings in the array. You can then create a multi-key index on the array and create queries that select values from the array.

Example

Given a collection of library volumes that you want to provide topic-based search. For each volume, you add the array topics, and you add as many keywords as needed for a given volume.

For the Moby-Dick volume you might have the following document:

{ title : "Moby-Dick" ,
  author : "Herman Melville" ,
  published : 1851 ,
  ISBN : 0451526996 ,
  topics : [ "whaling" , "allegory" , "revenge" , "American" ,
    "novel" , "nautical" , "voyage" , "Cape Cod" ]
}

You then create a multi-key index on the topics array:

db.volumes.ensureIndex( { topics: 1 } )

The multi-key index creates separate index entries for each keyword in the topics array. For example the index contains one entry forwhaling and another for allegory.

You then query based on the keywords. For example:

db.volumes.findOne( { topics : "voyage" }, { title: 1 } )

Note

An array with a large number of elements, such as one with several hundreds or thousands of keywords will incur greater indexing costs on insertion.

Limitations of Keyword Indexes

MongoDB can support keyword searches using specific data models and multi-key indexes; however, these keyword indexes are not sufficient or comparable to full-text products in the following respects:

Stemming. Keyword queries in MongoDB can not parse keywords for root or related words.
Synonyms. Keyword-based search features must provide support for synonym or related queries in the application layer.
Ranking. The keyword look ups described in this document do not provide a way to weight results.
Asynchronous Indexing. MongoDB builds indexes synchronously, which means that the indexes used for keyword indexes are always current and can operate in real-time. However, asynchronous bulk indexes may be more efficient for some kinds of content and workloads.

Link

MongoDB 2.4 text searc

text

Definition

text

New in version 2.4.

Searches text content stored in the text index. The text command is case-insensitive.

The text command returns all documents that contain any of the terms; i.e. it performs a logical OR search. By default, the command limits the matches to the top 100 scoring documents, in descending score order, but you can specify a different limit.

The text command has the following syntax:

db.collection.runCommand( "text", { search: <string>,
                                    filter: <document>,
                                    project: <document>,
                                    limit: <number>,
                                    language: <string> } )

The text command has the following parameters:

Field	Type	Description
`search`	string	A string of terms that MongoDB parses and uses to query the `text` index. Enclose the string of terms in escaped double quotes to match on the phrase. For further information on the `search` field syntax, see The search Field.
`filter`	document	Optional. A query document to further limit the results of the query using another database field. Use any valid MongoDB query in the filter document, except if the index includes an ascending or descending index field as a prefix. If the index includes an ascending or descending index field as a prefix, the `filter` is required and the `filter` query must be an equality match.
`project`	document	Optional. Limits the fields returned by the query to only those specified. By default, the `_id` field returns as part of the result set, unless you explicitly exclude the field in the project document.
`limit`	number	Optional. The maximum number of documents to include in the response. The `text`command sorts the results before applying the `limit`. The default limit is 100.
`language`	string	Optional. The language that determines the list of stop words for the search and the rules for the stemmer and tokenizer. If not specified, the search uses the default language of the index. For supported languages, see Text Search Languages. Specify the language in lowercase.

Returns:	The `text` command returns a document that contains a field `results` that contains an array of the highest scoring documents, in descending order by score. See Output for details.

Warning

The complete results of the text command must fit within the BSON Document Size. Otherwise, the command will limit the results to fit within the BSON Document Size. Use the limit and the project parameters with the text command to limit the size of the result set.

Note

If the search string includes phrases, the search performs an AND with any other terms in the search string; e.g. search for"\"twinkle twinkle\" little star" searches for "twinkle twinkle" and ("little" or "star").
text adds all negations to the query with the logical AND operator.
The text command ignores stop words for the search language, such as the and and in English.
The text command matches on the complete stemmed word. So if a document field contains the word blueberry, a search on the term blue will not match. However, blueberry or blueberries will match.

Note

You cannot combine the text command, which requires a special text index, with a query operator that requires a different type of special index. For example you cannot combine text with the $near operator.

The `search` Field

The search field takes a string of terms that MongoDB parses and uses to query the text index. Enclose the string of terms in escaped double quotes to match on the phrase. Additionally, the text command treats most punctuation as delimiters, except when a hyphen - negates terms.

Prefixing a word with a hyphen sign (-) negates a word:

The negated word excludes documents that contain the negated word from the result set.
A search string that only contains negated words returns no match.
A hyphenated word, such as pre-market, is not a negation. The text command treats the hyphen as a delimiter.

Examples

The following examples assume a collection articles that has a text index on the field subject:

db.articles.ensureIndex( { subject: "text" } )

Search for a Single Word

db.articles.runCommand( "text", { search: "coffee" } )

This query returns documents that contain the word coffee, case-insensitive, in the indexed subject field.

Search for Multiple Words

The following command searches for bake or coffee or cake:

db.articles.runCommand( "text", { search: "bake coffee cake" } )

This query returns documents that contain either bake or coffee or cake in the indexed subject field.

Search for a Phrase

db.articles.runCommand( "text", { search: "\"bake coffee cake\"" } )

This query returns documents that contain the phrase bake coffee cake.

Exclude a Term from the Result Set

Use the hyphen (-) as a prefix to exclude documents that contain a term. Search for documents that contain the words bake or coffee but do not contain cake:

db.articles.runCommand( "text", { search: "bake coffee -cake" } )

Search with Additional Query Conditions

Use the filter option to include additional query conditions.

Search for a single word coffee with an additional filter on the about field, but limit the results to 2 documents with the highest score and return only the subject field in the matching documents:

db.articles.runCommand( "text", {
                                  search: "coffee",
                                  filter: { about: /desserts/ },
                                  limit: 2,
                                  project: { subject: 1, _id: 0 }
                                }
                      )

The filter query document may use any of the available query operators.
Because the _id field is implicitly included, in order to return only the subject field, you must explicitly exclude (0) the _id field. Within the project document, you cannot mix inclusions (i.e. <fieldA>: 1) and exclusions (i.e. <fieldB>: 0), except for the_id field.

Search a Different Language

Use the language option to specify Spanish as the language that determines the list of stop words and the rules for the stemmer and tokenizer:

db.articles.runCommand( "text", {
                                    search: "leche",
                                    language: "spanish"
                                }
                      )

See Text Search Languages for the supported languages.

Important

Specify the language in lowercase.

Output

The following is an example document returned by the text command:

{
   "queryDebugString" : "tomorrow||||||",
   "language" : "english",
   "results" : [
      {
         "score" : 1.3125,
         "obj": {
                  "_id" : ObjectId("50ecef5f8abea0fda30ceab3"),
                  "quote" : "tomorrow, and tomorrow, and tomorrow, creeps in this petty pace",
                  "related_quotes" : [
                                       "is this a dagger which I see before me",
                                       "the handle toward my hand?"
                                     ],
                  "src" : {
                             "title" : "Macbeth",
                             "from" : "Act V, Scene V"
                          },
                  "speaker" : "macbeth"
                }
      }
   ],
   "stats" : {
               "nscanned" : 1,
               "nscannedObjects" : 0,
               "n" : 1,
               "nfound" : 1,
               "timeMicros" : 163
            },
   "ok" : 1
}

The text command returns the following data:

text.queryDebugString: For internal use only.

text.language: The language field returns the language used for the text search. This language determines the list of stop words and the rules for the stemmer and tokenizer.

text.results

The results field returns an array of result documents that contain the information on the matching documents. The result documents are ordered by the score. Each result document contains:

text.results.obj: The obj field returns the actual document from the collection that contained the stemmed term or terms.

text.results.score: The score field for the document that contained the stemmed term or terms. The score field signifies how well the document matched the stemmed term or terms. See Control Results of Text Search with Weights for how you can adjust the scores for the matching words.

text.stats

The stats field returns a document that contains the query execution statistics. The stats field contains:

text.stats.nscanned: The nscanned field returns the total number of index entries scanned.

text.stats.nscannedObjects: The nscannedObjects field returns the total number of documents scanned.

text.stats.n: The n field returns the number of elements in the results array. This number may be less than the total number of matching documents, i.e. nfound, if the full result exceeds the BSON Document Size.

text.stats.nfound: The nfound field returns the total number of documents that match. This number may be greater than the size of the results array, i.e. n, if the result set exceeds the BSON Document Size.

text.stats.timeMicros: The timeMicros field returns the time in microseconds for the search.

text.ok: The ok returns the status of the text command.

Text Search Languages

The text index and the text command support the following languages:

danish
dutch
english
finnish
french
german
hungarian
italian
norwegian
portuguese
romanian
russian
spanish
swedish
turkish

Note

If you specify a language value of "none", then the text search has no list of stop words, and the text search does not stem or tokenize the search terms.

Link

C:\Users\shayes\Documents\Dundee\Project\Write Up\8122013.docx

Link

MongoSQL

What is mongoSQL?

mongoSQL is a free Mac, Windows, Linux/Unix and Web client UI application that talks directly to MongoDB by MongoDB, Inc. It does not require any other server software other than MongoDB. SQL scripts can be created, executed, saved and opened from the UI. When executed, the SQL is translated behind the scenes into the appropriate MongoDB API calls. Data can be inserted, updated, deleted and selected. Selected documents are returned and displayed. The user only has to work with familiar SQL syntax and not with the JSON and/or Javascript that the MongoDB shell and APIs require. (Although for more expert MongoDB users, the SQL can be dissected and the corresponding JSON and/or Javascript will be displayed.)

Why mongoSQL?

Using SQL to query and filter unstructured data, such as that in Big Data databases like MongoDB, is sometimes debated. Many people believe that SQL is not really the best tool. While that is true sometimes, for many jobs it is definitely up to the task. It carries with it some of the same limitations as MongoDB, such as no joins. But it supports simple and complex SQL along with aggregation that can optionally use either MongoDB’s aggregation framework or map reduce. There are no architectural changes necessary because mongoSQL is a client UI.

The most likely users of mongoSQL are people who:

– Are more comfortable with SQL than with writing JSON / Javascript when specifying conditions, aggregations, sorts, etc.

– Do not want to commit architecturally yet to additional Big Data tools.

– Are not proficient with the available Big Data tools or do not require much of the functionality of these tools.

– Want to use a friendly SQL UI.

– Need to vet and possibly modify MongoDB data directly.

– Want to translate SQL into the necessary JSON / Javascript required by the MongoDB shell or API.

Link

Robomongo

nice looking wrapper for mongodb. Also comes in linux flavour.

Link

BigData vs RDF & SPARQL

I’d be interested to hear your thoughts about RDF and whether it is a viable approach to semi-structured data & the need for schema free datastores

What Do RDF and SPARQL bring to Big Data Projects?

Bob DuCharme

Big Data. March 2013: 38-41.

First Page | Full Text PDF or HTML | Reprints | Permissions

Stuart Hayes (<3 data)

All things (big)data & analytics

Blog Archives

Link

Link

Link

Link

Link

Model Data to Support Keyword Search

Pattern

Limitations of Keyword Indexes

Link

text

Definition

The `search` Field

Examples

Search for a Single Word

Search for Multiple Words

Search for a Phrase

Exclude a Term from the Result Set

Search with Additional Query Conditions

Search a Different Language

Output

Text Search Languages

Link

Link

Link

Link

Share this:

Share this:

Share this:

Share this:

Model Data to Support Keyword Search

Pattern

Limitations of Keyword Indexes

Share this:

text

Definition

The search Field

Examples

Search for a Single Word

Search for Multiple Words

Search for a Phrase

Exclude a Term from the Result Set

Search with Additional Query Conditions

Search a Different Language

Output

Text Search Languages

Share this:

Share this:

Share this:

Share this:

Share this:

The `search` Field