counting the occurrence of words in a field (column)

How to do it?

I have a series of documents in a MongoDB collection, each doc has 7 columns.

Col7 is a large text field which contains the Ts and Cs, scraped from various websites.

I want to count the freqy of each word in that field, for each object.

So, for example, col 7 in document 1 may contain ‘delete’ 100 times.

Col 7 in doc 90 may contain ‘delete’ 2 times.

More to follow…

================================================================

Thanks to Terradata’s Chris Hillman for the following suggestion:

…That seems like a nice MapReduce task, I’d do it like this
Map
input Key, Value
    <documentID, DocumentText>
Process
    tokenise the document
Output Key, Value
    <documentID concatenated with token, 1> For DocID 50 and token “plum” This would look like 50|plum, 1
Reduce
input Key, Value
    <documentID concatenated with token, 1>
Process
    sum the counts by docid|token
Output Key, Value
    <documentID concatenated with token, count>
Up to you if you use the pipe d”|” delimiter for this or maybe use a tab “\t” to make the output easier to parse I’ve not used     Mongo so I don’t know what format it accepts
For the gold star solution use a “combiner” in the Map step to do a pre-reduce on the docid|token keys and reduce I/O
Basically the same as standard word count but with the docid concatenated on the token
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s