(Re)installing Apache Tika


Install a Windows subversion client (to make life easier)

Download Tika
This will install all the required Tika packages
SVN > Checkout
URL of repository: http://svn.apache.org/repos/asf/tika/trunk

Download Java SDK
Before downloading Maven, you need the java SDK


Install it to a pathname without spaces, such as c:\j2se1.6.

Once Java is installed, you must ensure that the commands from the Java SDK are in your PATH environment variable.

Set environmental variable JAVA_HOME to installation dir
Control Panel\System and Security\System
> Advanced system settings > SYSTEM VARIABLES > NEW

Append the full path of the Java compiler to the system path
Append the string C:\jdk1.7.0_79 to the end of the system variable ‘PATH’

Download Maven


  1. navigate in explorer to the Maven directory
  2. go to a dir IN the bin
  3. copy the address in the address bar(must end with bin)
  4. go to Start and type in “env”
  5. Select “edit the system evironment variables”
  6. find the PATH variable which must also have an existing value for Java as Maven needs Java.
  7. append a ; + paste the path.
  8. restart to update system
  9. run “mvn install” in the client ???

Download the ‘binaries’? *BIN* & *SRC*

Extract to
& C:\apache-maven-3.2.5\src

Maven in PATH

You run Maven by invoking a command-line tool: mvn.bat from the bin directory of the Maven. To do this conveniently, c:\mvn3.0.4\bin must be in your PATH, just like the J2SE SDK commands. You can add directories to your PATH in the control panel; the details vary by Windows version.

Add M2_HOME to sys vars

Update PATH variable, append Maven bin folder – %M2_HOME%\bin, so that you can run the Maven’s command everywhere.

Check install using
mvn –version
Should echo back windows etc.

Go to the src dir
Run mvn install


Now to connect Tika with Pentaho!


Scala Actions

first – returns the first element of the RDD (typically a file header)

take – returns a given number of records (like *nix HEAD)

collect – returns the whole dataset

Actions, by default, return results to the local process. The saveAsTextFile actions saves the contents of an RDD to persistent storage, such as HDFS
-This action creates a directory & writes out each partition as a file within it.

foreach println – prints out each value in the array on its own line

foreach printlin is an example of functional programming, where one function (foreach) is passed as an argument to another function (println)

The Spark Programming Model

Spark programming usually starts with a dataset or few, usually residing in some form of distributed, persistent storage like HDFS. Writing a spark program usually consists of a few related steps

  • Defining a set of transformations on input data sets
  • Invoking actions that output the transformed datasets to persistent storage or return results (to local memory)
  • Running local computations that operate on the results computed in a distributed fashion – these can help you decide what the next transformations and actions could be

Spark functions revolve around storage and execution

SCALA uses type inference

Whenever we create a new variable in Scala, we must preface it with either val or var
– val are immutable and cannot be changed to refer to another value once assigned
-var can be changed to refer to different objects of the same type.

Advanced Analytics with Apache Spark: What is Spark?

The challenges of Data Science

The hard truths:

First, the majority of work that goes into conducting successful analyses lie in preprocessing data. Data is messy, and cleansing, munging, fusing, mushing and many other verbs are prerequisites for doing anything useful with it…A typical data (processing) pipeline requires spending far more time in feature engineering and selection than in choosing and writing algorithms. The Data Scientist must choose from a wide variety of potential features (data). Each comes with its own challenges in converting to vectors fit for machine learning algorithms. A system needs to support more flexible transformations than turning a 2-d array (tuple) of doubles into a mathematical model.

Second, iteration. Models typically require multiple passes/scans over the same data to reach convergence. Iteration is also crucial within the Data Scientist’s own workflow, usually the results of a query inform the next query that should run. Feature and algorithm selection, significance testing and finding the right hyper-parameters all require multiple experiments.

Third, implementation. Models need to be operationalised and become part of a production service and may need to be rebuilt periodically or even in real time – analytics in the ‘lab’ vs analytics in the ‘factory’. Exploratory analytics usually occurs in e.g. R, whereas production applications’ data pipelines are written in C++ or Java.

Enter Apache Spark

Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast toHadoop’s two-stage disk-based MapReduce paradigm, Spark’s in-memory primitives provide performance up to 100 times faster for certain applications.[1] By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well suited to machine learning algorithms.[2]

Spark is an open-source framework that combines an engine for distributed programs across clusters of machines with an elegant model for writing programs atop it.

Word Count

In this SCALA example, we build a dataset of (String, Int) pairs called counts and then save it to a file:

val file = spark.textFile(“hdfs://…”)
val counts = file.flatMap(line => line.split(” “))
.map(word => (word, 1))
.reduceByKey(_ + _)

One way to understand Spark is to compare it to its predecessor, MapReduce. Spark maintains MapReduce’s linear scalability and fault tolerance, but extends it in 3 important ways:

First, rather than relying on a rigid map-then-reduce format, Spark’s engine can execture a more general directed acyclic graph (DAG) of operators. 

This means that (rather than MapReduce writing out intermediate results to the distributed filesystem), Spark can pass them directly to the next step in the pipeline.

Secondly, Spark has a rich set of transformations that enable users to express computation more naturally. Its streamlined API can represent complex pipelines in a few lines of code.

Thirdly, Spark leverages in-memory processing (rather than distributed disk). Developers can materialise any point in a processing pipeline into memory across the cluster, meaning that future steps that want to deal with the same dataset need not recompute it from disk.

Perhaps most importantly, Spark’s collapsing of the full pipeline, from preprocessing to model evaluation, into a single programming environment can dramatically speed up productivity. By packaging an expressive programming model with a set of analytic libraries under REPL it avoids round trips and moving data back and forth. Spark spans the gap between lab and factory analytics, Spark is better at being an operational system than most exploratory systems and better for data exploration that the technologies commonly used in operational systems. Spark also has tremendous versatility offering functionality like SQL, being able to ingest data from streaming services like Flume & Kafka, it has machine learning tools and graph processing capabilities.

Building Microservices

Reading this

Ch1. Microservices
Ch2. The Evolutionary Architect
Ch3. How to Model Services
Ch4. Integration
Ch5. Splitting the Monolith
Ch6. Deployment
Ch7. Testing
Ch8. Monitoring
Ch9. Security
Ch10. Conway’s Law & System Design
Ch11. Microservices at scale
Ch12. Bringing It All Together

I’ve learned something today as I’d never heard of “Conway’s Law”. Here’s a list of Google results

Conway states: Conway’s law was not intended as a joke or a Zen koan, but as a valid sociological observation. It is a consequence of the fact that two software modules A and B cannot interface correctly with each other unless the designer and implementer of A communicates with the designer and implementer of B. Thus the interface structure of a software system necessarily will show a congruence with the social structure of the organization that produced it.

Wiki: Conway’s law is an adage named after computer programmer Melvin Conway, who introduced the idea in 1968; it was first dubbed Conway’s law by participants at the 1968 National Symposium on Modular Programming.[1] It states that

organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations

—M. Conway[2]

A nice link, which took me to another nice link, which provides a nice, succint definition (same author as the book)

In short, the microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare mininum of centralized management of these services, which may be written in different programming languages and use different data storage technologies.

Which arrived me at another link

To start explaining the microservice style it’s useful to compare it to the monolithic style: a monolithic application built as a single unit. Enterprise Applications are often built in three main parts: a client-side user interface (consisting of HTML pages and javascript running in a browser on the user’s machine) a database (consisting of many tables inserted into a common, and usually relational, database management system), and a server-side application. The server-side application will handle HTTP requests, execute domain logic, retrieve and update data from the database, and select and populate HTML views to be sent to the browser. This server-side application is a monolith – a single logical executable[2]. Any changes to the system involve building and deploying a new version of the server-side application.

{Here endeth today’s lesson. More tomorrow}

ES: Filter types

and filter
bool filter
exists filter
geo bounding box filter
geo distance filter
geo distance range filter
geo polygon filter
geoshape filter
geohash cell filter
has child filter
has parent filter
ids filter
indices filter
limit filter
match all filter
missing filter
nested filter
not filter
or filter
prefix filter
query filter
range filter
regexp filter
script filter
term filter
terms filter
type filter

Chapter 12: Structured search

Combining Filters

SELECT product
FROM products
WHERE (price 20 OR productID = ‘XXYYZZ’)
AND (price !=30)

Bool Filter

The bool filter is comprised 3 sections

“bool” : {
“must”:          [ ],
“should”:       [ ],
“must_not”:   [ ],

-MUST: All of the clauses must match. The equivalent of AND
-SHOULD: At least ONE of the clauses must match. The equivalent of OR
-MUST_NOT: All of the clauses must NOT match. The equivalent of NOT


To replicate the preceding SQL example, we will take the two term filters that we used previously and place them inside the should clause of a bool filter, and add another clause to deal with the NOT condition:

GET /my_store/products/_search
   "query" : {
      "filtered" : { 
         "filter" : {
            "bool" : {
              "should" : [
                 { "term" : {"price" : 20}}, 
                 { "term" : {"productID" : "XHDK-A-1293-#fJ3"}} 
              "must_not" : {
                 "term" : {"price" : 30} 
Note that we still need to use a filtered query to wrap everything.
These two term filters are children of the bool filter, and since they are placed inside the should clause, at least one of them needs to match.
If a product has a price of 30, it is automatically excluded because it matches a must_notclause.

Our search results return two hits, each document satisfying a different clause in the bool filter:

"hits" : [
        "_id" :     "1",
        "_score" :  1.0,
        "_source" : {
          "price" :     10,
          "productID" : "XHDK-A-1293-#fJ3" 
        "_id" :     "2",
        "_score" :  1.0,
        "_source" : {
          "price" :     20, 
          "productID" : "KDKE-B-9947-#kL5"
Matches the term filter for productID = "XHDK-A-1293-#fJ3"
Matches the term filter for price = 20

Chapter 8: (Sorting and) relevance

(pp117) What is Relevance?

The relevance score of each document is represented by a +ve float called _score. The higher the _score, the more relevant the document.

A query clause generates a _score for each document. How that score is calculated depends on the type of query used. 

Different queries are used for different purposes.
– A fuzzy query might determine the _score by calculating how similar the spelling of the query term is to that within documents.
– A terms query would incorporate the percentage of terms that were found.

The standard similarity algorithm used in ES is known as a term freqy/inverse document freqy or TD/IDFwhich takes the following factors into account:

-Term Freqy
-Inverse Document Freqy
-Field-length norm : How long is the field? The longer it is, the less likely it is that the words in the field will be relevant. A term appearing in a <short title> field carries more weight than the same term appearing in a long <content> field.


Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
Source: http://www.tfidf.com/

Ch3: Data In, Data Out (inverted indexes)

In the real world, not all entities of the same type looks the same. One person might just have a home telephone, another a cell # and another both, another none of these. In the RDBMS world, each entities demands its own column, or to be modelled as a KV pair. This leads to waste and redundancy. Brazillian people may have 10+ family names, a Brit one.

The problem comes when we need to store these entities. Traditionally, this as been accomplished through a RDBMS with columns and rows.

Of course, we don’t only need to store the data, we need to query, use the data. While noSQL solutons (eg MongoDB) exist that allow us to store objects as documents, they still require us to think about how we want to query our data, and which fields require an index to speed up retrieval.

In ES, all data in every field is indexed by default. Every field has a dedicated inverted index and unlike most other DBs, it can use all of those inverted indices in the same query

What is an inverted index?