Apache Tika in Action – detetcting types

Tika – the filetype inspector:

pp71 Simple type detector example

import java.io.File;

import org.apache.tika.Tika;

public class SimpleTypeDetector {

public static void main(String [] args) throws Exception {

Tika tika = new Tika();

for (String file : args) {

String type = tika.detect(new File(file));

System.out.println(file + “: ” + type);




Hadoop – Hardware 101

The term commodity hardware is cited. However, commodity is referred to as:
(This doesn’t sound much like the romantic notion of dirt cheap infrastructure to me!)

SATA         Data Transfer Rate
Version      Gbits/sec  MBytes/sec  Year
1.0 (I)         1.5        150      2001
2.0 (II, 3G)    3.0        300      2004
3.0 (III, 6G)   6.0        600      2009
3.2 (Express)  16.0       1969      2013

More on RAID here

Using RAID on the DataNode FS used to store HFDS content is a bad idea because HDFS already has replication and error-checking bullt in. RAID is strongly recommended on the NameNode for additional security (HDFS uses disks to durably store metadata about the FS).

Topology: All of the master and slave nodes must be able to open connections to each other. Client nodes need to be able to talk to all of the master and slave nodes.

Hadoop in Practice: Chapter 1 – Hadoop in a heartbeat

Big data brings with it two fundamental challenges: how to store and work with voluminous data sizes, and more important, how to understand data and turn it into a competitive advantage.

Hadoop is a distributed filesystem, and it offers a way to parallellize and execute progams on a cluster of machines.

Figure 1.3 – Topography
The HDFS namenode keeps in memory the metadata about the filesystem such as which datanodes manage the blocks for each file.

HDFS clients talk to the namenode for metadata-related activities and DataNodes for reading and writing files.

DataNodes communicate with each other for pipelining file reads and writes.

Files are mede up of blocks, and each file can be replicated multiple times, meaning there are many identical copies of each block for the file (default = 3).

Hadoop in Practice

Reading Hadoop in Practice

“It was a revelation to observe our MapReduce jobs crunching through our data in minutes. Of course, what we weren’t expecting was the amount of time that we would spend debugging and performance-tuning our MR jobs.

Not to mention the new roles we took on as production administrators-the biggest surprise in this role was the number of disk failures we encountered during those first few months supporting production [1998].

The greatest challenge we faced when working with Hadoop, and specifically MR, was learning how to (think) solve problems with it.

After one is used to thinking in MR, the next challenge is typically related to the logistics of working with Hadoop, such as how to move data in & out of HDFS.”


SQL > Mongo?

I’ve previously talked about translating/moving data from Mongo to SQL, but how about the reverse of that?! 

data translator for moving your SQL data to MongoDB.

Mongify helps you move your data without worrying about the IDs or foreign IDs. It even allows you to embed your data into other documents.



Taming Text


Just beginning ‘Taming Text – How to find, organize and manipulate it’

  1. Getting started and taming text
  2. Foundations of taming text
  3. Searching
  4. Fuzzy string matching
  5. Identifying people, places and things
  6. Clustering text
  7. Classification, categorization and tagging
  8. Building an example question answering system
  9. Untamed text: exploring the next frontier