Hadoop – Hardware 101

The term commodity hardware is cited. However, commodity is referred to as:
(This doesn’t sound much like the romantic notion of dirt cheap infrastructure to me!)

SATA         Data Transfer Rate
Version      Gbits/sec  MBytes/sec  Year
1.0 (I)         1.5        150      2001
2.0 (II, 3G)    3.0        300      2004
3.0 (III, 6G)   6.0        600      2009
3.2 (Express)  16.0       1969      2013

More on RAID here

Using RAID on the DataNode FS used to store HFDS content is a bad idea because HDFS already has replication and error-checking bullt in. RAID is strongly recommended on the NameNode for additional security (HDFS uses disks to durably store metadata about the FS).

Topology: All of the master and slave nodes must be able to open connections to each other. Client nodes need to be able to talk to all of the master and slave nodes.

MongoDB Utils

So, I managed to

a) Clone the SD cards with a successful MongoDB installation
b) Reformat the WinXP Dell to Debian 6.0 and install MongoDB
c) Plug everything into a new 8-port ethernet switch
d) Install PuTTY

…Which should mean I have at least 5 nodes in an available cluster (+ VMs on the Dell). I was just about to install MongoVue on the Dell (which is now becoming the master/control machine) and connect to each node from within MongoVue and browse collections made on each device  – only to find out it runs only on Windows!

In browsing to see if there was a linux version of MongoVue I stumbled across this excellent blog / summary of what’s out there

http://timgourley.com/2010/03/16/tuesday-night-tech-mongodb-ui-edition.html

This tool looks pretty cool for administering clustersImageI’ll give it a shot.

DavidHows @10Gen also recommended that I installed MMS. This is an agent (or python daemon) which polls the internal performance metrics out of your Mongo instances. Its free and is a great help if you’re trying to diagnose performance issues. This should be useful for me, as I’m interested in ETL performance (throughput) depending on how many nodes I have, how the data is sharded, size and type of SD cards etc.

http://www.10gen.com/products/mms
By default, the MMS dashboard displays 9 metrics:

  • Op Counters – Count of operations executed per second
  • Memory – Amount of data MongoDB is using
  • Lock Percent – Percent of time spent in write lock
  • Background Flush – Average time to flush data to disk
  • Connections – Number of current open connections to MongoDB
  • Queues – Number of operations waiting to run
  • Page Faults – Number of page faults to disk
  • Replication – Oplog length (for primary) and replication delay to primary (on secondary)
  • Journal – Amount of data written to journal

 

Successful Debian (6.0) installation

It’s nice working on a 42″ plasma TV in the living room(!) where the internet router is situated. I managed to get a DVI <> HDMI cable for the old Dell Dimension (which  means no more bulky old Dell monitor)

Installed a Debian image onto the previously WinXP Dell, using Unetbootin and a USB drive.

Fun with .tar filesImage

I now have 5 x R Pi with 32GB SD cards and MongoDB installed on Raspbian O/S, plus MongoDB installed on the old Dell.

I have also installed Oracle Virtual Box which lets you create virtual machines relatively easily.

My new 8-port network switch should hopefully come in the post today.
Should be in business!

Don’t forget, you’ll need to configure WiFi access too! Although for the RPi, this was pretty helpful, as was this & this

The Dell is so old it doesn’t have an internal wireless card. Instead I use a wifi dongle
$ lsusb
Bus 001 Device 004: ID 0846:4260 NetGear, Inc. WG111v3 54 Mbps Wireless [realtek RTL8187B]

$ sudo iwlist wlan0 scan | grep ‘BT’
                    ESSID:”BTHub3-WFF6″
ESSID:”BTHub3-F9Q7″
ESSID:”BTWiFi”
ESSID:”BTWiFi-with-FON”
ESSID:”BTOpenzone-H”
ESSID:”BTFON”
ESSID:”BTHomeHub2-GMGT”

$ man 5 interfaces
$ man 8 wpa_supplicant
$ man 8 iwconfig
$ man 8 iwlist

Take a look/install these Unix GUIs to help with setting up a network connection

  • kmanager
  • wicd

Distribution, Replica sets, Master-Slave

I’m just getting to grips with distributed database terminology pp130 in “MongoDB The Definitive Guide”

Master-Slave Replication

Master-slave replication  can be used for

  • Backup
  • Failover
  • Read scaling, and more

Read scaling could be really interesting from a BI consumption point-of-view eg a peak of user traffic to the DW portal at the beginning of the day.

In MongoDB the most basic setup is to start a master node (in my case the Dell 5100)  and add one or more slave nodes (in my case, the Raspberry Pis) Each of the slaves must know the address of the master.

To start the master run: mongod –master

To start a slave run: mongod –slave –source master_address
– where master_address is the address of the master node just started. This is where my previous post on fixing a static IP comes in handy.

First, create a directory for the master to store data in and choose a port (10000)

$ mkdir -p ~/dbs/master
$ ./mongod --dbpath ~/dbs/master --port 10000 --master

Now, set up the slave(s) choosing a different data directory (if on same/virtual machine) and port. For any slave, you also need to specify who the master is

$ mkdir -p ~/dbs/slave
$ ./mongod --dbpath ~/dbs/slave --port 10001 --slave --source localhost:10000

All slaves must be replicated from a master node. It is not possible to replicate from slave to slave.

…As soon as my 32MB SD cards arrive in the post and I install MongoDB on the remaining 4 R PIs I will give this a go!

Replica Sets

replica set is the same as the above but with automatic failover. The biggest difference between a M-S cluster and a replica set is that a replica set does not have a single master.

One is ‘elected’ by the cluster and may change to another node if the then current master becomes uncontactable.

 

There are 3 different types of nodes which can co-exist in a MongoDB cluster

  1. Standard – Stores a complete, full copy of the data being replicated, takes part in the voting when the primary node is being elected and is capable of being the primary node in the cluster
  2. Passive – As above, but will never become the primary node for the set
  3. Arbiter – Participates only in voting. Does not receive any of the data being replicated and cannot become the primary node

Standard and passive nodes are configured using the priority key. A node with priority 0 is passive and will never be selected as primary.

I suppose in my case if I had decreasing SD cards, or a mix of model A (256 MB) and model B (512 MB) Pis, then I could set priorities in decreasing order so the weakest node was never the master and always selected last/lowest priority.

Initializing a set (pp132)

(Networking) Success!

With some help from my former colleague and current classmate Pete Griffiths, I managed to remote logon to MongoDB installed on the ‘clear’ Pi (with the 32GB SD Card & successful GitHub installation) via MongoVue installed on the Win7 laptop.

This was a breakthrough as I’m now confident my cluster will be able to see each device – which then allows me to begin spreading data across nodes aka sharding.

Node(s) visible

Node(s) visible

Here you can see a connection to ‘clear’ has been created. Clear has been given a fixed IP address of 10.0.0.2, yellow has been given a static IP address of 10.0.0.3, Red .4 etc etc. MongoVue is running on my laptop, but able to connect to remote devices in the network switch.

You can see the dummy document I entered into a new collection!

Image

Clear is plugged in to the TP-Link switch on port 3 (lit up; laptop in port 1).

I’m just waiting for my other 4 32GB cards to come in the post and then I should be able to install Mongo on the remaining R Pi’s then hook everything together into the network switch.

We also linuxed the old Dell using Unetbootin, removing WinXP which is incompatible with MongoDB. Raspbian is crafted for the puny ARM processor, so I had to install Debian, which I believe(?) is from the same Linux family, so I hope has more or less the same functionality and syntax,

Also created virtual machines on the laptop using Virtual Box. I think I’ll need these to create virtual ‘config servers’ which mongodb requires for coordination. Config servers store all cluster metadata, most importantly, the mapping from chunks to shards:
Config servers maintain the shard metadata in a config database. The config database stores the relationship between chunks and where they reside within a sharded cluster. Without a config database, the mongos instances would be unable to route queries or write operations within the cluster.

– see more here

Divide and conquer