DATA IS RAW
.when designing your (big data) system, you want to be able to answer as many questions as possible. If you can, you want to store the rawest information you can get your hands on – the rawer your data – the more questions you can ask of it.
storing ‘super-atomic’ raw data is hugely valuable because you rarely you rarely know in advance all the questions you want answered.
By keeping the rawest data possible, you maximize the ability to obtain new insights, whereas summarizing (aggregating), overwriting or deleting information limits what the data can tell you.
if the algorithm generating data is likely to change over time, then store the unstructured (unprocessed) data – the data can be re computed from source as the algorithm improves.
DATA IS IMMUTABLE
Unlike the RDBMS/OLTP world of updates, you don’t update or delete data, you only add (append) more. This provides two advantages
– human-fault tolerance
– simplicity: indexes are not required as no data objects need to be retrieved or updated. Storing a master dataset can be as simple as flat (S3, HDFS) files.