About sdh1974

MSc BI @ University of Dundee. After Trying (somewhat unsuccessfully) to build a mongodb cluster in the spare bedroom using extreme commodity hardware (Raspberry Pis) I am now looking at alternatives: ElasticSearch & Tika.

Data Stewardship

As an addendum to my previous article on the resources required to establish a data governance program, see Building out a Data Quality/Governance team

I came across the following Understanding the different types of a data steward

written by George Fircan

Data object data steward

Description: This role manages reference data and attributes of one business data entity.

Synonym: Domain data steward

Example: Customer data steward. Reference data and attributes managed by this steward: company hierarchy, address, industry code, contact information, finance data.

Key characteristics and considerations:

  • The role needs to reaches across functional lines and needs  to establish a cross-department team of subject matter experts
  • The most common types of stewardships, yet often most difficult to implement especially in decentralized organizations
  • Requires strong executive sponsorship

2.     Business data steward

Description: Manages the critical data, both reference and transactional, created or used by one business function.

Synonym: Functional data steward

Example: Sales or marketing data steward, business or data analyst

Key characteristics and considerations:

  • They are key representatives in a specific business area that is responsible for quality, use, and meaning of that data in the organization
  • One of the easiest functions to implement in a highly autonomous company
  • Effectiveness can be more easily measured by a direct business unit process metric
  • Gets challenging where the data is shared between several business units. This is where a centralized data governance organization is needed to intervene. Read more on the data governance operating models.

3.     Process data steward

Description: Manages all data across one business process.

Synonym: N/A

Example: Lean specialist

Key characteristics and considerations:

  • Works across multiple data domains
  • Need strong cross-process governance in order to be successful
  • Often this person is part of the process improvement team
  • Interacts regularly with business unit data stewards

4.     System data steward

Description: Manages the data for one or more IT systems.

Synonym: Technical data steward

Example: Enterprise data warehouse architect, MDM practitioner

Key characteristics and considerations:

  • Best to ask how the data is created, transformed, stored, and moved in technical systems
  • Good place to start if no formal stewardship program in place
Advertisements

Building out a Data Quality/Governance team

Have been reading the excellent CDO playbook (Amazon) by Caroline Carruthers and Peter Jackson & also thinking about Nicola Askham’s approach as I am attending one of her DG clinics tomorrow

Data Governance methodology

Chapter 8 – Building the CDO Team

‘You can’t do it alone…it can get lonely, you need some people to share ideas and observations!’

A trusted data-savvy liutenant is essential and quickly backing this up with a team that all works together will help keep your sanity!

The Different roles

Information architects (data architects)

What would it be like to use a library without any kind of indexing system in place? The IA builds the structures & frameworks to help you get the best from your data. They understand how your company uses information and turn that into conceptual models, which you can use to understand who is responsible for your information domains & therefore who is really accountable for making decisions about sections of your information. This is the area that eventually gives you your MDM and data models.

Logical Model Design

Conceptual data model

A conceptual data model for marketing might be something like:

CUSTOMERS
PRODUCTS
CHANNELS
CAMPAIGNS

For each ‘entity’ then list out the data fields
e.g. customer would have
-Name
-address
-DOB
-Gender

…& so on;

Now each ‘entity’ what are the Critical Data Elements (CDEs)?
For example, in a regulated industry, it’s likely that age (DOB) is a vital data element

To quote Nicola from her article ‘8 top tips for gaining stakeholder buy-in for DG’

“Creating conceptual data models is the best way to get
stakeholders engaged in your Data Governance initiative”

A sample job description for an Information Architect

Main purpose of job

  • To deliver the technical Information Architecture able to accommodate the current and future needs of the business, taking input from other business enterprise architects.
  • Define enterprise Information Architecture vision, strategy, principles and standards. Initiate and participate in projects to align the Information Architecture to meet the strategic business goals.
Key responsibilities
  • Own the definition and development of the Information Architecture framework that supports the business needs and guides the development of appropriate Information Architecture
  • Actively promote Information Architecture, facilitate connections, coordination and communications to ensure that Information Architecture reaches all appropriate parts of the business
  • Identify appropriate roadmaps for the necessary changes to process, ways of working and associated technology to manage the effective use of information in adherence with the info governance framework
  • Evaluate proposed changes to the Information Architecture such that risks are highlighted and mitigated
  • Research, evaluate, define and maintain corporate data and information policies, processes, standards, tools and repositories necessary for the successful use and / or reuse of data and information models
  • Support the maintenance of Information Architectures to reflect changes to the Information Governance framework, new requirements, emergence of technology innovations and changes to the business.
  • Provide expert consultancy advice and guidance to other architecture areas
  • Build relationships with business SME’s and stakeholders to understand their data and information landscape and business priorities
  • Define and develop Information standards, procedures and life cycles that meet our strategic needs
  • Maintain strong processes and procedures and cultivate high quality standards and best practice across information governance
Required experience

Essential

  • Extensive experience of managing data and Information Architectural development at an enterprise level
  • Substantial experience of systems analysis and requirements specification
  • Substantial experience of working with the business and at a strategic level
  • Knowledge of enterprise integration concepts and technologies
  • Ability to identify and control both technical and project / commercial risks and facilitate issue resolution, liaising with the business as appropriate
  • Understand TOGAF / DAMA principles
Desirable
  • Substantial experience of corporate process architecture development and deployment
  • Broad experience of the use of modelling, data management, simulation, work flow and end user analysis tools
  • Data management concepts and technologies (e.g. data warehouse)
  • Experience of the use of enterprise integration architecture methodology and tools
  • Information security concepts, standards, infrastructure and technology
Business and Personal Leadership/skills and attitudes
  • A strong element of creativity to shape innovative solutions combined with architectural design experience in order to develop workable solutions
  • The ability to speak confidently on technical subjects with limited preparation commanding respect and authority
  • Strong analysis and design skills
  • Excellent communication skill and the ability to work in a fast paced, commercially orientated environment

Information asset owners 

Tying into the work of the IA, Information Asset Owners are the people who are responsible for looking after the different information domains; they are accountable for making decisions about their particular defined areas and are experts in their field. While they don’t sit as part of a core team (they’re going to be on the business side), that doesn’t make them any less critical to have working with you.

Data Stewards

These are key data advocates, specialists, who know the most about what the organisation should be doing with its data and want to help the business get to data nirvana!

Example Job Description for (Lead) Data/Information Steward

Main purpose of job
  • Provide support to the CDO to embed the Information Management (IM) Framework across all business activities, to develop governance documents and to oversee assurance work of all the department to establish effective adoption of the IM Key principles throughout Lowell.
Key responsibilitiesEnsure all governance documents are up to date and relevant to sustain the key Information Management principles
  • Support the CDO to embed the IM framework across the business
  • Coordinating the IM team’s efforts to provide assistance and support to the companies lines of business to promote and embed the IM framework and Key Principles throughout the organisation.
  • Coordinating the work of and the support of the Information Governance Champions
  • Maintain the Information related risks
  • Investigate the identified data problems and follow through to make sure a resolution is identified and implemented
  • Monitor the IM reporting mechanisms to ensure they are robust and operating effectively
  • Improve data quality measures and the controls to monitor data quality
Required experience
Essential
  • Experience in leading a team
  • Proven track record in defining governance framework and governance documents
  • Excellent interpersonal skills
  • Excellent communication skills
  • Knowledge and understanding of the relevant information management and data protection legislation
  • Ability to multi-task, deal with work within tight / conflicting deadlines and prioritise work appropriately
  • Proven ability to engage successfully with cross – functional stakeholders outside of direct line management responsibilities
  • Knowledge and understanding of the importance of confidentiality and Information management and security principles
Desirable
  • Experience of developing technology transformation strategies and managing the delivery of associated technical services
Business and Personal Leadership/skills and attitudes
  • Great Team worker
  • The ability to speak confidently on information and data management subjects with limited preparation commanding respect and authority
  • Strong analysis and design skills
  • Strong problem solving and analytical skills
  • Excellent communication skill and the ability to work in a fast paced, commercially orientated environment

Data Analysts

Both the Steward and the Data Analysts form the team that helps improve DQ throughout the org. Data Analysts, or BAs with a data head, work closely with the Data Stewards, supporting them and learning from them. This is one of the entry level roles within the team. Having the right level of curiosity and analytics/querying skills, learning about the business and helping it understand the detail behind the data & how to improve the information being used. These are detail-oriented people and often go very deep to surface root causes of data quality issues.

Information (data) champions

Technically these people don’t sit within the CDO team, but work closely with them. They normally have full-time jobs in different business areas and are experts and influential in those areas (and hence usually super busy with the ‘day job’). Work with them to understand their area, find out why it’s important for them, what data could do for them.

Project Mangers

Typical PM activities!

Governance Specialists

(?)Typically when you come in there will be a complex set of policies, standards, procedures & guidelines that govern what people need to do. Typically these policies are siloed and aren’t joined up, frequently contradicting each other and not considering the whole E2E data flow. Governance Specialists help the company the structure the explains clearly, simply and concisely what obligations everyone in the company has regarding the treatment of data & what you want them to do about it, as well as helping to define what the structure is that keeps them all up to date.

Other roles

The CDO will need data engineers, people to manipulate data, build and manage the ETL/ELT; these engineers may also be/include data modellers. A BI team will also be required – these individuals may be out, embedded in the business, but their source(s) of data, methodology, governance & tooling should be derived from the core CDO team. Data Scientists will also be required to extract the high value from the data soon to be brought together and of course business analysts, those who can help elicit the right questions from data hungry people out in the business.

Streaming Data

Have begun reading
https://www.manning.com/books/streaming-data

51k8Fv94dBL._SX397_BO1,204,203,200_

What is a real-time system?

Real time systems are defined as hard, soft & near. Being defined by their latency and tolerance for delay. These definitions are somewhat fuzzy.

Anti-lock brakes would be hard – immediate application and critical consequences of delay. Skype would a near RT system.

Differences between RT & streaming systems

These systems have 2 parts: A computation/processing system and a consumption system(s).

DEFINITION: STREAMING DATA SYSTEM – in many scenarios the consumption part of the system is operating in a non-hard RT fashion, however the clients may not be consuming the data in RT due to network delays, application design or perhaps a client application isn’t running. The clients consume data when they need it. This is a streaming data system.

Typical architectural components

Collection Tier >
Message Queuing Tier >
Analysis Tier (> Optional persistent storage) >
In-Memory Data Store >
Data Access Tier (consumers)

[See also An Architecture for Fast and General Data Processing on Large Clusters by Matei Zaharia]

Put another way:
[Ref: Databricks]

At a high level, modern distributed stream processing pipelines execute as follows:

  1. Receive streaming data from data sources (e.g. live logs, system telemetry data, IoT device data, etc.) into some data ingestion system like Apache Kafka, Amazon Kinesis, etc.
  2. Process the data in parallel on a cluster. This is what stream processing engines are designed to do, as we will discuss in detail next.
  3. Output the results out to downstream systems like HBase, Cassandra, Kafka, etc.

To process the data, most traditional stream processing systems are designed with a continuous operator model, which works as follows:

  • There is a set of worker nodes, each of which run one or more continuous operators.
  • Each continuous operator processes the streaming data one record at a time and forwards the records to other operators in the pipeline.
  • There are “source” operators for receiving data from ingestion systems, and “sink” operators that output to downstream systems.

In-Stream Big Data Processing

Highly Scalable Blog

The shortcomings and drawbacks of batch-oriented data processing were widely recognized by the Big Data community quite a long time ago. It became clear that real-time query processing and in-stream processing is the immediate need in many practical applications. In recent years, this idea got a lot of traction and a whole bunch of solutions like Twitter’s Storm, Yahoo’s S4, Cloudera’s Impala, Apache Spark, and Apache Tez appeared and joined the army of Big Data and NoSQL systems. This article is an effort to explore techniques used by developers of in-stream data processing systems, trace the connections of these techniques to massive batch processing and OLTP/OLAP databases, and discuss how one unified query engine can support in-stream, batch, and OLAP processing at the same time.

At Grid Dynamics, we recently faced a necessity to build an in-stream data processing system that aimed to crunch about 8 billion events daily providing…

View original post 5,219 more words

The Batch Layer

The goal of a BI system is to answer any question (within reason) asked of it. In the Lambda architecture, any question can be implemented as function that takes all the data as input – unfortunately something that consumes the whole dataset is likely not to perform.

In the Lambda architecture, the batch layer precomputes the master dataset into batch views so that queries can be run with low latency. This requires balancing what needs to be precomputed  & what needs to be computed on the fly at execution time (Rather like aggregates in a star schema), the key is precompute just enough information to enable the query to return in an acceptable time.

The batch layer runs functions over the master dataset to precompute intermediate data called batch views. The batch views are loaded by the serving layer, which indexes them to allow rapid access to that data.

The speed layer compensates for the high latency of the batch layer by providing low-latency updates using data that has yet to be precomputed into a batch view.

(Rather like aggregates/caching in memory, with more esoteric queries going to the relational engine).

Queries are then satisfied by processing data from the serving layer views and the speed layer views, and merging the results

*you should take the opportunity to thoroughly explore the data & connect diverse pieces of data together! – assuming you have a priori knowledge of the necessary ‘joined’ datasets!

A naive strategy for computing on the batch layer would be to precompute all possible queries and cache the results in the serving layer. Unfortunately you can’t always precompute everything. Consider the pageviews-over-time query as an example. If you wanted to precompute every potential query, you’d need to determine the answer for every possible range of hours for every URL. But the number of ranges of hours within a given time frame can be huge. In a one-year period, there are approximately 380 million distinct hour ranges. To precompute the query, you’d need to precompute and index 380 million values for every URL. This is obviously infeasible and an unworkable solution

…Yet this is very much achievable using OLAP tools

RDDs are the new bytecode of Apache Spark

O. Girardot

With the Apache Spark 1.3 release the Dataframe API for Spark SQL got introduced, for those of you who missed the big announcements, I’d recommend to read the article : Introducing Dataframes in Spark for Large Scale Data Science from the Databricks blog. Dataframes are very popular among data scientists, personally I’ve mainly been using them with the great Python library Pandas but there are many examples in R (originally) and Julia.

Of course if you’re using only Spark’s core features, nothing seems to have changed with Spark 1.3 : Spark’s main abstraction remains the RDD (Resilient Distributed Dataset), its API is very stable now and everyone used it to handle any kind of data since now.

But the introduction of Dataframe is actually a big deal, because when RDDs were the only option to load data, it was obvious that you needed to parse your “maybe” un-structured data using RDDs, transform…

View original post 917 more words

the properties of (big)data

-Rawness

-immutability

-perpetuity

DATA IS RAW

.when designing your (big data) system, you want to be able to answer as many questions as possible. If you can, you want to store the rawest information you can get your hands on – the rawer your data – the more questions you can ask of it.

storing ‘super-atomic’ raw data is hugely valuable because you rarely you rarely know in advance all the questions you want answered.

By keeping the rawest data possible, you maximize the ability to obtain new insights, whereas summarizing (aggregating), overwriting or deleting information limits what the data can tell you.

if the algorithm generating data is likely to change over time, then store the unstructured (unprocessed) data – the data can be re computed from source as the algorithm improves.

DATA IS IMMUTABLE

Unlike the RDBMS/OLTP world of updates, you don’t update or delete data, you only add (append) more. This provides two advantages

– human-fault tolerance

– simplicity: indexes are not required as no data objects need to be retrieved or updated. Storing a master dataset can be as simple as flat (S3, HDFS) files.

noSQL is not a panacea…

NoSQL is not a panacea
The past decade has seen a huge amount of innovation in scalable data systems.
These include large-scale computation systems like Hadoop and databases such as
Cassandra and Riak. These systems can handle very large amounts of data, but with
serious trade-offs.
Hadoop, for example, can parallelize large-scale batch computations on very large
amounts of data, but the computations have high latency. You don’t use Hadoop for
anything where you need low-latency results.

NoSQL databases like Cassandra achieve their scalability by offering you a much
more limited data model than you’re used to with something like SQL. Squeezing
your application into these limited data models can be very complex. And because the
databases are mutable, they’re not human-fault tolerant.

These tools on their own are not a panacea. But when intelligently used in conjunction
with one another, you can produce scalable systems for arbitrary data problems
with human-fault tolerance and a minimum of complexity. This is the Lambda
Architecture…