Whilst there is a lot to love about big data, big data gives CIOs and business folks reason to moan: using it means developing expertise in new approaches to analytics, new technologies and new business processes. However, for companies that have not, to date, successfully implemented a data-warehouse, big data offers one huge reason to smile: it makes it possible for companies to develop their data warehousing platform in an incremental, step-by-step fashion, with much lower initial costs than the traditional, “big bang†approach that represented the orthodox approach in the pre-big data world.

In this blog post, we explain why that is the case, and outline what we believe is the best approach for companies looking to build out an internal analytics capability that takes advantage of big data technologies like Hadoop.
1. The case for a data warehouse (or centralised, complete data repository)
It is well recognised that:
- Business decision-making should be data driven
- Staff members should have good visibility of their customer base, including how that customer base is evolving over time and how their actions impact the customer base
- Staff members should understand the value of individual customers, segments of customers and the importance of different products, services and interactions in terms of growing that customer value
- Staff members should have a consistent view
- Long term business value is created when a company aims to maximize the lifetime value of all of its customers (rather than maximize short-term measures like the return on individual marketing campaigns)
Today, sixteen years after Ralph Kimball published The Datawarehousing Toolkit, the technology stack and business processes for delivering the above capability are well understood. The stack broadly comprises:
- A data warehouse which holds the complete set of relevant data for a company, including all customer and financial records, at an atomic level (i.e. down to the individual transaction or customer action)
- A set of ETL (extract, transform and load) processes for migrating data from operational systems, where it is created, into the data warehouse, where it is used for analytics, developing business intelligence and driving decision-making
- A business intelligence tool, or suite of tools, to enable business users to drill into the data and use it in a commercial context
However, this technology stack no longer necessarily represents best practice in a big data world, a fact which has profound implications for companies in the media and retail sectors who seek to develop their internal analytics capability. Critically, it means that these companies can now develop an alternative “big data” infrastructure incrementally over time, and at much lower initial cost, than would have been required were they implementing the “classic” technology stack above. To explain how this is so, we must start by explaining the traditional approach to data warehousing.
2. Traditional approaches to building a data warehouse: significant upfront cost
One of the premises behind classic data warehousing approaches is that the structure of data in the data warehouse should be such that it makes it easy for an analyst to query the data. Most data warehouses are modelled on a star or snowflake schema. We do not need to go into the intricacies of either of those architectures, but it is worth understanding why they make sense in a pre-big data world, and how those structures differ from the way data is typically structured within an operational system:
Denormalizing data so it is easier to query
In operational systems, it makes sense to minimise the volume of data stored, by minimising redundancy. Normalization in databases aims to achieve this: a single item of data (let’s say, a product being manufactured by “Apple Computer, Inc”) is stored once, and every record with that attribute (i.e. all Apple products) has an ID that points to the single representation of the attribute in the database, rather than having a copy of the attribute stored against it. This normalized approach also helps with data quality: if Apple shorten their name to “Apple, Inc” (which they did), then you only need update the manufacturer name in one place, and all the references to that one place will update automatically.
The trouble with this approach is that it increases the number of tables in a database, which makes querying the data harder, because more joins between tables are required. In an operational system, that is not necessarily a problem, because it is a program that’s reading and writing to the data (so creating the queries), rather than a human being trying to make sense of the data. But for a human being, there is an advantage in simpler, less normalized structures which demand less complex queries.
In a data warehouse, then, data is typically de-normalized, so that the analysts need perform fewer joins to pull out meaningful slices or extracts of the data.
Star and snowflake schemas
In a traditional data warehouse, data is structured in a star or snowflake schema. At the “centre†of the schema is a fact table: in it, each line of data represents a single event that an analyst can easily understand: this might be a transaction, or a particular user engagement:

The tables that surround the fact table (the points of the star) contain additional information about the event. These are “dimensionsâ€: if for example the fact table represents a series of “content plays†for a media company, then the dimensions provide additional information about those plays: one table might include information about the “player†or “platform†on which the engagement happened: was it a web or mobile app? What version was it? Another table might include information on the customer who initiated the play. Is the customer a man or a woman? Which country is he/she based in? How old is he/she? Is he/she a subscriber to the service? In which country is he/she based? Another might be about the media that has been played: what type of media was it (song, video, radio show etc)?
This type of data structure has two major benefits:
- For an analyst with only a passing knowledge of the company, it is relatively easy to look at the database, understand how the data is structured and start querying the data
- Data in this format is easy for business intelligence platforms to roll up to create OLAP cubes (also referred to as “cross tabs†or “pivot tablesâ€). OLAP cubes are summaries of the data, where individual records are aggregated (e.g. summed over time or by type of events). OLAP views typically give the analyst the opportunity to slice the data by different dimensions and drill down into specific segments (for example, investigating people who watched the new series of Boardwalk Empire the month it was released in the UK on mobile devices.)
This approach makes implementing a data warehouse costly
However, implementing a data warehouse to deliver on the above is expensive, because before the data warehouse can be built the data schema first needs to be designed. This is no trivial task: the schema needs to capture every “event†or “fact†at the heart of what makes the business tick – not just for today but potentially for tomorrow too. The designer has to identify every dimension that needs to be captured for each “fact†or “event†– this will be driven by what analysts at the company want to slice the data by, but also by what data is actually available from the operational systems.
As well as designing the schema, the actual ETL (“extract, transform and load”) scripts for transferring data from operational systems into the data warehouse will need to be designed and built. The ETL process is especially time consuming because fully describing a single event may involve grabbing data from multiple different operational systems, cleaning it and combining it. Some of that data will additionally need to be “transformed†so that it is consistent between systems, before being uploaded into the warehouse.
Worst, all of this has to be done before any analysis is delivered. With classic data warehousing projects, companies have to spend big bucks and cross their fingers in the hope that they achieve a positive return-on-investment in the long-run.
Evolving a data warehouse over time becomes difficult and costly
In a traditional data warehouse, architecting is a one-time process that precedes the building of the warehouse.
For a company today, though, especially in the media sector, this approach is somewhat problematic. Every few months there is a new product launched that gives consumers a new way to engage with media – just look at all the innovation in film: people can watch films through multiple streaming services (e.g. NetFlix, Lovefilm), download services (e.g. iTunes), rent them (Blockbuster / Netflix), buy them online (Amazon), buy them offline (HMV / Walmart). Watching and buying are not the only interesting actions: users can review films on multiple sites, buy tickets for cinema showings (e.g. Flixster), discuss films with their friends (Facebook, Twitter); they can discover films on TV (e.g. using Shazam), using recommendation services (e.g. StumbleUpon). The list of possible interactions is only likely to grow as more companies innovate in the space.
Integrating new data sources related to new services into an existing data warehouse is immensely problematic. Before it can happen, the data needs to be “linkedâ€. Ideally, customers on existing services need to be mapped onto customers on the new service. Media titles on existing services need to be mapped onto titles on the new service – we want to know, for example, that Gladiator bought on Amazon is the same film as Gladiator streamed via Netflix, if we are going to perform an analysis of how many people have watched Gladiator over time, and what has been the impact of media spend on those consumption patterns.
Again, the mapping needs to be completed before any analysis can occur. That makes it costly – it is hopefully already clear that it is difficult, too.
Evolving a data warehouse to accommodate unstructured and semi-structured data is effectively impossible
Many of the sources that we would like to incorporate in a data warehouse represent unstructured or semi-structured data. To give one example of each: social media interactions (e.g. on Facebook and Twitter) are fundamentally unstructured (at least from a computer’s perspective), while in an e-commerce product catalogues (e.g. Hybris or Magento), the data structure of each product category will look different from the others. Unfortunately, because data warehouses are built using relational databases, they make very imperfect places to store such forms of semi- and unstructured data.
3. Approaches to data warehousing in a big data world: Hadoop-based solutions
An alternative to the traditional data warehouse stack is illustrated below:

Features of this alternative approach are as follows:
- All available data, from all available sources, is dumped into the Hadoop grid. Note: data is dumped in a raw form, and no ETL-like transformation processes are run, unlike in a traditional data warehouse
- Analysts or data scientists within a company can query data directly from the Hadoop grid, by running their own custom analysis jobs (“MapReduce”) against the grid
- For most business users, running a custom analysis is too big an ask. For them, aggregates and subsets of the data will be pushed into data marts that they can interrogate using the same business intelligence tools that are used in a traditional datawarehouse stack. Some of these cuts may look exactly like the traditional data warehouses from the pre-Hadoop era
This approach means much smaller upfront investment in the solution
Unlike traditional relational databases, data is stored in Hadoop without a schema (for more information on this, see our earlier blog post on big data and Hadoop for retailers and media companies). This means that all relevant data can just be dumped into the Hadoop grid without being transformed, and no all-encompassing schema has to be developed to house every single data source in a coherent way. As a result, the upfront cost of the solution is significantly smaller than for a traditional datawarehouse.
Of course, there is an obvious consequence of this approach: it is much harder for end users to query the data, given that it will likely be very disorganised. Ultimately, any data that we want to analyse needs to be “clean†and “coherent†before we can say anything meaningful about it. By not performing this upfront, we have only deferred it until the time of performing our analysis – we have not got rid of it altogether. However, the advantage of this approach is it means that the business can clean its data in an iterative way, as more and more value is generated from the data, rather than being forced to try and clean all of it, before any analysis is possible. We will describe this iterative approach in the next section.
4. An incremental approach to developing your analytics capability
Using the above architecture, companies can take an incremental, iterative approach to developing their analytics capability:

The above approach works because as soon as the data is in the Hadoop data warehouse, it is possible for a data scientist to start mining that data to answer business questions. In the process of using the data to answer business questions, the data analyst will gain a solid understanding of:
- What views on the data are useful for business folk
- The underlying structure of the data
He or she is then perfectly placed to specify the process by which data marts are populated automatically, to deliver those data views to the business users on an ongoing basis. Those requirements  can then be implemented by the IT department.
As a company works through multiple iterations of the above cycle, progressively more and more data marts are constructed that help business users make decisions in a wide variety of areas. More and more of the data available to the company is then being used to drive business value:
Advantages of an iterative approach
- The datawarehouse starts delivering business value immediately. There is no need to integrate every data source before some of the data can be used to drive decision-making
- Upfront costs are lower, and companies can increase their investment in the infrastructure once the value of the generated insight has already been seen. This is much more palatable for senior management
- The infrastructure evolves with the needs of the business. Companies can prioritise developing infrastructure and understanding around the data that matters most to them now, safe in the knowledge that if the business strategy changes, a new set of data marts can be developed as required to reflect the new strategy and decisions that the business needs to take
- The infrastructure evolves with new data sources. Because adding in new data sources is at the heart of this iterative approach, as new sources become available they can be accommodated without having to rework the existing infrastructure, as might be the case in a traditional data warehouse. That is especially attractive to companies in retail and media, where new sources of data are continually being developed as new digital platforms or products are launched

