Different approaches to measuring customer lifetime value with SnowPlow

June 9th, 2012 by Yali

The cohort analysis blog post series

As part of our cohort analysis series, we have emphasized that there are a wide variety of different cohort analyses that are possaible, depending on the business question to be answered. To recap, just quickly, we can vary the cohort analysis by what metric we use to compare between cohorts, and by how we define our cohorts. We have written a post about comparing user engagement between different cohorts, and how this is valuable to especially to social networks, community sites and publishers. In this post, we look at comparing customer value, including customer lifetime value (CLV) between cohorts. We explain why this is important to all companies whose business models depend, at the end of the day, on monetizing users – including retailers, media companies and financial services companies. Lastly, we look at how to measure these values in SnowPlow, so that an appropriate cohort analysis can then be performed, as described in our previous blog post.

Weighing your customers' value


Defining customer value and customer lifetime value

The idea behind customer value and customer lifetime value in particular is very simple, in spite of all of the nuances associated with the mathematics of calculating them. The idea is that each customer provides our business with a certain amount of value (which we can measure as profit or revenue) over a period of time. That value might come in a number of ways: to take the example of a retailer, a customer might make a number of purchases from us over an extended time period, and we need to include the value of all those purchases when assessing that customer’s value. But the value that customer provides might not lie only in the purchases she makes. She may recommend products to her friends, thus providing additional value in terms of marketing / customer acquisition. She may write product reviews on the site: thereby making the site more engaging for other customers and prospective customers. These additional activities are valuable, but hard to quantify precisely.

Customer lifetime value or CLV is an extension of the concept of customer value. When we talk about a customer’s value, we talk in terms of their value over a period of time. If we want to be more rigorous, we might want to quantify the total value of the customer over their complete “lifetime” – i.e. the time period over which that person is a customer of ours. For some types of business (e.g. a cable subscription businesses), defining the lifetime is very easy: it’s just the length of time that person was a subscriber. For others (such as retailers), it is harder to quantify, because it is not clear when a person stops being a customer: there is always the possibility they will revisit the site and make another purchase.

Why is customer value and customer lifetime value so important?

Historically, different types of businesses have optimised themselves against different metics, depending on their particular commercial dynamics. For example: hotels and airlines have optimised on occupancy; newspapers, car manufacturers and record labels on units sold; financial services firms on the number and profitability of products sold.

There are two major weaknesses with metrics such as these:

  1. They obscure the customer-company relationship at the heart of every company’s business, and instead focus attention on supply-side issues like product production, capacity and distribution, which are generally much easier to control than demand-side issues like the way a customer feels about your product, service and brand.
  2. By not looking at the customer journey as a whole, these metrics lead companies into making decisions that often have unintended negative consequences at a different part of the customer journey.

To start with the first weakness: record labels have historically made all of their money selling CDs, and consequently have spent a lot of analytical effort understanding which are the best selling artists by geography, and then making sure that they distribute the right volume of CDs to those geographies to maximize profits. (Not too many to depress CD prices, not too few to artificially depress sales.) Today, the music industry has changed so that music is distributed digitally (so no fretting about matching CD supply with demand), and labels monetize their artists through record sales, streaming deals, live concerts and merchandising. Record companies therefore need to focus on understanding the complete value of each artist but also the complete value per fan. It is only when a record label understands which are its most valuable fans (those that go to the concerts, buy the merchandise, influence the prospects of upcoming artists etc.) that they will know where to target their marketing to maximize their profits going forwards. That means understanding the customer value of each of their users, and how that varies by age, affluence, taste in music and taste in music services (Spotify, Songkick and the like).

To illustrate the second weakness, consider the widespread example of the retailer advertising his products using AdWords, and measuring the return on that marketing spend based on the fraction of those users who clicked on the ads who went on to make a purchase, and the value of the purchases they made. This is often given as a textbook example for data-driven marketing. Unfortunately, it is massively flawed:

  • Many customers visit a site more than once before making a purchase. By looking at users’ behaviour only on their initial visit after clicking on an ad, the retailer will likely overlook a large number of return-visit sales which stemmed from that initial AdWords click. This means that the retailer is liable to underestimate the return on their search engine marketing spend.
  • Many customers who purchase after some ad exposure will go on to make additional purchases in the future. Some of these sales would never have been made were it not for them having seen the ad originally, and then making a successful initial purchase. However, none of the value of the repeat purchases is taken into account by the person calculating the return on this marketing spend, and as a result, the retailer is again liable to underestimate his return on marketing spend.

It is only when a company looks at the complete value of each customer acquired from a given marketing activity that they can calculate the true return on that marketing spend. If they do not do this, they risk cutting marketing spend from potentially their best long-term marketing channels, and spending too much on channels that only drive short term sales increases.

For another example of how a company can make a terrible strategic decision because it does not monitor the lifetime value of its customers, consider the (probably apocryphal) story of the bank that axed its children’s accounts after realizing that these were unprofitable. That bank’s profits will have risen in the short-term, and then nose dived 15 years after the decision was made, as the cohort of users who would have formed the bedrock of their future profitable customer base took up children’s accounts with competing banks.

Measuring customer lifetime value (general approach)

Measuring customer lifetime value is difficult for a number of reasons:

  1. As described above, there are particular customer actions (such as recommending a brand to a friend) that certainly create value, but are hard to quantify
  2. It is not always easy to define the “lifetime” of a customer, in other words to determine over what period we should measure that customer’s value
  3. Many of our most interesting customers will be in an ongoing relationship with our business: without a crystal ball we cannot be very certain what value we will be able to realize from those customers in the future

A full discussion of how to address the above challenges is beyond the scope of this blog post, although we do plan to explore some approaches in future posts. At this stage we make the following general observations, before discussing how to perform some initial customer lifetime value (CLV) calculations using SnowPlow:

  • While it is difficult to quantify those more “intangible” user actions (such as reviewing a product or “liking” a fan page), it is possible to come up with models to do so. Developing these valuation models means analysing what effect these actions have over a long time period, and comparing those effects with “control groups” where no equivalent action was performed. This is quite an involved analysis, and it requires good granularity of different customer journeys to perform – fortunately, SnowPlow provides the data granularity required. Once the analysis has been performed, a provisional value for that user action can be assigned.
  • Whilst defining the lifetime of a customer can be very difficult, we can instead make do by looking at the customer value over a defined period of time (e.g. a quarter), and see how that varies over time periods: are we increasing the value of our individual customers over time, or decreasing it? And how does that vary by customer segment? If we adopt this approach, we need to perform a separate churn analysis, to capture the undesirable but important commercial impact of users who leave us.
  • An alternative if we want to stick with customer lifetime value is to build a model that predicts the lifetime of each current customer, based on their behaviour to date. Again, because SnowPlow provides a very granular data set for each customer, it can support this type of analysis well. The output is a function that predicts the lifetime value of a customer based on their to-date behaviour, where the salient aspects of that behaviour have been carefully selected, through a combination of customer knowledge and data analysis.
  • Whichever of the above routes is taken, remember that a given user’s CLV may evolve over time, based both on their own changing behaviour but also on our evolving model and understanding of valuable customer behaviour. This is fine and to be expected – the important thing is that at any given moment in time we can attribute a meaningful value for each customer.

Measuring customer lifetime value (using SnowPlow)

SnowPlow makes it easy to take calculate customer value based on web analytics data, unlike traditional web analytics programs. There are a number of key areas where it excels:

1. It makes it possible to measure the total revenue of each customer directly, over their entire history

Remember the example given earlier of the retailer that measured the return on AdWords spend based on the fraction of users who clicked on the ad and went on to purchase on the same visit? Well, SnowPlow makes it trivial to look at the amount of revenue booked each customer over all their visits, using the query below:

/* HiveQL */
SELECT
	`user_id` AS customer,
	YEAR(`dt`) AS time_period_year,
	MONTH(`dt`) AS time_period_month,
	SUM(`ev_value`) AS `customer_value_per_time_period`
FROM `events`
WHERE `ev_action` LIKE 'order-confirmation'
GROUP BY `user_id`, YEAR(`time_period`), MONTH(`dt`);

The query above sums all the order values for each `user_id` by time period (in this case, month). It is trivial modify the query to add in different types of events, by simply adding them to the `WHERE` clause. It is also trivial to take a different time period, e.g. looking by week, by aggregating the event dates using the Hive WEEKOFYEAR() function rather than the MONTH() function.

If we wanted to perform the above analysis just for a cohort or users who had come to the site from AdWords, we would perform a separate query to identify just those users who’d clicked on an AdWords ad (shown below), and then JOIN the results (a list of user_ids) with the output of the query above. This query would get you that list of user_ids:

/* HiveQL */
SELECT `user_id`
FROM `events`
WHERE `mkt_source` = 'google' AND `mkt_medium` = 'cpc' ;

2. It is possible to include the value of “intangible” customer actions (e.g. “liking a product”) in the SnowPlow customer value calculation

If your company has developed a model / view of the value of “intangible” or otherwise hard-to-value actions (e.g. “liking” a product), those values can be incorporated by SnowPlow when you perform the analysis. That’s because, unlike other web analytics packages including Google Analytics, SnowPlow lets you upload your own first- and third-party data into SnowPlow to use in the calculation.

In this case, you would create a table in SnowPlow with a list of the different actions, categorised by events, where each event has an associated value. The table would look like this:

/* HiveQL */
CREATE TABLE `intangible_event_values` (
	ev_category STRING,
	ev_action STRING,
	value INT
) ;

You would then upload your event values to the `intangible_event_values` table and JOIN it with the `events` table when calculating the value per customer per time period:

/* HiveQL */
SELECT
	`user_id` AS customer,
	YEAR(`dt`) AS time_period_year,
	MONTH(`dt`) AS time_period_month,
	SUM(`e`.`ev_value`) + SUM(`v`.`value`) AS `customer_value_per_time_period`
FROM `events` e JOIN `intangible_event_values` v ON (e.ev_category = v.ev_category AND e.ev_value = v.ev_value)
WHERE (`ev_action` LIKE 'order-confirmation' OR v.ev_value IS NOT NULL)
GROUP BY `user_id`, YEAR(`dt`), MONTH(`dt`) ;

This allows us to track customer value over time, and monitor how effectively we are growing customer value over time – for the user base as a whole, by customer segment, or for each customer (by looking at particular `user_id`s).

3. It is possible to include model-based estimates of future customer value in customer lifetime value calculations

Assuming that we have developed a model that predicts future customer value based on their behaviour to-date, we can incorporate that into SnowPlow to calculate the full-lifetime value for the current customer base.

Our function to predict customer lifetime value would take certain events in a customer’s past and uses those to predict future value. Because all of the events in a customer’s online history are stored in SnowPlow, it is possible to run the function in SnowPlow and calculate a full-lifetime value for each of our `user_id`s. From this point we could sum all of those individual CLVs to calculate the value of our current user base, and compare that with the value for our userbase last month or year, for example.

How we do this depends on the nature of our predictive function. If it can be written directly in Hive using HiveQL, we simply need to include it in our query. If it is not possible to compose in HiveQL, we can create a user-defined function in Java to do the calculation and return the result to Hive. Either way, our analysis proceeds as follows:

First we define our “live” userbase:

/* HiveQL with pseudo-code */
CREATE TABLE `this_months_customers` (
	user_id STRING
) ;

INSERT OVERWRITE TABLE `this_months_customers`
SELECT `user_id`
FROM `events`
WHERE dt>'2012-05-31' /* In this case we define our live customers as those who have engaged with us in June */
GROUP BY `user_id` ;

Next we perform a comparable query to identify customers who were “live” last month:

/* HiveQL with pseudo-code */
CREATE TABLE `last_months_customers` (
	user_id STRING
) ;

INSERT OVERWRITE TABLE `last_months_customers`
SELECT `user_id`
FROM `events`
WHERE `dt`>'2012-04-30' AND `dt`<'2012-06-01'
GROUP BY `user_id` ;

We then calculate the value of this month’s customers, using our customer lifetime value predictor function:

/* HiveQL with pseudo-code */
CREATE TABLE `lifetime_value_by_customer_this_month` (
	user_id STRING,
	predicted_lifetime_value FLOAT
) ;

INSERT OVERWRITE `lifetime_value_by_customer_this_month`
SELECT
	`user_id` AS customer,
	customer_lifetime_value_predictor_function(input parameters) AS `customer_lifetime_value`
FROM `events`
GROUP BY `user_id`  ;

Note that we look at the complete history of each of this month’s customers to predict their future value.

Next we perform the comparable calculation for last month’s customers:

/* HiveQL with pseudo-code */
CREATE TABLE `lifetime_value_by_customer_last_month` (
	user_id STRING,
	predicted_lifetime_value FLOAT
) ;

INSERT OVERWRITE `lifetime_value_by_customer_last_month`
SELECT
	`user_id` AS customer,
	customer_lifetime_value_predictor_function(input parameters) AS `customer_lifetime_value`
FROM `events`
WHERE dt<'2012-06-01'
GROUP BY `user_id`  ;

We can now compare the two tables `lifetime_value_by_customer_last_month` with `lifetime_value_by_customer_this_month` to see how the average lifetime has varied, how the distribution of lifetime value has varied and how the number of customers by lifetime value has varied.

We will explore different approaches to developing the customer_lifetime_value_preditor_function in future blog posts.

What to learn more about SnowPlow? Then visit the SnowPlow website, and in particular, the guide to calculating customer lifetime value using SnowPlow.

Leave a Reply