Performing the cohort analysis described in Eric Ries’s Lean Startup using SnowPlow and Hive

May 15th, 2012 by Yali

The cohort analysis blog post series

This blog post is the third in our series on cohort analyses using SnowPlow. In the first post, we provided an overview of cohort analyses: why they are so powerful and what are the different analytic steps necessary to perform a cohort analysis. In the second post, we looked at why SnowPlow is such a good platform for performing cohort analyses using web analytics data, and worked through the specific example of the Twitter cohort analysis that gets so much attention in startup circles.

In this post, we will follow up with a look at another famous example of a cohort analysis: this time from Eric Ries’s excellent book The Lean Startup. We will show how a company running SnowPlow can easily perform the type of analysis Eric performed when he was CTO at IMVU, to assess the progress they were making towards achieving a product-market fit.

A version of the data Eric Ries presents in his book is shown below:

Interpreting Eric Ries’s cohort analysis

Before I describe how to perform Eric Ries’s cohort analysis, let’s remind ourselves what this analysis demonstrates.

The analysis shows the percentage of new users added each month, split by what stage they reached in their customer journey. Remember that Eric’s startup had an instant messaging product where customers talked to one another using 3D avatars; ultimately, the success of the company depending on persuading users to pay for the product, but that meant each user working through a number of “stage gates”, including:

  1. Downloading the product
  2. Registering
  3. Creating a 3D avatar (potentially more than one)
  4. Connecting with another user (potentially multiple users)
  5. Signing up to the service and paying the subscription fee

Understanding how many potential customers progressed through each stage-gate was critical to IMVU. The graph above looks at the number of people who registered in each of the first 7 months after the first version of their product was launched, and looks at what percentage of users made it through to each stage gate. The aim was to get as many customers as possible to the final stage gate: “Paid”. As you’ll see from the graph however, the percentage of users who make it to this stage each month was very small (less than 1.4% in each month.) Worse, this percentage did not increase month-by-month, as the team worked to improve the product. It was this lack of improvement, in spite of enormous product development effort, that persuaded the IMVU team to pivot their product, with dramatic effect.

There are some interesting things to note when comparing this analysis to the one we performed in the last post. As in the previous analysis, each cohort is defined by the month that the user started using the service (although in this case, that “starting point” is determined by the point at which the user “registers”, which is after they have already downloaded the product). For each cohort, we only look at one month of data. This is very different from the previous example, where we looked at the average engagement level of each cohort over successive months. This may be because the IMVU guys found that if a user hadn’t signed up to their service after 1 month, there was a negligable chance that they would go on to do so. (Meaning that the timeframe for converting a prospect to a paid user was reasonably short, and hence there was no point including subsequent months in the analysis for each cohort. This makes the IMVU product very different to Twitter, where a user may only graduate to a highly engaged user after two or even three months.) When you consider performing a comparable analysis for your company, decide carefully what time frame to look at, so you don’t miss relevant data points for each cohort by picking an inappropriate time period.

Performing the cohort analysis using SnowPlow

We will perform the analysis by following the steps outlined in the first post i.e.:

  1. Start by defining the business question
  2. Work out what is the most appropriate metric to measure, given the business question
  3. Define the cohorts
  4. Perform the analysis

1. Define the business question

This is simply: are we getting better over time at getting new people introduced to the product to become paid users?

2. What is the most appropriate metric to measure, given the business question?

In this case it is the percentage of users who register who go on to reach the different stage gates identified in the list above.

It’s important to note that SnowPlow’s data table in Hive contains five fields that describe a particular “event” in a user journey:

Name Req? Description
Category Yes The type of event being tracked e.g. “ecomm” for buying related events or “media” for media consumption events
Action Yes The actual user action performed
Object No The specific object being acted on e.g. the product brought or video played
Property No An option sprint describing the object or the action performed on it e.g. “HD” if its a video being played
Value No A value associated with the action e.g. revenue associated with the action, or number of minutes into the video that it is started / stopped

In our case, we’ll assume that the different actions that indicate what funnel stage a user is in can be read from the Action field (called `event_action` to be precise). In the IMVU case, that might mean that we need to check for the following values in the `event_action` column for a particular user:

  1. `Subscribe` to indicate he/she has reached the final stage in the funnel
  2. `Made-friend` to indicate he/she has connected with another user
  3. `Created-avatar` to indicate he/she has created an avatar
  4. `Login` to indicate he/she has logged in
  5. `Register` to indicate he/she has logged in

To work out what stage each visitor has reached in his/her customer journey, we process the `snowplow_events_table` for each `user_id` and combine all the different `event_actions` which a user has taken into a single set (stored as a HiveQL array):

/* HiveQL */
CREATE TABLE `actions_by_user` (
	`user_id` string,
	`acts` array<string>
);

INSERT OVERWRITE TABLE `actions_by_user`
SELECT
	`user_id`,
	collect_set(`event_action`) AS `acts`
FROM `snowplow_events_table`
GROUP BY
	`user_id`;

For each user, we then look through the array of different actions that he / she has performed to work out at what stage each visitor is in the funnel:

/* HiveQL */
CREATE TABLE `stage_in_funnel_by_user` (
	`user_id` string,
	`stage` string
);

INSERT OVERWRITE TABLE `stage_in_funnel_by_user`
SELECT
	`user_id`,
	CASE
		WHEN array_contains(`acts`,"Subscribe") THEN "Subscribe"
		WHEN array_contains(`acts`,"Made-friend") THEN "Made-friend"
		WHEN array_contains(`acts`,"Created-avatar") THEN "Created-avatar"
		WHEN array_contains(`acts`,"Login") THEN "Login"
		WHEN array_contains(`acts`,"Register") THEN "Registered"
		ELSE "Not-registered" END AS stage
FROM `actions_by_user`;

Note that we start from the final stage in the funnel. For each visitor, we check if they have reached the final stage, and if not, work backwards stage-by-stage to work out if they’ve reached the previous stage.

3. Define your cohorts

This is straightforward: we want to divide up our users into cohorts based on the date in which they registered for the product. This is done using the following query:

/* HiveQL */
CREATE TABLE `users_by_cohort` (
	user_id string,
	cohort string
);

INSERT OVERWRITE TABLE `users_by_cohort`
SELECT
	`user_id`,
	concat(year(min(`dt`)),"-",month(min(`dt`))) as cohort
FROM `snowplow_events_table`
WHERE `event_action` LIKE "Register"
GROUP BY `user_id`;

4. Perform the analysis

We now need to join our `users_by_cohort` table which defines which cohort each `user_id` belongs with our `stage_in_funnel_by_user` which tells us which stage in the customer journey each user belongs to:

/* HiveQL */
CREATE TABLE `cohort_analysis` (
	`cohort` string,
	`stage` string,
	`number_of_users` int
);

INSERT OVERWRITE TABLE `cohort_analysis`
SELECT
	`cohort`,
	`stage`,
	count(`stage_by_user`.`user_id`) as `number_of_users`
FROM
	`users_by_cohort`
JOIN
	`stage_by_user`
ON
	`users_by_cohort`.`user_id` = `stage_by_user`.`user_id`
GROUP BY
	`cohort`, `stage`

And that’s all there is to this retrospective cohort analysis! Our resulting table looks as follows:

It is now straightforward to convert the absolute numbers of users in each cohort it a percentage value and plot a graph similar to the one displayed at the top of this blog post.

Interested in learning more? Then visit the SnowPlow website, including the section on performing cohort analyses with SnowPlow.

Update: you can now also discuss this blog post on Hacker News.

Leave a Reply