Performing cohort analysis on web analytics data using SnowPlow

May 8th, 2012 by Yali

The cohort analysis blog post series

In the previous blog post in this series, cohort analysis for digital businesses: an overview we described what cohort analysis is and why it is so powerful. In this post, we will look at how to perform cohort analysis on web analytics data in SnowPlow. We will start with an overview of the general methodology and approach for cohort analysis using SnowPlow, and then launch into a specific example analysis: the Twitter engagement example that made cohort analysis so famous in startup circles.

Methodology for performing cohort analyses in SnowPlow

SnowPlow has been designed to:

  1. Make it easy to perform specific cohort analyses
  2. Give users maximum flexibility to perform a wide range of cohort analyses, by making it easy to define cohorts in multiple ways and leverage multiple different metrics to measure and compare between the different cohorts

To understand what makes SnowPlow so suitable for cohort analyses, we need to consider the way data is structured in SnowPlow. This is represented in the diagram below:

SnowPlow records all data in a single events table in Hive. Whenever one of your customers does anything on your website, be it click on a link, fill in a web form, play a video, add a product to basket, perform a search or roll-over an ad (to give just some examples), a line of data is generated in the events table.

The events table contains a number of different fields. Some of these describe the particular event: distinguishing the type of event (e.g. a pageview or an add-to-basket), noting the time and date of the event, and recording the value of the event if it is an e.g. transaction; other fields relate to the customer performing the action: distinguishing customers by the type of device they are using, where they are located, what language their browser is set to for example. (For a complete description of the table and fields, see the analyst documentation on the SnowPlow website.

Recall from our last blog post that performing a cohort analysis is a four step process:

  1. Define your business question
  2. Work out what is the most appropriate metric (or set of metrics) to measure, given your business question
  3. Define your cohorts, given the business question
  4. Perform the analysis

From a data crunching perspective, the key challenge when performing a cohort analysis is to access the data necessary to calculate the metric identified in step 2, sliced by the cohorts defined in step 3. This is not always straightforward with conventional web analytics tools – we will explore the limitations to doing this in Google Analytics in a later post.

We can now see what makes SnowPlow such a good platform to perform cohort analyses. Defining our cohorts and performing the data separation is simple: we define our cohorts by any of the data fields we associate with each visitor or, equally, we can define our cohorts based on user behaviour (or even by some combination of the two e.g. define a cohort of “male customers who shop regularly”). In either case, a straightforward HiveQL SELECT statement of the following form will return a table mapping customers (i.e. user_ids) to cohorts (however they are defined, in terms of either customer data, or event data, or a combination of the two):

/* HiveQL with pseudo-code */
CREATE TABLE `user_cohort_map` AS
SELECT
	`user_id`,
	[[ Generate cohort_ids using visitor_attributes and visitor_events ]] AS `cohort_id`
FROM
	`events`
GROUP BY
	`user_id`

Because we have all our raw data about customers and events in the events table, we have maximum flexibility to define our cohorts however we want, based on any combination of them. (In simple cases the cohort function can be written directly in HiveQL, in more difficult cases by writing a user-defined function.)

As a next step, we need to calculate the value of the desired metric we are comparing the cohorts against for each visitor. Again, this is straightforward to generate from the same events table:

/* HiveQL with pseudo-code */
CREATE TABLE `metric_by_user` AS
SELECT
	`user_id`,
	[[ Generate metric_per_customer using visitor_attributes and visitor_events ]] AS `metric_per_customer`
FROM
	`events`
GROUP BY
	`user_id`

Once again, we have maximum flexibility to calculate any metric we want, based on having all the available data in the events table.

In the final step, we aggregate the metrics generated per user to a cohort level, so that we can compare the results for each cohort against one another:

/* HiveQL with pseudo-code */
SELECT
	`cohort`,
	[[ Aggregate the metric over all the customers in each cohort ]] AS `metric_per_cohort`
FROM
	`user_cohort_map`
INNER JOIN
	`metric_by_user`
ON `user_cohort_map`.`user_id` = `metric_by_user`.`user_id`
GROUP BY
	`cohort`

Let’s apply this approach to a specific example:

Example 1: Looking for improvements in user engagement over time (the Twitter case study)

As a first example we’ll take the famous Twitter cohort analyiss exploring whether user engagement levels were rising over time. A version of the data is shown below. Remember: the folks at Twitter wanted to examine whether they were getting better at getting users to engage over time. So they looked at the users they’d acquired in January and compared how many of them remained engaged after using the service for 1 month, 2 months, 3 months, 4 months etc., with the users they added in February. To see the results more clearly, click on the graph below for an enlarged version.

As you should be able to see from the graph, the fraction of users who remained engaged after 1, 2, 3, 4 months etc. rises over time, so that the results for the February cohort look better than those for January. Similarly, the March cohort results are better then Feb, the April cohort better than March and so on.

We’ll follow the four step approach to cohort analysis outlined in the previous blog post, namely:

  1. Define your business question
  2. Work out the most appropriate metric to measure
  3. Define your cohort
  4. Perform the analysis

1. The business question: Are we getting better at getting user’s to engage over time? We are clear on the question to be answered.

2. The metric we want to examine: For each cohort, we want to examine “engagement”. For this specific example, we’ll take the number of actions per user per month. In the Twitter example, they looked at the percentage of users who remained active after X months. This is a good measure of how good Twitter are at facilitating a “minimum” level of engagement. Our measure, by contrast, will better reflect the difference between users who only engage occasionally with those that use the service heavily, but do a less good job of indicating what percentage of users do not engage at all. (Which is more appropriate depends on the type of service we are looking at, but in reality, most companies should be looking at both types of measure.)

Remember: for each cohort, we will want to plot the engagement level for the first month of use versus engagement level for the second month of use, versus engagement level for the third month of use etc.

Our HiveQL query to measure engagement then looks like this:

/* HiveQL */
CREATE TABLE `engagement_by_user` (
user_id string,
mn string,
engagement_per_month int
);

INSERT OVERWRITE TABLE `engagement_by_user`
SELECT
`user_id`,
substr(`dt`,1,7) AS mn,
count(tm)
FROM spconsolidated
GROUP BY `user_id`, substr(`dt`,1,7);

Unlike SQL, HiveQL does not have date processing functionality, which means that we have to use HiveQL’s string processing functions to group the results by year and month. (Note: I have since realised this is wrong – Hive DOES have date functionality, which can be used to rewrite the above query more elegantly… Nonetheless, the above query will still work :-) )

The resulting table looks like this:

3. Cohort definition: We want to know if we are getting better over time at getting users to engage. So we will define each cohort by the month in which the user first starts using the service. The query for categorising each customer by cohort in HiveQL is then very simple:

/* HiveQL */
CREATE TABLE `visitors_by_cohort` (
cohort string,
user_id string
);

INSERT OVERWRITE TABLE `visitors_by_cohort`
SELECT
substr(MIN(`dt`),1,7) as `cohort`,
`user_id`
FROM spconsolidated
GROUP BY `user_id`;

The resulting table looks like this:

Perform the analysis. We simply combine the two queries above to aggregate our results by cohort, comparing the average engagement level between cohorts. This is illustrated schematically below:

The required query is straightforward to write in HiveQL:

/* HiveQL */
CREATE EXTERNAL TABLE `cohort_analysis` (
`cohort` string,
`month` int,
`average_engagement` double
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 's3n://{{your-snowplow-bucket-name}}/analysistables/cohortanalysis' ;

INSERT OVERWRITE TABLE `cohort_analysis`
SELECT
`cohort`,
(substr(`mn`,1,4))*12 + (substr(`mn`,6,2)) - (substr(`cohort`,1,4))*12 + (substr(`cohort`,6,2))+1 AS `month`,
AVG(`engagement_per_month`)
FROM `visitors_by_cohort`
JOIN `engagement_by_user`
ON `engagement_by_user`.`user_id` = `visitors_by_cohort`.user_id
GROUP BY `cohort`, (substr(`mn`,1,4))*12 + (substr(`mn`,6,2)) - (substr(`cohort`,1,4))*12 + (substr(`cohort`,6,2))+1;

The small complexity introduced in Hive is calculating the difference in month between the cohort date (i.e. the date when the user started using the service) and the month the calculation is being performed for. In Hive, we subtract one date from another, multiplying the number of years by 12 before adding on the number of months. We then add on a single month, so that the initial value (i.e. for the first month of use) is 1 instead of 0.

Note also that we make this final table an external table, and save the results back to S3. The results data set is small and easy to plot even in Excel.

And here is what our final cohort analysis looks like:

Et voila! Cohort analysis using web analytics data in 3 easy steps :-)

Update!

Since writing this post, the SnowPlow website has gone live, including a detailed page describing how to perform cohort analyses with SnowPlow.

If you would like help performing cohort analyses like these, or would like to hear more about the benefits of SnowPlow for your web business, please visit the SnowPlow website, or contact us.

Leave a Reply