Different approaches to measuring user engagement with SnowPlow

May 16th, 2012 by Yali

The cohort analysis blog post series

User engagement is one of the most interesting, most important, and yet challenging areas of data analysis. In this post, we will look at different metrics which can be employed to understand user behaviour as part of our overall series on cohort analyses. In addition, we will show how each suggested user engagement metric can be calculated in a straightforward way using SnowPlow.

Not an engaged user

Historically, it was online publishers who were most concerned with the level of their users’ engagement. But the advent of social networks like Facebook and Twitter, community sites (like Mixcloud and Mumsnet) and socially-aware applications (like Spotify and Steam) means that there is now a much larger number of businesses whose success depends very directly on how good they are at getting users to engage frequently (many times per day or month) and deeply (long sessions, many page views). That may be because engagement directly drives revenue (e.g. for an ad-funded business), or because building up a critical mass of users is key to building a viable online marketplace or social network.

There are countless potential measures of metrics to capture user engagement – to give just some examples:

  • Number of visits / logins per month
  • Number of page views per month
  • Amount of time spent on the site / in the product
  • Number of articles opened / read / downloaded / shared / saved / favourited
  • Number of sessions per month where the user spends a minimum amount of time on the site / connects with another users / completes a level on a game

Which ones are most relevant to you will depend on your business, and the ways in which your users engage with your product. Let’s take some (famous) examples:

Example 1: User engagement at Facebook

In the case of Facebook, one well established user behaviour is to log into Facebook (e.g. at work, fairly early on in the day) and leave the page open during the day, rechecking it at intervals when there is a lull to see if anything interesting has appeared in the user’s stream. (Similar behaviour is evident when people access Facebok via mobile.) If this represents a significant fraction of user behaviour, then it becomes meaningless to measure the depth of engagement by the length of time a session lasts (because a session may last 8 hours). Similarly, measuring the breadth of engagement by the number of sessions would be misleading, because a user who logs in and out 5 times in a day (because they are on a public computer) is no more engaged than one who logs in once on their own machine, and keeps the page open all day.

It is likely for this reason (although we haven’t confirmed with analysts at Facebook), that Facebook tracks the number of days per week (or days per month) that a user logs in as a measure of the breadth of their engagement. Most likely Facebook would use a measure like this to get a handle on a user’s overall engagement with Facebook, and at the same time also look at how deeply or actively a user engages with Facebook by looking at how many items in their stream they actively engage with (e.g. by commenting on them, liking or actively contributing content to the stream by posting status updates and photos). Our guess would be that, for Facebook, user engagement is tracked across these two dimensions: one for breadth (number of days per month) and one for depth (level of activity when engaged, from active to passive).

Example 2: User engagement at eBay

For eBay on the other hand, measuring engagement is more difficult. For starters, there are different types of users on eBay, and we might want to apply different methods for measuring engagement based on the user type. Buyers on eBay may make up a distinct group as opposed to sellers. In the case of buyers, we might superficially be interested in the number of products they search for on eBay – so perhaps number of searches per week is a relevant metric? Or we might be interested in the breadth of products they look for on eBay – so a user who searches for electronic goods, clothing and stationery is more “engaged” than one who only looks at second-hand books. We might also want to analyse the “depth” of engagement by looking at how actively a user bids on products on eBay: what is the number of products bid on per week? How actively does the user bid on each item?

Conversely, if we look at sellers on eBay, things become more straightforward. For sellers we are interested in the number of items they sell (per week or per month). We might also want to quantify what more advanced sellers are making available on eBay relative to what they are retailing on their own website or Amazon Marketplace, or look at how frequently they are upgrading their listings with additional photos and similar. In all of this, we may want to differentiate “professional” sellers (including those who sell products on multiple channels) from part-time or hobbyists, who use eBay to sell their old stuff.

Measuring user engagement using SnowPlow

SnowPlow offers a lot of flexibility to use multiple different measures of engagement. To start with, let’s take the simple case of measuring each user’s engagement by the number of ‘actions’ they have performed (where an action might be anything from loading a web page, to playing a piece of media, to adding an item to a shopping basket). Because a line of data is recorded in SnowPlow’s event table every time a user peforms an action (and each line has a unique `txn_id`, all we need to do is sum the number of actions take per user per month:

/* HiveQL */
INSERT OVERWRITE TABLE `engagement_by_user`
SELECT
	`user_id`,
	YEAR(`dt`),
	MONTH(`dt`),
	COUNT(`txn_id`) AS `engagement`
FROM `events`
GROUP BY `user_id`, YEAR(`time_period`), MONTH(`dt`);

We can make our measure more sophisticated by weighting different actions based on the level of engagement they demonstrate. For example, the folks at Facebook might decide that posting a status update or uploading a photo constitute more active levels of engagement than simply liking a post in your stream, or clicking on a friend’s photo album. In this case, in SnowPlow, we would create a new table populated with the different user action types (e.g. from the `event_action` field) and assigning each a score:

/* HiveQL */
CREATE TABLE `engagement_by_event_action` (
	`event_action` string,
	`engagement_level` int
);

We would then populate the table with a list of different types of action and a representative value for the engagement level, like so:

`event_action` `engagement_level`
‘like_post’ 2
‘comment_on_post’ 3
‘post_status_update’ 5

Note: the above table would have to include a value for every type of action recorded in the SnowPlow events table. Our query for calculating the weighted engagement level by user per period would then join the `events` table with the `engagement_by_event_action` table:

/* HiveQL */
INSERT OVERWRITE TABLE `engagement_by_user`
SELECT
	`user_id`,
	YEAR(`dt`),
	MONTH(`dt`)
	sum(`engagement_level`) AS `engagement`
FROM
	`events`
JOIN
	`engagement_by_event_action`
ON
	`events`.`event_action` LIKE `engagement_by_event_action`.`event_action`
GROUP BY
	`user_id`, YEAR(`dt`), MONTH(`dt`);

For the eBay example, our first suggested measure of breadth of engagement for buyers was to examine the number of product searches performed. Assuming that every product search by a user is logged in Snowplow as an event with `event_action` = ‘search’ and `event_property` = “search string”, we could as a first measure simply count the number of searches per time period per buyer:

INSERT OVERWRITE TABLE `engagement_by_buyer`
SELECT
	`user_id`,
	YEAR(`dt`),
	MONTH(`dt`),
	COUNT(`txn_id`) AS `engagement`
FROM
	`events`
WHERE
	`event_action` like "search"
GROUP BY
	`user_id`, YEAR(`dt`), MONTH(`dt`) ;

If we want to be more sophisticated and look at the breadth of items searched for (e.g. the number of different product categories sought by the user), we would need:

  1. A way to categorise the search query into a likely product category
  2. To count the number of categories searched

Assuming our search facility had some kind of categorisation engine (i.e. could approximate the specific category of product the user is currently searching for), we could create a table in Hive that maps search terms to categories:

/* HiveQL */
CREATE TABLE `category_by_search_term` (
	`search_term` string,
	`product_category` string
);

We could then examine the breadth of products searched for by buyers using the following query:

/* HiveQL */
INSERT OVERWRITE TABLE buyer
SELECT
	`user_id`,
	YEAR(`dt`),
	MONTH(`dt`),
	COUNT(DISTINCT `product_category`) AS `engagement`
FROM
	`events`
JOIN
	`category_by_search_term`
ON
	`events`.`event_property` LIKE `category_by_search_term`.`search_terms`
GROUP BY
	`user_id`, YEAR(`dt`), MONTH(`dt`);

Meanwhile, to examine the engagement levels of sellers on eBay, we could look at the number of products listed by each seller per month. Assuming that each time a seller lists an item, SnowPlow records an event with `event_type` = “list” and `event_property` = item sku, it would be relatively simple to count the number of items listed per month:

/* HiveQL */
INSERT OVERWRITE TABLE `engagement_by_seller`
SELECT
	`user_id`,
	YEAR(`dt`),
	MONTH(`dt`),
	COUNT(`txn_id`) AS `engagement`
FROM
	`events`
WHERE
	`event_action` LIKE "list"
GROUP BY
	`user_id`, YEAR(`dt`), MONTH(`dt`) ;

If we wanted to compare the number of different SKUs listed as well, that is also straightforward:

/* HiveQL */
INSERT OVERWRITE TABLE `engagement_by_seller`
SELECT
	`user_id`,
	YEAR(`dt`),
	MONTH(`dt`),
	COUNT(DISTINCT `event_property`) AS `engagement`
FROM
	`events`
WHERE
	`event_action` LIKE "list"
GROUP BY
	`user_id`, YEAR(`dt`), MONTH(`dt`) ;

Comparing this number against a hypothetical number for the total amount each seller could list would require some data gathering / intelligence outside of SnowPlow to look at e.g. the total number of items the seller makes available on competitor platforms such as Amazon. This data would be put into a Hive table as below:

/* HiveQL */
CREATE TABLE `max_skus_by_user_id` (
	`user_id` string,
	`yr` int,
	`month` int,
	`max_skus` int
);

And finally, performing the comparison:

/* HiveQL */
INSERT OVERWRITE TABLE `engagement_by_seller`
SELECT
	`events`.`user_id`,
	YEAR(`dt`),
	MONTH(`dt`),
	COUNT(DISTINCT `event_property`) / `max_skus` AS `engagement` /* Note `engagement` needs to be type float or double */
FROM
	`events`
JOIN
	`max_skus_by_user_id`
ON
	`events`.`user_id` LIKE `max_skus_by_visitor`.`user_id`
WHERE
	`events`.`event_type` LIKE "list"
GROUP BY
	`user_id`, YEAR(`dt`), MONTH(`dt`) ;

Deciding which metric to use for user engagement

As the above examples hopefully illustrate, there are many possible metrics which you can employ to describe and measure user engagement. Which metrics are most applicable will vary by business: for each business, there will be multiple possible measures that might potentially work. Whilst a full discussion of all the different possibilities is beyond the scope of this post, a quick tip is that if you do have a number of different possible metrics, and if they rate different users similarly, relative to one another, for equivalent levels of engagement, then chances are that all of them are reasonably robust, and you are reasonably safe picking any one of them (because all of them give similar results).

Interested in using SnowPlow to measure engagement on your website or application?

Visit the SnowPlow website to learn more, especially the page on using SnowPlow to measure user engagement.

Leave a Reply