Approaches to developing your company’s analytics capability in a big data world

February 10th, 2012 by Yali

Whilst there is a lot to love about big data, big data gives CIOs and business folks reason to moan: using it means developing expertise in new approaches to analytics, new technologies and new business processes.  However, for companies that have not, to date, successfully implemented a data-warehouse, big data offers one huge reason to smile: it makes it possible for companies to develop their data warehousing platform in an incremental, step-by-step fashion, with much lower initial costs than the traditional, “big bang” approach that represented the orthodox approach in the pre-big data world.

In this blog post, we explain why that is the case, and outline what we believe is the best approach for companies looking to build out an internal analytics capability that takes advantage of big data technologies like Hadoop. Read the rest of this entry »

Approaches to accuracy for Mechanical Turk

September 30th, 2011 by Yali

This is the third blog post in our series on using Amazon’s Mechanical Turk to build scalable business processes. Please see also our introductory post and our second post, getting started with Mechanical Turk.

Amazon’s Mechanical Turk provides a very convenient platform for getting large numbers of workers to perform manual steps as part of large scale business processes, such as cleaning data sets for use in machine-learning algorithms, or moderating content.

However, it is not enough for Mechanical Turk to provide results fast. The results themselves need to be reliable and hence it is critical that companies using Mechanical Turk invest in a suitable strategy for accuracy.

The mirror in the Hubble Space Telescope, the most precise ever made, was initially 10 nanometers off the correct curvature. The inaccuracy was catastrophic and cost several million dollars to fix

Amazon provides two primary tools for helping users validate the accuracy of results. We’ll look at these both briefly, before outlining a third technique which, used in combination with the first two, can be used to deliver a very rigorous approach to accuracy. These three strategies for accuracy are as follows:

Read the rest of this entry »

Smarter catalogue management through automation: a primer for online retailers

September 27th, 2011 by Alex

This post is the first in a Keplar series for online retailers, showing you how to automate your way to a more profitable and responsive e-commerce business. Get in touch to discuss how to apply these techniques to your company.

At Keplar we have just completed a “soup-to-nuts” project launching a new image-heavy e-commerce site in the lifestyle space; the retailer has launched with 100 SKUs (each with 17 product images) with a plan to grow its catalogue aggressively to 1,000+ SKUs over the next 6 months. With these sorts of numbers, catalogue management – especially around product imagery – starts to be a real headache and also potentially a significant time/cost sink for the business: even something as simple as updating the watermark on each image becomes a major untaking.

The headaches of manual processes

The headaches of manual processes

Off-the-shelf technology solutions to streamline these processes already exist – typically referred to as Master Data Management systems, the leader in the field is probably Hybris with their Hybris PCM (product content management) system. But the Hybris technology stack is designed for major retailers with very large catalogues and/or complex product lifecycles – and it is priced accordingly; there’s no real equivalent for smaller retailers who want a better (i.e. less manual) approach to catalogue management than that provided by their ecommerce package.

Read the rest of this entry »

Getting started with Mechanical Turk

September 20th, 2011 by Yali

Amazon has done an excellent job of making Mechanical Turk very easy to use. It also provides great documentation to help users get started. The purpose of this post then is to provide a high level overview of how to:

  1. Conceptually to think about using Mechanical Turk
  2. Use the web UI Amazon provide to do the actual implementation

Define your HIT(s)

At the heart of every Mechanical Turk engagement is what Amazon calls a “Human Intelligence Task” or “HIT”. Each HIT is an independent unit of work.

As we mentioned in our last blog post, we have been using Mechanical Turk to check the language of a short content item. We already have an inkling what language each content item is, however we are only 70-80% sure that we are correct – so we use Mechanical Turk to get real people to check if each guess is correct.

In our case, then, each “HIT” consists simply of a worker checking the content and either confirming that the content is in the language we thought it was, or not. Notice this HIT has several important characteristics:

  1. It is a very simple to instruct. “Look at the below sentence. Is it in French? If so, click ‘yes’. Otherwise, click ‘no’.”
  2. It is independent. We have millions of sentences to check. However, the checking of each individual sentence is a completely independent task: there is no requirement that the person checking sentence A needs also to check sentence B. Hence it is possible that many hundreds or thousands of workers can work on the tasks in parallel, to ensure they are done quickly.
  3. It is repeatable. We can ask a number of different workers to perform the same task, and they should all give the same answer.  (This becomes important for ensuring the accuracy of results, because it means that we can verify the accuracy of individual tasks and individual workers.)

Read the rest of this entry »

Introducing Amazon’s mighty Mechanical Turk

September 20th, 2011 by Yali

Many companies, both inside and outside the tech industry, invest significant resources in using tech to automate business processes. Some processes, however, are much better done by people than machines. This makes automating them and then scaling them difficult. For some of these business processes, however, Amazon’s Mechanical Turk provides a way to effectively “automate” the manual step, providing an incredibly powerful tool to build scalable systems and processes that rely on human input.

Amazon’s Mechanical Turk has been around for some time (it was first launched in 2005). However, we’re surprised by the number of businesses we encounter who are still not aware of Mechanical Turk and the potential ways using it could save them time and money, whilst opening up new product development opportunities.  In this blog post series, we look at what Mechanical Turk is, how it works, where it works best, how to start using it and how to build scalable systems around it that effectively yield accurate results.

Read the rest of this entry »

Better competitive intelligence through scraping with Groovy

January 20th, 2010 by Alex

For the first of our series of technical posts I’m going to look at the poorly understood topic of web scraping. To start with a definition: web scraping is the process of automatically collecting Web information and turning it from unstructured, human-readable data into structured data that can be stored and analysed in a database or spreadsheet. The most famous scraper of all is Google, who regularly scrape and index a huge proportion of the Web to feed into their search engine.

How is web scraping useful for a business that isn’t Google-sized?  Web scraping can be used to collect and structure competitor data, making it an incredibly powerful marketing intelligence tool.  Consider online retail: using a web scraper it is possible for a retailer to automatically survey the range of products offered by competitor sites and the price each product is offered at. Because web scrapers can be automated, they can be programmed to run regularly – so companies can analyse how sensitive their own sales volumes are not just as a function of the item price, but as a function of the prices competitors price them at. It is even possible to use the data from web scrapers to dynamically price items in an online shop so that they are always competitive. If you’re an online retailer, you are quite possibly already being regularly scraped by a competitor.

In this post, we provide an example of a simple web scraper built using Groovy. We chose Groovy because it’s a powerful, agile scripting language which is great at navigating/analysing HTML. The target of our scraper is a simple test “shop” which we have setup in Keplar Labs. You are welcome to run this scraper against our test shop – please note that scraping other sites may be against their terms and conditions or even in some cases an offence. Please seek legal advice before running any scraper on someone else’s website.

Without further ado, here is the Groovy code:

Read the rest of this entry »