2011 in retrospect: agile data analytics with Scala

January 4th, 2012 by Alex

Looking back, 2011 was the year in which the team here at Keplar ‘got our hands dirty’ and started writing code to answer some of our clients’ thornier business questions. In the sectors we focus on (online/offline retail, online advertising, digital products), clients often have access to very large data sets, and need help manipulating and analysing this ‘big data’ to understand their business performance, make strategic decisions and build better products and services. This new-found appetite for agile and ‘bottom-up’ analysis and decision-making contrasts strongly with the more ‘top-down’ approach (of business models, focus groups and desk research) traditionally favoured by management consultants.

One of the Keplar bookshelves (the Scala books are out on loan)

One of the Keplar bookshelves (the Scala books are out on loan)

The toolkit for answering key business questions through big data is evolving fast – but it is still a programmer’s toolkit, not an ideal one for a data scientist, let alone a business analyst. In 2011 we used Apache Hadoop, Facebook’s Hive and Amazon Mechanical Turk on various data projects – and we plan to use these and other technologies through 2012. These technologies all require some programming effort to prepare meaningful input files and stitch workflows together, and some (such as Google’s MapReduce) actually require coding chops to write the analyses themselves. Although in 2011 we wrote some of that ‘glue code’ in Python, our main language for working with these big data technologies has been – and will continue to be – Scala.

Scala is a relatively young programming language which runs on the JVM (the Java virtual machine) and attempts to fuse Java’s object-oriented approach with a more Haskell-ish functional programming style. A few things made Scala stand out as the language for our agile data analytics work:

  1. Scala runs on the JVM, which gives us native access to some key big data technologies such as Hadoop, HBase and Hive as well as plenty of well-supported client libraries for third-party web services (e.g. ecommerce and advertising platforms)
  2. Scala is statically typed and supports straightforward data encapsulation through OOP – which is our strong preference when working with so many third-party data sources. In particular, static typing catches a whole set of data issues upfront, which saves a lot of data analysis pain later
  3. Scala has functional programming features – which allows us to code in a style which has strong mathematical foundations. This proves really helpful when it comes to performing data analyses

Scala is by no means perfect (and the community can sometimes verge on the cranky side of academic), but it offers a good general purpose, ‘batteries included’ foundation for Keplar’s agile analytics projects. In future blog post series we will talk about these projects some more (we have already written a series on Mechanical Turk) – but in my next blog post I’m going to provide a brief introduction on Scala – including looking at some of the neatest features and nastiest gotchas.

Leave a Reply