For the first of our series of technical posts I’m going to look at the poorly understood topic of web scraping. To start with a definition: web scraping is the process of automatically collecting Web information and turning it from unstructured, human-readable data into structured data that can be stored and analysed in a database or spreadsheet. The most famous scraper of all is Google, who regularly scrape and index a huge proportion of the Web to feed into their search engine.
How is web scraping useful for a business that isn’t Google-sized? Web scraping can be used to collect and structure competitor data, making it an incredibly powerful marketing intelligence tool. Consider online retail: using a web scraper it is possible for a retailer to automatically survey the range of products offered by competitor sites and the price each product is offered at. Because web scrapers can be automated, they can be programmed to run regularly – so companies can analyse how sensitive their own sales volumes are not just as a function of the item price, but as a function of the prices competitors price them at. It is even possible to use the data from web scrapers to dynamically price items in an online shop so that they are always competitive. If you’re an online retailer, you are quite possibly already being regularly scraped by a competitor.
In this post, we provide an example of a simple web scraper built using Groovy. We chose Groovy because it’s a powerful, agile scripting language which is great at navigating/analysing HTML. The target of our scraper is a simple test “shop” which we have setup in Keplar Labs. You are welcome to run this scraper against our test shop – please note that scraping other sites may be against their terms and conditions or even in some cases an offence. Please seek legal advice before running any scraper on someone else’s website.
Without further ado, here is the Groovy code:
