For the first of our series of technical posts I’m going to look at the poorly understood topic of web scraping. To start with a definition: web scraping is the process of automatically collecting Web information and turning it from unstructured, human-readable data into structured data that can be stored and analysed in a database or spreadsheet. The most famous scraper of all is Google, who regularly scrape and index a huge proportion of the Web to feed into their search engine.
How is web scraping useful for a business that isn’t Google-sized? Web scraping can be used to collect and structure competitor data, making it an incredibly powerful marketing intelligence tool. Consider online retail: using a web scraper it is possible for a retailer to automatically survey the range of products offered by competitor sites and the price each product is offered at. Because web scrapers can be automated, they can be programmed to run regularly – so companies can analyse how sensitive their own sales volumes are not just as a function of the item price, but as a function of the prices competitors price them at. It is even possible to use the data from web scrapers to dynamically price items in an online shop so that they are always competitive. If you’re an online retailer, you are quite possibly already being regularly scraped by a competitor.
In this post, we provide an example of a simple web scraper built using Groovy. We chose Groovy because it’s a powerful, agile scripting language which is great at navigating/analysing HTML. The target of our scraper is a simple test “shop” which we have setup in Keplar Labs. You are welcome to run this scraper against our test shop – please note that scraping other sites may be against their terms and conditions or even in some cases an offence. Please seek legal advice before running any scraper on someone else’s website.
Without further ado, here is the Groovy code:
/* Intelbot written by Alex Dean on 15 Jan 2010 with dependencies:
* - NekoHTML parser (latest stable), http://nekohtml.sourceforge.net/index.html
* - Xerces (2.0.0 or higher), http://www.apache.org/dist/xerces/j/
*/
// Define the pages which contain links to products - our "seeds" in crawl parlance.
def seeds = ["http://labs.keplarllp.com/shop"]
// Load the NekoHTML parser with Xerces - this lets us parse the HTML.
slurper = new XmlSlurper(new org.cyberneko.html.parsers.SAXParser())
// Now let's loop through each seed URL in turn.
seeds.each() {
println "Accessing seed URL ${it}"
def seedURL = new URL(it)
seedURL.withReader { seedReader ->
def seedHTML = slurper.parse(seedReader)
// Show the title of the seed page we're parsing.
println "Seed page title is ${seedHTML.depthFirst().grep{ it.name() == 'TITLE'}}"
// Now loop through and find all the product links on this page.
// For our purposes, a product link is any A tag inside a box div on the page.
def productLinks = seedHTML.depthFirst().grep{ it.name() == 'DIV' && it.@class == 'box' }.collect { it.A.'@href'.toString() }
productLinks.each {
println " Accessing product URL ${it}"
def productURL = new URI(seedURL.toString()).resolve(it).toURL()
productURL.withReader { productReader ->
def productHTML = slurper.parse(productReader)
// Now display the product name.
println " Name is ${productHTML.depthFirst().grep{ it.name() == 'H1'}}"
// Now display the product description.
println " Description is ${productHTML.depthFirst().grep{ it.name() == 'P' && it.@class == 'ProductDescription'}}"
// Now need to grab the product price.
println " Price is ${productHTML.depthFirst().grep{ it.name() == 'P' && it.@class == 'ProductPrice'}}"
}
}
}
}
The code should be fairly self-documenting: essentially there is a loop to process each “seed page” (we have only one such page), and then an inner loop to process each product page found on a given seed page.
When you run the script in the Groovy Console, you should see output like this:
Accessing seed URL http://labs.keplarllp.com/shop
Seed page title is [Keplar Labs]
Accessing product URL http://labs.keplarllp.com/shop/product1.html
Name is [Flight to Dubai]
Description is [Economy class flight to Dubai with Emirates]
Price is [£699.00]
Accessing product URL http://labs.keplarllp.com/shop/product2.html
Name is [Train to Paris]
Description is [First class train ticket to Paris on Eurostar]
Price is [£225.50]
Accessing product URL http://labs.keplarllp.com/shop/product3.html
Name is [Flight to New York]
Description is [Economy class flight to New York on United]
Price is [£375.00]
As you can see, the scraper has successfully identified three products linked from the seed page, then accessed each of these product pages and retrieved the key information about each product (its name, description and current price). A more sophisticated scraper would simply build on this functionality, for example adding error handling and potentially adding some caching.
Web scraping is an incredibly powerful tool that has already transformed the web (by enabling services like Google search) but we believe it has the power to transform regular businesses too – once they learn to employ it. It’s also cheap to perform, making it a cost effective tool. So do try out the above code on your own computer – and let me know how you get on in the comments.
Interested in having a web scraper designed and built for your company? Send us an email to find out how Keplar can help.

Thanks, great post!
First run I got:
Accessing product URL /shop/product1.html
Caught: java.net.MalformedURLException: no protocol: /shop/product1.html
at x$_run_closure1_closure2_closure6.doCall(x.groovy:32)
at x$_run_closure1_closure2.doCall(x.groovy:29)
at x$_run_closure1.doCall(x.groovy:20)
at x.run(x.groovy:15)
so I added ‘http://labs.keplarllp.com/’ as the root of the URL and it worked
I guess relative paths support is missing.
Besides, I had to download cyberneko manually, anyone knows why
@Grapes(
@Grab(group=’nekohtml’, module=’nekohtml’, version=’0.7.6′)
)
doesn’t work?
Thanks for your comment! You’re right, the code isn’t coping with relative URLs. I’ve updated the offending line to be:
Hopefully that fixes it. I hadn’t heard of Grape – it looks cool, will check it out!