<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Keplar LLP blog &#187; Coding</title>
	<atom:link href="http://www.keplarllp.com/blog/category/coding/feed" rel="self" type="application/rss+xml" />
	<link>http://www.keplarllp.com/blog</link>
	<description>Blogging from the team at Keplar LLP</description>
	<lastBuildDate>Wed, 01 Feb 2012 12:54:20 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>2011 in retrospect: agile data analytics with Scala</title>
		<link>http://www.keplarllp.com/blog/2012/01/2011-in-retrospect-agile-data-analytics-with-scala</link>
		<comments>http://www.keplarllp.com/blog/2012/01/2011-in-retrospect-agile-data-analytics-with-scala#comments</comments>
		<pubDate>Wed, 04 Jan 2012 14:05:49 +0000</pubDate>
		<dc:creator>Alex</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Coding]]></category>
		<category><![CDATA[agile analytics]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[scala]]></category>

		<guid isPermaLink="false">http://www.keplarllp.com/blog/?p=1774</guid>
		<description><![CDATA[Looking back, 2011 was the year in which the team here at Keplar &#8216;got our hands dirty&#8217; and started writing code to answer some of our clients&#8217; thornier business questions. In the sectors we focus on (online/offline retail, online advertising, digital products), clients often have access to very large data sets, and need help manipulating [...]]]></description>
			<content:encoded><![CDATA[<p>Looking back, 2011 was the year in which the team here at Keplar &#8216;got our hands dirty&#8217; and started writing code to answer some of our clients&#8217; thornier business questions. In the sectors we focus on (online/offline retail, online advertising, digital products), clients often have access to very large data sets, and need help manipulating and analysing this &#8216;big data&#8217; to understand their business performance, make strategic decisions and build better products and services. This new-found appetite for agile and &#8216;bottom-up&#8217; analysis and decision-making contrasts strongly with the more &#8216;top-down&#8217; approach (of business models, focus groups and desk research) traditionally favoured by management consultants.</p>
<p><div class="wp-caption aligncenter" style="width: 470px"><img title="One of the Keplar bookshelves (the Scala books are out on loan)" src="http://www.keplarllp.com/blog/wp-content/uploads/2012/01/keplar-bookshelf.jpg" alt="One of the Keplar bookshelves (the Scala books are out on loan)" width="460" height="387" /><p class="wp-caption-text">One of the Keplar bookshelves (the Scala books are out on loan)</p></div><br />
<span id="more-1774"></span>The toolkit for answering key business questions through big data is evolving fast &#8211; but it is still a programmer&#8217;s toolkit, not an ideal one for a data scientist, let alone a business analyst. In 2011 we used <a href="http://hadoop.apache.org/" title="Apache Hadoop" target="_blank">Apache Hadoop</a>, <a href="http://hive.apache.org/" title="Apache Hive" target="_blank">Facebook&#8217;s Hive</a> and <a href="https://www.mturk.com/mturk/welcome" title="Amazon Mechanical Turk" target="_blank">Amazon Mechanical Turk</a> on various data projects &#8211; and we plan to use these and other technologies through 2012. These technologies all require some programming effort to prepare meaningful input files and stitch workflows together, and some (such as Google&#8217;s <a href="http://en.wikipedia.org/wiki/MapReduce" title="Google's MapReduce" target="_blank">MapReduce</a>) actually require coding chops to write the analyses themselves. Although in 2011 we wrote some of that &#8216;glue code&#8217; in Python, our main language for working with these big data technologies has been &#8211; and will continue to be &#8211; <a href="http://www.scala-lang.org/" title="Scala" target="_blank">Scala</a>.</p>
<p>Scala is a relatively young programming language which runs on the JVM (the Java virtual machine) and attempts to fuse Java&#8217;s object-oriented approach with a more <a href="http://learnyouahaskell.com/" title="Learn You A Haskell" target="_blank">Haskell-ish</a> functional programming style. A few things made Scala stand out as the language for our agile data analytics work:</p>
<ol>
<li>Scala runs on the JVM, which gives us native access to some key big data technologies such as Hadoop, HBase and Hive as well as plenty of well-supported client libraries for third-party web services (e.g. ecommerce and advertising platforms)</li>
<li>Scala is statically typed and supports straightforward data encapsulation through OOP &#8211; which is our strong preference when working with so many third-party data sources. In particular, static typing catches a whole set of data issues upfront, which saves a lot of data analysis pain later</li>
<li>Scala has functional programming features &#8211; which allows us to code in a style which has strong mathematical foundations. This proves really helpful when it comes to performing data analyses</li>
</ol>
<p>Scala is by no means perfect (and the community can sometimes verge on the cranky side of academic), but it offers a good general purpose, &#8216;batteries included&#8217; foundation for Keplar&#8217;s agile analytics projects. In future blog post series we will talk about these projects some more (we have already written a series on <a href="http://www.keplarllp.com/blog/2011/09/amazons-mighty-mechanical-turk" title="Introducing Amazon’s mighty Mechanical Turk" target="_blank">Mechanical Turk</a>) &#8211; but in my next blog post I&#8217;m going to provide a brief introduction on Scala &#8211; including looking at some of the neatest features and nastiest gotchas.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.keplarllp.com/blog/2012/01/2011-in-retrospect-agile-data-analytics-with-scala/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Our first open-source release: an e-commerce library for using PayPal with CodeIgniter</title>
		<link>http://www.keplarllp.com/blog/2011/03/our-first-open-source-release-an-e-commerce-library-for-using-paypal-with-codeigniter</link>
		<comments>http://www.keplarllp.com/blog/2011/03/our-first-open-source-release-an-e-commerce-library-for-using-paypal-with-codeigniter#comments</comments>
		<pubDate>Wed, 02 Mar 2011 18:18:00 +0000</pubDate>
		<dc:creator>Alex</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[E-commerce]]></category>
		<category><![CDATA[codeigniter]]></category>
		<category><![CDATA[ecommerce]]></category>
		<category><![CDATA[ipn]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[paypal]]></category>
		<category><![CDATA[php]]></category>

		<guid isPermaLink="false">http://www.keplarllp.com/blog/?p=1040</guid>
		<description><![CDATA[A significant proportion of the work we do at Keplar involves helping companies to build out their ecommerce propositions. For speed and flexibility much of this work is done in PHP &#8211; or rather on well-established open-source technologies, such as CodeIgniter, MODx and Magento, which are built atop PHP. To keep costs down for our [...]]]></description>
			<content:encoded><![CDATA[<p>A significant proportion of the work we do at Keplar involves helping companies to build out their ecommerce propositions. For speed and flexibility much of this work is done in PHP &#8211; or rather on well-established open-source technologies, such as <a href="http://codeigniter.com/">CodeIgniter</a>, <a href="http://modx.com/">MODx</a> and <a href="http://www.magentocommerce.com/">Magento</a>, which are built atop PHP.</p>
<p>To keep costs down for our clients and avoid &#8220;reinventing the wheel&#8221;, where possible we make use of existing libraries, plugins and extensions for these platforms. However this isn&#8217;t always possible, and sometimes we see an opportunity to improve (hopefully!) on the existing options.</p>
<p>At Keplar we&#8217;re keen to open-source any such code which we develop and own ourselves (i.e. isn&#8217;t part of a client deliverable) which we think could be useful for a wider audience. Open sourcing in-house technology has two clear benefits in our eyes: firstly it helps to support the open source ecosystems on which we depend, and secondly, the more eyes we can get on the code we use in the wild the better.</p>
<p>As our first tentative steps down this route, we are open sourcing a <a href="https://github.com/orderly/codeigniter-paypal-ipn" target="_blank">PayPal e-commerce library for CodeIgniter</a> which we are already using in production on a couple of clients. We are releasing this library under the guise of an initiative which we are calling &#8220;Orderly&#8221; &#8211; we hope to release other e-commerce software and libraries under this banner in the future.</p>
<p><span id="more-1040"></span></p>
<p>The library is called codeigniter-paypal-ipn, and is already <a href="https://github.com/orderly/codeigniter-paypal-ipn" target="_blank">available for download on GitHub</a>. The library is designed to make it easier for developers using CodeIgniter to receive, validate and store instant payment notifications (IPNs) sent by PayPal when an order has been paid for by a customer.</p>
<p>The library was inspired by an earlier CodeIgniter PayPal library &#8211; Ran Aroussi&#8217;s <a href="http://codeigniter.com/wiki/PayPal_Lib/" target="_blank">PayPal_Lib</a>. Compared to Ran&#8217;s library, our library adds some additional validation to your website&#8217;s interaction with PayPal, and also it adds functionality to log your orders to the database. (The launch version of codeigniter-paypal-ipn uses Doctrine 1.2 to save the orders, but we&#8217;re planning a future version which should work without Doctrine.)</p>
<p>For instructions on installing the library, please see the <a href="https://github.com/orderly/codeigniter-paypal-ipn/blob/master/README.textile" target="_blank">codeigniter-paypal-ipn readme file</a>. Once the library is installed, using it in a CodeIgniter controller is quite straightforward &#8211; just write a controller like this:</p>
<pre class="brush: php; title: ; notranslate">
// To handle the IPN post made by PayPal (uses the Paypal_Lib library).
    function ipn()
    {
        $this-&gt;load-&gt;library('PayPal_IPN'); // Load the library

        // Try to get the IPN data.
        if ($this-&gt;paypal_ipn-&gt;validateIPN())
        {
            // Succeeded, now let's extract the order
            $this-&gt;paypal_ipn-&gt;extractOrder();

            // And we save the order now.
            $this-&gt;paypal_ipn-&gt;saveOrder();

            // Now let's check what the payment status is and act accordingly
            if ($this-&gt;paypal_ipn-&gt;orderStatus == PayPal_IPN::PAID)
            {
                // Configure to send HTML emails.
                $this-&gt;load-&gt;library('email');
                $mail_config['mailtype'] = 'html';
                $this-&gt;email-&gt;initialize($mail_config);

                // Prepare the variables to populate the email template:
                $data = $this-&gt;paypal_ipn-&gt;order;
                $data['items'] = $this-&gt;paypal_ipn-&gt;orderItems;

                // Now construct the email (create a template using Smarty or similar)
                $emailBody = $this-&gt;smarty-&gt;view('ecommerce/conf_email.tpl', $data, TRUE);

                // Finish configuring email contents and send.
                $this-&gt;email-&gt;to($data['payer_email'], ($data['first_name'] . ' ' . $data['first_name']));
                $this-&gt;email-&gt;bcc('sales@CHANGEME.com');
                $this-&gt;email-&gt;from('support@CHANGEME.com', 'CHANGEME');
                $this-&gt;email-&gt;subject('Order confirmation');
                $this-&gt;email-&gt;message($emailBody);
                $this-&gt;email-&gt;send();
            }
        }
        else // Just redirect to the root URL
        {
            $this-&gt;load-&gt;helper('url');
            redirect('/', 'refresh');
        }
    }
</pre>
<p>That should be everything! Let us know how you get on with the library in the blog comments below &#8211; or over on GitHub if you prefer. We hope you find it useful, and we look forward to making more &#8220;Orderly&#8221; releases soon.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.keplarllp.com/blog/2011/03/our-first-open-source-release-an-e-commerce-library-for-using-paypal-with-codeigniter/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Internationalising your e-commerce site with MaxMind customer geo-location</title>
		<link>http://www.keplarllp.com/blog/2010/07/internationalising-your-e-commerce-site-with-maxmind-customer-geo-location</link>
		<comments>http://www.keplarllp.com/blog/2010/07/internationalising-your-e-commerce-site-with-maxmind-customer-geo-location#comments</comments>
		<pubDate>Thu, 15 Jul 2010 17:28:47 +0000</pubDate>
		<dc:creator>Alex</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[E-commerce]]></category>
		<category><![CDATA[currency]]></category>
		<category><![CDATA[geo-IP]]></category>
		<category><![CDATA[geo-location]]></category>
		<category><![CDATA[i18n]]></category>
		<category><![CDATA[internationalisation]]></category>
		<category><![CDATA[localisation]]></category>
		<category><![CDATA[MaxMind]]></category>

		<guid isPermaLink="false">http://www.keplarllp.com/blog/?p=992</guid>
		<description><![CDATA[One of the major attractions of selling online is the ability to address global markets as well as your local market. Doing this effectively means localising key content (e.g. prices) for people visiting from countries around the world. The impact on sales if this is done correctly can be extremely positive: one client saw a [...]]]></description>
			<content:encoded><![CDATA[<p>One of the major attractions of selling online is the ability to address global markets as well as your local market. Doing this effectively means localising key content (e.g. prices) for people visiting from countries around the world. The impact on sales if this is done correctly can be extremely positive: one client saw a 300% increase in international sales since implementing geo-located pricing and delivery. Nor is this overly complex to do, thanks to widely available global payment platforms such as PayPal and geo-location tools such as MaxMind.</p>
<p><img class="aligncenter" src="/blog/wp-content/uploads/2010/07/many_countries.jpg" alt="Many countries" /></p>
<p>And so for the second in our series of <a href="/blog/category/coding">technical blog posts</a>, we are going to look at the opportunities to enhance your e-commerce site using geo-IP location. Geo-IP location sounds complicated but it is simply the process of determining where your individual website visitors are geographically located in the world; this is achieved by looking up each visitor&#8217;s IP address in a database which maps known IP addresses to individual countries or even cities.</p>
<p>As an online retailer, knowing where your website visitors are located allows you to provide them with a much more personalised shopping experience &#8211; for example, you could:</p>
<ul>
<li>Show specific contact details for your visitor&#8217;s country</li>
<li>Price your product catalogue in your customer&#8217;s local currency</li>
<li>Automatically calculate delivery times and costs for their order</li>
</ul>
<p>These sorts of personalisations work in two ways to improve your bottom-line: firstly, they increase the level of confidence and trust which a visitor feels in your site by showing that you can treat them as a &#8216;local&#8217;. And secondly, they reduce friction in the check-out process, removing difficult steps for the user such as converting the given currency into their own money. Using these techniques can significantly increase conversions among overseas visitors, as we have seen above.</p>
<p>On to the technology: although there are various providers of geo-IP address databases, we use <a href="http://www.maxmind.com/" target="_blank">MaxMind</a> because it is free, simple to use and regularly updated. Also note that many e-commerce packages such as Magento or Prestashop have MaxMind integrations available already for free or low cost &#8211; check online to see if your e-commerce package has one too.</p>
<p>For this example we will be proceed as if we are integrating MaxMind directly with a simple PHP-based online shop; we will use MaxMind to display some simple internationalised information to your site visitor. In future blog posts we will explore some more sophisticated localisation approaches, to really drive more sales.</p>
<p>Now on to the technical steps&#8230;</p>
<p><span id="more-992"></span></p>
<p>The first step is to install the MaxMind API and database. The commands below all assume that you are working in a web root directory, in a Debian/Ubuntu-like environment:</p>
<pre class="brush: bash; title: ; notranslate">
mkdir MaxMind
cd MaxMind
wget -r --no-parent --reject &quot;index.html*&quot; -nH --cut-dirs=4 http://geolite.maxmind.com/download/geoip/api/php/
mkdir data
cd data
wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCountry/GeoIP.dat.gz
tar -xvf GeoIP.dat.gz
cd ../..
</pre>
<p>We&#8217;re also going to install the country flags from the famfamfam icon set, so that we can show the user their national flag:</p>
<pre class="brush: bash; title: ; notranslate">
mkdir flags
cd flags
wget http://www.famfamfam.com/lab/icons/flags/famfamfam_flag_icons.zip
unzip famfamfam_flag_icons.zip
cd ..
</pre>
<p>Next, we write a simple helper PHP file which will be included into our shop and will make it easy to run MaxMind and lookup the user&#8217;s country:</p>
<pre class="brush: bash; title: ; notranslate">
mkdir includes
vi includes/geoip.php
</pre>
<p>And enter the text:</p>
<pre class="brush: php; title: ; notranslate">
&lt;?php
/**
 * This is a MaxMind helper library
 * Author: Alex Dean (@alexatkeplar http://www.keplarllp.com)
 */

// Include the required PHP file
require_once dirname(__FILE__) . '/../MaxMind/geoip.inc';

// Get the country code using MaxMind geo-IP lookup
function getCountryCode() {
    $mm = geoip_open(dirname(__FILE__) . &quot;/../MaxMind/data/GeoIP.dat&quot;, GEOIP_STANDARD);
    $ip = isset($_SERVER['HTTP_X_FORWARDED_FOR']) ? $_SERVER['HTTP_X_FORWARDED_FOR'] : $_SERVER['REMOTE_ADDR'];
    $countryCode = geoip_country_code_by_addr($mm, $ip);
    geoip_close($mm);

    return $countryCode; // 'GB', 'US' etc, or null if not found
}
</pre>
<p>With this done, let&#8217;s next display the user&#8217;s current location, so that they know that this online shop is localised for their specific country. Start by creating an index file:</p>
<pre class="brush: bash; title: ; notranslate">
vi ../index.php
</pre>
<p>And add in the following code:</p>
<pre class="brush: php; title: ; notranslate">
&lt;?php
require_once(&quot;includes/geoip.php&quot;);

$countryCode = getCountryCode();
if (isset($countryCode)) {
    echo &quot;Hello, you are from &lt;img src='flags/png/&quot; . strtolower($countryCode) . &quot;.png'&gt;&quot;;
} else {
    echo &quot;Sorry, we don't know what country you are from&quot;;
}
</pre>
<p>A few things to note about this code:</p>
<ol>
<li> We need to check that MaxMind successfully found the IP address, because some IP addresses don&#8217;t exist in the MaxMind database.</li>
<li> Rather than just displaying the visitor&#8217;s country&#8217;s flag, we could use the Zend Framework to map the country code onto the country&#8217;s name. (Installing the Zend Framework is out of scope for this blog post).</li>
<li> The MaxMind database is regularly updated (typically once a month) with new and changing IP addresses, so it would be worth setting up a cronjob to update the database automatically</li>
</ol>
<p>With the basic country-detection code functioning, the next steps would be to layer in more country-specific features, such as pricing in local currency and country-specific contact details. Also it&#8217;s a good idea to allow the user to change their country manually, in case MaxMind got it wrong or their country could not be determined. Let us know in the comments what aspects of this internationalisation you would like us to tackle next!</p>
<p>I hope the above is useful and leaves you with a better understanding of what geo-IP location is, and why it is such a powerful tool for e-commerce. And do let me know how you get on with the code samples &#8211; I will reply to any questions in the comments.</p>
<p><b>Are you interested in internationalising your e-commerce site? Keplar can provide you with strategic and technical support &#8211; please drop us an <a href="mailto:alex@keplarllp.com?Subject=Ecommerce%20internationalisation">email</a> to find out more.</b></p>
]]></content:encoded>
			<wfw:commentRss>http://www.keplarllp.com/blog/2010/07/internationalising-your-e-commerce-site-with-maxmind-customer-geo-location/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Better competitive intelligence through scraping with Groovy</title>
		<link>http://www.keplarllp.com/blog/2010/01/better-competitive-intelligence-through-scraping-with-groovy</link>
		<comments>http://www.keplarllp.com/blog/2010/01/better-competitive-intelligence-through-scraping-with-groovy#comments</comments>
		<pubDate>Wed, 20 Jan 2010 08:37:20 +0000</pubDate>
		<dc:creator>Alex</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[Scaling business processes]]></category>
		<category><![CDATA[Search & aggregation]]></category>
		<category><![CDATA[automate]]></category>
		<category><![CDATA[crawl]]></category>
		<category><![CDATA[groovy]]></category>
		<category><![CDATA[monitor]]></category>
		<category><![CDATA[scrape]]></category>
		<category><![CDATA[spider]]></category>

		<guid isPermaLink="false">http://www.keplarllp.com/blog/?p=198</guid>
		<description><![CDATA[For the first of our series of technical posts I&#8217;m going to look at the poorly understood topic of web scraping. To start with a definition: web scraping is the process of automatically collecting Web information and turning it from unstructured, human-readable data into structured data that can be stored and analysed in a database [...]]]></description>
			<content:encoded><![CDATA[<p>For the first of our series of technical posts I&#8217;m going to look at the poorly understood topic of <a href="http://en.wikipedia.org/wiki/Web_scraping" target="_blank">web scraping</a>. To start with a definition: web scraping is the process of automatically collecting Web information and turning it from unstructured, human-readable data into structured data that can be stored and analysed in a database or spreadsheet. The most famous scraper of all is Google, who regularly scrape and index a huge proportion of the Web to feed into their search engine.</p>
<p>How is web scraping useful for a business that isn’t Google-sized?  Web scraping can be used to collect and structure competitor data, making it an incredibly powerful marketing intelligence tool.  Consider online retail: using a web scraper it is possible for a retailer to automatically survey the range of products offered by competitor sites and the price each product is offered at. Because web scrapers can be automated, they can be programmed to run regularly – so companies can analyse how sensitive their own sales volumes are not just as a function of the item price, but as a function of the prices competitors price them at. It is even possible to use the data from web scrapers to dynamically price items in an online shop so that they are always competitive. If you&#8217;re an online retailer, you are quite possibly already being regularly scraped by a competitor.</p>
<p>In this post, we provide an example of a simple web scraper built using <a href="http://groovy.codehaus.org/" target="_blank">Groovy</a>. We chose Groovy because it&#8217;s a powerful, agile scripting language which is great at navigating/analysing HTML. The target of our scraper is a <a href="http://labs.keplarllp.com/shop/" target="_blank">simple test &#8220;shop&#8221;</a> which we have setup in Keplar Labs. You are welcome to run this scraper against our test shop &#8211; please note that scraping other sites may be against their terms and conditions or even in some cases an offence. <strong>Please seek legal advice before running any scraper on someone else&#8217;s website.</strong></p>
<p>Without further ado, here is the Groovy code:</p>
<p><span id="more-198"></span></p>
<pre class="brush: groovy; title: ; notranslate">
/* Intelbot written by Alex Dean on 15 Jan 2010 with dependencies:
 *  - NekoHTML parser (latest stable), http://nekohtml.sourceforge.net/index.html
 *  - Xerces (2.0.0 or higher), http://www.apache.org/dist/xerces/j/
 */

// Define the pages which contain links to products - our &quot;seeds&quot; in crawl parlance.
def seeds = [&quot;http://labs.keplarllp.com/shop&quot;]

// Load the NekoHTML parser with Xerces - this lets us parse the HTML.
slurper = new XmlSlurper(new org.cyberneko.html.parsers.SAXParser())

// Now let's loop through each seed URL in turn.
seeds.each() {

	println &quot;Accessing seed URL ${it}&quot;
	def seedURL = new URL(it)

	seedURL.withReader { seedReader -&gt;
		def seedHTML = slurper.parse(seedReader)

		// Show the title of the seed page we're parsing.
		println &quot;Seed page title is ${seedHTML.depthFirst().grep{ it.name() == 'TITLE'}}&quot;

		// Now loop through and find all the product links on this page.
		// For our purposes, a product link is any A tag inside a box div on the page.
		def productLinks = seedHTML.depthFirst().grep{ it.name() == 'DIV' &amp;&amp; it.@class == 'box' }.collect { it.A.'@href'.toString() }
		productLinks.each {

			println &quot;  Accessing product URL ${it}&quot;
			def productURL = new URI(seedURL.toString()).resolve(it).toURL()

			productURL.withReader { productReader -&gt;
				def productHTML = slurper.parse(productReader)

				// Now display the product name.
				println &quot;    Name is ${productHTML.depthFirst().grep{ it.name() == 'H1'}}&quot;

				// Now display the product description.
				println &quot;    Description is ${productHTML.depthFirst().grep{ it.name() == 'P' &amp;&amp; it.@class == 'ProductDescription'}}&quot;

				// Now need to grab the product price.
				println &quot;    Price is ${productHTML.depthFirst().grep{ it.name() == 'P' &amp;&amp; it.@class == 'ProductPrice'}}&quot;
			}
		}
	}
}
</pre>
<p>The code should be fairly self-documenting: essentially there is a loop to process each &#8220;seed page&#8221; (we have only one such page), and then an inner loop to process each product page found on a given seed page.</p>
<p>When you run the script in the Groovy Console, you should see output like this:</p>
<pre class="brush: plain; light: true; title: ; notranslate">
Accessing seed URL http://labs.keplarllp.com/shop
Seed page title is [Keplar Labs]
  Accessing product URL http://labs.keplarllp.com/shop/product1.html
    Name is [Flight to Dubai]
    Description is [Economy class flight to Dubai with Emirates]
    Price is [£699.00]
  Accessing product URL http://labs.keplarllp.com/shop/product2.html
    Name is [Train to Paris]
    Description is [First class train ticket to Paris on Eurostar]
    Price is [£225.50]
  Accessing product URL http://labs.keplarllp.com/shop/product3.html
    Name is [Flight to New York]
    Description is [Economy class flight to New York on United]
    Price is [£375.00]
</pre>
<p>As you can see, the scraper has successfully identified three products linked from the seed page, then accessed each of these product pages and retrieved the key information about each product (its name, description and current price). A more sophisticated scraper would simply build on this functionality, for example adding error handling and potentially adding some caching.</p>
<p>Web scraping is an incredibly powerful tool that has already transformed the web (by enabling services like Google search) but we believe it has the power to transform regular businesses too &#8211; once they learn to employ it. It&#8217;s also cheap to perform, making it a cost effective tool. So do try out the above code on your own computer &#8211; and let me know how you get on in the comments.</p>
<p><strong><i>Interested in having a web scraper designed and built for your company? <a href="http://www.keplarllp.com/contact">Send us an email</a> to find out how Keplar can help.</i></strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.keplarllp.com/blog/2010/01/better-competitive-intelligence-through-scraping-with-groovy/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

