Terje Haarstad's Blog: oktober 2013

Lately i have been fiddling around with several libraries which simplifies the whole process of working with HTML pages in Ruby. For the simplest of HTML-handling "open-uri", a default library that comes with Ruby, should be sufficient. To handle html-code in a more complex way i have tried both Mechanize and Nokogiri to help me extract wanted information based on certain criteria i have. The process of extracting and parsing data obtained from web-sites are commonly called "scraping". There have been written numerous articles and blogs on the subject, so for the sake of it - here is yet another one:)

As an example i am going to extract electronic components from a vendor, where i want specific information of each component - information like Serial-number, Name and price are all relevant. I want to place the components in its category since i am going to fetch several different component-types, such as resistors, transistors, diodes, IC, Crystals and resonators.

Scraping

If i were to extract all this informasjon by hand it would be quite tedious work, so what isnt more fun than to create a script that simulates humans browsing through wanted paged and extracts all this informasjon for us, which again we can use to manipulate on a later time by storing into files. The way we search / scrape for relevant information is by searching for certain elements in the source-code of each page by using so called CSS-selectors. As we might know, CSS (Cascade Styling Sheets) are used to keep page styles, fonts and such separate and easily available from your code. In CSS we often rely on using selectors to mark up which part of the code each style should be declared on. By using these selectors while parsing we can extract wanted information based on which element we want the selector to grab.

As a side note i want to mention that when ever one build scraping-tools and web-crawlers we should always respect the sites "robots.txt" found in the root of web-domains (http://www.example.com/robots.txt). This file commonly tells robots how they should read a site and where they have and dont have access, read more here - Robots exclusion standard. Although we have pretty decent control over our bot and where to go so it is not a subject in this matter. Though i decided to check the robots.txt-file; They were only rejecting access to the "cgi-bin" folder.

Ruby gem

In ruby we have several gems which does the whole web-scraping process pretty easy. From simple web-scraping to more advanced html-handling liek filling out forms and to process cookies etc. In this example i do not need more than simple CSS-selectors, so i will use Nokogiri to assist me. Nokogiris authors describes it like this:

Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors.

Okey, this is good and all that - now we need to take a look at the page and see what information we can start with.

Scrape teh page

Lets move to the main site belonging to the vendor and see what it looks like. When we go to http://www.ehobby.no the main page will show some kind of greetings, at the left we can see categories-submenu where we find the following text "Komponenter" (Components) this is a good starting point, now we need to find the tag used to display this text.

Finding elements is pretty easy when using Firebug-extention which allow us to "inpect" certain elements on a site and display the requested element in Firebug's own analysis-window as shown below.

Since this is all the information we need from the main page it surely would be alot easier if we just gave that URL to our bot instead of writing a small procedure to extract this portion of the code. Lets move on to the category-listings page where we will find our wanted categories of each component. Now we can start looking for the wanted elements and put them right into Nokogiri as CSS-selectors and extract each category. To hold only the wanted categories i created a hash with key-names of each of the wanted categories, each key holds an array as value to store each component inside.

Let us try to use Nokogiri to show show us the url to the components-type we want. The elements we can use are shown in picture below where we will see how each category is wrapped inside a div-tag with class-name "categoryListBoxContents".

By using IRB for testing purposes we don't hammer the web-site with requests as we create a Nokogiri-object of the page and search for the elements by using CSS-selector shown above. A picture of how the test-script looks like is shown below.

This will output Category-name and URL to components in each category. It is quite easy to extract the information this way as we just have to identify the elements and place those into Nokogiri, but when we do look at the resistors components-list (for example) we will see it shows only 10 results pr page, "Showing N of N (out of N products)", so we need to create a method finding out how many results we have obtained and how many pages there actually are. Firebug help us find the selector to identify these number and extract them. Below is a picture of how this info is extracted and used in a method in ruby.

All we need now is to extract each component and put it into the right hash-key of categories. To scrape the components i could use the css-selector "table .tabTable tr", but this gave me 11 results pr page - so it did actually include the text of each column on the page as well as all the 10 components. To drop out the column-text i had to look for <td>-tag inside the selector shown above, if this contained 4 elements we assume that is the column we want and place them into an array. The only problem is, because i use a very nasty split to extract the prices we dont get the right price - since i only extract one decimal (and split them at ","). But for this example i guess this is all good.

At this point we can handle the extracted data just as we want - we already have them in a container. But as the complete script will reveal, i wrote them into a simple text-file. Now all we need is a complete scrape_index-method, which are going to loop through all the categories, fill them with wanted information.

Now we have completed the scraper to extract wanted information and store it in a simple way.

This is the basics of web-scraping, script can be found on me drive.

Terje Haarstad's Blog

søndag 20. oktober 2013

Scraping with Ruby

Scraping

Scrape teh page

Om meg