A New Way to Scrape

I’m a big fan of Ruby Mechanize and the built-in Regular Expressions—it makes screen scraping so much easier than it could be. Even so, I’m pleasantly surprised to see an innovative approach taken by a project called ARIEL – A Ruby Information Extraction Language.

According to the project home page:

Ariel is a library that allows you to extract information from semi-structured documents (such as websites). It is different to existing tools because rather than expecting the developer to write rules to extract the desired information, Ariel will use a small number of labeled examples to generate and learn effective extraction rules.

That sure beats developing, testing, and re-testing complex regular expressions!

The process involves… 1) Creating the structure that you’d like it to create (example below):

1
2
3
4
5
6
7
8
9
10
11

structure = Ariel::Node::Structure.new do |r|
  r.item :title
  r.item :body
  r.list :comments do |c|
    c.list_item :comment do |d|
      d.item :author
      d.item :body
    end
  end
end

2) Find some example data and label each example with tags such as , , etc. in the relevant places.

3) Feed Ariel the examples:
1
2

Ariel.learn structure, labeled_file1, labeled_file2, labeled_file3

4) Begin extracting your data!

1
2

extractions = Ariel.extract structure, unlabeled_file1, unlabeled_file2

The full article is here


About this entry