A New Way to Scrape
I’m a big fan of Ruby Mechanize and the built-in Regular Expressions—it makes screen scraping so much easier than it could be. Even so, I’m pleasantly surprised to see an innovative approach taken by a project called ARIEL – A Ruby Information Extraction Language.
According to the project home page:
Ariel is a library that allows you to extract information from semi-structured documents (such as websites). It is different to existing tools because rather than expecting the developer to write rules to extract the desired information, Ariel will use a small number of labeled examples to generate and learn effective extraction rules.
That sure beats developing, testing, and re-testing complex regular expressions!
The process involves… 1) Creating the structure that you’d like it to create (example below):
1 2 3 4 5 6 7 8 9 10 11 |
structure = Ariel::Node::Structure.new do |r| r.item :title r.item :body r.list :comments do |c| c.list_item :comment do |d| d.item :author d.item :body end end end |
2) Find some example data and label each example with tags such as
1 2 |
Ariel.learn structure, labeled_file1, labeled_file2, labeled_file3
|
4) Begin extracting your data!
1 2 |
extractions = Ariel.extract structure, unlabeled_file1, unlabeled_file2
|
The full article is here
About this entry
You’re currently reading “A New Way to Scrape,” an entry on VotanWeb
- Published:
- September 11th 09:01 PM
- Updated:
- September 13th 12:23 PM
- Sections:
- Ruby


0 comments
Jump to comment form | comments rss [?]