What is best Ruby Class design / pattern for this scenario?

Bala

I currently have this class for scraping products from a single retailer website using Nokogiri. XPath, CSS path details are stored in MySQL.

ActiveRecord::Base.establish_connection( 
  :adapter => "mysql2",
  ...
)

class Site < ActiveRecord::Base
  has_many :site_details

  def create_product_links
    # http://www.example.com
    p = Nokogiri::HTML(open(url))
    p.xpath(total_products_path).each {|lnk| SiteDetail.find_or_create_by(url: url + "/" + lnk['href'], site_id: self.id)}
  end    
end

class SiteDetail < ActiveRecord::Base
  belongs_to :site   

  def get_product_data
    # http://www.example.com
    p = Nokogiri::HTML(open(url))
    title = p.css(site.title_path).text
    price = p.css(site.price_path).text
    description = p.css(site.description_path).text
    update_attributes!(title: title, price: price, description: description)
  end 
end

# Execution
@s = Site.first
@s.site_details.get_product_data

I will be adding more sites (around 700) in the future. Each site have a different page structure. So get_product_data method cannot be used as is. I may have to use case or if statement to jump and execute relevant code. Soon this class becomes quite chunky and ugly (700 retailers).

What is the best design approach suitable in this scenario?

Kevin

Like @James Woodward said, you're going to want to create a class for each retailer. The pattern I'm going to post has three parts:

  1. A couple of ActiveRecord classes that implement a common interface for storing the data you want to record from each site
  2. 700 different classes, one for each site you want to scrape. These classes implement the algorithms for scraping the sites, but don't know how to store the information in the database. To do that, they rely on the common interface from step 1.
  3. One final class that ties it all together running each of the scraping algorithms you wrote in step 2.

Step 1: ActiveRecord Interface

This step is pretty easy. You already have a Site and SiteDetail class. You can keep them for storing the data you scrape from website in your database.

You told the Site and SiteDetail classes how to scrape data from websites. I would argue this is inappropriate. Now you've given the classes two responsibilities:

  1. Persist data in the database
  2. Scrape data from the websites

We'll create new classes do handle the scraping responsibility in the second step. For now, you can strip down the Site and SiteDetail classes so that they only act as database records:

class Site < ActiveRecord::Base
  has_many :site_details
end

class SiteDetail < ActiveRecord::Base
  belongs_to :site
end

Step 2: Implement Scrapers

Now, we'll create new classes that handle the scraping responsibility. If this were a language that supported abstract classes or interfaces like Java or C#, we would proceed like so:

  1. Create an IScraper or AbstractScraper interface that handles the tasks common to scraping a website.
  2. Implement a different FooScraper class for each of the sites you want to scrape, each one inheriting from AbstractScraper or implementing IScraper.

Ruby doesn't have abstract classes, though. What it does have is duck typing and module mix-ins. This means we'll use this very similar pattern:

  1. Create a SiteScraper module that handles the tasks common to scraping a website. This module will assume that the classes that extend it have certain methods it can call.
  2. Implement a different FooScraper class for each of the sites you want to scrape, each one mixing in the SiteScraper module and implementing the methods the module expects.

It looks like this:

module SiteScraper
  # Assumes that classes including the module
  # have get_products and get_product_details methods
  #
  # The get_product_urls method should return a list
  # of the URLs to visit to get scraped data
  #
  # The get_product_details the URL of the product to
  # scape as a string and return a SiteDetail with data
  # scraped from the given URL 
  def get_data
    site = Site.new
    product_urls = get_product_urls

    for product_url in product_urls
      site_detail = get_product_details product_url
      site_detail.site = site
      site_detail.save
    end
  end
end 

class ExampleScraper
  include 'SiteScraper'

  def get_product_urls
    urls = []
    p = Nokogiri::HTML(open('www.example.com/products'))
    p.xpath('//products').each {|lnk| urls.push lnk}
    return urls
  end

  def get_product_details(product_url)
    p = Nokogiri::HTML(open(product_url))
    title = p.css('//title').text
    price = p.css('//price').text
    description = p.css('//description').text

    site_detail = SiteDetail.new
    site_detail.title = title
    site_detail.price = price
    site_detail.description = description
    return site_detail
  end
end

class FooBarScraper
  include 'SiteScraper'

  def get_product_urls
    urls = []
    p = Nokogiri::HTML(open('www.foobar.com/foobars'))
    p.xpath('//foo/bar').each {|lnk| urls.push lnk}
    return urls
  end

  def get_product_details(product_url)
    p = Nokogiri::HTML(open(product_url))
    title = p.css('//foo').text
    price = p.css('//bar').text
    description = p.css('//foo/bar/iption').text

    site_detail = SiteDetail.new
    site_detail.title = title
    site_detail.price = price
    site_detail.description = description
    return site_detail
  end
end

... and so on, creating a class that mixes in SiteScraper and implements get_product_urls and get_product_details for each one of the 700 website you need to scrape. Unfortunately, this is the tedious part of the pattern: There's no real way to get around writing a different scraping algorithm for all 700 sites.

Step 3: Run Each Scraper

The final step is to create the cron job that scrapes the sites.

every :day, at: '12:00am' do
  ExampleScraper.new.get_data
  FooBarScraper.new.get_data
  # + 698 more lines
end

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

TFS - What is the best branching strategy for this scenario?

From Dev

class/interface design approach for given scenario

From Dev

What is the benefit of this design pattern?

From Dev

What is the Hexagon design pattern

From Dev

Best design pattern for event listeners

From Dev

What pattern to use for the following scenario?

From Dev

What is the name for this design pattern?

From Dev

What is the best C++ design to contain a class's objects?

From Dev

What is the best way to implement promise in the given scenario?

From Dev

What is Facet design pattern?

From Dev

Which design is the best for this scenario?

From Dev

What is this design pattern called?

From Dev

Design pattern to handle following scenario

From Dev

What is the best design pattern for batch insertion using the Django REST Framework?

From Dev

What is best way to implement Viewholder design pattern.

From Dev

Design pattern best practices

From Dev

Design Patterns - What pattern is this?

From Dev

MVP design pattern best practice

From Dev

What is a good design pattern for tracking issues in a class?

From Dev

How to identify the most suitable Design Pattern for a scenario

From Dev

What is the best pattern for a static class with an unmanaged static instance?

From Dev

class/interface design approach for given scenario

From Dev

Scenario where State Design pattern is used

From Dev

What pattern to use for the following scenario?

From Dev

What is the best C++ design to contain a class's objects?

From Dev

Best OOP design pattern for static class DbTable

From Dev

What is best XSLT XPath performance in my scenario?

From Dev

What is the best database structure In this scenario?

From Dev

best pattern design for my project

Related Related

  1. 1

    TFS - What is the best branching strategy for this scenario?

  2. 2

    class/interface design approach for given scenario

  3. 3

    What is the benefit of this design pattern?

  4. 4

    What is the Hexagon design pattern

  5. 5

    Best design pattern for event listeners

  6. 6

    What pattern to use for the following scenario?

  7. 7

    What is the name for this design pattern?

  8. 8

    What is the best C++ design to contain a class's objects?

  9. 9

    What is the best way to implement promise in the given scenario?

  10. 10

    What is Facet design pattern?

  11. 11

    Which design is the best for this scenario?

  12. 12

    What is this design pattern called?

  13. 13

    Design pattern to handle following scenario

  14. 14

    What is the best design pattern for batch insertion using the Django REST Framework?

  15. 15

    What is best way to implement Viewholder design pattern.

  16. 16

    Design pattern best practices

  17. 17

    Design Patterns - What pattern is this?

  18. 18

    MVP design pattern best practice

  19. 19

    What is a good design pattern for tracking issues in a class?

  20. 20

    How to identify the most suitable Design Pattern for a scenario

  21. 21

    What is the best pattern for a static class with an unmanaged static instance?

  22. 22

    class/interface design approach for given scenario

  23. 23

    Scenario where State Design pattern is used

  24. 24

    What pattern to use for the following scenario?

  25. 25

    What is the best C++ design to contain a class's objects?

  26. 26

    Best OOP design pattern for static class DbTable

  27. 27

    What is best XSLT XPath performance in my scenario?

  28. 28

    What is the best database structure In this scenario?

  29. 29

    best pattern design for my project

HotTag

Archive