I am using Nokogiri to scrape a site that looks like this:
<div class="BOX">
<div class="apple">This is an apple.</div>
<p>Apple a day, doctor away</p>
</div>
<div class="BOX">
<div class="iphone">This is an iPhone.</div>
<div class="android">This is an Android.</div>
<a href="www.apple.com">Apple home page</a>
<p>Snoop Lion has both. He's rich.</p>
</div>
I would like to scrape everything within the "BOX" div. Each "BOX" has its own unique divs and HTML tags, with no apparent patterns. How would I do this?
My first attempt looked like this:
require 'uri-open'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.examplesite.com'))
doc.css('BOX').each do |box|
puts box.content
end
But it returns nothing. May I please have an explanation of what's going on?
I think you should use #inner_html
method instead of #content
. Although your CSS class selector
rule is wrong. The code should look like below :
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-eot
<div class="BOX">
<div class="apple">This is an apple.</div>
<p>Apple a day, doctor away</p>
</div>
<div class="BOX">
<div class="iphone">This is an iPhone.</div>
<div class="android">This is an Android.</div>
<a href="www.apple.com">Apple home page</a>
<p>Snoop Lion has both. Hes rich.</p>
</div>
eot
doc.css('.BOX').each do|n|
p n.inner_html
end
output:
<div class="apple">This is an apple.</div>
<p>Apple a day, doctor away</p>
<div class="iphone">This is an iPhone.</div>
<div class="android">This is an Android.</div>
<a href="www.apple.com">Apple home page</a>
<p>Snoop Lion has both. He's rich.</p>
#content
will give you all the text by removing the html wrapper inside the each div
node.See below :
doc.css('.BOX').each do|n|
puts n.content
end
output:
This is an apple.
Apple a day, doctor away
This is an iPhone.
This is an Android.
Apple home page
Snoop Lion has both. He's rich.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments