How to get all information with all images in site

mikezang

I want to get all information with all images, is there any simple way to use bash on Mac? I want to get data as below:

"product": "8020"
"simage": "/uploadfile/201281616171259157_.GIF"
"image": "/uploadfile/201281616171259157.GIF"
"name": "Taipei 101"

"product": "8019"
"simage": "/uploadfile/201432010288118198_.jpg"
"image": "/uploadfile/201432010288118198.jpg"
"name": "TianTan"

This doesn't work, also I need product and name and so on, there are not in src attribute...

baseurl=$(echo $url | egrep -o "https?://[a-z.]+")

curl --silent $url | egrep -o "src=[^>]*(\.jpg|\.gif|\.png)" | sed 's/src=\"\(.*\)/\1/g' > /tmp/$$.list
sed -i "s|^/|$baseurl/|" /tmp/$$.list

while read filename;
do
    curl -s -O "$baseurl/$filename"
done < /tmp/$$.list

The site contents of product.asp?cxsort=10001

....
<ul id="small" >
    <li><a href="product.asp?cxsort=10001">Military1</a></li>
    <li><a href="product.asp?cxsort=10021">Military2</a></li>
    <li><a href="product.asp?cxsort=10101">Military3</a></li>
....
</ul>

....
<table cellpadding="0" cellspacing="0">
    <tr>
        <td>Product:8020</td>
        <td><div class="set"><img  src="/uploadfile/201281616171259157_.GIF" width="94" height="69"  style="display:block" class="/uploadfile/201281616171259157.GIF" alt="TianTan" /></div></td>
    </tr>
</table>
....
<table cellpadding="0" cellspacing="0">
    <tr>
        <td>Product:8019</td>
        <td><div class="Set"><img  src="/uploadfile/201432010288118198_.jpg" width="94" height="69"  style="display:block" class="/uploadfile/201432010288118198.jpg" alt="Taipei 101" /></div></td>
    </tr>
</table>
....
SLePort

You can try this :

sed -n '
/Product/ {
        s/[ \t]*<[^>]*>//g
        s/Product:\([0-9]*\)/"product": "\1"/p
        n
        s/.*img  *src="\([^"]*\)".*class="\([^"]*\).*alt="\([^"]*\).*/"simage": "\1"\n"image": "\2"\n"name": "\3"\n/p
}
' file.html

It works with your example and should do the trick on your html if the code related to product and image is always structured the same way.

But a web-scraping library like BeautifulSoup in python would be a better choice.

A BeautifulSoup python code looks like this :

from bs4 import BeautifulSoup

f = file('file.html', 'r')
soup = BeautifulSoup(f)

all_img = soup.find_all('img')
for img in all_img:
        print '%s : %s' % (img['alt'], img['src'])

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

How to Optimize images after uploading all images to Wordpress Site?

From Dev

Get all users, with all their images

From Dev

How to get all images inside a folder with MATLAB?

From Dev

How get all url images in a page with JSoup?

From Dev

How to get all images in the resources as list?

From Dev

How to get all images in a folder using CakePHP

From Dev

How to get a list of all urls for images on a page

From Dev

How to get all images uploaded by the user?

From Dev

Get all images and all except images with regex

From Dev

How can I programmatically get a list of all of the Sitecore domains in a site?

From Dev

How to get all the src and href attributes of a web site

From Dev

How to get all users of a site role in Liferay 6.1?

From Dev

How to get all the src and href attributes of a web site

From Dev

How to get all order payment method information in magento2

From Dev

How to get all information from custom ListView with two different bindings?

From Dev

How do I get all the information regarding the header of an audio file?

From Dev

How to get all CPU cache information without SU access

From Dev

How to get all order payment method information in magento2

From Dev

how to programmatically get all available information from a Wikidata entity?

From Dev

How to get youngest person with all information in SQL query

From Dev

Get all images from website

From Dev

how to limit the ouput of get_all_images in amazon boto?

From Dev

How can I get all users images in template,django

From Dev

How to get all images in folder using c++

From Dev

How to get images for a youtube video of all size for a youtube channel?

From Dev

How to get all images from a url to picturebox in c#?

From Dev

How to iterate BeautifulSoup to get all the actions (links) from all the forms on a site?

From Dev

Libcurl how to not show all this information

From Dev

How do I get all images within Wordpress Uploads directory, then display all?

Related Related

  1. 1

    How to Optimize images after uploading all images to Wordpress Site?

  2. 2

    Get all users, with all their images

  3. 3

    How to get all images inside a folder with MATLAB?

  4. 4

    How get all url images in a page with JSoup?

  5. 5

    How to get all images in the resources as list?

  6. 6

    How to get all images in a folder using CakePHP

  7. 7

    How to get a list of all urls for images on a page

  8. 8

    How to get all images uploaded by the user?

  9. 9

    Get all images and all except images with regex

  10. 10

    How can I programmatically get a list of all of the Sitecore domains in a site?

  11. 11

    How to get all the src and href attributes of a web site

  12. 12

    How to get all users of a site role in Liferay 6.1?

  13. 13

    How to get all the src and href attributes of a web site

  14. 14

    How to get all order payment method information in magento2

  15. 15

    How to get all information from custom ListView with two different bindings?

  16. 16

    How do I get all the information regarding the header of an audio file?

  17. 17

    How to get all CPU cache information without SU access

  18. 18

    How to get all order payment method information in magento2

  19. 19

    how to programmatically get all available information from a Wikidata entity?

  20. 20

    How to get youngest person with all information in SQL query

  21. 21

    Get all images from website

  22. 22

    how to limit the ouput of get_all_images in amazon boto?

  23. 23

    How can I get all users images in template,django

  24. 24

    How to get all images in folder using c++

  25. 25

    How to get images for a youtube video of all size for a youtube channel?

  26. 26

    How to get all images from a url to picturebox in c#?

  27. 27

    How to iterate BeautifulSoup to get all the actions (links) from all the forms on a site?

  28. 28

    Libcurl how to not show all this information

  29. 29

    How do I get all images within Wordpress Uploads directory, then display all?

HotTag

Archive