retrieve useful info from webpage using JSOUP

prashantitis

How can i retrieve the Contact us link from any webpage in world wide web from it's "footer" part of the page in JAVA.

E.g. find footer element, or an element with id="footer" or having a footer class?

I had tried retrieving all the links from webpage using JSOUP and then running regex .*contact.* in it. But I cannot be 100% sure on that the fetched link from this approach is the contact us page of the website.

Q2

Is there any other robust approach or if i could use both footer link and my already completed approach to conclude if a page is certainly a contact us page?

Stephan

But I cannot be 100% sure on that the fetched link...

SHORT ANSWER

You will NEVER be sure.


LONG ANSWER

For a given random HTML page, you want to find the "Contact Us" link. This kind of work is trivial for a human. It represents a big challenge for a computer.

I can see some options in your case:

Option 1: Crowd sourcing

  • Fetch all the website urls you want the "Contact Us" information
  • Send them to a crowd service platform asking real people to find the information for you (Rapidworkers.com, Crowdsource.com, Clickworker.com, Amazon Mechanical Turk, microworkers.com)

Check if the platform offer an API.

+ work done by human
+ dynamically adapt to unknown pattern
- cost money
- We suck at repetitive tasks

Option 2: IA (patten searching)

  • Train an IA for extracting the information
  • Then through at it your websites

Have a look at Weka for instance or Java-ML.

+ Automated task
+ Can perform a repetitive task long time
- May take time to built a robust solution
- Risk of false positive or complete miss

Option 3: Use Jsoup

  • Carefully study the pattern of the websites you target
  • Tell Jsoup to find the pattern you have detected

This option is a never ending task. You'll have to always feed Jsoup with new patterns. I suggest you having a monitoring system telling you when website escapes any known pattern.

+ Automated task
+ Can perform a repetitive task long time
- Take time for studying, discovering, adding new patterns
- Risk of false positive or complete miss

Option 4: A mix of the three above options

You can have the three options working on the websites you target.

+ Reduce chances of false positive or complete misses
+ More confident final result
- Take time for studying, discovering, adding new patterns
- Cost money

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

retrieve useful info from webpage using JSOUP

From Dev

Display info from a database on a webpage using php

From Dev

How to get the resource types from a webpage using JSoup?

From Dev

Remove DIV from a webpage for webview android using Jsoup

From Dev

Retrieve HTML structure from text using jsoup java

From Dev

How to retrieve URL from link tags using Jsoup

From Dev

Retrieve HTML structure from text using jsoup java

From Dev

How can I retrieve data from html using Jsoup

From Dev

Using $.get() to retrieve all content from another webpage

From Dev

PHP to retrieve title from a webpage

From Dev

Speeding up loading a webpage into android using Jsoup

From Dev

Selenium jSoup get data from Javascript Webpage

From Dev

Cakephp retrieve info from database

From Dev

Updating webpage info from socket (JSP)

From Dev

Retrieve row id from table displayed in webpage

From Dev

How do I store the links of a webpage in a set using Jsoup?

From Dev

jSoup get data from webpage and display in JavaFX TableView

From Dev

Securing account login info while trying to retrieve data from an api using JS

From Dev

Flutter retrieve object info using json

From Dev

No useful info in developer console

From Dev

Retrieve info from additional field to product list

From Dev

Retrieve info from additional field to product list

From Dev

WPF retrieve leap info from different class

From Dev

Retrieve Info From Callback Thread Safe Implementation

From Dev

Jsoup parsing a Webpage

From Dev

getting HTTPError while trying to parse info from webpage

From Dev

How to retrieve a code snippet from html page with Jsoup?

From Dev

How to retrieve a code snippet from html page with Jsoup?

From Dev

Retrieve datas from Node.js server and display it on a webpage

Related Related

  1. 1

    retrieve useful info from webpage using JSOUP

  2. 2

    Display info from a database on a webpage using php

  3. 3

    How to get the resource types from a webpage using JSoup?

  4. 4

    Remove DIV from a webpage for webview android using Jsoup

  5. 5

    Retrieve HTML structure from text using jsoup java

  6. 6

    How to retrieve URL from link tags using Jsoup

  7. 7

    Retrieve HTML structure from text using jsoup java

  8. 8

    How can I retrieve data from html using Jsoup

  9. 9

    Using $.get() to retrieve all content from another webpage

  10. 10

    PHP to retrieve title from a webpage

  11. 11

    Speeding up loading a webpage into android using Jsoup

  12. 12

    Selenium jSoup get data from Javascript Webpage

  13. 13

    Cakephp retrieve info from database

  14. 14

    Updating webpage info from socket (JSP)

  15. 15

    Retrieve row id from table displayed in webpage

  16. 16

    How do I store the links of a webpage in a set using Jsoup?

  17. 17

    jSoup get data from webpage and display in JavaFX TableView

  18. 18

    Securing account login info while trying to retrieve data from an api using JS

  19. 19

    Flutter retrieve object info using json

  20. 20

    No useful info in developer console

  21. 21

    Retrieve info from additional field to product list

  22. 22

    Retrieve info from additional field to product list

  23. 23

    WPF retrieve leap info from different class

  24. 24

    Retrieve Info From Callback Thread Safe Implementation

  25. 25

    Jsoup parsing a Webpage

  26. 26

    getting HTTPError while trying to parse info from webpage

  27. 27

    How to retrieve a code snippet from html page with Jsoup?

  28. 28

    How to retrieve a code snippet from html page with Jsoup?

  29. 29

    Retrieve datas from Node.js server and display it on a webpage

HotTag

Archive