I need to put the URL into the database. I don't want to store the same page twice so I need to strip all fluff off the URL.
# if I have
url_1 = "http://scientificamerican.com/royal-baby/?utm_campaign=promo"
# and
url_2 = "http://scientificamerican.com/royal-baby/?utm_source=email"
# then they should map to:
url_canonical = "http://scientificamerican.com/royal-baby/"
In order to get a single canonical URL regardless of what was on it I tried stripping the query string. The problem is that there are still CMSs which use the query string.
e.g.
url_1 = "https://www.scientificamerican.com/article.cfm?id=obama-budget"
# strip the query string and it becomes
url_1 = "https://www.scientificamerican.com/article.cfm"
# which is obviously the same for all articles :(
This is obviously a problem that a number of people have had to solve, not least the search engines. How do you reduce the URL down such that all that remains is the data for the page?
できません。URLを区別するために必要なクエリパラメータを知る方法はありません。意図的に削除できるパラメータは明らかにたくさんありますが(つまり、utm_campaignなど)、すべてではありません。
最善の策は、ページのHTMLをロードして、正規リンク要素を探すことです。それが存在する場合は、正規URLを取得しています。
この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。
侵害の場合は、連絡してください[email protected]
コメントを追加