我一直在开发python脚本,以从网络服务器下载csv文件。我通常使用的方法是右键单击网页,转到“检查元素”(Chrome),切换到网络视图,然后单击链接以查看流量。我原本希望看到类似“ https://domain.com/file_i_need.csv ”的内容,但是我得到的却是perl脚本的位置。由于我不完全了解它的工作原理,因此我只复制了curl命令(右键单击相关的网络流量,然后单击“复制为卷曲”)。因此,我最初只是向发出了curl命令os.system()
。然后,一旦我开始工作,我便尝试修改脚本以使用pycurl。现在,我想将其更改为使用请求库(主要是为了保持优雅/简洁)。一世'回答,但我想知道是否有其他方法可以实现,因为后端与预期的略有不同。我看到推荐使用urllib.urlretreive()作为替代方案,但我猜想这里不起作用。
问题:如何从Web服务器下载文件,而用于生成文件的http是perl脚本?
即https :: //domain.com/file_maker.pl?param1 = 12345
curl命令:“ curl” https://release.domain.com/release_cr_new.pl?releaseid=26851&v=2&m=a&dump_csv=1 “ -H”接受编码:gzip,deflate,sdch“ -H”主机:发布.domain.com“ -H”接受语言:en-US,en; q = 0.8“ -H”用户代理:Mozilla / 5.0(Macintosh; Intel Mac OS X 10_8_4)AppleWebKit / 537.36(KHTML,例如Gecko) Chrome / 27.0.1453.116 Safari / 537.36“ -H”接受:text / html,application / xhtml + xml,application / xml; q = 0.9,/ ; q = 0.8“ -H”引荐来源:https://release.domain .com / release_cr_new.html?releaseid = 26851&v = 2&m = a“ -H” Cookie:releasegroup =开发;XR77 = 3q3pzeMQc1gf-jDlpNtkgr4WvZYqxVZSYzeQHfGAwMTAeZQ6D3g2e6w; __utma = 147924903.423899313.1373397746.1378841205.1380290587.15; __utmc = 147924903; __utmz = 147924903.1380290587.15.14.utmcsr = google | utmccn =(organic)| utmcmd = organic | utmctr =(not%20提供); pubcookie_s_release.domain.com = Hm17WT1VJbPpBLOQ + NhtyBbZlfO9qntsoGP0P8BEVeh4d0ay + THE3EkNLc6PV5rJ40Ui7uj / + c6f2tzZYWOJ / J + dyoP5l + J // rL875K9ERxio1FZeiUVRQgeabetZ + V1AWlrkjURmAw2SU1hEz / f2pCt0sHe06C14vWA95PFu1Smp6viWOL8QnaPHFWhGU3uQQH5Wxex0CziHbrYXHuKwnxwWejvVtTM8e8aIHkM2WuB3IIDhGMVtd0r292owvcv6Rvcl7tYSoQaQYfSpPZreXo4tNO9gh9ZIGqao8LaCfG5Fw8 + Ow5wQKf2ryVuPc8Ah4MTIzC1UeZxBtxSTyZk5E1in7LCV9E + d / 5G84U + ECcdn166gJg1iMG68II81YJO9fYs91gGtA5iUa6h3RpFo + ysBkqbHjCpetOUxfHh47sdr4nUoIWEb0LfKVTYfvmW6BNGx4m90PqE8aQlknv7zxqAQrujqe7h5zSpmaD5UjrfRwp7lYD + 6e88vgQzLgWlcAA =; _session_id = eb0095f849a509c3cf65b43680b3002a; default_column_2 = bugid%2Cloginname%2Ccomponent%2Cversionvalue%2Cbugdate%2Cshortdescription%2Cpriority%2Cstatus%2Cqacontact%2Csqa_status%2Cis_dep“ -H”连接:保持活动“
很抱歉提供大量文字。
如果要从服务器流式传输数据:
# UNTESTED
import requests
import csv
# Connect to the web server.
response = requests.get("https:://domain.com/file_maker.pl?param1=12345", stream=True)
# Read the data as CSV
data = csv.reader(response.raw)
# Use the data
for line in data:
print line
或者,如果要从服务器下载文件并将其存储在本地:
# UNTESTED
import requests
import csv
# Connect to the web server.
response = requests.get("https:://domain.com/file_maker.pl?param1=12345")
# Store the data
with open('outfile', 'w') as outfile:
outfile.write(response.content)
在您的特定情况下,CGI脚本需要一些特定的标头或cookie才能返回正确的数据。我不知道它需要哪个标题或cookie,所以只发送它们:
url = "https://release.domain.com/release_cr_new.plreleaseid=26851&v=2&m=a&dump_csv=1"
headers = {
"Accept-Encoding" : "gzip,deflate,sdch",
"Accept-Language" : "en-US,en;q=0.8",
"User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36",
"Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Referer" : "https://release.domain.com/release_cr_new.html?releaseid=26851&v=2&m=a",
"Cookie" : "releasegroup=Development; XR77=3q3pzeMQc1gf-jDlpNtkgr4WvZYqxVZSYzeQHfGAwMTAeZQ6D3g2e6w; __utma=147924903.423899313.1373397746.1378841205.1380290587.15; __utmc=147924903; __utmz=147924903.1380290587.15.14.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); pubcookie_s_release.domain.com=Hm17WT1VJbPpBLOQ+NhtyBbZlfO9qntsoGP0P8BEVeh4d0ay+THE3EkNLc6PV5rJ40Ui7uj/+c6f2tzZYWOJ/j+dyoP5l+J//rL875K9ERxio1FZeiUVRQgeabetZ+V1AWlrkjURmAw2SU1hEz/f2pCt0sHe06C14vWA95PFu1Smp6viWOL8QnaPHFWhGU3uQQH5Wxex0CziHbrYXHuKwnxwWejvVtTM8e8aIHkM2WuB3IIDhGMVtd0r292owvcv6Rvcl7tYSoQaQYfSpPZreXo4tNO9gh9ZIGqao8LaCfG5Fw8+Ow5wQKf2ryVuPc8Ah4MTIzC1UeZxBtxSTyZk5E1in7LCV9E+d/5G84U+ECcdn166gJg1iMG68II81YJO9fYs91gGtA5iUa6h3RpFo+ysBkqbHjCpetOUxfHh47sdr4nUoIWEb0LfKVTYfvmW6BNGx4m90PqE8aQlknv7zxqAQrujqe7h5zSpmaD5UjrfRwp7lYD+6e88vgQzLgWlcAA=; _session_id=eb0095f849a509c3cf65b43680b3002a; default_column_2=bugid%2Cloginname%2Ccomponent%2Cversionvalue%2Cbugdate%2Cshortdescription%2Cpriority%2Cstatus%2Cqacontact%2Csqa_status%2Cis_dep"
}
response = requests.get(url, headers=headers)
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句