使用python从维基百科访问数据时如何解决套接字错误

debugcn 发表于 Dev

曼德雷克

我正在尝试使用 python 从维基百科访问数据集，代码的目的是访问 S&p500 公司的表并将数据集提取到一个 csv 文件中（每个公司数据在一个 csv 文件中），其中一些数据很好已访问，但我遇到了套接字异常，我觉得这有点难以理解。我正在提供我的完整代码

import bs4 as bs
import datetime as dt
import os
import pandas as pd
import pandas_datareader.data as web
import pickle
import requests


def save_sp500_tickers():
resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
soup = bs.BeautifulSoup(resp.text,  'lxml')
table = soup.find('table', {'class': 'wikitable sortable'})
tickers = []
for row in table.findAll('tr')[1:]:
    ticker = row.findAll('td')[0].text
    tickers.append(ticker)

with open("sp500tickers.pickle","wb") as f:
    pickle.dump(tickers,f)

return tickers

#save_sp500_tickers()


def get_data_from_yahoo(reload_sp500=False):

if reload_sp500:
    tickers = save_sp500_tickers()
else:
    with open("sp500tickers.pickle","rb") as f:
        tickers = pickle.load(f)

if not os.path.exists('stock_dfs'):
    os.makedirs('stock_dfs')

start = dt.datetime(2000, 1, 1)
end = dt.datetime(2016, 12, 31)

for ticker in tickers:
    # just in case your connection breaks, we'd like to save our progress!
    if not os.path.exists('stock_dfs/{}.csv'.format(ticker)):
        df = web.DataReader(ticker, "yahoo", start, end)
        df.to_csv('stock_dfs/{}.csv'.format(ticker))
    else:
        print('Already have {}'.format(ticker))

 get_data_from_yahoo()

我得到如下异常

        Traceback (most recent call last):
      File "C:\Users\Jeet Chatterjee\Data Analysis With Python for finance\op6.py", line 49, in <module>
get_data_from_yahoo()
     File "C:\Users\Jeet Chatterjee\Data Analysis With Python for finance\op6.py", line 44, in get_data_from_yahoo
df = web.DataReader(ticker, "yahoo", start, end)
     File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas_datareader\data.py", line 121, in DataReader
session=session).read()
       File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas_datareader\yahoo\daily.py", line 115, in read
df = super(YahooDailyReader, self).read()
       File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas_datareader\base.py", line 181, in read
params=self._get_params(self.symbols))
       File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas_datareader\base.py", line 79, in _read_one_data
out = self._read_url_as_StringIO(url, params=params)
       File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas_datareader\base.py", line 90, in _read_url_as_StringIO
response = self._get_response(url, params=params)
        File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas_datareader\base.py", line 139, in _get_response
raise RemoteDataError('Unable to read URL: {0}'.format(url))
     pandas_datareader._utils.RemoteDataError: Unable to read URL: https://query1.finance.yahoo.com/v7/finance/download/AGN?period1=946665000&period2=1483208999&interval=1d&events=history&crumb=6JtBOAj%5Cu002F6EP

请帮我解决这个问题，提前致谢

麦凯

您所做的没有太大问题，一个问题是 Yahoo 时间序列数据不能保证 100% 的时间可用，它确实倾向于出现和消失。我刚刚查看了雅虎网站；虽然 Allergan (AGN) 似乎没有问题，这是对您来说失败的那个，但当时我尝试 Brown Forman (BF.B) 和 Berkshire Hathaway B (BRK.B) 不可用。

另一个问题是，您不能假设标准普尔 500 指数上的每个交易品种都有您硬编码的范围内的时间序列数据；有些只存在于 2017 年。

以下是代码的略微修改版本，它尽最大努力获取所有符号，请求从 2000 年 1 月 1 日到当天的数据，如果雅虎没有可用数据，则放弃。

在撰写本文时，它能够获取标准普尔 500 指数当前 505 个品种中的 503 个的时间序列。注意我使用了代理服务器，您可以删除或注释掉这部分代码。

import bs4 as bs
import datetime as dt
import os
import pandas as pd
import pandas_datareader.data as web
import pickle
import requests

# proxy servers for internet connection
proxies = {
    'http': 'http://my.proxy.server:8080',
    'https': 'https://my.proxy.server:8080',
}

symbol_filename = "sp500tickers.pickle"

def save_sp500_tickers():    
    resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies', proxies=proxies)
    soup = bs.BeautifulSoup(resp.text,  'lxml')
    table = soup.find('table', {'class': 'wikitable sortable'})
    tickers = []
    for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        tickers.append(ticker)
    with open(symbol_filename,"wb") as f:
        pickle.dump(tickers,f)
    return tickers


def get_data_from_yahoo(reload_sp500=False):
    if reload_sp500 or not os.path.exists(symbol_filename):
        tickers = save_sp500_tickers()
    else:
        with open(symbol_filename,"rb") as f:
            tickers = pickle.load(f)

    if not os.path.exists('stock_dfs'):
        os.makedirs('stock_dfs')

    start = dt.datetime(2000, 1, 1)
    end = dt.datetime(dt.date.today().year, dt.date.today().month, dt.date.today().day) 

    for ticker in tickers:
        if not os.path.exists('stock_dfs/{}.csv'.format(ticker)):
            try:
                print ticker
                df = web.DataReader(ticker, "yahoo", start, end)
                df.to_csv('stock_dfs/{}.csv'.format(ticker))
            except:
                print ("No timeseries available for " + ticker)
        else:
            pass # print('Already have {}'.format(ticker))


os.environ["HTTP_PROXY"]=proxies['http']
os.environ["HTTPS_PROXY"]=proxies['https']
get_data_from_yahoo()

希望这是有帮助的。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-07-14

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章

使用python从维基百科访问数据时如何解决套接字错误

使用python从维基百科访问数据时如何解决套接字错误

在 python 3.7 中使用维基百科 API 时出现证书错误

使用维基百科数据时如何提高性能？网页？

如何使用python从多个维基百科页面抓取数据？

维基百科pdf生成器，如何解决“宽表”问题

用于访问本地维基百科的Python库？

用于访问本地维基百科的Python库？

如何输出简单的维基百科行数据？

如何从希伯来语维基百科获取数据？

如何用 Python 废弃维基百科表格

如何使用 mwapi 库获取维基百科页面？

从维基百科获取数据

如何嵌入维基百科的文字？

维基百科的Python API

使用API搜索维基百科

Quicksort：维基百科实施无法解决

智能关键字用法：如何从地址栏中搜索 IMDB、维基百科等

如何使用BeautifulSoup仅获取维基百科页面上所有表格的第一行数据？

如何整理数据框中的数据（维基百科内部链接）？

奇怪的维基百科Mojibake（错误编码）

Javascript 维基百科摘要提取错误

从维基百科获取 Api 结果 - 错误

从维基百科页面解析OpenURL数据

提取维基百科信息框数据

从维基百科 api 中获取响应数据

从英语维基百科读取UTF-8 sql文件时，Python中出现UnicodeDecodeError

从英语维基百科读取UTF-8 sql文件时，Python中出现UnicodeDecodeError

SPARQL WikiData。如何只选择维基百科数据库并避免注释重复？

维基百科如何管理数据库中的链接？

如何创建循环以从维基百科表的列中抓取数据？