我做了两次尝试,以使我的代码导航到网页,将表中的数据导入数据框,然后移至下一页并再次执行相同的操作。以下是我测试过的一些示例代码。现在我被困住了。不知道如何进行。
# first attempt
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from time import sleep
lst = []
url = "https://www.nasdaq.com/market-activity/stocks/screener"
for numb in (1, 10):
url = "https://www.nasdaq.com/market-activity/stocks/screener"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html.parser")
table = soup.find_all('table')
df = pd.DataFrame(table)
lst.append(df)
def get_cpf():
driver = webdriver.Chrome("C:/Utility/chromedriver.exe")
driver.get(url)
driver.find_element_by_class('pagination__page" data-page="'' + numb + ''').click()
sleep(10)
text=driver.find_element_by_id('texto_cpf').text
print(text)
get_cpf()
get_cpf.click
### second attempt
#import BeautifulSoup
from bs4 import BeautifulSoup
import pandas as pd
import requests
from selenium import webdriver
from time import sleep
lst = []
for numb in (1, 10):
r=requests.get('https://www.nasdaq.com/market-activity/stocks/screener')
data = r.text
soup = BeautifulSoup(data, "html.parser")
table = soup.find( "table", {"class":"nasdaq-screener__table"} )
for row in table.findAll("tr"):
for cell in row("td"):
data = cell.get_text().strip()
df = pd.DataFrame(data)
lst.append(df)
def get_cpf():
driver = webdriver.Chrome("C:/Utility/chromedriver.exe")
driver.get(url)
driver.find_element_by_class('pagination__page" data-page="'' + numb + ''').click()
sleep(10)
text=driver.find_element_by_id('texto_cpf').text
print(text)
get_cpf()
get_cpf.click
### third attempt
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import time
import requests
import pandas as pd
lst = []
url="https://www.nasdaq.com/market-activity/stocks/screener"
driver = webdriver.Chrome("C:/Utility/chromedriver.exe")
wait = WebDriverWait(driver, 10)
driver.get(url)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"#_evh-ric-c"))).click()
for pages in range(1,9):
try:
print(pages)
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html.parser")
table = soup.find_all('table')
df = pd.DataFrame(table)
lst.append(df)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,"button.pagination__next"))).click()
time.sleep(1)
except:
break
这是我要抓取的表格后面的HTML的屏幕截图。
因此,在第一页上,我想从以下内容中抓取所有内容:
AAPL Apple Inc. Common Stock $127.79 6.53 5.385% 2,215,538,678,600
到:
ASML ASML Holding N.V. New York Registry Shares $583.55 16.46 2.903% 243,056,764,541
然后,移至第2页,执行相同的操作,移至第3页,执行相同的操作,等等,等等,等等。我不确定仅使用BeautifulSoup是否可行。或者也许我需要Selenium,用于按钮单击事件。我愿意在这里做最简单的事情。谢谢!
请注意,您不需要使用selenium
这样的任务,因为它完全会减慢您的处理速度。
在现实世界中,我们仅用于selenium
绕过浏览器检测,然后将cookie传递到任何HTTP模块以继续操作。
关于你的任务,我注意到有一个API
实际上饲料HTML
来源。
这是一个快速的电话。
import pandas as pd
import requests
def main(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0"
}
params = {
'tableonly': 'true',
'limit': 1000
}
r = requests.get(
'https://api.nasdaq.com/api/screener/stocks', params=params, headers=headers)
goal = pd.DataFrame(r.json()['data']['table']['rows'])
print(goal)
goal.to_csv('data.csv', index=False)
if __name__ == "__main__":
main('https://api.nasdaq.com/api/screener/stocks')
请注意,每个页面包含25个股票。在我的代码中,我已获取
1000/ 25
= 40页。
您无需在pages
此处循环。因为您可以与增加限制进行互动!
但是如果您想使用for
循环,那么您必须循环以下内容
并保持偏移量。
https://api.nasdaq.com/api/screener/stocks?tableonly=true&limit=25&offset=0
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句