我有一个汽车规格的字符串清单。但是,将不同的修剪粉碎在一起,我希望代码以年份为指标自动将它们分开。它必须精确地是4位数字或在值的范围内,因为有3位数字值和5位数字值,但年份始终为4。我需要告诉什么代码来寻找要创建的4位代码换一行,然后继续循环?
这是代码:
import re
import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
# headers = {
# 'User-Agent': 'Mewspoon',
# 'From': '[email protected]'
#}
URL = requests.get('https://www.caranddriver.com/reviews/a24847025/2018-
ford-mustang-automatic-transmission-performance/')
soup = BeautifulSoup(URL.text, 'html.parser')
for tag in soup.find_all(class_="specs-content"):
DataList=pd.DataFrame(tag.get_text(strip=True, separator="\n").split())
#create file
df.to_excel('CarScrapeTest.xlsx', sheet_name='Car&Driver')
#File Format
df = pd.DataFrame(DataList).transpose()
回答您的问题,您可以re.match(r'.*([1-3][0-9]{3})', text)
用来检查有效年份。并且如果它匹配,您将开始在注释数据帧上进行写入。
我还注意到您正在尝试获取汽车规格,因此我编写了一个litle循环,可用于将信息添加到数据框,然后将其写入csv。我使用:
标记分隔属性和值,然后将其串联在df上。
干杯。
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
URL = requests.get('https://www.caranddriver.com/reviews/a24847025/2018-ford-mustang-automatic-transmission-performance/')
soup = BeautifulSoup(URL.text, 'html.parser')
specifications = soup.find(class_="specs-content")
cars_specs = dict()
df = pd.DataFrame()
for paragraph in specifications.find_all('p'):
paragraph_text = paragraph.get_text(strip=True, separator="\n").strip()
if paragraph_text == "Specifications":
continue
year = re.match(r'.*([1-3][0-9]{3})', paragraph_text)
if year:
if len(cars_specs) > 1:
new_df = pd.DataFrame.from_dict(cars_specs, orient='index')
df = pd.concat([df, new_df], axis=1, sort=False)
cars_specs = {'Car': paragraph_text}
else:
specs = paragraph_text.split('\n')
for index in range(len(specs) - 1):
if specs[index].find(':') == len(specs[index]) - 1:
cars_specs[specs[index].replace(':','')] = specs[index + 1]
elif specs[index].find(':') > 1:
inline_specs = specs[index].split(':')
cars_specs[inline_specs[0]] = inline_specs[1]
else:
new_df = pd.DataFrame.from_dict(cars_specs, orient='index')
df = pd.concat([df, new_df], axis=1, sort=False)
print(df)
df.to_csv('CarScrapeTest.csv', encoding='utf-8', header=False, sep=';')
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句