我正在使用 Python 的 re.sub 函数。它抛出一个类型错误:“预期的字符串或缓冲区。” 在调试并添加大量断言语句以检查我是否将字符串传递给 re.sub 之后,我仍然不确定为什么我会收到异常。下面,请参阅:我的代码、错误堆栈以及我仔细阅读的其他相关问题。
import json
import re
import string
def readFile(filename):
p = re.compile('[1-9]*[1-9]')
def n2w(_string):
isInt = True
stringToReturn = ""
try:
stringToReturn = num2words(int(_string))
except:
stringToReturn = _string
assert isinstance(stringToReturn,str)
return stringToReturn
def convertNumbersToWords(_string):
#Error: expected string?
assert isinstance(_string,str)
_string_copy = p.sub(_string,n2w)
return _string_copy
questions = []
articleTitles = []
articleTexts = []
answers = [] # Stores questions and article titles and article contents and their associated answers, which are stored as strings.
# I can access the questions by using [:,0]
#TODO: Find a way to store questions and article content as keys.
# TODO: Convert unicode to string.
#NOTE: I use questions_answers rather than articleTitles_answers because articles can have multiple answers.
with open(filename) as file:
data = json.load(file)
articles = data["data"]
# Iterate through articles, looking for question/answer pairs.
for article in articles:
article_title = str(article["title"].encode('utf-8','replace')) # Converts Unicode object to string.
article_paragraphs = article["paragraphs"]
article_text = "".join([str(paragraph["context"].encode('ascii','replace')) for paragraph in article_paragraphs])
if (len(article_paragraphs) == 0):
print("O")
for paragraph in article_paragraphs:
qas_pairs = paragraph["qas"]
# Check if this paragraph has questions.
if (len(qas_pairs) == 0):
print("O")
for qas_pair in qas_pairs:
# Note: There's another attribute called "context", which may come in handy.
answer = qas_pair["answers"][0]
answer_text = str(answer["text"].encode('ascii','replace')) # Converts Unicode object to string.
# Get where to find the answers.
#answer_start = answer["answer_start"]
#answer_end = answer_start + len(answer_text) - 1
question = str(qas_pair["question"].encode('ascii','replace'))
# Replace numeric characters with English words.
question = convertNumbersToWords(question)
answer_text = convertNumbersToWords(answer_text)
article_title = convertNumbersToWords(article_title)
article_text = convertNumbersToWords(article_text)
# Remove special characters.
from string import punctuation
question = question.strip(punctuation)
answer_text = answer_text.strip(punctuation)
article_title = article_title.strip(punctuation)
article_text = article_text.strip(punctuation)
questions.append(question)
articleTitles.append(article_title)
articleTexts.append(article_text)
answers.append(answer_text)
print("All done")
extractedData = np.array(questions,articleTitles,articleTexts,answers)
return extractedData
-------------------------------------------------- ------------------------- TypeError Traceback (最近一次调用最后一次) in () ----> 1 trainingData = readFile("train-v1 .1.json") 2 from sys import getsizeof 3 print("完成加载训练数据。") 4 print("训练数据的大小:",getsizeof(trainingData))
in readFile(filename) 51 question = str(qas_pair["question"].encode('ascii','replace')) 52 # 用英文单词替换数字字符。---> 53 问题 = convertNumbersToWords(question) 54 answer_text = convertNumbersToWords(answer_text) 55 article_title = convertNumbersToWords(article_title)
在 convertNumbersToWords(_string) 16 #Error: 预期的字符串?17 断言 isinstance(_string,str) ---> 18 _string_copy = p.sub(_string,n2w) 19 返回 _string_copy 20 个问题 = []
类型错误:预期的字符串或缓冲区
其他问题
TypeError:预期的字符串或缓冲区 TypeError:在 python re.search 中使用正则表达式时预期的字符串或缓冲区错误 TypeError:预期的字符串或缓冲区
这些问题专门针对正则表达式函数接收字符串的情况;因为我已经做了大量工作以确保这是真的,所以我觉得这些问题无关紧要。
对于初学者,您可能想要更改_string_copy = p.sub(_string,n2w)
为_string_copy = p.sub(n2w,_string)
. 此外,如果您还可以提供 JSON 文件的示例,那将会有所帮助。然后,虽然不确定你想要什么,你可以考虑extractedData = np.array(questions,articleTitles,articleTexts,answers)
改为extractedData = np.array([questions,articleTitles,articleTexts,answers])
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句