我试图在标点符号之后匹配空格,以便可以拆分大量的文本集,但是我看到一些常见的带有边,标题和缩写的常见情况:
I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith
我正在将其与re.split
Python 3中的函数一起使用,我想得到这个:
["I am from New York, N.Y. and I would like to say hello!",
"How are you today?",
"I am well.",
"I owe you $6. 00 because you bought me a No. 3 burger."
"-Sgt. Smith"]
这是我的正则表达式:
(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)(?<=[^N]..)(?<=[^o].)
我决定尝试No.
用后两个条件修复第一个。但这取决于匹配N
和o
独立,我认为这将在其他地方解决误报问题。我无法弄清楚如何使它No
成为句号后面的字符串。然后,我将对遇到的Sgt.
任何其他“问题”字符串使用类似的方法。
我正在尝试使用类似的东西:
(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)^(?<=^No$)
但是之后它什么也没捕获。如何获取它以排除某些我希望在其中包含句点的字符串而不捕获它们?
这是我的情况的正则表达式:https ://regexr.com/4sgcb
仅使用一个正则表达式会很棘手-如评论中所述,有很多极端情况。
我本人将通过三个步骤来做到这一点:
re.sub
)的空格re.split
)例如:
import re
zero_width_space = '\u200B'
s = 'I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith'
s = re.sub(r'(?<=\.)\s+(?=[\da-z])|(?<=,)\s+|(?<=Sgt\.)\s+', zero_width_space, s)
s = re.split(r'(?<=[.?!])\s+', s)
from pprint import pprint
pprint([line.replace(zero_width_space, ' ') for line in s])
印刷品:
['I am from New York, N.Y. and I would like to say hello!',
'How are you today?',
'I am well.',
'I owe you $6. 00 because you bought me a No. 3 burger.',
'-Sgt. Smith']
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句