我有一堆明文形式的推文,如下所示。我希望仅提取文本部分。
在文件中采样数据-
Fri Nov 13 20:27:16 +0000 2015 4181010297 rt we're treating one of you lads to this d'struct denim shirt! simply follow & rt to enter
Fri Nov 13 20:27:16 +0000 2015 2891325562 this album is wonderful, i'm so proud of you, i loved this album, it really is the best. -273
Fri Nov 13 20:27:19 +0000 2015 2347993701 international break is garbage smh. it's boring and your players get injured
Fri Nov 13 20:27:20 +0000 2015 3168571911 get weather updates from the weather channel. 15:27:19
Fri Nov 13 20:27:20 +0000 2015 2495101558 woah what happened to twitter this update is horrible
Fri Nov 13 20:27:19 +0000 2015 229544082 i've completed the daily quest in paradise island 2!
Fri Nov 13 20:27:17 +0000 2015 309233999 new post: henderson memorial public library
Fri Nov 13 20:27:21 +0000 2015 291806707 who's going to next week?
Fri Nov 13 20:27:19 +0000 2015 3031745900 why so blue? @ golden bee
这是我在预处理阶段的尝试-
for filename in glob.glob('*.txt'):
with open("plain text - preprocesshurricane.txt",'a') as outfile ,open(filename, 'r') as infile:
for tweet in infile.readlines():
temp=tweet.split(' ')
text=""
for i in temp:
x=str(i)
if x.isalpha() :
text += x + ' '
print(text)
输出-
Fri Nov rt treating one of you lads to this denim simply follow rt to
Fri Nov this album is so proud of i loved this it really is the
Fri Nov international break is garbage boring and your players get
Fri Nov get weather updates from the weather
Fri Nov woah what happened to twitter this update is
Fri Nov completed the daily quest in paradise island
Fri Nov new henderson memorial public
Fri Nov going to next
Fri Nov why so golden
此输出不是所需的输出,因为
1.不允许我在推文的文本部分中提取数字/数字。
2.每行以FRI NOV开始。
您能否建议一个更好的方法来达到相同的目的?我对正则表达式不太熟悉,但是我认为我们可以雇用re.search(r'2015(magic to remove tweetID)/w*',tweet)
在这种情况下,您可以避免使用正则表达式。就推文之前的空格而言,您呈现的文本行是一致的。只是split()
:
>>> data = """
lines with tweets here
"""
>>> for line in data.splitlines():
... print(line.split(" ", 7)[-1])
...
rt we're treating one of you lads to this d'struct denim shirt! simply follow & rt to enter
this album is wonderful, i'm so proud of you, i loved this album, it really is the best. -273
international break is garbage smh. it's boring and your players get injured
get weather updates from the weather channel. 15:27:19
woah what happened to twitter this update is horrible
i've completed the daily quest in paradise island 2!
new post: henderson memorial public library
who's going to next week?
why so blue? @ golden bee
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句