NLTK令牌-从熊猫系列中创建单词的单个列表

debugcn 发表于 Dev

Louloumonkey

我正在寻找有关NLTK或任何其他可以帮助我解决所面临问题的库的帮助。

我不是Python专家（实际上我只是4个月前才开始学习Python），但是在寻求帮助之前，我已经做了很多研究：

这就是我所拥有的：一个数据框，其中包含大量有关学生在我们的网站上搜索信息时所寻找的内容（这是校园的网站）的信息。

它看起来像这样：

session             | student_query
2020-05-15 09:34:21 | exams session june 2020
2020-05-15 09:41:12 | when are the exams?
2020-05-15 09:59:51 | exams.
2020-05-15 10:02:18 | what's my teacher's email address

我想要的是一个看起来像这样的大清单：['查询'，'考试'，'会话'，'june'，'2020'，'when'，'are'，'the'，tests' ，“考试”，“什么”，“ s”，“我”，“老师”，“ s”，“电子邮件”，“地址” ===>一个字符串，所有单词（没有句子），没有标点符号。

我试过了：

tokens = df['query'].apply(word_tokenize)
text = nltk.Text(tokens)

===>给我每一行一个单独的字符串

sentences = pd.Series(df.Name)
sentences = sentences.str.replace('[^A-z ]','').str.replace(' +',' ').str.strip()
splitwords = [ nltk.word_tokenize( str(sentence) ) for sentence in sentences ]
print(splitwords)

===>好一点，但不是我想要的

纽约市编码员

您可以这样做：

df['student_query'] = df['student_query'].str.replace(r'\?|\.|\'', ' ')
list_of_words = ' '.join(df['student_query']).split()
print(list_of_words)

['exams', 'session', 'june', '2020', 'when', 'are', 'the', 'exams', 'exams', 'what', 's', 'my', 'teacher', 's', 'email', 'address']

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。