在Python 3.x中从另一个列表中删除单独的项目列表

debugcn 发表于 Dev

穆罕默德（Mohammed）

我有一个列表，其中包含很多标记的二元组。一些二元组没有正确标记，因此我想将它们从主列表中删除。双连词的单词之一经常重复出现，因此，如果双连词中包含xyz单词，则可以将其删除。伪示例如下：

master_list = ['this is', 'is a', 'a sample', 'sample word', 'sample text', 'this book', 'a car', 'literary text', 'new book', 'them about', 'on the' , 'in that', 'tagged corpus', 'on top', 'a car', 'an orange', 'the book', 'them what', 'then how']

unwanted_words = ['this', 'is', 'a', 'on', 'in', 'an', 'the', 'them']

new_list = [item for item in master_list if not [x for x in unwanted_words] in item]

我可以分别删除项目，即每次创建列表时，都删除包含单词“ on”的项目。这很繁琐，将需要数小时的筛选并创建新列表以筛选每个不需要的单词。我认为循环会有所帮助。但是，我收到以下类型错误：

Traceback (most recent call last):
File "<pyshell#21>", line 1, in <module>
new_list = [item for item in master_list if not [x for x in  unwanted_words] in item]
File "<pyshell#21>", line 1, in <listcomp>
new_list = [item for item in master_list if not [x for x in unwanted_words] in item]
TypeError: 'in <string>' requires string as left operand, not list

非常感谢您的帮助！

tobias_k

您的条件if not [x for x in unwanted_words] in item与相同if not unwanted_words in item，即您正在检查列表是否包含在字符串中。

相反，您可以使用any来检查bigram的任何部分是否在中unwanted_words。此外，你可以做unwanted_words一个set加快查找。

>>> master_list = ['this is', 'is a', 'a sample', 'sample word', 'sample text', 'this book', 'a car', 'literary text', 'new book', 'them about', 'on the' , 'in that', 'tagged corpus', 'on top', 'a car', 'an orange', 'the book', 'them what', 'then how']
>>> unwanted_words = set(['this', 'is', 'a', 'on', 'in', 'an', 'the', 'them'])
>>> [item for item in master_list if not any(x in unwanted_words for x in item.split())]
['sample word', 'sample text', 'literary text', 'new book', 'tagged corpus', 'then how']

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。