对字符串列表中出现的字符串的双列表理解

debugcn 发表于 Dev

kat678

我有两个清单清单：

text = [['hello this is me'], ['oh you know u']]
phrases = [['this is', 'u'], ['oh you', 'me']]

我需要将使短语中出现的单词组合的文本拆分为单个字符串：

result = [['hello', 'this is', 'me'], ['oh you', 'know', 'u']

我尝试使用zip（），但它会连续遍历列表，而我需要检查每个列表。我也尝试过find（）方法，但是从本示例中，它还将找到所有字母“ u”并使其成为字符串（就像单词“ you”一样，使其变为“ yo”，“ u”）。我希望replace（）在用列表替换字符串时也能起作用，因为它会让我做类似的事情：

for line in text:
        line = line.replace('this is', ['this is'])

但是尝试一切，在这种情况下我仍然找不到适合我的任何东西。你能帮我吗？

缺口

您能否澄清一下：

给定文字pack my box with five dozen liquor jugs和短语five dozen

结果应该是：

（1）

['pack', 'my', 'box', 'with', 'five dozen', 'liquor', 'jugs']

要么

（2）

['pack my box with', 'five dozen', 'liquor jugs']

谢谢！

在下面的代码（当前实现选项1）中，每个文本和短语都转换为Python单词列表，['this', 'is', 'an', 'example']从而防止在单词内匹配“ u”。

文本的所有可能子短语均由生成compile_subphrases()。较长的短语（更多的单词）会首先生成，因此在较短的短语之前会被匹配。'five dozen jugs'总是优先于'five dozen'或匹配'five'。

短语和子短语使用列表切片进行比较，大致如下：

    text = ['five', 'dozen', 'liquor', 'jugs']
    phrase = ['liquor', 'jugs']
    if text[2:3] == phrase:
        print('matched')

使用此方法比较短语，脚本将遍历原始文本，并用挑选出的短语重写它。

texts = [['hello this is me'], ['oh you know u']]
phrases_to_match = [['this is', 'u'], ['oh you', 'me']]
from itertools import chain

def flatten(list_of_lists):
    return list(chain(*list_of_lists))

def compile_subphrases(text, minwords=1, include_self=True):
    words = text.split()
    text_length = len(words)
    max_phrase_length = text_length if include_self else text_length - 1
    # NOTE: longest phrases first
    for phrase_length in range(max_phrase_length + 1, minwords - 1, -1):
        n_length_phrases = (' '.join(words[r:r + phrase_length])
                            for r in range(text_length - phrase_length + 1))
        yield from n_length_phrases
        
def match_sublist(mainlist, sublist, i):
    if i + len(sublist) > len(mainlist):
        return False
    return sublist == mainlist[i:i + len(sublist)]

phrases_to_match = list(flatten(phrases_to_match))
texts = list(flatten(texts))
results = []
for raw_text in texts:
    print(f"Raw text: '{raw_text}'")
    matched_phrases = [
        subphrase.split()
        for subphrase
        in compile_subphrases(raw_text)
        if subphrase in phrases_to_match
    ]
    phrasal_text = []
    index = 0
    text_words = raw_text.split()
    while index < len(text_words):
        for matched_phrase in matched_phrases:
            if match_sublist(text_words, matched_phrase, index):
                phrasal_text.append(' '.join(matched_phrase))
                index += len(matched_phrase)
                break
        else:
            phrasal_text.append(text_words[index])
            index += 1
    results.append(phrasal_text)
print(f'Phrases to match: {phrases_to_match}')
print(f"Results: {results}")

结果：

$python3 main.py
Raw text: 'hello this is me'
Raw text: 'oh you know u'
Phrases to match: ['this is', 'u', 'oh you', 'me']
Results: [['hello', 'this is', 'me'], ['oh you', 'know', 'u']]

要使用较大的数据集测试此答案和其他答案，请在代码开头尝试一下。它在一个长句子上生成100多种变体，以模拟100多种文本。

from itertools import chain, combinations
import random

#texts = [['hello this is me'], ['oh you know u']]
theme = ' '.join([
    'pack my box with five dozen liquor jugs said',
    'the quick brown fox as he jumped over the lazy dog'
    ])
variations = list([
    ' '.join(combination)
    for combination
    in combinations(theme.split(), 5)
])
texts = random.choices(variations, k=500)
#phrases_to_match = [['this is', 'u'], ['oh you', 'me']]
phrases_to_match = [
    ['pack my box', 'quick brown', 'the quick', 'brown fox'],
    ['jumped over', 'lazy dog'],
    ['five dozen', 'liquor', 'jugs']
]

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-04-8

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章

对字符串列表中出现的字符串的双列表理解

对字符串列表中出现的字符串的双列表理解

如何生成其中出现另一个字符串的字符串列表

字符串列表

替换字符串列表中的字符串时出现“无重载版本”错误

如何在字符串列表中查找出现最多的字符串？

如何在子字符串列表中找到子字符串的首次出现？

替换字符串列表中的字符串

以逗号分隔的字符串合并字符串列表

将字符串追加到字符串列表

从字符串列表返回随机字符串的公式？

从字符串列表中删除子字符串

在字符串中显示字符串列表

Python：减少（字符串列表）->字符串

从字符串列表中删除字符串项目

解析XML字符串并构建字符串列表

从字符串创建子字符串列表

字符串到字符串列表

从字符串列表中删除空字符串

从子字符串列表构造目标字符串

遍历字符串列表以拉出子字符串

替换字符串列表中的字符串

从字符串列表中找到最快的字符串

Python“字符串或字符串列表”类型约定？

用查询字符串查询字符串列表？

如何绑定字符串形式的字符串列表？

解析XML字符串并构建字符串列表

OpenSkyApi：字符串数组还是字符串列表？

字符串列表中的字符串长度python

与输入字符串匹配的字符串列表

python 字符串列表中子字符串的索引