spacy-模式匹配

debugcn 发表于 Dev

鞋子

我想尝试看看如何使用spaCy模式匹配来查找文本中已引用的产品类别。我显然没有正确地构造它。

我想将CAT-POS-2299确定为产品。我尝试了几种不同的变体。您将如何寻找甚至更通用的模式CAT-???-???

也许我应该使用其他东西？

码：

from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

matcher.add("Product", None,
            [{"LOWER": "CAT"},{"LOWER":"-"},{"LOWER":"POS"},{"LOWER":"-"},{"IS_DIGIT":True}]
           )

doc = nlp(" We have a new product CAT-POS-2299 that will be available to users soon.")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)```

维克多·史翠比维

如果检查输入字符串的标记方式，您将看到它POS-2299来自单个标记：

print([t.text for t in doc])
[' ', 'We', 'have', 'a', 'new', 'product', 'CAT', '-', 'POS-2299', 'that', 'will', 'be', 'available', 'to', 'users', 'soon', '.']

因此，如果您打算以CAT不区分大小写的方式匹配单词，然后匹配一个-令牌，然后匹配所有ASCII字母后缀的单词-以及一个或多个数字，则可以使用

matcher.add("Product", None, [{"TEXT": {"REGEX": "(?i)CAT"}},{"TEXT":"-"},{"TEXT": {"REGEX": r"(?i)[A-Z]+-\d+"}}])
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

# => 16898055450696666743 Product 6 9 CAT-POS-2299

由于您正在寻求使模式更加通用，因此我认为使用REGEX令牌是有意义的。

注意：