从文本中提取信息

debugcn 发表于 Dev

类二烯类

我有以下文字：

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.              

Name                                 Group                       12345678        
ALEX A ALEX                                                                   
ID#                                  PUBLIC NETWORK                  
XYZ123456789                                                                  


Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

我想提取位于文本中 ID# 关键字下的 ID 值。

问题是在不同的文本文件ID中可以位于不同的位置，例如在另一个文本的中间，像这样：

Lorem Ipsum is simply dummy text of                                          ID#             the printing and typesetting industry. Lorem Ipsum has been the industry's          
standard dummy text ever since the 1500s, when an unknown printer took a     XYZ123456789    galley of type and scrambled it to make a type specimen book.

此外，可以在ID#和值之间有额外的行：

Lorem Ipsum is simply dummy text of                                          ID#             the printing and typesetting industry. Lorem Ipsum has been the industry's      
printing and typesetting industry. Lorem Ipsum has been the                                  printing and typesetting industry. Lorem Ipsum has been the 
standard dummy text ever since the 1500s, when an unknown printer took a     XYZ123456789    galley of type and scrambled it to make a type specimen book.

您能否展示一种如何ID#提取上述值的方法？是否有任何标准技术可用于提取此信息？例如 RegEx 或 RegEx 顶部的一些方法。可以在这里应用 NLP 吗？

雷沃

似乎没有明确的 ID 值格式，因此单行正则表达式无济于事，因为这里几乎没有任何正则。

您必须使用两个正则表达式来实现预期的输出。第一个是：

(?m)^(.*)ID#.*([\s\S]*)

它尝试逐行查找ID#。它捕获两块字符串。第一个块是从该行开头到所在ID#行之后出现的所有内容ID#。

然后我们计算第一个捕获组的长度。它为我们提供了列号，我们应该在下一行开始搜索 ID：

m.group(1).length();

然后我们构建使用这个长度的第二个正则表达式：

(?m)^.{X}(?<!\S)\h{0,3}(\S+)

分解：

(?m) 启用多行模式
^ 匹配行首
.{X}匹配前 X 个字符（X 是m.group(1).length()）
(?<!\S) 检查当前位置是否出现在空格字符之前
\h{0,3} 最多可匹配 3 个字符的水平空格（如果值向右移动）
(\S+) 捕获以下非空白字符

然后我们在前一个正则表达式的第二个捕获组上运行这个正则表达式：

Matcher m = Pattern.compile("(?m)^(.*)ID#.*([\\s\\S]*)").matcher(string);                  
if (m.find()) {
    Matcher m1 = Pattern.compile("(?m)^.{" + m.group(1).length() + "}(?<!\\S)\\h{0,3}(\\S+)").matcher(m.group(2));
    if (m1.find())
        System.out.println(m1.group(1));
}

现场演示

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。