我正在尝试从文章中提取可能的作者姓名。我的工作假设作者姓名在署名中
"By FirstName LastName"
或者
"By FirstName MiddleName LastName"
名字、中间名和姓氏都以大写字母开头。
如何使用正则表达式提取“By”之后的所有 2-3 个单词字符串,这些字符串也满足上述条件?
例如,如果文章有文字
"By Barack Obama on January 20th 2017. By January 2017, we all know Obama will no longer be the president"
它会提取
"Barack Obama"
和
"January"
作为可能的作者姓名,然后我将确定哪个是正确的。
目前我的正则表达式是:
/By ([A-Z][\w-]*(\s+[A-Z][\w-]*)+)/
但是,当我在字符串上使用它时
"By Alex Jackson Olerud"
它似乎同时返回
"Alex Jackson Olerud"
和
" Olerud"
我使用 Ruby 作为我的首选语言,但任何与语言无关的解决方案都足够了。
这是我的建议:
str = "By Barack Obama on January 20th 2017. By January 2017, we all know Obama will no longer be the president.
By A. B. Cecil"
def find_authors(str)
str.scan(/
(?<name> # a named capture group for one of the names
\p{Lu} # starts with an upper case letter, unicode so will work also for e.g. Åsa
(?: \. | \p{Ll}+) # followed by a period or some lower case letters
){0} # zero matches, this is just a subroutine to be used again
(?<=[Bb]y\s) # lookbehind to make sure the author is after a by or By
(?<wholename> # capture group to extract the whole name
\g<name> (\s \g<name>){1,2} # a name should have a least two components
)/x).map(&:last) # remove the match by the <name> group from the result
end
def find_authors_oneline(str)
str.scan(/(?<name>\p{Lu}(?:\.|\p{Ll}+)){0}(?<=[Bb]y\s)(?<wholename>\g<name>(\s\g<name>){1,2})/).map(&:last)
end
p find_authors str
>> ["Barack Obama", "A. B. Cecil"]
p find_authors_oneline str
>> ["Barack Obama", "A. B. Cecil"]
您可以阅读有关正则表达式子例程和正则表达式 /x 修饰符的信息
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句