我反复遇到这个问题,以从相对非结构化的文本文档中解析日期,该日期中嵌入了日期,并且其位置和格式因情况而异。一些示例文本是:
"Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."
我想"July 1st, 2015"
从文本中提取日期字符串(步骤1),并将其转换为例如2015-07-01 UTC
(步骤2)的格式。例如,可以使用parse_date_time
from包执行步骤2 lubridate
(这对于多种适用的日期格式非常有用):
情况1:
library(lubridate)
parse_date_time("July 1st, 2015", "b d Y", local="C")
[1] "2015-07-01 UTC"
在某些情况下,parse_date_time
也适用于包含日期的较大字符串。例如:
情况2:
parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November", "b d Y" , local="C")
[1] "2015-07-01 UTC"
但是,据我所知,第2步不能直接在完整的示例文本上运行:
情况3:
parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100.", "b d Y" , local="C")
[1] NA
显然,文本中的一些其他信息使直接从全文中解析日期变得很麻烦。我可以想到一种方法,其中使用正则表达式执行步骤1,以提取包含日期且对其parse_date_time
有效的精简字符串(类似于案例1或案例2)。但是,将正则表达式与日期结合使用似乎总是有点脏,因为正则表达式不知道它是否提取有效日期。
是否有办法像上述示例(案例3)中一样,对非结构化文本直接执行第2步(即,没有基于正则表达式的变通方法)?
非常感谢任何输入!
使用此网站,我们可以构建一些正则表达式代码:(
( [J, F, M, A, S, O, N, D])\w+ [1-31][th, st]\w+, [0-2100]\w+
),但在R ...中不起作用::(
如果纠正,它确实可以工作。
> x = "Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."
> m = regexpr(' [JFMASOND]\\w+ ([1-9]|[12][0-9]|3[0-1])(th|rd|nd|st), [12]\\d{3}', x)
> if (m > 0) substr(x, m, m + attr(m, 'match.length') - 1)
[1] " July 1st, 2015"
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句