我有一个来自 .csv 文件的数据框,其中包含 4 个变量:
str(statementGS)
$ X : int ...
$ statement_type_cd: Factor ...
$ statement_text : Factor ...
$ serial_no : int ...
我需要使用statement_text
向量(9629704 行):
statement_text
1 pistols
2 CORDS, LINES, [ TWINES, ] AND ROPES
3 POCKET AND TABLE CUTLERY *silver color*
4 (Based on intent) Nail brushes; Lip brushes; and Make-up brushes
5 ICE CREAM FREEZERS.
...
9629702 Contract workflows, and data analytics. The SAAS feature technology for contracts
9629703 ADVANCED COMBAT SURVEILLANCE DROW (LOW ENDURANCE)
9629704 Health spa; namely, cosmetic body care services; ((beauty salon))
我一直在尝试将昏迷之间的每个产品名称提取到一个带有正则表达式的新向量中,但没有成功(使用数据帧的子集)。
我认为正则表达式的顺序应该是这样的:
.
单元格的末尾[
]
((
))
;
.
昏迷,
*
*
与*
自己namely
或-namely
and
昏迷后删除(
以Based on
delete 里面的所有内容()
和()
它们自己开头,
单元格中有,将它们之间的内容复制到一个新的向量中,但如果它们之间只有空格,则跳过,
(不知道如何为第一个和最后一个元素编程),如果没有,只需将单元格复制到新向量。
t-shirt
1000 次,最好不要复制它,但也许更容易获得新向量,然后删除与它们之前的另一个具有相同字符的单元格) .我一直在阅读文档,如果我没记错的话,前 5 个步骤将使用该gsub
函数完成,然后需要一个 if/else 循环来获取新向量。
想要的结果:
Products
1 pistols
2 CORDS
3 LINES
4 TWINES
5 ROPES
6 POCKET AND TABLE CUTLERY
7 Nail brushes
8 Lip brushes
9 Make-up brushes
10 ICE CREAM FREEZERS
...
20000000 ADVANCED COMBAT SURVEILLANCE DROW (LOW ENDURANCE)
20000001 Health spa
20000002 cosmetic body care services
20000003 beauty salon
20000004 Contract workflows
20000005 data analytics
20000006 The SAAS feature technology for contracts
PS:我是 R(和编程)的新手,但我注意到当typeof
与向量一起使用时它返回它是一个整数,这不是很奇怪吗?:思维:
typeof(statementGS$statement_text)
[1] "integer"
谢谢你的帮助 :)
我前段时间解决了这个问题,但忘了回答。
gsub("\\.(?=\\n$)", "", statement_text);
gsub(";", ",", statement_text);
gsub("((", ",", statement_text);
gsub("))", ",", statement_text);
gsub("[", ",", statement_text);
gsub("]", ",", statement_text);
gsub("namely", "", statement_text, ignore.case=T);
gsub("-namely", "", statement_text, ignore.case=T);
gsub("namely:", "", statement_text, ignore.case=T);
gsub("namely,", "", statement_text, ignore.case=T);
gsub(",and", "", statement_text, ignore.case=T);
gsub(";and", "", statement_text, ignore.case=T);
gsub("\(Based on.*\)", "", statement_text, ignore.case=T);
gsub("^ ", "", statement_text);
gsub("\*.*2\*", "", statement_text);
gsub("\{.*2\}", "", statement_text);
#Replace commas with new lines, when doing this if the dataframe has X rows
#it won't add new rows (a lot of info would be lost), so I did it with notepad++
#find and replace function.
#If you now how to do this in R say so in comments please.
gsub(",", "\\n", statement_text);
gsub(""", "", statement_text);
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句