在 R 中的向量中提取昏迷（及更多）之间的内容

debugcn 发表于 Dev

宝珠

我有一个来自 .csv 文件的数据框，其中包含 4 个变量：

str(statementGS)
$ X                : int ...
$ statement_type_cd: Factor ...
$ statement_text   : Factor ...
$ serial_no        : int ...

我需要使用statement_text向量（9629704 行）：

                                                                            statement_text
1                                                                                  pistols
2                                                      CORDS, LINES, [ TWINES, ] AND ROPES
3                                                  POCKET AND TABLE CUTLERY *silver color*
4                         (Based on intent) Nail brushes; Lip brushes; and Make-up brushes
5                                                                      ICE CREAM FREEZERS.
...        
9629702  Contract workflows, and data analytics. The SAAS feature technology for contracts  
9629703                                  ADVANCED COMBAT SURVEILLANCE DROW (LOW ENDURANCE)
9629704                  Health spa; namely, cosmetic body care services; ((beauty salon))

我一直在尝试将昏迷之间的每个产品名称提取到一个带有正则表达式的新向量中，但没有成功（使用数据帧的子集）。

我认为正则表达式的顺序应该是这样的：

删除每个.单元格的末尾
改变每一个[ ] (( )) ; .昏迷,
之间删除一切* *与*自己
删除每个namely或-namely
每次and昏迷后删除
如果 a(以Based ondelete 里面的所有内容()和()它们自己开头
现在，看看向量，如果,单元格中有，将它们之间的内容复制到一个新的向量中，但如果它们之间只有空格，则跳过,（不知道如何为第一个和最后一个元素编程），如果没有，只需将单元格复制到新向量。
- （如果一个元素已经在新向量中，即不复制t-shirt1000 次，最好不要复制它，但也许更容易获得新向量，然后删除与它们之前的另一个具有相同字符的单元格） .

我一直在阅读文档，如果我没记错的话，前 5 个步骤将使用该gsub函数完成，然后需要一个 if/else 循环来获取新向量。

想要的结果：

         Products
1        pistols
2        CORDS 
3        LINES
4        TWINES
5        ROPES
6        POCKET AND TABLE CUTLERY
7        Nail brushes
8        Lip brushes 
9        Make-up brushes
10       ICE CREAM FREEZERS
...
20000000 ADVANCED COMBAT SURVEILLANCE DROW (LOW ENDURANCE)
20000001 Health spa 
20000002 cosmetic body care services 
20000003 beauty salon
20000004 Contract workflows 
20000005 data analytics 
20000006 The SAAS feature technology for contracts

PS：我是 R（和编程）的新手，但我注意到当typeof与向量一起使用时它返回它是一个整数，这不是很奇怪吗？：思维：

typeof(statementGS$statement_text)
[1] "integer"

谢谢你的帮助：）

宝珠

我前段时间解决了这个问题，但忘了回答。

gsub("\\.(?=\\n$)", "", statement_text);
gsub(";", ",", statement_text);
gsub("((", ",", statement_text);
gsub("))", ",", statement_text);
gsub("[", ",", statement_text);
gsub("]", ",", statement_text);
gsub("namely", "", statement_text, ignore.case=T);
gsub("-namely", "", statement_text, ignore.case=T);
gsub("namely:", "", statement_text, ignore.case=T);
gsub("namely,", "", statement_text, ignore.case=T);
gsub(",and", "", statement_text, ignore.case=T);
gsub(";and", "", statement_text, ignore.case=T);
gsub("\(Based on.*\)", "", statement_text, ignore.case=T);
gsub("^ ", "", statement_text);
gsub("\*.*2\*", "", statement_text);
gsub("\{.*2\}", "", statement_text);
#Replace commas with new lines, when doing this if the dataframe has X rows
#it won't add new rows (a lot of info would be lost), so I did it with notepad++ 
#find and replace function.
#If you now how to do this in R say so in comments please. 
gsub(",", "\\n", statement_text);
gsub(""", "", statement_text);

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。