合并行+合并大型数据集的值(将SQL导出转换为多值)

阿莱吉斯人

我在CSV中有5+百万条记录的SQL导出。我想将具有相同PDP_ID字段的行合并,并将其值从一列连接到新列中

我正在使用以下函数,但是它们只是花费太长时间而无法执行,而且似乎没有进展:

PDP_ID <- unique(data$PDP_ID)

getDetailNumbers <- function(i)(paste(data$DETAIL_NUMBER[data$PDP_ID==i],collapse="@"))

DETAIL_NUMBERS <- aaply(PDP_ID,1,getDetailNumbers,.expand=FALSE,.progress="text")

获得(PDP_ID,DETAIL_NUMBERS)data.frame之后,我的计划是将其与原始数据帧合并。

PDP_ID包含约410万条记录。处理这种情况的最快方法是什么?分割文件?“数据”数据帧按PDP_ID排序。我也尝试过使用降雪包同时使用两个cpu内核,但无济于事。

Sample data:

"PDP_ID","STREETNAME_DUTCH","ACTUAL_BOX_NUMBER","DETAIL_NUMBER"
111115,"An entry which wont be combined",
231313,"Street two",12
231313,"Street two",15
231313,"Street two",17
467626,"a third entry",1
467626,"a third entry",2
638676,"another which wont be combined",

Desired result:

"PDP_ID","STREETNAME_DUTCH","ACTUAL_BOX_NUMBER","DETAIL_NUMBER"
111115,"An entry which wont be combined",
231313,"Street two",12@15@17
467626,"a third entry",1@2
638676,"another which wont be combined",
大卫·阿伦堡

您的数据有点奇怪,因为您只有4个列名,而只有3个列,所以我删除了一个列名。

无论如何,使用data.table它应该很快

首先,你数据

df <- read.csv(text = '"PDP_ID","STREETNAME_DUTCH","DETAIL_NUMBER"
111115,"An entry which wont be combined",
231313,"Street two",12
231313,"Street two",15
231313,"Street two",17
467626,"a third entry",1
467626,"a third entry",2
638676,"another which wont be combined",')

解决方案

library(data.table)
setDT(df)[ , list(STREETNAME_DUTCH = STREETNAME_DUTCH[1],
                  DETAIL_NUMBER = paste(DETAIL_NUMBER, collapse = "@")), by = PDP_ID]

结果

#    PDP_ID                STREETNAME_DUTCH DETAIL_NUMBER
# 1: 111115 An entry which wont be combined            NA
# 2: 231313                      Street two      12@15@17
# 3: 467626                   a third entry           1@2
# 4: 638676  another which wont be combined            NA

或者,您可以尝试dplyr(也非常快)

Imprtant注dtachplyr包首先,使用detach("package:plyr", unload=TRUE)

解决方案

library(dplyr)
df %>%
  group_by(PDP_ID) %>%
  summarise(STREETNAME_DUTCH = STREETNAME_DUTCH[1],
            DETAIL_NUMBER = paste(DETAIL_NUMBER, collapse = "@"))

结果

# Source: local data frame [4 x 3]
# 
#   PDP_ID                STREETNAME_DUTCH DETAIL_NUMBER
# 1 111115 An entry which wont be combined            NA
# 2 231313                      Street two      12@15@17
# 3 467626                   a third entry           1@2
# 4 638676  another which wont be combined            NA

本文收集自互联网,转载请注明来源。

如有侵权,请联系[email protected] 删除。

编辑于
0

我来说两句

0条评论
登录后参与评论

相关文章