Delete duplicates in large dataset based on condition

debugcn Published at Dev

user3032689

I would like to delete duplicates in a very large dataset (millions of rows) based on a condition. I thought about the following simplifying example to illustrate my prob:

test <- read.table(
text = "
A   1900    1   10  45tz    tztime1 01.06.1900
A   1900    2   9   45tz    tztime1 01.06.1900
A   1900    3   8   45tz    tztime1 01.06.1900
A   1900    4   7   45tz    tztime1 01.06.1900
A   1900    5   6   45tz    tztime1 01.06.1900
A   1900    6   5   45tz    tztime1 01.06.1900
A   1900    7   4   45tz    tztime1 01.06.1900
A   1900    7   10  45tz    tztime1 01.06.1900
A   1900    7   9   45tz    tztime1 01.06.1900
A   1900    8   3   45tz    tztime1 01.06.1900
A   1900    8   10  45tz    tztime1 01.06.1900
A   1900    8   9   45tz    tztime1 01.06.1900
A   2000    1   10  45tz    tztime2 01.06.2000
A   2000    2   9   45tz    tztime2 01.06.2000
A   2000    3   8   45tz    tztime2 01.06.2000
A   2000    3   10  45tz    tztime2 01.06.2000
A   2000    3   9   45tz    tztime2 01.06.2000
B   1800    1   10  52fd    tztime0 01.06.1800
B   1800    2   9   52fd    tztime0 01.06.1800
B   1800    3   8   52fd    tztime0 01.06.1800
B   1800    3   10  52fd    tztime0 01.06.1800
B   1800    3   9   52fd    tztime0 01.06.1800
B   1800    4   7   52fd    tztime0 01.06.1800
B   1900    1   10  52fd    tztime1 01.06.1900
B   1900    2   9   52fd    tztime1 01.06.1900
B   1900    2   10  52fd    tztime1 01.06.1900
B   1900    2   9   52fd    tztime1 01.06.1900
",header=TRUE)
library(data.table)
setDT(test)
names(test) <-  c("ID", "Year", "Count", "value", "A","B","C")

In this simplified dataset, I have two individuals (A and B), for different but possibly overlapping years. A Count is given, as well as a value.

I would like to delete the observations for each ID within each YEAR and Count group, that are duplicates and fullfill a certain condition (see below). For example for the group:

A   1900    7   4
A   1900    7   10
A   1900    7   9

I would like to delete all observations, whose value is larger than the minimum value within each group. In this case I would like to have only

A   1900    7   4

as a remainder.

Note that my real dataset is very large and has many more columns. Therefore if possible, I am looking for a solution which is memory-efficient.

I hope that was clear enough. If not, feel free to ask for any information that is missing.

Edit: my real dataset has a lot more columns than displayed here, so in the end I am looking for a solution which displays the information of all the columns (for example, assume in this case there are also column A, B and C as part of the dataset, which I have added in the latest edit. They are not really needed for the grouping/filtering, but still should be part of the final result). The currently proposed solution does not account for this.

Austin

In R, you can answer this with the following: test[,.(Value=min(Value)), by=.(ID, Year, Count)]

Here we are going through the data and finding the minimum value for each combination of ID, Year, and Count. This uses the data.table syntax from the package data.table

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-07-21

Comments

0 comments

From Dev

Subsetting a dataset based on a boolean condition assessed on another dataset, keeping the rows that meets the condition (R)?

From Java

Related Related

Article