Delete duplicates in large dataset based on condition

user3032689

I would like to delete duplicates in a very large dataset (millions of rows) based on a condition. I thought about the following simplifying example to illustrate my prob:

test <- read.table(
text = "
A   1900    1   10  45tz    tztime1 01.06.1900
A   1900    2   9   45tz    tztime1 01.06.1900
A   1900    3   8   45tz    tztime1 01.06.1900
A   1900    4   7   45tz    tztime1 01.06.1900
A   1900    5   6   45tz    tztime1 01.06.1900
A   1900    6   5   45tz    tztime1 01.06.1900
A   1900    7   4   45tz    tztime1 01.06.1900
A   1900    7   10  45tz    tztime1 01.06.1900
A   1900    7   9   45tz    tztime1 01.06.1900
A   1900    8   3   45tz    tztime1 01.06.1900
A   1900    8   10  45tz    tztime1 01.06.1900
A   1900    8   9   45tz    tztime1 01.06.1900
A   2000    1   10  45tz    tztime2 01.06.2000
A   2000    2   9   45tz    tztime2 01.06.2000
A   2000    3   8   45tz    tztime2 01.06.2000
A   2000    3   10  45tz    tztime2 01.06.2000
A   2000    3   9   45tz    tztime2 01.06.2000
B   1800    1   10  52fd    tztime0 01.06.1800
B   1800    2   9   52fd    tztime0 01.06.1800
B   1800    3   8   52fd    tztime0 01.06.1800
B   1800    3   10  52fd    tztime0 01.06.1800
B   1800    3   9   52fd    tztime0 01.06.1800
B   1800    4   7   52fd    tztime0 01.06.1800
B   1900    1   10  52fd    tztime1 01.06.1900
B   1900    2   9   52fd    tztime1 01.06.1900
B   1900    2   10  52fd    tztime1 01.06.1900
B   1900    2   9   52fd    tztime1 01.06.1900
",header=TRUE)
library(data.table)
setDT(test)
names(test) <-  c("ID", "Year", "Count", "value", "A","B","C")

In this simplified dataset, I have two individuals (A and B), for different but possibly overlapping years. A Count is given, as well as a value.

I would like to delete the observations for each ID within each YEAR and Count group, that are duplicates and fullfill a certain condition (see below). For example for the group:

A   1900    7   4
A   1900    7   10
A   1900    7   9

I would like to delete all observations, whose value is larger than the minimum value within each group. In this case I would like to have only

A   1900    7   4

as a remainder.

Note that my real dataset is very large and has many more columns. Therefore if possible, I am looking for a solution which is memory-efficient.

I hope that was clear enough. If not, feel free to ask for any information that is missing.

Edit: my real dataset has a lot more columns than displayed here, so in the end I am looking for a solution which displays the information of all the columns (for example, assume in this case there are also column A, B and C as part of the dataset, which I have added in the latest edit. They are not really needed for the grouping/filtering, but still should be part of the final result). The currently proposed solution does not account for this.

Austin

In R, you can answer this with the following: test[,.(Value=min(Value)), by=.(ID, Year, Count)]

Here we are going through the data and finding the minimum value for each combination of ID, Year, and Count. This uses the data.table syntax from the package data.table

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Delete duplicates based on Group By - SQL

From Dev

Remove duplicates from dataset based on criteria

From Dev

Looking to find duplicates in large dataset using indexes in SQL

From Dev

Pandas:drop_duplicates() based on condition in python

From Dev

Remove duplicates from List<Object> based on condition

From Dev

Delete pandas group based on condition

From Dev

Delete pandas group based on condition

From Dev

Find duplicates based on two fields and delete them

From Dev

How to delete one of the duplicates based on another column?

From Dev

How to find duplicates in a large table based on matching and non matching fields?

From Dev

Subsetting a dataset based on a boolean condition assessed on another dataset, keeping the rows that meets the condition (R)?

From Java

Delete a row in R based on next row condition

From Dev

Delete row from datatable based on condition

From Dev

How to delete a column in pandas dataframe based on a condition?

From Dev

How to delete columns based on certain condition?

From Dev

How to delete a row in SQL based on a NULL condition

From Dev

Delete rows based on more than one condition

From Dev

Delete row from datatable based on condition

From Dev

Delete all row based on condition in MySQL

From Dev

How to delete rows based on condition in VBA?

From Dev

How to delete a column in pandas dataframe based on a condition?

From Dev

DELETE rows in a SELECT statement based on condition

From Dev

SQL DELETE based on JOIN and aggregate condition

From Dev

Automatically fill out a large dataset based on number ranges

From Dev

Assign labels based on given examples for a large dataset effectively

From Dev

Remove duplicates based on 2nd column condition

From Dev

MySQL Query to find row duplicates based on condition with limit

From Dev

Remove duplicates from a List<T> based on a condition in C#

From Dev

Removing duplicates from pandas data frame with condition based on another column

Related Related

  1. 1

    Delete duplicates based on Group By - SQL

  2. 2

    Remove duplicates from dataset based on criteria

  3. 3

    Looking to find duplicates in large dataset using indexes in SQL

  4. 4

    Pandas:drop_duplicates() based on condition in python

  5. 5

    Remove duplicates from List<Object> based on condition

  6. 6

    Delete pandas group based on condition

  7. 7

    Delete pandas group based on condition

  8. 8

    Find duplicates based on two fields and delete them

  9. 9

    How to delete one of the duplicates based on another column?

  10. 10

    How to find duplicates in a large table based on matching and non matching fields?

  11. 11

    Subsetting a dataset based on a boolean condition assessed on another dataset, keeping the rows that meets the condition (R)?

  12. 12

    Delete a row in R based on next row condition

  13. 13

    Delete row from datatable based on condition

  14. 14

    How to delete a column in pandas dataframe based on a condition?

  15. 15

    How to delete columns based on certain condition?

  16. 16

    How to delete a row in SQL based on a NULL condition

  17. 17

    Delete rows based on more than one condition

  18. 18

    Delete row from datatable based on condition

  19. 19

    Delete all row based on condition in MySQL

  20. 20

    How to delete rows based on condition in VBA?

  21. 21

    How to delete a column in pandas dataframe based on a condition?

  22. 22

    DELETE rows in a SELECT statement based on condition

  23. 23

    SQL DELETE based on JOIN and aggregate condition

  24. 24

    Automatically fill out a large dataset based on number ranges

  25. 25

    Assign labels based on given examples for a large dataset effectively

  26. 26

    Remove duplicates based on 2nd column condition

  27. 27

    MySQL Query to find row duplicates based on condition with limit

  28. 28

    Remove duplicates from a List<T> based on a condition in C#

  29. 29

    Removing duplicates from pandas data frame with condition based on another column

HotTag

Archive