How to remove inconsistencies from dataframe (time series)

debugcn 投稿 Dev

Yatrosin

Let's say that we have this dataframe:

x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
                        c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
                        c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
                        c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5)))
colnames(x)<- c("ID", "Visit", "Time", "State")

Column ID indicates subject ID.

Column Visit indicates a series of visits

Column Time indicates the time that has passed to reach a certain "State"

Column State indicates severity of a certain disease, where 5 means death. That means that you can fluctuate from worse states to better states, but you can never improve from category 5, since you are dead.

I would like to identify only those subjects that improved from category 5 to a better one, since these are errors from the dataframe (i.e. rows 13 and 16).

Additionally, I would like to remove those rows where a subject seems to have died more than once (i.e. row 18).

I made a similar question before, but it was very general and it implied that all fluctuations to a better state were removed from the dataset, which it is not what I actually want.

Uwe

Answer to modified question

The OP has modified the question substantially by requesting that all rows are considered erroneous which appear after the first occurrence of State 5 (death). This includes false recoveries (as in rows 13 and 16) as well as "duplicated deaths" (as in rows 17 and 18).

An answer to this requires a complete different approach. One possibility is to use a non-equi join:

library(data.table)
setDT(x)[x[, first(Visit[State == 5]), by = ID], on = .(ID, Visit > V1), error := TRUE][]

    ID Visit Time State error
 1:  A     1 10.0     1    NA
 2:  A     2 12.5     3    NA
 3:  A     3 15.0     4    NA
 4:  B     1  2.0     1    NA
 5:  B     2  3.4     2    NA
 6:  B     3  5.7     3    NA
 7:  B     2  8.0     4    NA
 8:  B     3  9.5     3    NA
 9:  C     1  1.0     2    NA
10:  C     2  5.6     2    NA
11:  C     3  8.9     3    NA
12:  C     4 10.0     5    NA
13:  C     5 11.0     2  TRUE
14:  D     1  2.0     3    NA
15:  D     2  3.4     5    NA
16:  D     3  6.0     4  TRUE
17:  D     4  8.0     5  TRUE
18:  D     5 10.5     5  TRUE

The number of the first visit with State 5 is returned by

x[, first(Visit[State == 5]), by = ID]

   ID V1
1:  C  4
2:  D  2

In the subsequent non-equi join only those rows are marked which appear after the first State 5 event.

Data

x <- data.frame(
  ID = c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
  Visit = c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
  Time = c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
  State = c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5))

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-06-2

コメントを追加

サインイン

分類Dev

In pandas DataFrame, how can I store a specific value from a column into a variable, and then subsequently remove that value from the column?

分類Dev

Related 関連記事

記事