在R 3.1.1中工作。
我有一个包含交易数据的数据集。每个客户至少购买了两次(我已经细分了原始数据)。我想做的是将每笔交易标记为“首次购买者”交易或“重复购买者”交易。问题是,我想将“重复购买者交易”定义为过去交易的特定时间范围内的交易,因此它并不像将每个客户的第一个交易标记为“第一个”,而将其余的标记为“第一个”那样简单“重复”。如果客户超过一年没有购物(52.25周,我希望他/她被视为首次!)
我认为,实现这一目标的最好方法是效率极低(完全公开,仍在运行,因此启动可能是错误的)。我正在使用嵌套的循环... :(
关于如何更有效地完成此操作的任何建议?在此先感谢您的帮助和建议!整个代码都带有注释,所以我会让它自己说出来,但是如果不清楚,请告诉我!
#let's ensure the repdata is ordered by date first
attach(repdata)
repdata <- repdata[order(date),]
detach(repdata)
#now, we loop through repdata and decide whether purchase
#is a first time or repeat buyer
#setting time frame to 1 year (52.25 weeks as we use week as units below)
timeframe = 52.25
#add new column to repdata that we will use below
repdata$rpt52wk <- ""
#for each row in repdata, do the following
for(i in seq_along(repdata$date))
{
#assume that this is a first purchase; set rpt52wk var for [i] to "FIRST TIME BUYER"
repdata$rpt52wk[i] = "FIRST TIME BUYER"
#look at all previous transactions
#we can ignore higher indexed transactions (we sorted the data, ascending by date)
for (j in seq_along(repdata$date[1:(i-1)]))
{
#if a transaction is found in which the same member bought within the timeframe
else if(repdata$MEMBER_ID[i] == repdata$MEMBER_ID[j] &
(difftime(repdata$date[i],repdata$date[j],units="weeks")<timeframe))
{
#then this is a repeat buyer; set rpt var for [i] appropriately
repdata$rpt52wk[i]="REPEAT BUYER"
}
}
}
添加失败的测试数据,至少在到目前为止介绍的两种解决方案都可以支持的情况下。
MEMBER_ID date
1 2011-04-13
2 2011-04-22
3 2011-04-17
3 2011-04-26
4 2011-04-13
4 2011-04-16
4 2011-04-16
5 2011-04-20
5 2011-04-13
5 2011-04-18
6 2011-04-13
7 2011-04-13
8 2011-04-25
8 2011-04-20
9 2011-04-14
10 2011-04-14
11 2011-04-18
12 2011-04-15
13 2011-04-15
14 2011-04-13
#TEST SET GENERATION:
library(lubridate)
MEMBER_ID <- c(1,2,3,3,4,4,4,5,5,5,6,7,8,8,9,10,11,12,13,14)
date <- ymd(c("2011-04-13 UTC", "2011-04-22 UTC", "2011-04-17 UTC", "2011-04-26 UTC",
"2011-04-13 UTC", "2011-04-16 UTC", "2011-04-16 UTC", "2011-04-20 UTC",
"2011-04-13 UTC", "2011-04-18 UTC", "2011-04-13 UTC", "2011-04-13 UTC",
"2011-04-25 UTC", "2011-04-20 UTC", "2011-04-14 UTC", "2011-04-14 UTC",
"2011-04-18 UTC", "2011-04-15 UTC", "2011-04-15 UTC", "2011-04-13 UTC"))
rm(repdata)
repdata <- data.frame(MEMBER_ID, date)
repdata
(请注意,我意识到代码中存在i = 1的错误。我现在暂时忽略它,以免在for循环中添加另一个if语句)
您可以尝试使用ddply。
首先生成一个按日期排序的数据集,时间范围为52周。
#TEST SET GENERATION:
library(lubridate)
MEMBER_ID <- c(1,2,3,3,4,4,4,5,5,5,6,7,8,8,9,10,11,12,13,14)
date <- ymd(c("2011-04-13 UTC", "2011-04-22 UTC", "2011-04-17 UTC", "2011-04-26 UTC",
"2011-04-13 UTC", "2011-04-16 UTC", "2011-04-16 UTC", "2011-04-20 UTC",
"2011-04-13 UTC", "2011-04-18 UTC", "2011-04-13 UTC", "2011-04-13 UTC",
"2011-04-25 UTC", "2011-04-20 UTC", "2011-04-14 UTC", "2011-04-14 UTC",
"2011-04-18 UTC", "2011-04-15 UTC", "2011-04-15 UTC", "2011-04-13 UTC"))
rm(repdata)
repdata <- data.frame(MEMBER_ID, date)
repdata <- repdata[order(repdata$date),]
repdata
# define a timeframe of 4 weeks
timeframe <- as.difftime(52, units = "weeks")
然后调整以下代码:
library(plyr)
first.buyers <- ddply(repdata, .(MEMBER_ID),
function(x) x[c(TRUE, diff(x$date) > timeframe),])
first.buyers <- mutate(first.buyers, rpt52wk = "FIRST TIME BUYER")
final <- merge(repdata,first.buyers, all = TRUE)
final[is.na(final$rpt52wk),"rpt52wk"] <- "REPEAT BUYER"
我们得到以下结果:
MEMBER_ID date rpt52wk
1 1 2011-04-13 FIRST TIME BUYER
2 2 2011-04-22 FIRST TIME BUYER
3 3 2011-04-17 FIRST TIME BUYER
4 3 2011-04-26 REPEAT BUYER
5 4 2011-04-13 FIRST TIME BUYER
6 4 2011-04-16 REPEAT BUYER
7 4 2011-04-16 REPEAT BUYER
8 5 2011-04-13 FIRST TIME BUYER
9 5 2011-04-18 REPEAT BUYER
10 5 2011-04-20 REPEAT BUYER
11 6 2011-04-13 FIRST TIME BUYER
12 7 2011-04-13 FIRST TIME BUYER
13 8 2011-04-20 FIRST TIME BUYER
14 8 2011-04-25 REPEAT BUYER
15 9 2011-04-14 FIRST TIME BUYER
16 10 2011-04-14 FIRST TIME BUYER
17 11 2011-04-18 FIRST TIME BUYER
18 12 2011-04-15 FIRST TIME BUYER
19 13 2011-04-15 FIRST TIME BUYER
20 14 2011-04-13 FIRST TIME BUYER
ddply按MEMBER_ID拆分数据帧,并将函数应用于每个子集。每个子集是一个具有固定MEMBER_ID和有序日期的数据帧。第一个元素将始终对应于第一个购买者,对于下一个元素,您必须确定自上次交易以来经过的时间是否大于您的阈值(如果是,则该成员可以再次考虑为第一个购买者)。
在上面的代码中,进行比较diff(x $ date)>时间范围时,您应检查时间单位是否一致(取决于日期格式)
一旦您找到了第一次购买者,我认为下一步是相当明确的。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句