我有一个数据帧,其中包含来自陀螺仪的时间序列数据,以20 Hz(每50ms)采样。我需要使用4秒钟的移动窗口来计算参考4秒参考信号的DTW距离。
我正在使用此代码:
df['Gyro_Z_DTW']=df['Gyro_Z'].rolling(window='4s',min_periods=80).apply(DTWDistanceWindowed,raw=False)
函数DTWDistanceWindowed()如下:
def DTWDistanceWindowed(entry):
w=10
s1=entry
s2=reference
DTW={}
w = max(w, abs(len(s1)-len(s2)))
print('window = ',w)
for i in range(-1,len(s1)):
for j in range(-1,len(s2)):
DTW[(i, j)] = float('inf')
DTW[(-1, -1)] = 0
for i in range(len(s1)):
for j in range(max(0, i-w), min(len(s2), i+w)):
dist= (s1[i]-s2[j])**2
DTW[(i, j)] = dist + min(DTW[(i-1, j)],DTW[(i, j-1)], DTW[(i-1, j-1)])
return math.sqrt(DTW[len(s1)-1, len(s2)-1])
# adapted method from #http://alexminnaar.com/2014/04/16/Time-Series-Classification-and-Clustering-with-Python.html
它可以工作,但是如果移动窗口每次可以滑动500毫秒而不是50毫秒,我可以节省一些时间。
有没有办法做到这一点?
如果您知道更好的方法,我愿意接受其他建议,而不愿意滚动。
一种方法是检查第一个(或真正的任何索引)entry
是否为500ms的倍数,np.nan
如果不是,则返回。然后,“复杂”计算仅每500ms发生一次。所以功能是
def DTWDistanceWindowed(entry):
if bool(entry.index[0].microsecond%500000):
return np.nan
w=10
s1=entry
....# same as your function after
有趣的是,pd.Timestamp(的类型entry.index[0]
)具有微秒属性,但没有毫秒,因此%500000
被使用。
编辑:现在,如果您想加快功能,可以使用numpy array这样:
#sample data
np.random.seed(6)
nb = 200
df = pd.DataFrame({'Gyro_Z':np.random.random(nb)},
index=pd.date_range('2020-05-15', freq='50ms', periods=nb))
reference = np.random.random(10)
# create a for reference with your function
a = df['Gyro_Z'].rolling(window='4s',min_periods=80).apply(DTWDistanceWindowed,raw=False)
用numpy定义函数
def DTWDistanceWindowed_np(entry):
if bool(entry.index[0].microsecond%500000):
return np.nan
w=10
s1=entry.to_numpy()
l1 = len(s1) # calcualte once the len of s1
# definition of s2 and its length
s2 = np.array(reference)
l2 = len(s2)
w = max(w, abs(l1-l2))
# create an array of inf and initialise
DTW=np.full((l1+1,l2+1), np.inf)
DTW[0, 0] = 0
# avoid calculate some difference several times
s1ms2 = (s1[:,None]-s2)**2
# do the loop same way, note the small change in bounds
for i in range(1,l1+1):
for j in range(max(1, i-w), min(l2+1, i+w)):
DTW[i, j] = s1ms2[i-1,j-1] + min(DTW[i-1, j],DTW[i, j-1], DTW[i-1, j-1])
return math.sqrt(DTW[l1, l2])
# use it to create b
b = df['Gyro_Z'].rolling(window='4s',min_periods=80).apply(DTWDistanceWindowed_np,raw=False)
# compare a every 10 rows and b and mot the nan rows
print ((b.dropna() == a.dropna()[::10]).all())
# True
定时:
#original solution
%timeit df['Gyro_Z'].rolling(window='4s',min_periods=80).apply(DTWDistanceWindowed,raw=False)
3.31 s ± 422 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# with numpy and 1 out of 10 rows
%timeit df['Gyro_Z'].rolling(window='4s',min_periods=80).apply(DTWDistanceWindowed_np,raw=False)
41.7 ms ± 9.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
因此这样做的if bool(...
速度已经快了将近10倍,而使用numpy
速度又快了9倍。速度可能取决于参考的大小,我还没有真正检查过。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句