我有以下DataFrame
import pandas as pd, numpy as np
pd.DataFrame({'Instrument':['A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M'],
'Date':['2020-01-02','2020-01-02','2020-01-02','2020-01-02', '2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02'],
'Time':['09:59:50.024','09:59:50.623','10:00:15.124','10:00:15.148','10:00:42.552','10:04:49.782','10:05:14.896','10:05:41.719','10:05:45.050','10:08:19.170'],
'midquote': [0.0,0.0,14.145,14.145,14.245,14.360,14.305,14.235,14.295,14.285]})
我想在每行的时间之后5分钟提取中间引号。不可能获得精确的5分钟差异,因此最好使用最接近5分钟(必须小于)的时间。
所需的输出
pd.DataFrame({'Instrument':['A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M'],
'Date':['2020-01-02','2020-01-02','2020-01-02','2020-01-02', '2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02'],
'Time':['09:59:50.024','09:59:50.623','10:00:15.124','10:00:15.148','10:00:42.552','10:04:49.782','10:05:14.896','10:05:41.719','10:05:45.050','10:08:19.170'],
'midquote': [0.0,0.0,14.145,14.145,14.245,14.360,14.305,14.235,14.295,14.285],
'timestamp':['2020-01-02 09:59:50.024000','2020-01-02 09:59:50.623000','2020-01-02 10:00:15.124000','2020-01-02 10:00:15.148000','2020-01-02 10:00:42.552000','2020-01-02 10:04:49.782000','2020-01-02 10:05:14.896000','2020-01-02 10:05:41.719000','2020-01-02 10:05:45.050000','2020-01-02 10:08:19.170000'],
'timestamp_5_min':['2020-01-02 10:04:49.782000','2020-01-02 10:04:49.782000','2020-01-02 10:05:14.896000','2020-01-02 10:05:14.896000','2020-01-02 10:05:41.719000','nan','nan','nan','nan','nan'],
'midquote_5_min':[14.360,14.305,14.235,14.295,14.285,'nan','nan','nan','nan','nan']})
我尝试了以下代码,但没有用
df['timestamp'] = pd.to_datetime(df.Date + ' ' + df.Time)
df.sort_values(by=['timestamp'], inplace=True)
df['midquote'].fillna(0, inplace=True)
mid = df[['Instrument','timestamp','midquote']]
mid['5_min_timestamp'] = mid.timestamp
mid.rename(columns={'midquote':'5_min_midquote'}, inplace=True)
pd.merge_asof(df, mid, on='timestamp', by=['Instrument'], tolerance=pd.Timedelta("5min"), direction='forward', allow_exact_matches=False)
另外,如果我没有这样做sort_values
,它将给出一个错误left keys must be sorted
请帮忙。
谢谢。
问题的症结在于,该方法.searchsorted
用来返回该列Time
与该列之间的最近日期/时间的索引+ 1,即Time
+ 5分钟timestamp_5_min
。从那里,我进行了一些混乱的合并,列重命名和列删除操作,这些操作可能更干净,但可以完成工作。
输入:
import pandas as pd, numpy as np, datetime as dt
df = pd.DataFrame({'Instrument':['A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M','A2M'],
'Date':['2020-01-02','2020-01-02','2020-01-02','2020-01-02', '2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02'],
'Time':['09:59:50.024','09:59:50.623','10:00:15.124','10:00:15.148','10:00:42.552','10:04:49.782','10:05:14.896','10:05:41.719','10:05:45.050','10:08:19.170'],
'midquote': [0.0,0.0,14.145,14.145,14.245,14.360,14.305,14.235,14.295,14.285]})
码:
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_datetime(df['Time'])
df['timestamp_5_min'] = pd.to_datetime(df['Time'] + pd.Timedelta(minutes=5))
df['timestamp_5_min_new'] = df['Time'].searchsorted(df['timestamp_5_min'])-1
df1 = df[['Time', 'midquote']]
df2 = pd.merge(df,df1,how='left',left_on='timestamp_5_min_new', right_index=True)
df2.loc[df2['timestamp_5_min_new'] == max(df2['timestamp_5_min_new']), ['timestamp_5_min_new', 'Time_y', 'midquote_y']] = np.nan
df2 = df2.drop(['timestamp_5_min_new','timestamp_5_min', 'midquote_x'], axis=1).rename({'Time_x': 'Time', 'Time_y': 'timestamp_5_min', 'midquote_y' : 'midquote'}, axis=1)
df2
输出:
Instrument Date Time timestamp_5_min midquote
0 A2M 2020-01-02 2020-07-16 09:59:50.024 2020-07-16 10:04:49.782 14.360
1 A2M 2020-01-02 2020-07-16 09:59:50.623 2020-07-16 10:04:49.782 14.360
2 A2M 2020-01-02 2020-07-16 10:00:15.124 2020-07-16 10:05:14.896 14.305
3 A2M 2020-01-02 2020-07-16 10:00:15.148 2020-07-16 10:05:14.896 14.305
4 A2M 2020-01-02 2020-07-16 10:00:42.552 2020-07-16 10:05:41.719 14.235
5 A2M 2020-01-02 2020-07-16 10:04:49.782 NaT NaN
6 A2M 2020-01-02 2020-07-16 10:05:14.896 NaT NaN
7 A2M 2020-01-02 2020-07-16 10:05:41.719 NaT NaN
8 A2M 2020-01-02 2020-07-16 10:05:45.050 NaT NaN
9 A2M 2020-01-02 2020-07-16 10:08:19.170 NaT NaN
用仪器分组回答#2。这使它更具挑战性,但是下面完成了工作。
我已经更改了一些示例数据,因此现在有两个类别,并且我将时差要求更改为25秒。
步骤#1-和一起使用.groupby
,以找到最接近本练习(传递给p的参数)的最近时间值的索引。.transform
'lambda x
searchsorted
25 seconds
pd.Timedelta
步骤#2-我a
用来创建字典的那行实际上是为了按组查找最大索引,其最大索引Instrument
分别是3和9。
步骤#3-从那里创建一个临时文件,显示每个仪器的最大索引以及最大索引大于该timestamp_5_min_new
列的任何行。从本质上讲,这意味着该组之后的25秒之内没有行,因此NaN
在np.where()
语句中将其转到了行。从那里,只需清理列和列名。
输入:
import pandas as pd, numpy as np, datetime as dt
df = pd.DataFrame({'Instrument':
['B3M','B3M','B3M','B3M','A2M','A2M','A2M','A2M','A2M','A2M'],
'Date':['2020-01-02','2020-01-02','2020-01-02','2020-01-02', '2020-01-02',
'2020-01-02','2020-01-02','2020-01-02','2020-01-02','2020-01-02'],
'Time':['09:59:50.024','09:59:50.623','10:00:15.124','10:00:15.148','10:00:42.552',
'10:04:49.782','10:05:14.896','10:05:41.719','10:05:45.050','10:08:19.170'],
'midquote': [0.0,0.0,14.145,14.145,14.245,14.360,14.305,14.235,14.295,14.285]})
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_datetime(df['Time'])
码:
df['timestamp_5_min'] = pd.to_datetime(df['Time'] + pd.Timedelta(seconds=25))
df['timestamp_5_min_new'] = df.groupby(['Instrument'])['Time'].transform(lambda x: x).searchsorted(df.groupby(['Instrument'])['timestamp_5_min'].transform(lambda x: x))
a = {v: k for k, v in df['Instrument'].drop_duplicates(keep='last').to_dict().items()}
df['temp'] = df['Instrument'].map(a)
df1 = df[['Time', 'midquote']]
df2 = pd.merge(df,df1,how='left',left_on='timestamp_5_min_new', right_index=True)
df2['Time_y'] = np.where((df2['timestamp_5_min_new'] > df2['temp']), np.datetime64('NaT'), df2['timestamp_5_min'])
df2['midquote_y'] = np.where((df2['timestamp_5_min_new'] > df2['temp']), np.nan, df2['midquote_y'])
df2 = df2.drop(['timestamp_5_min_new','timestamp_5_min', 'midquote_x', 'temp'], axis=1).rename({'Time_x': 'Time', 'Time_y': 'timestamp_5_min', 'midquote_y' : 'midquote'}, axis=1)
df2
输出:
Instrument Date Time timestamp_5_min midquote
0 B3M 2020-01-02 2020-07-17 09:59:50.024 2020-07-17 10:00:15.024 14.145
1 B3M 2020-01-02 2020-07-17 09:59:50.623 NaT NaN
2 B3M 2020-01-02 2020-07-17 10:00:15.124 NaT NaN
3 B3M 2020-01-02 2020-07-17 10:00:15.148 NaT NaN
4 A2M 2020-01-02 2020-07-17 10:00:42.552 2020-07-17 10:01:07.552 14.360
5 A2M 2020-01-02 2020-07-17 10:04:49.782 2020-07-17 10:05:14.782 14.305
6 A2M 2020-01-02 2020-07-17 10:05:14.896 2020-07-17 10:05:39.896 14.235
7 A2M 2020-01-02 2020-07-17 10:05:41.719 2020-07-17 10:06:06.719 14.285
8 A2M 2020-01-02 2020-07-17 10:05:45.050 2020-07-17 10:06:10.050 14.285
9 A2M 2020-01-02 2020-07-17 10:08:19.170 NaT NaN
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句