我有一个大数据框,它有两个索引列-'date'和'con'
In [28]: df = pd.read_csv('~/futures_min_all.csv')
In [29]: df
Out[29]:
open close high low tvr oi vol ticker date tme con
0 2854.0 2850.0 2854.0 2850.0 5696.0 1226 2 MA1506 20140618 93000 MA
1 2854.0 2854.0 2854.0 2854.0 5698.0 1228 2 MA1506 20140618 93100 MA
2 2854.0 2854.0 2854.0 2854.0 0.0 1228 0 MA1506 20140618 93200 MA
3 2854.0 2854.0 2854.0 2854.0 0.0 1228 0 MA1506 20140618 93300 MA
4 2854.0 2851.0 2854.0 2851.0 5698.0 1228 2 MA1506 20140618 93400 MA
... ... ... ... ... ... ... ... ... ... ... ..
27277818 2359.0 2359.0 2359.0 2359.0 0.0 0 0 PM2105 20201223 145600 PM
27277819 2359.0 2359.0 2359.0 2359.0 0.0 0 0 PM2105 20201223 145700 PM
27277820 2359.0 2359.0 2359.0 2359.0 0.0 0 0 PM2105 20201223 145800 PM
27277821 2359.0 2359.0 2359.0 2359.0 0.0 0 0 PM2105 20201223 145900 PM
27277822 2359.0 2359.0 2359.0 2359.0 0.0 0 0 PM2105 20201223 150000 PM
对于每个骗局,日期范围都不同。
In [30]: df.groupby(['con']).apply(lambda x: set(x['date']))
Out[30]:
con
A {20160512, 20160513, 20160516, 20160517, 20160...
AG {20160512, 20160513, 20160516, 20160517, 20160...
AL {20160512, 20160513, 20160516, 20160517, 20160...
AP {20181008, 20181009, 20181010, 20181011, 20181...
AU {20160512, 20160513, 20160516, 20160517, 20160...
...
WH {20160512, 20160513, 20160516, 20160517, 20160...
WR {20160512, 20160513, 20160516, 20160517, 20160...
Y {20160512, 20160513, 20160516, 20160517, 20160...
ZC {20160512, 20160513, 20160516, 20160517, 20160...
ZN {20160512, 20160513, 20160516, 20160517, 20160...
Length: 68, dtype: object
我想将数据框分为两部分:
前70%,后30%
让我解释:
对于con =='IC','日期'范围为[20200101-20201230](约365天)
IC的前70%约为[20200101-20200820]
对于con =='NI',则'date'范围可能是[20200801-20201230]
在前70%中,“ NI”行将仅在[20200801-20201020]中保留“ date”
ps。我认为head(0.7 * len(df))方法可能无法解决这个问题。因为我想要set(date)的70%,而不是df的70%
我认为必须有一些优雅的方法可以将原始df分割成我想要的内容,您能帮上忙吗?
如果顺序pd.unique
对set
s来说不重要,则这里是前70%的唯一行的修改后的解决方案:
f = lambda x: x.head(int(len(pd.unique(x['Date'])) * -.7))
df1 = df.groupby('con',group_keys=False).apply(f).reset_index(drop=True)
最后过滤所有不匹配的行:
df2 = df[~df.index.isin(df1.index)]
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句