我得到 6500 只关于心电图的苍蝇。
我想从这些文件中读取它并对其进行一些处理,但我发现时间成本比我之前的想法和 tqdm 估计的要多得多。
因此,如果我的代码有任何问题,我会感到困惑。
这是 mat 文件示例:
# the number of each array are given same for convience, in fact they are totally not same
mat1 = scipy.io.loadmat('Train/TRAIN0001.mat')
mat1
{'I': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
'II': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
'III': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
'V1': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
'V2': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
'V3': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
'V4': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
'V5': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
'V6': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
'__globals__': [],
'__header__': b'MATLAB 5.0 MAT-file Platform: nt, Created on: Mon May 6 16:56:48 2019',
'__version__': '1.0',
'aVF': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
'aVL': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
'aVR': array([[-0.02928, -0.02928, -0.02928, ... , 0.46848, 0.53192, 0.5856]]),
'age': array([[63]], dtype=int32),
'sex': array(['FEMALE'], dtype='<U6'),
}
这是代码:
def read_mat(mat_path, index):
mat = scipy.io.loadmat(mat_path)
mat_df = pd.DataFrame({
'I_' + str(index): mat['I'][0],
'II_' + str(index): mat['II'][0],
'III_' + str(index): mat['III'][0],
'V1_' + str(index): mat['V1'][0],
'V2_' + str(index): mat['V2'][0],
'V3_' + str(index): mat['V3'][0],
'V4_' + str(index): mat['V4'][0],
'V5_' + str(index): mat['V5'][0],
'V6_' + str(index): mat['V6'][0],
'aVF_' + str(index): mat['aVF'][0],
'aVL_' + str(index): mat['aVL'][0],
'aVR_' + str(index): mat['aVR'][0]
})
age = pd.DataFrame({'age': mat['age'][0]})
sex = pd.DataFrame({'sex': mat['sex']})
sex['sex'] = sex['sex'].apply(lambda x: 1 if x == 'male' (0 if x == 'female' else 2))
return mat_df, age, sex
def read_data():
# target.csv save the label of every people
tar = pd.read_csv('target.csv')
# ECG has collected 5000 samples of each people, so I want to treat every sample as a feature
train = pd.DataFrame(columns=[i for i in range(0, 5000)])
for i in tqdm(range(1, 6501)):
tmp_filename = 'TRAIN' + str(i).zfill(4)
train_tmp, age, sex = read_mat('Train/' + tmp_filename, i)
train_tmp = train_tmp.transpose()
train_tmp['age'] = age['age'][0]
train_tmp['sex'] = sex['sex'][0]
train_tmp['target'] = tar['label'][i-1]
# add 5000 samples of each mat file into train DataFrame
train = train.append(train_tmp)
del train_tmp, age, sex
target = pd.Series()
target = train['target']
return train, target, tar
这是时间成本:
0% | 11/6500 [00:00<01:01, 105.36it/s]
0% | 19/6500 [00:00<01:08, 94.25it/s]
...
...
10% | 636/6500 [02:14<39:37, 2.47it/s]
10% | 640/6500 [02:15<39:52, 2.45it/s]
...
...
20% | 1322/6500 [09:25<1:12:56, 1.18it/s]
20% | 1328/6500 [09:30<1:13:27, 1.17it/s]
...
...
30% | 1918/6500 [20:02<1:13:53, 1.23s/it]
...
...
40% | 2586/6500 [35:52<1:44:42, 1.61s/it]
...
...
50% | 3237/6500 [2:08:11<10:58:41, 12.09s/it]
当我阅读了 50% 的 mat 文件时,它估计将花费 10 多个小时。
而且我想知道我的代码有什么问题,所以会花费太多时间。
谁能给我一些关于我的代码的提示?
提前致谢。
免责声明:检查的正确方法是通过分析器运行您的代码,我没有这样做(因为它需要伪造输入数据的长度等)。
查看 for 循环的主体,唯一可以合理增加执行时间的行是
train = train.append(train_tmp)
该文档特别指出要避免这种情况(可能是因为画家 Schlemiel 的情况):
以迭代方式将行附加到 DataFrame 可能比单个串联在计算上更加密集。更好的解决方案是将这些行附加到列表中,然后一次性将列表与原始 DataFrame 连接起来。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句