该文件包含2000000行:每行包含208列,以逗号分隔,如下所示:
0.0863314058048,0.0208767447842,0.03358010485,0.0,1.0,0.0,0.314285714286,0.336293217457,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,1,0,0,0,0,0,0
该程序将该文件读取到一个numpy narray中,我希望它将消耗大约(2000000 * 208 * 8B) = 3.2GB
内存。但是,当程序读取此文件时,我发现程序消耗了大约20GB的内存。
我对为什么我的程序消耗了如此多的内存而没有达到预期感到困惑?
我使用numpy的1.9.0和记忆inneficiencynp.loadtxt()
和np.genfromtxt()
似乎是直接关系到它们是基于临时表来存储数据的事实:
通过预先了解shape
数组的大小,您可以想到一个文件读取器,该读取器将通过使用相应的数据存储数据来消耗非常接近理论内存量(在这种情况下为3.2 GB)的内存dtype
:
def read_large_txt(path, delimiter=None, dtype=None):
with open(path) as f:
nrows = sum(1 for line in f)
f.seek(0)
ncols = len(f.next().split(delimiter))
out = np.empty((nrows, ncols), dtype=dtype)
f.seek(0)
for i, line in enumerate(f):
out[i] = line.split(delimiter)
return out
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句