Pandas Dataframe memory read_csv


There are three screen shots below. The first two show the difference in memory free simply by entering a command which reads a csv to a dataframe (pandas.read_csv).

The third one is the dataframe's .info() stating how much memory is being used by the dataframe.

The numbers don't add up.




Specifically, the command says there are ~200 MB used by the dataframe. The difference in free memory is ~700 MB (and I'm looking at the middle row as per the famous website).

That's terrible! This is repeatable. Is this a bug? Or is there something not being released at the end of the pandas.read_csv method.



Create a simple frame of int and object dtypes. Create a similar from with Categoricals as well.

In [1]: df_object = DataFrame({'A' : np.random.randn(5), 'B' : Series(['a','foo','bar','a really long string','baz'])})

In [4]: df_object = pd.concat([df_object]*100000,ignore_index=True)

In [2]: df_cat = df_object.copy()

In [3]: df_cat['B'] = df_cat['B'].astype('category')

In [5]: df_cat = pd.concat([df_cat]*100000,ignore_index=True)

Here is what the .info() will show in 0.15.1. Note the '+'

In [10]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500000 entries, 0 to 499999
Data columns (total 2 columns):
A    500000 non-null float64
B    500000 non-null object
dtypes: float64(1), object(1)
memory usage: 11.4+ MB

The represents the memory for the object pointer (which are int64's), but NOT the actual string storage.

In [6]: def as_mb(v):
   ...:         return "%.1f MB" % (v/(1024.0*1024))

Here is the memory usage for what python actually does. This is in addition to the above usage. IOW, this is the frame PLUS the storage of the object. (it IS possible that python 3 actually uses less, as it may optimize this somewhat).

In [13]: import sys

In [14]: as_mb(sum(map(sys.getsizeof,df_object['B'].values)))
Out[14]: '20.5 MB'

If you represent this as a variable length string (not currently possible, but instructive)

In [16]: as_mb(sum([ len(b) for b in df_object['B'] ]))
Out[16]: '2.9 MB'

If you convert this to a numpy fixed-length dtype (pandas will reconvert this, so this is not currently possible in pandas).

In [17]: df_object['B'].values.astype(str).dtype
Out[17]: dtype('S20')

# note that this is marginal (e.g. in addition to the above). I have subtracted out
# the int64 pointers to avoid double counting
In [19]: as_mb(df_object['B'].values.astype(str).nbytes - 8*len(df_object['B']))
Out[19]: '5.7 MB'

If you convert to a Categorical type. Note that the memory usage is a function of the number of categories, IOW, if you have completely unique values this will not help much.

In [11]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500000 entries, 0 to 499999
Data columns (total 2 columns):
A    500000 non-null float64
B    500000 non-null category
dtypes: category(1), float64(1)
memory usage: 8.1 MB

