Pandas Dataframe memory read_csv

user3659451

There are three screen shots below. The first two show the difference in memory free simply by entering a command which reads a csv to a dataframe (pandas.read_csv).

The third one is the dataframe's .info() stating how much memory is being used by the dataframe.

The numbers don't add up.

  1. https://www.dropbox.com/s/9bda421ukwewoef/Screenshot%202014-12-08%2018.09.35.png?dl=0

  2. https://www.dropbox.com/s/bxx0wczdz7sfhcn/Screenshot%202014-12-08%2018.13.11.png?dl=0

  3. https://www.dropbox.com/s/qf20yhpn7w9fmld/Screenshot%202014-12-08%2018.13.44.png?dl=0

Specifically, the df.info() command says there are ~200 MB used by the dataframe. The difference in free memory is ~700 MB (and I'm looking at the middle row as per the famous linuxatemyram.com website).

That's terrible! This is repeatable. Is this a bug? Or is there something not being released at the end of the pandas.read_csv method.

THANKS.

Jeff

Create a simple frame of int and object dtypes. Create a similar from with Categoricals as well.

In [1]: df_object = DataFrame({'A' : np.random.randn(5), 'B' : Series(['a','foo','bar','a really long string','baz'])})

In [4]: df_object = pd.concat([df_object]*100000,ignore_index=True)

In [2]: df_cat = df_object.copy()

In [3]: df_cat['B'] = df_cat['B'].astype('category')

In [5]: df_cat = pd.concat([df_cat]*100000,ignore_index=True)

Here is what the .info() will show in 0.15.1. Note the '+'

In [10]: df_object.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500000 entries, 0 to 499999
Data columns (total 2 columns):
A    500000 non-null float64
B    500000 non-null object
dtypes: float64(1), object(1)
memory usage: 11.4+ MB

The represents the memory for the object pointer (which are int64's), but NOT the actual string storage.

In [6]: def as_mb(v):
   ...:         return "%.1f MB" % (v/(1024.0*1024))
   ...: 

Here is the memory usage for what python actually does. This is in addition to the above usage. IOW, this is the frame PLUS the storage of the object. (it IS possible that python 3 actually uses less, as it may optimize this somewhat).

In [13]: import sys

In [14]: as_mb(sum(map(sys.getsizeof,df_object['B'].values)))
Out[14]: '20.5 MB'

If you represent this as a variable length string (not currently possible, but instructive)

In [16]: as_mb(sum([ len(b) for b in df_object['B'] ]))
Out[16]: '2.9 MB'

If you convert this to a numpy fixed-length dtype (pandas will reconvert this, so this is not currently possible in pandas).

In [17]: df_object['B'].values.astype(str).dtype
Out[17]: dtype('S20')

# note that this is marginal (e.g. in addition to the above). I have subtracted out
# the int64 pointers to avoid double counting
In [19]: as_mb(df_object['B'].values.astype(str).nbytes - 8*len(df_object['B']))
Out[19]: '5.7 MB'

If you convert to a Categorical type. Note that the memory usage is a function of the number of categories, IOW, if you have completely unique values this will not help much.

In [11]: df_cat.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500000 entries, 0 to 499999
Data columns (total 2 columns):
A    500000 non-null float64
B    500000 non-null category
dtypes: category(1), float64(1)
memory usage: 8.1 MB

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Pandas Dataframe memory read_csv

From Java

Pandas read_csv low_memory and dtype options

From Dev

Pandas - Creating Dataframe from Generator object using read_csv

From Dev

Read 2 lines of a file into a dataframe with two columns using pandas read_csv()

From Dev

Key error when selecting columns in pandas dataframe after read_csv

From Dev

StringIO and pandas read_csv

From Dev

pandas dataframe memory python

From Dev

Pandas read csv out of memory

From Dev

Force Python Pandas DataFrame( read_csv() method) to avoid/not consider first row of my csv/txt file as header

From Dev

'DataFrame' object has no attribute 'read_csv'

From Dev

using pandas read_csv with missing data

From Dev

How to speed up pandas read_csv?

From Dev

pandas read_csv error on windows

From Java

Pandas read_csv from url

From Dev

pandas - read_csv with missing values in headline

From Dev

returned objects by read_csv in Pandas

From Dev

Pandas - Is it possible to read_csv with no quotechar?

From Dev

list of pandas read_csv encoding list

From Dev

pandas read_csv convert object to float

From Dev

Pandas read_csv import results in error

From Dev

Pandas read_csv incorrect columns

From Dev

Pandas read_csv with chunksize is skipping data

From Dev

Customizing the separator in pandas read_csv

From Dev

module 'pandas' has no attribute 'read_csv

From Dev

Pandas read_csv import results in error

From Dev

list of pandas read_csv encoding list

From Dev

pandas read_csv() and python iterator as input

From Dev

pandas read_csv convert object to float

From Dev

Codec issues in pandas read_csv