pandas, dataframe, groupby, std

LetMeSOThat4U

New to pandas here. A (trivial) problem: hosts, operations, execution times. I want to group by host, then by host+operation, calculate std deviation for execution time per host, then by host+operation pair. Seems simple?

It works for grouping by a single column:

df
Out[360]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 132564 entries, 0 to 132563
Data columns (total 9 columns):
datespecial    132564  non-null values
host           132564  non-null values
idnum          132564  non-null values
operation      132564  non-null values
time           132564  non-null values
...
dtypes: float32(1), int64(2), object(6)



byhost = df.groupby('host')


byhost.std()
Out[362]:
                 datespecial         idnum      time
host
ahost1.test  11946.961952  40367.033852  0.003699
host1.test   15484.975077  38206.578115  0.008800
host10.test           NaN  37644.137631  0.018001
...

Nice. Now:

byhostandop = df.groupby(['host', 'operation'])

byhostandop.std()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-364-2c2566b866c4> in <module>()
----> 1 byhostandop.std()

/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in std(self, ddof)
    386         # todo, implement at cython level?
    387         if ddof == 1:
--> 388             return self._cython_agg_general('std')
    389         else:
    390             f = lambda x: x.std(ddof=ddof)

/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_general(self, how, numeric_only)
   1615
   1616     def _cython_agg_general(self, how, numeric_only=True):
-> 1617         new_blocks = self._cython_agg_blocks(how, numeric_only=numeric_only)
   1618         return self._wrap_agged_blocks(new_blocks)
   1619

/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_blocks(self, how, numeric_only)
   1653                 values = com.ensure_float(values)
   1654
-> 1655             result, _ = self.grouper.aggregate(values, how, axis=agg_axis)
   1656
   1657             # see if we can cast the block back to the original dtype

/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, values, how, axis)
    838                 if is_numeric:
    839                     result = lib.row_bool_subset(result,
--> 840                                                  (counts > 0).view(np.uint8))
    841                 else:
    842                     result = lib.row_bool_subset_object(result,

/home/username/anaconda/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.row_bool_subset (pandas/lib.c:16540)()

ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'

Huh?? Why do I get this exception?

More questions:

  • how do I calculate std deviation on dataframe.groupby([several columns])?

  • how can I limit calculation to a selected column? E.g. it obviously doesn't make sense to calculate std dev on dates/timestamps here.

Roman Pekar

It's important to know your version of Pandas / Python. Looks like this exception could arise in Pandas version < 0.10 (see ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'). To avoid this, you can cast your float columns to float64:

df.astype('float64')

To calculate std() on selected columns, just select columns :)

>>> df = pd.DataFrame({'a':range(10), 'b':range(10,20), 'c':list('abcdefghij'), 'g':[1]*3 + [2]*3 + [3]*4})
>>> df
   a   b  c  g
0  0  10  a  1
1  1  11  b  1
2  2  12  c  1
3  3  13  d  2
4  4  14  e  2
5  5  15  f  2
6  6  16  g  3
7  7  17  h  3
8  8  18  i  3
9  9  19  j  3
>>> df.groupby('g')[['a', 'b']].std()
          a         b
g                    
1  1.000000  1.000000
2  1.000000  1.000000
3  1.290994  1.290994

update

As far as it goes, it looks like std() is calling aggregation() on the groupby result, and a subtle bug (see here - Python Pandas: Using Aggregate vs Apply to define new columns). To avoid this, you can use apply():

byhostandop['time'].apply(lambda x: x.std())

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Calculate STD manually using Groupby Pandas DataFrame

From Dev

Calculate STD manually using Groupby Pandas DataFrame

From Dev

pandas dataframe groupby summation

From Dev

Reorder pandas groupby dataframe

From Dev

groupby - python pandas dataframe

From Dev

Pandas Groupby back to DataFrame

From Dev

groupby week - pandas dataframe

From Dev

Pandas Dataframe groupby Display

From Dev

pandas dataframe groupby summation

From Dev

Reorder pandas groupby dataframe

From Dev

Groupby Pandas DataFrame and calculate mean and stdev of one column and add the std as a new column with reset_index

From Dev

Split pandas dataframe based on groupby

From Dev

Groupby value counts on the dataframe pandas

From Dev

Filter Pandas DataFrame by GroupBy Contents

From Java

pandas dataframe groupby datetime month

From Dev

pandas add column to groupby dataframe

From Dev

Pandas - Groupby and create new DataFrame?

From Dev

Pandas Percentage count on a DataFrame groupby

From Dev

pandas DataFrame.groupby with a tolerance

From Dev

Pandas - return a dataframe after groupby

From Dev

Groupby value counts on the dataframe pandas

From Dev

Pandas dataframe groupby remove column

From Dev

Pandas dataframe groupby function usage

From Dev

Pandas Dataframe Groupby multiple columns

From Dev

Pandas DataFrame GroupBy sum/count to new DataFrame

From Dev

pandas DataFrame and pandas.groupby to calculate Salaries

From Dev

Pandas: Groupby dataframe and create dicts with missing data

From Dev

Pandas stack/groupby to make a new dataframe

From Dev

Kurtosis on groupby of pandas dataframe doesn't work