Python Pandas-groupby 함수의 결과를 부모 테이블로 반환

debugcn 에 게시 Dev

Matthijs

[Python3 사용] 저는 팬더를 사용하여 csv 파일을 읽고, 데이터 프레임을 그룹화하고, 그룹화 된 데이터에 함수를 적용하고, 이러한 결과를 원래 데이터 프레임에 다시 추가합니다.

내 입력은 다음과 같습니다.

email                   cc  timebucket  total_value
[email protected]           us  1           110.50
[email protected]     uk  3           208.84
...                     ... ...         ...

기본적으로 그룹화하여 해당 그룹 내의 cc각 값에 대한 백분위 수 순위를 계산 하려고합니다 total_value. 둘째, 이러한 결과에 흐름 설명을 적용하고 싶습니다. 이 결과를 원래 / 부모 DataFrame에 다시 추가해야합니다. 다음과 같이 보일 것입니다.

email                   cc  timebucket  total_value     percentrank rankbucket
[email protected]           us  1           110.50          48.59       mid50
[email protected]     uk  3           208.84          99.24       top25
...                     ... ...         ...             ...         ...

아래 코드는 나에게 제공 AssertionError하고 이유를 알 수 없습니다. 저는 Python과 pandas를 처음 접했기 때문에 서로를 설명 할 수 있습니다.

암호:

import pandas as pd
import numpy as np
from scipy.stats import rankdata

def percentilerank(frame, groupkey='cc', rankkey='total_value'):
    from pandas.compat.scipy import percentileofscore

    # Technically the below percentileofscore function should do the trick but I cannot
    # get that to work, hence the alternative below. It would be great if the answer would
    # include both so that I can understand why one works and the other doesnt.
    # func = lambda x, score: percentileofscore(x[rankkey], score, kind='mean')

    func = lambda x: (rankdata(x.total_value)-1)/(len(x.total_value)-1)*100
    frame['percentrank'] = frame.groupby(groupkey).transform(func)


def calc_and_write(filename):
    """
    Function reads the file (must be tab-separated) and stores in a pandas DataFrame.
    Next, the percentile rank score based is calculated based on total_value and is done so within a country.
    Secondly, based on the percentile rank score (prs) a row is assigned to one of three buckets:
        rankbucket = 'top25' if prs > 75
        rankbucket = 'mid50' if 25 > prs < 75
        rankbucket = 'bottom25' if prs < 25
    """

    # Define headers for pandas to read in DataFrame, stored in a list
    headers = [
        'email',            # 0
        'cc',               # 1
        'last_trans_date',  # 3
        'timebucket',       # 4
        'total_value',      # 5
    ]

    # Reading csv file in chunks and creating an iterator (is supposed to be much faster than reading at once)
    tp = pd.read_csv(filename, delimiter='\t', names=headers, iterator=True, chunksize=50000)
    # Concatenating the chunks and sorting total DataFrame by booker_cc and total_nett_spend
    df = pd.concat(tp, ignore_index=True).sort(['cc', 'total_value'], ascending=False)

    percentilerank(df)

편집 : 요청에 따라 다음은 추적 로그입니다.

Traceback (most recent call last):
  File "C:\Users\m\Documents\Python\filter_n_split_3.py", line 85, in <module>
    print(calc_and_write('tsv/test.tsv'))
  File "C:\Users\m\Documents\Python\filter_n_split_3.py", line 74, in calc_and_write
    percentilerank(df)
  File "C:\Users\m\Documents\Python\filter_n_split_3.py", line 33, in percentilerank
    frame['percentrank'] = frame.groupby(groupkey).transform(func)
  File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 1844, in transform
    axis=self.axis, verify_integrity=False)
  File "C:\Python33\lib\site-packages\pandas\tools\merge.py", line 894, in concat
    verify_integrity=verify_integrity)
  File "C:\Python33\lib\site-packages\pandas\tools\merge.py", line 964, in __init__
    self.new_axes = self._get_new_axes()
  File "C:\Python33\lib\site-packages\pandas\tools\merge.py", line 1124, in _get_new_axes
    assert(len(self.join_axes) == ndim - 1)
AssertionError

제프

이 시도. 귀하의 예제는 변환 함수에서 Series를 반환했지만 단일 값을 반환해야합니다. (그리고 이것은 팬더 순위 함수 FYI를 사용합니다)

In [33]: df
Out[33]: 
                 email  cc  timebucket  total_value
0        [email protected]  us           1       110.50
1  [email protected]  uk           3       208.84
2          [email protected]  us           2        50.00

In [34]: df.groupby('cc')['total_value'].apply(lambda x: 100*x.rank()/len(x))
Out[34]: 
0    100
1    100
2     50
dtype: float64

In [35]: df['prank'] = df.groupby('cc')['total_value'].apply(lambda x: 100*x.rank()/len(x))

In [36]: df
Out[36]: 
                 email  cc  timebucket  total_value  prank
0        [email protected]  us           1       110.50    100
1  [email protected]  uk           3       208.84    100
2          [email protected]  us           2        50.00     50

이 기사는 인터넷에서 수집됩니다. 재 인쇄 할 때 출처를 알려주십시오.

침해가 발생한 경우 연락 주시기 바랍니다[email protected] 삭제

에서 수정2021-05-28

몇 마디 만하겠습니다

0리뷰

로그인참여 후 검토

Related 관련 기사

기사

Python Pandas-groupby 함수의 결과를 부모 테이블로 반환

Python Pandas-groupby 함수의 결과를 부모 테이블로 반환

부분합이있는 Pandas groupby 결과를 상대 값으로 변환

여러 List를 하나의 GroupBy 특정 열로 결합하고 각 Grouped 결과를 HTML 테이블로 변환

SQL Server 2008의 사용자 정의 매개 변수를 기반으로 테이블 및 Groupby 결과 조인

부모 데이터와 함께 eloquent의 saveMany 결과 반환

커서의 결과를 테이블로 반환

groupby Pandas의 결과로 행 수 제한

Python pandas groupby 그런 다음 행 단위로 필터링하고 개수를 반환합니다.

Python NetCDF4는 결과의 일부를 반환합니다.

작업 결과를 함수로 반환

nlargest (2)를 반환하는 pandas groupby 및 lambda 함수

다른 결과를 반환하는 조건이있는 열의 Groupby?

함수의 결과를 오프셋이있는 셀로 반환

pandas groupby 변환 / 집계 결과를 데이터 프레임으로

groupby 결과 값을 새 열 Python Pandas의 데이터 프레임과 병합

plpgsql 함수 : 쿼리 실행, 레코드 처리 및 원래 결과 세트를 테이블로 반환

내부 함수 내부의 부모 함수에 객체를 반환합니까?

Python Pandas : groupby의 구성원을 반환하는 방법

1 개의 결과 만 표시하는 테이블 반환 함수

Python Pandas 피벗 테이블이 모든 열을 반환하지 않습니다.

python-pandas로 함수를 적용 할 때 groupby 항목의 이름을 얻는 방법은 무엇입니까?

Postgres의 함수에서 테이블과 총 개수를 반환하는 방법

CakePHP 3 모델로 테이블을 반환하는 PostgreSQL 함수를 어떻게 사용합니까?

주어진 사용자가 평가했는지 여부에 관계없이 왼쪽 테이블의 모든 제품과 평가 테이블의 결과를 반환합니다.

SQL : 다른 테이블의 개수 결과를 기반으로 한 테이블의 열 업데이트

Python Pandas의 groupby () 이해

테이블 내부의 데이터를 반환하지 않는 SQLFILE 매개 변수로 덤프 가져 오기

BigQuery 스크립트의 결과를 Python 클라이언트로 반환

R의 검색 창 모듈 반짝 이는 결과를 반환하는 방법

Python의 Palindrome 함수가 예상 결과를 반환하지 않음