열에서 일치하는 값 및 일치하는 값의 최소 / 최대 값 타임 스탬프를 기반으로 데이터 프레임 필터링

debugcn 에 게시 Dev

CoderGuru

정렬 된 사전에서 일치 항목을 찾으려는 이메일 주소 목록이 데이터 프레임으로 바뀝니다.

내 이메일 주소 목록은 다음과 같습니다.

email_list = ['[email protected]','[email protected]','[email protected]','[email protected]']

다음은 내 사전이 DataFrame (df2)으로 바뀝니다.

    sender      type          _time
0  [email protected]      email   2020-12-09 19:45:48.013140
1  [email protected]      email    2020-13-09 19:45:48.013140
2  [email protected]      email   2020-12-09 19:45:48.013140
3  [email protected]      email    2020-14-11 19:45:48.013140

일치하는 보낸 사람의 열, 일치 수 (수), 처음 본 날짜 및 마지막으로 본 날짜를 표시하는 새 DataFrame을 만들고 싶습니다. 모두 일치하는 발신자별로 그룹화됩니다. 첫 번째로 본 날짜는 일치하는 보낸 사람의 _time 열에있는 최소 타임 스탬프이고 마지막으로 본 열 값은 일치하는 보낸 사람의 _time 열에있는 최대 타임 스탬프입니다.

스크립트가 실행 된 후의 샘플 출력은 다음과 같습니다.

      sender  count      type          first_seen            last_seen
0  [email protected]   2        email   2020-12-09 19:45:48.013140   2020-13-09 19:45:48.013140
1  [email protected]   1        email   2020-12-09 19:45:48.013140   2020-12-09 19:45:48.013140
2  [email protected]   1        email    2020-14-11 19:45:48.013140   2020-14-11 19:45:48.013140
3  [email protected]   0        email             NA                     NA

지금까지 내 파이썬은 다음과 같습니다.

#Collect list of email addresses I want to find in df2
email_list = ['[email protected]','[email protected]','[email protected]','[email protected]']

# Turn email list into a dataframe
df1 = pd.DataFrame(email_list, columns=['sender'])

# Collect the table that holds the dictionary of emails sent
email_result_dict = {'sender': ['[email protected]','[email protected]','[email protected]','[email protected]',], 'type': ['email','email','email','email'], '_time': [' 2020-12-09 19:45:48.013140','2020-13-09 19:45:48.013140','2020-12-09 19:45:48.013140','2020-14-09 19:45:48.013140']}

# Turn dictionary into dataframe
df2 = pd.DataFrame.from_dict(email_result_dict)

# Calculate stats
c = df2.loc[df2['sender'].isin(df1['sender'].values)].groupby('sender').size().reset_index()
output = df1.merge(c, on='sender', how='left').fillna(0)
output['first_seen'] = df2.iloc[df2.groupby('sender')['_time'].agg(pd.Series.idxmin] # Get the earliest value in '_time' column
output['last_seen'] = df2.iloc[df2.groupby('sender')['_time'].agg(pd.Series.idxmax] # Get the latest value in '_time' column

# Set the columns of the new dataframe
output.columns = ['sender', 'count','first_seen', 'last_seen']

데이터 프레임에서 예상 출력을 얻는 방법에 대한 아이디어 나 제안이 있습니까? 나는 모든 것을 시도하고 카운트가 0보다 큰 각 일치에 대해 first_seen 및 last_seen 값을 얻는 데 계속 붙어 있습니다.

Mayank Porwal

입력 한 내용에 df따라 다음을 수행 할 수 있습니다 Groupby.agg.

In [1190]: res = df.groupby(['sender', 'type']).agg(['min', 'max', 'count']).reset_index()

In [1191]: res
Out[1191]: 
      sender   type                       _time                                  
                                            min                         max count
0  [email protected]  email  2020-14-11 19:45:48.013140  2020-14-11 19:45:48.013140     1
1  [email protected]  email  2020-12-09 19:45:48.013140  2020-13-09 19:45:48.013140     2
2  [email protected]  email  2020-12-09 19:45:48.013140  2020-12-09 19:45:48.013140     1

편집 : 중첩 된 열을 삭제하려면 다음 을 수행하십시오.

In [1206]: res.columns = res.columns.droplevel()

In [1207]: res
Out[1207]: 
                                            min                         max  count
0  [email protected]  email  2020-14-11 19:45:48.013140  2020-14-11 19:45:48.013140      1
1  [email protected]  email  2020-12-09 19:45:48.013140  2020-13-09 19:45:48.013140      2
2  [email protected]  email  2020-12-09 19:45:48.013140  2020-12-09 19:45:48.013140      1

편집 -2 :df1 또한 사용 :

In [1246]: df = df1.merge(df, how='left')
In [1254]: df.type = df.type.fillna('email')

In [1259]: res = df.groupby(['sender', 'type']).agg(['min', 'max', 'count']).reset_index()

In [1260]: res.columns = res.columns.droplevel()

In [1261]: res
Out[1261]: 
                                            min                         max  count
0  [email protected]  email                         NaN                         NaN      0
1  [email protected]  email  2020-14-11 19:45:48.013140  2020-14-11 19:45:48.013140      1
2  [email protected]  email  2020-12-09 19:45:48.013140  2020-13-09 19:45:48.013140      2
3  [email protected]  email  2020-12-09 19:45:48.013140  2020-12-09 19:45:48.013140      1

이 기사는 인터넷에서 수집됩니다. 재 인쇄 할 때 출처를 알려주십시오.

침해가 발생한 경우 연락 주시기 바랍니다[email protected] 삭제

에서 수정2021-04-5

몇 마디 만하겠습니다

0리뷰

로그인참여 후 검토

Related 관련 기사

기사