How to calculate unique values in pandas dataframe while grouping by values of particular column

Peter Published at Dev

Peter

I have a pandas DataFrame with columns ip, mac, hostname, and os. I want to convert that DataFrame DataFrame into a new DataFrame where the ip column is unique, and the other columns list all the unique values that appear with the given ip.

For example, a DataFrame as follows

ip mac hostname os

8.8.8.8 00:00:00:ff dns.google.com linux

8.8.8.8 00:00:ff:ff dns2.google.com windows

8.8.8.8 00:00:ff:ff dns2.google.com windows

8.8.4.4 00:00:00:ff dns.google.com linux

8.8.4.4 00:00:ff:ff dns2.google.com windows

8.8.4.4 00:00:ff:ff dns2.google.com windows

ip	mac	hostname	os
8.8.8.8	00:00:00:ff	dns.google.com	linux
8.8.8.8	00:00:ff:ff	dns2.google.com	windows
8.8.8.8	00:00:ff:ff	dns2.google.com	windows
8.8.4.4	00:00:00:ff	dns.google.com	linux
8.8.4.4	00:00:ff:ff	dns2.google.com	windows
8.8.4.4	00:00:ff:ff	dns2.google.com	windows

should be converted into the following

ip mac hostname os

8.8.8.8 00:00:00:ff, 00:00:ff:ff dns.google.com,dns2.google.com linux,windows

8.8.4.4 00:00:00:ff, 00:00:ff:ff dns.google.com,dns2.google.com linux,windows

ip	mac	hostname	os
8.8.8.8	00:00:00:ff, 00:00:ff:ff	dns.google.com,dns2.google.com	linux,windows
8.8.4.4	00:00:00:ff, 00:00:ff:ff	dns.google.com,dns2.google.com	linux,windows

I can accomplish the desired behavior by running

df.groupby('ip').agg(set)

but the data set is very large, and groupby is very memory intensive, so a 500MB dataset is consuming 3-4GB of memory. Is there an alternative way of doing this that is not so memory intensive?

Code for creating the input data:

import pandas as pd

df = pd.DataFrame(
    {
        'ip': ['8.8.8.8', '8.8.8.8', '8.8.8.8', '8.8.4.4', '8.8.4.4', '8.8.4.4'],
        'mac': ['00:00:00:ff', '00:00:ff:ff', '00:00:ff:ff', '00:00:00:ff', '00:00:ff:ff', '00:00:ff:ff'],
        'hostname': ['dns.google.com', 'dns2.google.com', 'dns2.google.com', 'dns.google.com', 'dns2.google.com', 'dns2.google.com',],
        'os': ['linux', 'windows', 'windows', 'linux', 'windows', 'windows'],
    }
)

Filip

df.groupby('ip').agg(lambda x: ', '.join(sorted(set(x))))

Joining the unique elements directly will reduce memory, as a single string will be stored in each cell instead of a set of strings.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2023-05-2

Comments

0 comments

From Dev

How to sort values of a pandas dataframe by a particular column in a particular manner (using lambda function like sorted in std lib)

From Dev

How to calculate the values of a pandas DataFrame column depending on the results of a rolling function from another column

From Dev

Related Related

Article