I have a pandas DataFrame with columns ip, mac, hostname, and os. I want to convert that DataFrame DataFrame into a new DataFrame where the ip column is unique, and the other columns list all the unique values that appear with the given ip.
For example, a DataFrame as follows
ip mac hostname os 8.8.8.8 00:00:00:ff dns.google.com linux 8.8.8.8 00:00:ff:ff dns2.google.com windows 8.8.8.8 00:00:ff:ff dns2.google.com windows 8.8.4.4 00:00:00:ff dns.google.com linux 8.8.4.4 00:00:ff:ff dns2.google.com windows 8.8.4.4 00:00:ff:ff dns2.google.com windows
should be converted into the following
ip mac hostname os 8.8.8.8 00:00:00:ff, 00:00:ff:ff dns.google.com,dns2.google.com linux,windows 8.8.4.4 00:00:00:ff, 00:00:ff:ff dns.google.com,dns2.google.com linux,windows
I can accomplish the desired behavior by running
df.groupby('ip').agg(set)
but the data set is very large, and groupby is very memory intensive, so a 500MB dataset is consuming 3-4GB of memory. Is there an alternative way of doing this that is not so memory intensive?
Code for creating the input data:
import pandas as pd
df = pd.DataFrame(
{
'ip': ['8.8.8.8', '8.8.8.8', '8.8.8.8', '8.8.4.4', '8.8.4.4', '8.8.4.4'],
'mac': ['00:00:00:ff', '00:00:ff:ff', '00:00:ff:ff', '00:00:00:ff', '00:00:ff:ff', '00:00:ff:ff'],
'hostname': ['dns.google.com', 'dns2.google.com', 'dns2.google.com', 'dns.google.com', 'dns2.google.com', 'dns2.google.com',],
'os': ['linux', 'windows', 'windows', 'linux', 'windows', 'windows'],
}
)
df.groupby('ip').agg(lambda x: ', '.join(sorted(set(x))))
Joining the unique elements directly will reduce memory, as a single string will be stored in each cell instead of a set of strings.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments