Match one to many columns in Pandas dataframe

gogasca

I have 2 datasets in CSV file, using pandas each file is converted into 2 different dataframes.

I want to find similar companies based on their url. I'm able to find similar companies based on 1 field (Rule1), but I want to compare more efficiently as following:

Dataset 1

uuid, company_name, website
YAHOO,Yahoo,yahoo.com    
CSCO,Cisco,cisco.com
APPL,Apple,

Dataset 2

company_name, company_website, support_website, privacy_website
Yahoo,,yahoo.com,yahoo.com
Google,google.com,,
Cisco,,,cisco.com

Result Dataset

company_name, company_website, support_website, privacy_website, uuid
Yahoo,,yahoo.com,yahoo.com,YAHOO
Google,google.com,,
Cisco,,,cisco.com,CSCO
  • Dataset1 contains ~50K records.
  • Dataset2 contains ~4M records.

Rules

  1. If field website in dataset 1 is the same as field company_website in dataset 2, extract identifier.

  2. If not match, check if field website in dataset 1 is the same as field support_website in dataset 2, extract identifier.

  3. If not match, check if field website in dataset 1 is the same as field privacy_website in dataset 2, extract identifier.

  4. If not match, check if field company_name in dataset 1 is the same as field company_name in dataset 2, extract identifier.

  5. If not matches return record and identifier field (UUID) will be empty.

Here is my current function:

def MatchCompanies(
    companies: pandas.Dataframe,
    competitor_companies: pandas.Dataframe) -> Optional[Sequence[str]]:
  """Find Competitor companies in companies dataframe and generate a new list.

  Args:
    companies: A dataframe with company information from CSV file.
    competitor_companies: A dataframe with Competitor information from CSV file.

  Returns:
    A sequence of matched companies and their UUID.

  Raises:
    ValueError: No companies found.
  """

  if _IsEmpty(companies):
    raise ValueError('No companies found')
  # Clean up empty fields. Use extra space to avoid matching on empty TLD.
  companies.fillna({'website': ' '}, inplace=True)
  competitor_companies = competitor_companies.fillna('')
  logging.info('Found: %d records.', len(competitor_companies))
  # Rename column to TLD to compare matching companies.
  companies.rename(columns={'website': 'tld'}, inplace=True)
  logging.info('Cleaning up company name.')
  companies.company_name = companies.company_name.apply(_NormalizeText)
  competitor_companies.company_name = competitor_companies.company_name.apply(
      _NormalizeText)
  # Rename column to TLD since Competitor already contains TLD in company_website.
  competitor_companies.rename(columns={'company_website': 'tld'}, inplace=True)
  logging.info('Extracting UUID')
  merge_tld = competitor_companies.merge(
      companies[['tld', 'uuid']], on='tld', how='left')
  # Extracts UUID for company name matches.
  competitor_companies = competitor_companies.merge(
      companies[['company_name', 'uuid']], on='company_name', how='left')
  # Combines dataframes.
  competitor_companies['uuid'] = competitor_companies['uuid'].combine_first(
      merge_tld['uuid'])
  match_companies = len(
      competitor_companies[competitor_companies['uuid'].notnull()])
  total_companies = len(competitor_companies)
  logging.info('Results found: %d out of %d', match_companies, total_companies)
  competitor_companies.rename(columns={'tld': 'company_website'}, inplace=True)
  return competitor_companies

Looking for advise in which function to use?

jezrael

Use map by Series with combine_first, but one requrement is necessary - always unique values in df1['website'] and df1['company_name']:

df1 = df1.dropna()
s1 = df1.set_index('website')['uuid']
s2 = df1.set_index('company_name')['uuid']

w1 = df2['company_website'].map(s1)
w2 = df2['support_website'].map(s1)
w3 = df2['privacy_website'].map(s1)
c = df2['company_name'].map(s2)

df2['uuid'] = w1.combine_first(w2).combine_first(w3).combine_first(c)
print (df2)
  company_name company_website support_website privacy_website   uuid
0        Yahoo             NaN       yahoo.com       yahoo.com  YAHOO
1       Google      google.com             NaN             NaN    NaN
2        Cisco             NaN             NaN       cisco.com   CSCO

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

moving from one column in a dataframe in pandas to many or many columns to one

From Dev

Match rows in one Pandas dataframe to another based on three columns

From Dev

Merging pandas columns (many-to-one)

From Dev

Identify where pandas dataframe columns match string

From Dev

Checking the type of relationship between columns in python/pandas? (one-to-one, one-to-many, or many-to-many)

From Dev

Replacing Columns from one dataframe with columns from another dataframe in pandas

From Dev

Replacing Columns from one dataframe with columns from another dataframe in pandas

From Dev

Statistics of one hot encoded columns in pandas dataframe

From Dev

Define two columns with one map in Pandas DataFrame

From Dev

Pandas, DataFrame: Splitting one column into multiple columns

From Dev

Statistics of one hot encoded columns in pandas dataframe

From Dev

Pandas, DataFrame: Splitting one column into multiple columns

From Dev

Dropping a number of columns in a pandas DataFrame on one line

From Dev

Adding Columns to pandas dataframe & iterating through one of the columns

From Dev

How to sum up the columns of a pandas dataframe according to the elements in one of the columns

From Dev

How to change the column order in a pandas dataframe when there are too many columns?

From Dev

How to split a pandas dataframe into many columns after groupby

From Dev

String Formatting using many pandas columns to create a new one

From Dev

Pandas: join multiple columns of one row to many rows (1:n)

From Java

How to create a pandas dataframe with 2 dataframes one as columns and one as rows

From Dev

Only allow one to one mapping between two columns in pandas dataframe

From Dev

Split one column to two columns depending one the content in pandas dataframe

From Dev

Pandas: match on one set of columns and populate with third value

From Dev

Pandas: Approximate join on one column, exact match on other columns

From Java

Replace or map values in a pandas dataframe column with a one to many dictionary

From Dev

Grouping by many columns in Pandas

From Dev

Efficiently select rows that match one of several values in Pandas DataFrame

From Java

pivot only two columns in pandas dataframe and add one of them

From Java

pivot one column in pandas dataframe and create 4 new columns

Related Related

  1. 1

    moving from one column in a dataframe in pandas to many or many columns to one

  2. 2

    Match rows in one Pandas dataframe to another based on three columns

  3. 3

    Merging pandas columns (many-to-one)

  4. 4

    Identify where pandas dataframe columns match string

  5. 5

    Checking the type of relationship between columns in python/pandas? (one-to-one, one-to-many, or many-to-many)

  6. 6

    Replacing Columns from one dataframe with columns from another dataframe in pandas

  7. 7

    Replacing Columns from one dataframe with columns from another dataframe in pandas

  8. 8

    Statistics of one hot encoded columns in pandas dataframe

  9. 9

    Define two columns with one map in Pandas DataFrame

  10. 10

    Pandas, DataFrame: Splitting one column into multiple columns

  11. 11

    Statistics of one hot encoded columns in pandas dataframe

  12. 12

    Pandas, DataFrame: Splitting one column into multiple columns

  13. 13

    Dropping a number of columns in a pandas DataFrame on one line

  14. 14

    Adding Columns to pandas dataframe & iterating through one of the columns

  15. 15

    How to sum up the columns of a pandas dataframe according to the elements in one of the columns

  16. 16

    How to change the column order in a pandas dataframe when there are too many columns?

  17. 17

    How to split a pandas dataframe into many columns after groupby

  18. 18

    String Formatting using many pandas columns to create a new one

  19. 19

    Pandas: join multiple columns of one row to many rows (1:n)

  20. 20

    How to create a pandas dataframe with 2 dataframes one as columns and one as rows

  21. 21

    Only allow one to one mapping between two columns in pandas dataframe

  22. 22

    Split one column to two columns depending one the content in pandas dataframe

  23. 23

    Pandas: match on one set of columns and populate with third value

  24. 24

    Pandas: Approximate join on one column, exact match on other columns

  25. 25

    Replace or map values in a pandas dataframe column with a one to many dictionary

  26. 26

    Grouping by many columns in Pandas

  27. 27

    Efficiently select rows that match one of several values in Pandas DataFrame

  28. 28

    pivot only two columns in pandas dataframe and add one of them

  29. 29

    pivot one column in pandas dataframe and create 4 new columns

HotTag

Archive