Comparing and matching product names from different stores/suppliers

victorhooi

I’m trying to write a simple program to compare prices for products from different suppliers. Different suppliers may call the same product different things.

For example, the following three strings refer to the same product:

  • A2 Full Cream Milk Bottle 2l
  • A2 Milk Full Cream 2L
  • A2 Full Cream Milk 2L

Or the following two strings are the same product:

  • Ambi Pur Air Freshener Car Voyage 8mL. Fresh Vanilla Flower fragrance. - 1 each
  • Ambi Pur Air Freshener Voyage Primary 8ml

Furthermore - some products are not the same, but are similar (for example, Full Cream 2L Milk may encompass various similar products.)

The only bits of information I have on each product are the title, and a price.

What are currently recommended techniques for matching product strings like this?

From my Googling and reading other SO threads, I found:

  • Some people recommend using Bayesian filtering techniques.
  • Some recommend doing feature extraction on all the products strings. So you might extract things like brands (e.g. “A2”), Product (“Milk”) and capacity (“2L”) from the products, then create distance vectors between products, and use something like a binary classifier to match products (SVM was mentioned). However, I’m not sure of how to achieve this without a whole bunch of rules or regex? I’m assuming there’s probably smarter unsupervised learning methods of attacking this problem? Price could probably be another “feature” we could use to calculate the distance vector as well.
  • Some people recommended using neural-network approaches, however, I wasn't able to find much in terms of concrete code or examples here.
  • Others recommended using string similarity algorithms, such as Levenshtein distance, or the Jaro-Winkler distance.

Would you use one of the above techniques, or would you use a different technique?

Also, does anybody know of any example code, or even libraries for this sort of problem? I couldn't seem to find any.

(For example, I saw that some people were having performance problems with calculating the Jaro-Winkler distance for large data-sets. I was hoping there might be a distributed implementation of the algorithm (e.g. with Mahout), but wasn’t able to find anything concrete.)

Raff.Edward

Would you use one of the above techniques, or would you use a different technique?

If I were doing this for real, I wouldn't use much machine learning. I'm sure most big companies have a database of brand and product names, and use that to match things up fairly easily. Some data sanitation might be needed - but its not much of an ML problem.

If you don't have that database, I'd say go simple. Convert everything to a feature-vector and do nearest neighbor search. Use that to create a tool to help you make a database. IE: you mark the first "A2 Whole Milk 2L" as "milk" yourself, and then see if its nearest neighbors are milk. Give yourself a way to quickly mark "yes" and "needs review", or some similar such option.

For simple data such as you suggested, where it will work 90% of the time - you should be able to get through the data with ease. I've done similar to label several thousand documents in a day.

Once you have your own database, resolving these should be pretty straightforward. You could reuse the code to create your database to handle "unseen" data.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Not matching product names in the Array List in Java

From Dev

Excel: comparing names in two different sheets

From Dev

Comparing names of two files in different directories

From Dev

Comparing files from different project

From Dev

Comparing (diff) different files with same names in lots of directories and subdirectories

From Dev

Find the best matching product version from a list of available product versions

From Dev

Find the best matching product version from a list of available product versions

From Dev

Comparing strings with symbols from different alphabets

From Dev

Comparing two dates from different format

From Dev

On comparing different way of reading an integer from a file

From Dev

Comparing datetime value of columns from different tables

From Dev

WPF Comparing values from different buttons

From Dev

Comparing MySQL rows from different tables in PHP

From Dev

Excel comparing value from row to different columns

From Dev

Trouble parsing product names out of some links with different depth

From Dev

Matching patterns for folder names in a path, excluding a chunk of the path from matching?

From Dev

Matching comparing values in a row from four columns in Excel

From Dev

Collecting arrays from JSON object and element class names and comparing them

From Dev

Multiplying two different dataframes only for completely matching row and column names

From Dev

Matching variable names with their corresponding values over different databases

From Dev

Multiplying two different dataframes only for completely matching row and column names

From Dev

moving files into different directories based on their names matching with another file

From Dev

Product images are serving from different paths on Product detail page and in sitemap

From Dev

How to get a product price from different sites

From Java

Comparing elements from different columns but from the same data frame with R

From Dev

R data.table dot product with matching column names (for each group)

From Dev

Regex RightToLeft '\w' matching different from default

From Dev

perl matching two columns from different tables

From Dev

MySQLi Matching ID from different table

Related Related

  1. 1

    Not matching product names in the Array List in Java

  2. 2

    Excel: comparing names in two different sheets

  3. 3

    Comparing names of two files in different directories

  4. 4

    Comparing files from different project

  5. 5

    Comparing (diff) different files with same names in lots of directories and subdirectories

  6. 6

    Find the best matching product version from a list of available product versions

  7. 7

    Find the best matching product version from a list of available product versions

  8. 8

    Comparing strings with symbols from different alphabets

  9. 9

    Comparing two dates from different format

  10. 10

    On comparing different way of reading an integer from a file

  11. 11

    Comparing datetime value of columns from different tables

  12. 12

    WPF Comparing values from different buttons

  13. 13

    Comparing MySQL rows from different tables in PHP

  14. 14

    Excel comparing value from row to different columns

  15. 15

    Trouble parsing product names out of some links with different depth

  16. 16

    Matching patterns for folder names in a path, excluding a chunk of the path from matching?

  17. 17

    Matching comparing values in a row from four columns in Excel

  18. 18

    Collecting arrays from JSON object and element class names and comparing them

  19. 19

    Multiplying two different dataframes only for completely matching row and column names

  20. 20

    Matching variable names with their corresponding values over different databases

  21. 21

    Multiplying two different dataframes only for completely matching row and column names

  22. 22

    moving files into different directories based on their names matching with another file

  23. 23

    Product images are serving from different paths on Product detail page and in sitemap

  24. 24

    How to get a product price from different sites

  25. 25

    Comparing elements from different columns but from the same data frame with R

  26. 26

    R data.table dot product with matching column names (for each group)

  27. 27

    Regex RightToLeft '\w' matching different from default

  28. 28

    perl matching two columns from different tables

  29. 29

    MySQLi Matching ID from different table

HotTag

Archive