Matching two data tables (Vlookup, dplyr, match(), left_join) keeping number of rows

debugcn Published at Dev

Amleto

I want to include the table "tech_distance" into my "first occurrences" table. The two data tables:

head(first_occurrences)
# A tibble: 6 x 4
# Groups:   Main, Second [6]
   year  Main Second  occurrence
   <int> <chr>  <chr>      <int>
1  1991  C09D   C08F          1
2  2002  A47C   A47D          1
3  2002  G10K   H05K          1
4  2004  G06G   C07K          1
5  2015  B64F   B64D          1
6  2015  H02G   B29C          1


head(tech_distance)
# A tibble: 6 x 2
    Main  tech_distance
   <fctr>         <dbl>
1   C09D           0.3
2   A47C           0.0
3   G10K           0.5
4   G06G           0.5
5   B64F           0.0
6   H02G           0.5

This is the result I want to get:

head(first_occurrences)
   Main year Second occurrence tech_distance
 1 A01B 2004   E21B          1           0.7
 2 A01B 2004   E21B          1           0.5
 3 A01B 2004   E21B          1           0.7
 4 A01B 2004   E21B          1           0.5
 5 A01B 2004   E21B          1           0.5
 6 A01B 2004   E21B          1           1.0

I have used mutate in dplyr:

first_occurrences <- data %>% 
 select(year = X3,Main = X7,Second = X8) %>% 
 group_by(Main,Second) %>% 
 mutate(occurrence = n(), tech_distance) %>% 
 filter(occurrence >= 0, occurrence <= 1, !(Main == Second))

But I get this error:

 Error in mutate_impl(.data, dots) : 
 Column `tech_distance` must be length 24 (the group size) or one, not 2

So I tried using merge():

first_occurrences <- merge(first_occurrences, tech_distance, by.x = "Main", by.y = "Main", all.x=T)

This seems to work but I get an enormous number of rows (240,217 entries)

 str(first_occurrences)
 'data.frame':  240217 obs. of  5 variables:
  $ Main         : chr  "A01B" "A01B" "A01B" "A01B" ...
  $ year         : int  2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
  $ Second       : chr  "E21B" "E21B" "E21B" "E21B" ...
  $ occurrence   : int  1 1 1 1 1 1 1 1 1 1 ...
  $ tech_distance: num  0.7 0.5 0.7 0.5 0.5 1 0.5 0.7 0.3 0 ...

while the datasets before were:

str(first_occurrences)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 8015 obs. of  4 variables:
 $ year      : int  1991 2002 2002 2004 2015 2015 2015 2015 2015 2015 ...
 $ Main      : chr  "C09D" "A47C" "G10K" "G06G" ...
 $ Second    : chr  "C08F" "A47D" "H05K" "C07K" ...
 $ occurrence: int  1 1 1 1 1 1 1 1 1 1 ...

str(tech_distance)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   8015 obs. of  2 variables:
 $ Main         : Factor w/ 815 levels "A01B","A01C",..: 345 62 684 651 265 749 328 735 173 788 ...
 $ tech_distance: num  0.3 0 0.5 0.5 0 0.5 0.5 0 0.5 0.5 ...

Does anybody have any idea on how to merge two dataframes keeping the same number of rows?

A Duv

Based on the above comment;

If tech_distance varies by multiple things, such as main and second, I would actually create a new column then use that to do the left_join.

    first_occurrences <- mutate(first_occurrences, ID = paste0(main, "_", second, "_", year)
    tech_distance <- mutate(tech_distance, ID = paste0(main, "_", second, "_", year)  
    combined_data <- dplyr::left_join(first_occurrences, tech_distance, by = "ID")

As for reordering columns, you can simply use select(#order of columns separated by names, -ID)

For others who may be reading this:

Assuming the tech_distance is specific per main, and not anything else, I would use:

combined_data <- dplyr::left_join(first_occurrences, tech_distance, by = "main")

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-08-7

Comments

0 comments

From Dev

Matching data from two tables with unequal number of rows

From Dev

Using vlookup match to check existence of number across two tables

From Dev

Matching two files and keeping blocks that contain the match

From Dev

Get matching rows between two tables and NULL otherwise on left table

From Dev

Count number of matching data in columns on the basis of primary key in two tables

From Dev

Sum data from two tables with different number of rows

From Dev

Sum data from two tables with different number of rows

From Dev

Join two tables on matching ID?

From Dev

How to create a random matching between the rows of two data.tables (or data.frames)

From Dev

Select data from two tables with LEFT JOIN

From Dev

R Match data tables using string matching

From Dev

Oracle SQL to compare data and retrieve un-matching rows between two tables in Oracle DB

From Dev

dplyr left_join() by rownames

From Java

match row names of two data frames and subset only matching rows in R

From Dev

mysql - LEFT JOIN two tables with columns that almost match

From Dev

efficient way substracting two very large data frames with different number of rows (matching XYZ)

From Dev

Computing number of bits that are set to 1 for matching rows in terms of hamming distance between two data frames

From Dev

MySQL: JOIN two tables with matching row names

From Dev

LEFT JOIN on two tables

From Dev

Join two tables using multiple rows in the join

From Dev

Join two dataframes and overwrite matching rows [R]

From Dev

join two table with non-matching rows

From Dev

How to join two tables on two columns normally, but if one of columns contain null, then result must contain rows that match another column only?

From Dev

Spring data JPA left join on two unrelated tables

From Dev

Left join on two tables with different conditions on part of table data

From Dev

LIMIT number of rows in a JOIN between MySQL tables

From Dev

Incorrect behavior with dplyr's left_join?

From Dev

left_join (dplyr) using a function

From Dev

SQL Find most rows that match between two tables

Related Related

Article