I want to include the table "tech_distance" into my "first occurrences" table. The two data tables:
head(first_occurrences)
# A tibble: 6 x 4
# Groups: Main, Second [6]
year Main Second occurrence
<int> <chr> <chr> <int>
1 1991 C09D C08F 1
2 2002 A47C A47D 1
3 2002 G10K H05K 1
4 2004 G06G C07K 1
5 2015 B64F B64D 1
6 2015 H02G B29C 1
head(tech_distance)
# A tibble: 6 x 2
Main tech_distance
<fctr> <dbl>
1 C09D 0.3
2 A47C 0.0
3 G10K 0.5
4 G06G 0.5
5 B64F 0.0
6 H02G 0.5
This is the result I want to get:
head(first_occurrences)
Main year Second occurrence tech_distance
1 A01B 2004 E21B 1 0.7
2 A01B 2004 E21B 1 0.5
3 A01B 2004 E21B 1 0.7
4 A01B 2004 E21B 1 0.5
5 A01B 2004 E21B 1 0.5
6 A01B 2004 E21B 1 1.0
I have used mutate in dplyr:
first_occurrences <- data %>%
select(year = X3,Main = X7,Second = X8) %>%
group_by(Main,Second) %>%
mutate(occurrence = n(), tech_distance) %>%
filter(occurrence >= 0, occurrence <= 1, !(Main == Second))
But I get this error:
Error in mutate_impl(.data, dots) :
Column `tech_distance` must be length 24 (the group size) or one, not 2
So I tried using merge():
first_occurrences <- merge(first_occurrences, tech_distance, by.x = "Main", by.y = "Main", all.x=T)
This seems to work but I get an enormous number of rows (240,217 entries)
str(first_occurrences)
'data.frame': 240217 obs. of 5 variables:
$ Main : chr "A01B" "A01B" "A01B" "A01B" ...
$ year : int 2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
$ Second : chr "E21B" "E21B" "E21B" "E21B" ...
$ occurrence : int 1 1 1 1 1 1 1 1 1 1 ...
$ tech_distance: num 0.7 0.5 0.7 0.5 0.5 1 0.5 0.7 0.3 0 ...
while the datasets before were:
str(first_occurrences)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 8015 obs. of 4 variables:
$ year : int 1991 2002 2002 2004 2015 2015 2015 2015 2015 2015 ...
$ Main : chr "C09D" "A47C" "G10K" "G06G" ...
$ Second : chr "C08F" "A47D" "H05K" "C07K" ...
$ occurrence: int 1 1 1 1 1 1 1 1 1 1 ...
str(tech_distance)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 8015 obs. of 2 variables:
$ Main : Factor w/ 815 levels "A01B","A01C",..: 345 62 684 651 265 749 328 735 173 788 ...
$ tech_distance: num 0.3 0 0.5 0.5 0 0.5 0.5 0 0.5 0.5 ...
Does anybody have any idea on how to merge two dataframes keeping the same number of rows?
Based on the above comment;
If tech_distance varies by multiple things, such as main and second, I would actually create a new column then use that to do the left_join
.
first_occurrences <- mutate(first_occurrences, ID = paste0(main, "_", second, "_", year)
tech_distance <- mutate(tech_distance, ID = paste0(main, "_", second, "_", year)
combined_data <- dplyr::left_join(first_occurrences, tech_distance, by = "ID")
As for reordering columns, you can simply use select(#order of columns separated by names, -ID)
For others who may be reading this:
Assuming the tech_distance is specific per main, and not anything else, I would use:
combined_data <- dplyr::left_join(first_occurrences, tech_distance, by = "main")
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments