与 R 中的日期进行模糊连接
fuzzyjoin with dates in R
我正在开展一个项目,我正在根据各国体育比赛的结果分析各国内部的个人层面调查数据,但我不确定产生我想要的合并的最有效方法是什么。
我正在处理两个独立的数据集。一个包含嵌套在国家内的个人级别数据。数据可能如下所示:
country <- c(rep("Country A", 4), rep("Country B", 6))
date <- c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-04", rep("2000-01-01", 2), "2000-01-02", rep("2000-01-03", 3))
outcome <- rnorm(10)
individual_data <- cbind.data.frame(country, date, outcome)
rm(country, date, outcome)
另一个有国家/地区匹配级别数据,看起来像这样:
date <- rep("2000-01-02", 2)
country <- c("Country A", "Country B")
opponent <- c("Country B", "Country A")
match_outcome <- c("L", "W")
match_data <- cbind.data.frame(date, country, opponent, match_outcome)
rm(date, country, opponent, match_outcome)
在这个例子中,只有一场比赛发生在 2000 年 1 月 2 日,A 国输给了 B 国。我想执行 fuzzy_join
与 left_join
这里,match_data
与 individual_data
匹配,即使日期不准确。
# incorrect
merged <- left_join(individual_data, match_data)
我想在 3 天的范围内执行此操作,并且我想要一个指标来指示在此范围内比赛前后的天数。最终产品看起来像这样:
country <- c(rep("Country A", 4), rep("Country B", 6))
date <- c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-04", rep("2000-01-01", 2), "2000-01-02", rep("2000-01-03", 3))
outcome <- rnorm(10)
opponent <- c(rep("Country B", 4), rep("Country A", 6))
match_outcome <- c(rep("L", 4), rep("W", 6))
match_date <- rep("2000-01-02", 10)
difference <- c(-1, 0, 1, 2, -1, -1, 0, rep(1, 3))
desired_output <- cbind.data.frame(country, date, outcome, opponent, match_outcome, match_date, difference)
rm(country, date, outcome, opponent, match_outcome, match_date, difference)
谁能帮帮我?我一直在为如何完成这项工作而苦苦挣扎。到目前为止,这是我尝试过的:
match_data$match_date_minus3 <- ymd(match_data$date) - days(3)
match_data$match_date_plus3 <- ymd(match_data$date) + days(3)
test_output <- fuzzy_left_join(individual_data, match_data,
by = c("country" = "country",
"match_date_minus3" = "date",
"match_date_plus3" = "date"),
match_fun = list("==", ">", "<"))
但我收到以下错误:Error in which(m) : argument to 'which' is not logical
作为参考,如果有人知道,我正在尝试复制 Depeteris-Chauvin et al. 2018 的结果。
存在三个问题
将 match_fun
中的双引号替换为反引号
应该反转 by
值
'date' 列更改为相应的 Date
class
library(fuzzyjoin)
library(dplyr)
individual_data$date <- as.Date(individual_data$date)
match_data$match_date_minus3 <- as.Date(match_data$match_date_minus3)
match_data$match_date_plus3 <- as.Date(match_data$match_date_plus3)
fuzzy_left_join(individual_data, match_data,
by = c("country" = "country",
'date' = "match_date_minus3",
'date' = "match_date_plus3"),
match_fun = list(`==`, `>`, `<`)) %>%
select(country = country.x, date = date.x, outcome,
opponent, match_outcome, match_date = date.y)
# country date outcome opponent match_outcome match_date
#1 Country A 2000-01-01 1.4003662 Country B L 2000-01-02
#2 Country A 2000-01-02 0.5526607 Country B L 2000-01-02
#3 Country A 2000-01-03 0.4316405 Country B L 2000-01-02
#4 Country A 2000-01-04 -0.1171910 Country B L 2000-01-02
#5 Country B 2000-01-01 1.3433921 Country A W 2000-01-02
#6 Country B 2000-01-01 -1.1773011 Country A W 2000-01-02
#7 Country B 2000-01-02 -0.6953120 Country A W 2000-01-02
#8 Country B 2000-01-03 1.3484053 Country A W 2000-01-02
#9 Country B 2000-01-03 -0.7266405 Country A W 2000-01-02
#10 Country B 2000-01-03 -0.9139988 Country A W 2000-01-02
我正在开展一个项目,我正在根据各国体育比赛的结果分析各国内部的个人层面调查数据,但我不确定产生我想要的合并的最有效方法是什么。
我正在处理两个独立的数据集。一个包含嵌套在国家内的个人级别数据。数据可能如下所示:
country <- c(rep("Country A", 4), rep("Country B", 6))
date <- c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-04", rep("2000-01-01", 2), "2000-01-02", rep("2000-01-03", 3))
outcome <- rnorm(10)
individual_data <- cbind.data.frame(country, date, outcome)
rm(country, date, outcome)
另一个有国家/地区匹配级别数据,看起来像这样:
date <- rep("2000-01-02", 2)
country <- c("Country A", "Country B")
opponent <- c("Country B", "Country A")
match_outcome <- c("L", "W")
match_data <- cbind.data.frame(date, country, opponent, match_outcome)
rm(date, country, opponent, match_outcome)
在这个例子中,只有一场比赛发生在 2000 年 1 月 2 日,A 国输给了 B 国。我想执行 fuzzy_join
与 left_join
这里,match_data
与 individual_data
匹配,即使日期不准确。
# incorrect
merged <- left_join(individual_data, match_data)
我想在 3 天的范围内执行此操作,并且我想要一个指标来指示在此范围内比赛前后的天数。最终产品看起来像这样:
country <- c(rep("Country A", 4), rep("Country B", 6))
date <- c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-04", rep("2000-01-01", 2), "2000-01-02", rep("2000-01-03", 3))
outcome <- rnorm(10)
opponent <- c(rep("Country B", 4), rep("Country A", 6))
match_outcome <- c(rep("L", 4), rep("W", 6))
match_date <- rep("2000-01-02", 10)
difference <- c(-1, 0, 1, 2, -1, -1, 0, rep(1, 3))
desired_output <- cbind.data.frame(country, date, outcome, opponent, match_outcome, match_date, difference)
rm(country, date, outcome, opponent, match_outcome, match_date, difference)
谁能帮帮我?我一直在为如何完成这项工作而苦苦挣扎。到目前为止,这是我尝试过的:
match_data$match_date_minus3 <- ymd(match_data$date) - days(3)
match_data$match_date_plus3 <- ymd(match_data$date) + days(3)
test_output <- fuzzy_left_join(individual_data, match_data,
by = c("country" = "country",
"match_date_minus3" = "date",
"match_date_plus3" = "date"),
match_fun = list("==", ">", "<"))
但我收到以下错误:Error in which(m) : argument to 'which' is not logical
作为参考,如果有人知道,我正在尝试复制 Depeteris-Chauvin et al. 2018 的结果。
存在三个问题
将
match_fun
中的双引号替换为反引号
应该反转
by
值'date' 列更改为相应的
Date
class
library(fuzzyjoin)
library(dplyr)
individual_data$date <- as.Date(individual_data$date)
match_data$match_date_minus3 <- as.Date(match_data$match_date_minus3)
match_data$match_date_plus3 <- as.Date(match_data$match_date_plus3)
fuzzy_left_join(individual_data, match_data,
by = c("country" = "country",
'date' = "match_date_minus3",
'date' = "match_date_plus3"),
match_fun = list(`==`, `>`, `<`)) %>%
select(country = country.x, date = date.x, outcome,
opponent, match_outcome, match_date = date.y)
# country date outcome opponent match_outcome match_date
#1 Country A 2000-01-01 1.4003662 Country B L 2000-01-02
#2 Country A 2000-01-02 0.5526607 Country B L 2000-01-02
#3 Country A 2000-01-03 0.4316405 Country B L 2000-01-02
#4 Country A 2000-01-04 -0.1171910 Country B L 2000-01-02
#5 Country B 2000-01-01 1.3433921 Country A W 2000-01-02
#6 Country B 2000-01-01 -1.1773011 Country A W 2000-01-02
#7 Country B 2000-01-02 -0.6953120 Country A W 2000-01-02
#8 Country B 2000-01-03 1.3484053 Country A W 2000-01-02
#9 Country B 2000-01-03 -0.7266405 Country A W 2000-01-02
#10 Country B 2000-01-03 -0.9139988 Country A W 2000-01-02