如果可用则按组合并数据否则忽略

Merge data by group if available ignore otherwise

我有包含年份、期号和标题信息的期刊数据 overview。我已经抓取 some_content 并正在寻找可能的合并。我有两个问题:

数据大致如下所示:

# A title like "A" can re-occur in different issues, in different years.
# A title is unique within one year-issue.
overview <- data.frame(Year = c(rep(2018,4), rep(2019,4)), 
                        Issue = c(1,1,1,2,1,2,3,3), 
                        Title = c("A", "B", "F", "A", "F", "L", "A", "F"))

  Year Issue Title
1 2018     1     A
2 2018     1     B
3 2018     1     F
4 2018     2     A
5 2019     1     F
6 2019     2     L
7 2019     3     A
8 2019     3     F

# The scraped titles include punctuation, like .  !  ?
some_content <- data.frame(Year = c(2018, 2018, 2019, 2019, 2019), 
                             Issue = c(1,1,2,3,3), 
                             Title = c("A.", "B!", "L?", "A.", "F"),
                             Content = c("helloworld", NA, "match", "lorem", NA))

  Year Issue Title    Content
1 2018     1    A. helloworld
2 2018     1    B!       <NA>
3 2019     2    L?      match
4 2019     3    A.      lorem
5 2019     3     F       <NA>

给大家讲讲《helloworld》的故事。 2018年第一期有多个标题。 overview 中的标题 A 绝对对应于 some_content 中的 A.,尽管它们并不完全相同。每当来自 overview 的每个 year-issue 组合的标题可以在来自 some_content 的 year-issue 组合中检测到时,来自 some_contentContent 应该合并到overview 数据框。结果应如下所示:

merge_data <- data.frame(Year = c(rep(2018,4), rep(2019,4)), 
                         Issue = c(1,1,1,2,1,2,3,3), 
                         Title = c("A", "B", "F", "A", "F", "L", "A", "F"),
                         Content = c("helloworld", NA, NA, NA, NA, "match", "lorem", NA))

  Year Issue Title    Content
1 2018     1     A helloworld
2 2018     1     B       <NA>
3 2018     1     F       <NA>
4 2018     2     A       <NA>
5 2019     1     F       <NA>
6 2019     2     L      match
7 2019     3     A      lorem
8 2019     3     F       <NA>

首先,我建议使用以下方法删除标点符号:

some_content$Title <- gsub("[[:punct:]]", "", some_content$Title)

之后你可以像这样做一个简单的left_join

library(dplyr)
left_join(overview, some_content, by = c("Year", "Issue", "Title"))

输出:

  Year Issue Title    Content
1 2018     1     A helloworld
2 2018     1     B       <NA>
3 2018     1     F       <NA>
4 2018     2     A       <NA>
5 2019     1     F       <NA>
6 2019     2     L      match
7 2019     3     A      lorem
8 2019     3     F       <NA>