如果可用则按组合并数据否则忽略
Merge data by group if available ignore otherwise
我有包含年份、期号和标题信息的期刊数据 overview
。我已经抓取 some_content
并正在寻找可能的合并。我有两个问题:
- 缺少数据。
overview
可以认为是完整的。但是我的刮擦有几个缺失。一年中的某期只要有标题的信息,就应该合并。
- 没有完美的标题匹配。从某种意义上说,在我的完整概述中,标题更短,它们没有标点符号(在一个版本中我删除了标点符号,但我觉得它没有保存,因为标点符号并不总是在标题的末尾)。
数据大致如下所示:
# A title like "A" can re-occur in different issues, in different years.
# A title is unique within one year-issue.
overview <- data.frame(Year = c(rep(2018,4), rep(2019,4)),
Issue = c(1,1,1,2,1,2,3,3),
Title = c("A", "B", "F", "A", "F", "L", "A", "F"))
Year Issue Title
1 2018 1 A
2 2018 1 B
3 2018 1 F
4 2018 2 A
5 2019 1 F
6 2019 2 L
7 2019 3 A
8 2019 3 F
# The scraped titles include punctuation, like . ! ?
some_content <- data.frame(Year = c(2018, 2018, 2019, 2019, 2019),
Issue = c(1,1,2,3,3),
Title = c("A.", "B!", "L?", "A.", "F"),
Content = c("helloworld", NA, "match", "lorem", NA))
Year Issue Title Content
1 2018 1 A. helloworld
2 2018 1 B! <NA>
3 2019 2 L? match
4 2019 3 A. lorem
5 2019 3 F <NA>
给大家讲讲《helloworld》的故事。 2018年第一期有多个标题。 overview
中的标题 A
绝对对应于 some_content
中的 A.
,尽管它们并不完全相同。每当来自 overview
的每个 year-issue 组合的标题可以在来自 some_content
的 year-issue 组合中检测到时,来自 some_content
的 Content
应该合并到overview
数据框。结果应如下所示:
merge_data <- data.frame(Year = c(rep(2018,4), rep(2019,4)),
Issue = c(1,1,1,2,1,2,3,3),
Title = c("A", "B", "F", "A", "F", "L", "A", "F"),
Content = c("helloworld", NA, NA, NA, NA, "match", "lorem", NA))
Year Issue Title Content
1 2018 1 A helloworld
2 2018 1 B <NA>
3 2018 1 F <NA>
4 2018 2 A <NA>
5 2019 1 F <NA>
6 2019 2 L match
7 2019 3 A lorem
8 2019 3 F <NA>
首先,我建议使用以下方法删除标点符号:
some_content$Title <- gsub("[[:punct:]]", "", some_content$Title)
之后你可以像这样做一个简单的left_join
:
library(dplyr)
left_join(overview, some_content, by = c("Year", "Issue", "Title"))
输出:
Year Issue Title Content
1 2018 1 A helloworld
2 2018 1 B <NA>
3 2018 1 F <NA>
4 2018 2 A <NA>
5 2019 1 F <NA>
6 2019 2 L match
7 2019 3 A lorem
8 2019 3 F <NA>
我有包含年份、期号和标题信息的期刊数据 overview
。我已经抓取 some_content
并正在寻找可能的合并。我有两个问题:
- 缺少数据。
overview
可以认为是完整的。但是我的刮擦有几个缺失。一年中的某期只要有标题的信息,就应该合并。 - 没有完美的标题匹配。从某种意义上说,在我的完整概述中,标题更短,它们没有标点符号(在一个版本中我删除了标点符号,但我觉得它没有保存,因为标点符号并不总是在标题的末尾)。
数据大致如下所示:
# A title like "A" can re-occur in different issues, in different years.
# A title is unique within one year-issue.
overview <- data.frame(Year = c(rep(2018,4), rep(2019,4)),
Issue = c(1,1,1,2,1,2,3,3),
Title = c("A", "B", "F", "A", "F", "L", "A", "F"))
Year Issue Title
1 2018 1 A
2 2018 1 B
3 2018 1 F
4 2018 2 A
5 2019 1 F
6 2019 2 L
7 2019 3 A
8 2019 3 F
# The scraped titles include punctuation, like . ! ?
some_content <- data.frame(Year = c(2018, 2018, 2019, 2019, 2019),
Issue = c(1,1,2,3,3),
Title = c("A.", "B!", "L?", "A.", "F"),
Content = c("helloworld", NA, "match", "lorem", NA))
Year Issue Title Content
1 2018 1 A. helloworld
2 2018 1 B! <NA>
3 2019 2 L? match
4 2019 3 A. lorem
5 2019 3 F <NA>
给大家讲讲《helloworld》的故事。 2018年第一期有多个标题。 overview
中的标题 A
绝对对应于 some_content
中的 A.
,尽管它们并不完全相同。每当来自 overview
的每个 year-issue 组合的标题可以在来自 some_content
的 year-issue 组合中检测到时,来自 some_content
的 Content
应该合并到overview
数据框。结果应如下所示:
merge_data <- data.frame(Year = c(rep(2018,4), rep(2019,4)),
Issue = c(1,1,1,2,1,2,3,3),
Title = c("A", "B", "F", "A", "F", "L", "A", "F"),
Content = c("helloworld", NA, NA, NA, NA, "match", "lorem", NA))
Year Issue Title Content
1 2018 1 A helloworld
2 2018 1 B <NA>
3 2018 1 F <NA>
4 2018 2 A <NA>
5 2019 1 F <NA>
6 2019 2 L match
7 2019 3 A lorem
8 2019 3 F <NA>
首先,我建议使用以下方法删除标点符号:
some_content$Title <- gsub("[[:punct:]]", "", some_content$Title)
之后你可以像这样做一个简单的left_join
:
library(dplyr)
left_join(overview, some_content, by = c("Year", "Issue", "Title"))
输出:
Year Issue Title Content
1 2018 1 A helloworld
2 2018 1 B <NA>
3 2018 1 F <NA>
4 2018 2 A <NA>
5 2019 1 F <NA>
6 2019 2 L match
7 2019 3 A lorem
8 2019 3 F <NA>