从文件名中提取字符串并使用 mutate 创建新列
Extract strings from filename and create new columns using mutate
我有一个包含两列的 data.frame。第二列是文件名。
df <- data.frame(paragraph = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
filename = "./data/RevCon_2015_C1_Austria_05_06.txt", stringsAsFactors = FALSE)
如何从第二列中提取某些字符串(使用 stringr
)并将它们添加(使用 dplyr::mutate
)作为附加变量(会议、年份、国家等)以便我得到以下结果:
df2 <- data.frame(paragraph = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
filename = "./data/RevCon_2015_C1_Austria_05_06.txt", conference = "RevCon", year = "2015", country= "Austria", date = "06.05.2015", stringsAsFactors = FALSE)
我们可以使用 tidyr::separate
执行以下操作:
library(tidyverse);
df %>%
mutate(tmp = gsub("(\./data/|\.txt)", "", filename)) %>%
separate(
tmp,
into = c("conference", "year", "ignored", "country", "month", "day")) %>%
mutate(date = paste(day, month, year, sep = "/")) %>%
select(-ignored, -month, -day)
# paragraph filename conference year
#1 Lorem ipsum [...] ./data/RevCon_2015_C1_Austria_05_06.txt RevCon 2015
# country date
#1 Austria 06/05/2015
请注意,这假设 filename
遵循以下模式:./data/{conference}_{year}_{ignored}_{country}_{month}_{day}.txt
示例数据
df <- data.frame(
paragraph = "Lorem ipsum [...]",
filename = "./data/RevCon_2015_C1_Austria_05_06.txt",
stringsAsFactors = FALSE)
这里有两种不同的方法,使用来自 tidyr
的 separate
和 extract
:
library(dplyr)
library(tidyr)
df %>%
mutate(filename2 = gsub("^(\w+)_(\d+)_.+?_(\w+)_(\d{2})_(\d{2}).+$",
"\1_\2_\3_\5.\4.\2", basename(filename))) %>%
separate(filename2, c("conference", "year", "country", "date"), sep = "_")
或 extract
:
df %>%
extract(filename, c("conference", "year", "country", "day", "month"),
"^.+/(\w+)_(\d+)_.+?_(\w+)_(\d{2})_(\d{2}).+$",
remove = FALSE) %>%
unite(date, month, day, year, sep = ".", remove = FALSE) %>%
select(paragraph, filename, conference, year, country, date)
结果:
paragraph
1 Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.
filename conference year country date
1 ./data/RevCon_2015_C1_Austria_05_06.txt RevCon 2015 Austria 06.05.2015
备注:
- 第一种方法使用
gsub
来匹配我们想要使用捕获组的每个 "column",并根据需要重新排序。注意加入_
来区分列
- 我使用
basename
函数提取了最后一个 /
之后的所有内容
然后使用 separate
将元素拆分为实际的列,其中 _
作为分隔符
- 第二种方法使用相同的正则表达式,但不是重新排列,
extract
将每个捕获组视为一个单独的列
unite
将 month
、day
和 year
绑定在一起而不删除原始列
- 最后
select
删除 day
和 month
并重新排列列顺序
我有一个包含两列的 data.frame。第二列是文件名。
df <- data.frame(paragraph = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
filename = "./data/RevCon_2015_C1_Austria_05_06.txt", stringsAsFactors = FALSE)
如何从第二列中提取某些字符串(使用 stringr
)并将它们添加(使用 dplyr::mutate
)作为附加变量(会议、年份、国家等)以便我得到以下结果:
df2 <- data.frame(paragraph = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
filename = "./data/RevCon_2015_C1_Austria_05_06.txt", conference = "RevCon", year = "2015", country= "Austria", date = "06.05.2015", stringsAsFactors = FALSE)
我们可以使用 tidyr::separate
执行以下操作:
library(tidyverse);
df %>%
mutate(tmp = gsub("(\./data/|\.txt)", "", filename)) %>%
separate(
tmp,
into = c("conference", "year", "ignored", "country", "month", "day")) %>%
mutate(date = paste(day, month, year, sep = "/")) %>%
select(-ignored, -month, -day)
# paragraph filename conference year
#1 Lorem ipsum [...] ./data/RevCon_2015_C1_Austria_05_06.txt RevCon 2015
# country date
#1 Austria 06/05/2015
请注意,这假设 filename
遵循以下模式:./data/{conference}_{year}_{ignored}_{country}_{month}_{day}.txt
示例数据
df <- data.frame(
paragraph = "Lorem ipsum [...]",
filename = "./data/RevCon_2015_C1_Austria_05_06.txt",
stringsAsFactors = FALSE)
这里有两种不同的方法,使用来自 tidyr
的 separate
和 extract
:
library(dplyr)
library(tidyr)
df %>%
mutate(filename2 = gsub("^(\w+)_(\d+)_.+?_(\w+)_(\d{2})_(\d{2}).+$",
"\1_\2_\3_\5.\4.\2", basename(filename))) %>%
separate(filename2, c("conference", "year", "country", "date"), sep = "_")
或 extract
:
df %>%
extract(filename, c("conference", "year", "country", "day", "month"),
"^.+/(\w+)_(\d+)_.+?_(\w+)_(\d{2})_(\d{2}).+$",
remove = FALSE) %>%
unite(date, month, day, year, sep = ".", remove = FALSE) %>%
select(paragraph, filename, conference, year, country, date)
结果:
paragraph
1 Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.
filename conference year country date
1 ./data/RevCon_2015_C1_Austria_05_06.txt RevCon 2015 Austria 06.05.2015
备注:
- 第一种方法使用
gsub
来匹配我们想要使用捕获组的每个 "column",并根据需要重新排序。注意加入_
来区分列- 我使用
basename
函数提取了最后一个/
之后的所有内容
然后使用 separate
将元素拆分为实际的列,其中_
作为分隔符
- 我使用
- 第二种方法使用相同的正则表达式,但不是重新排列,
extract
将每个捕获组视为一个单独的列unite
将month
、day
和year
绑定在一起而不删除原始列- 最后
select
删除day
和month
并重新排列列顺序