使用模式 R 的文本捕获 - 正则表达式
Text capture using pattern R - regular expression
我正在尝试通过模式映射提取所需的单词。
下面是对象中的示例数据table
+-----------+-------------------------------------------------------------------------------------------------+
| Unique_Id | Text |
+-----------+-------------------------------------------------------------------------------------------------+
| Ax23z12 | Tool generated code 2015-8134 upon further validation, the tool confirmed the code as 2015-8134 |
+-----------+-------------------------------------------------------------------------------------------------+
使用下面的代码
regmatches(table[1,2],gregexpr("2000-\d{4}",table[1,2]))
我能够将输出提取为
[[1]]
[1] "2000-0511" "2000-0511"
但是我正在寻找的输出如下所示
+-----------+---------------------------------------------------------------------------+-----------+-----------+
| Unique_Id | Text | Column1 | Column2 |
+-----------+---------------------------------------------------------------------------+-----------+-----------+
| Ax23z12 | Tool generated code 2015-8134 upon further validation, the tool confirmed | 2015-8134 | 2015-8134 |
| | the code as 2015-8134 | | |
+-----------+---------------------------------------------------------------------------+-----------+-----------+
文本列下的数据多次包含此数字(最多 7 次)因此寻找动态解决方案
非常感谢
类似的内容可能适合您
df[apply(df, 1, function(x) any(grepl("2000-\d{4}", x))), ]
查看这个可重现的示例
iris[apply(iris, 1, function(x) any(grepl("set", x))), ]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
# etc
使用stringr
和data.table
:
1) 使用str_match_all
提取所有匹配的模式;
2) 使用transpose
将提取的模式转换为列;
3) 将提取的列与原始列组合起来构建新的数据框;
library(stringr)
library(data.table)
lst = transpose(str_match_all(df$Text, "2015-\d{4}"))
data.frame(df, setNames(lst, paste0("Column", seq_along(lst))))
# Unique_Id Text Column1 Column2
#1 Ax23z12 Tool generated code 2015-8134 upon further validation, the tool confirmed the code as 2015-8134 2015-8134 2015-8134
#2 By56m22 Tool generated code 2015-8134 upon further validation 2015-8134 <NA>
这是适合您的一种方法。我使用了以下示例数据,称为 foo
.
# id text
# <int> <chr>
#1 1 Here is my code, 2015-8134. Here is your code, 2015-1111.
#2 2 His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666
我首先用 stri_extract_all_regex()
为 text
提取了数字。这个returns一个矩阵,所以我把它转换成了一个数据框。然后,我使用 bind_cols()
将其与原始数据集结合起来。最后一项工作是修改列名。我用 gsub()
中的 Column
替换了列名中的 X
library(dplyr)
library(stringi)
out <- stri_extract_all_regex(str = foo$text, pattern = "\d+-\d+", simplify = TRUE) %>%
data.frame(stringsAsFactors = FALSE) %>%
bind_cols(foo,. )
names(out) <- names(out) %>%
gsub(pattern = "X", replacement = "Column")
# id text Column1 Column2 Column3
# <int> <chr> <chr> <chr> <chr>
#1 1 Here is my code, 2015-8134. Here is your code, 2015-1111. 2015-8134 2015-1111
#2 2 His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666 2016-8888 2016-7777 2016-6666
数据
foo <- structure(list(id = 1:2, text = c("Here is my code, 2015-8134. Here is your code, 2015-1111.",
"His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666"
)), .Names = c("id", "text"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L))
我正在尝试通过模式映射提取所需的单词。
下面是对象中的示例数据table
+-----------+-------------------------------------------------------------------------------------------------+ | Unique_Id | Text | +-----------+-------------------------------------------------------------------------------------------------+ | Ax23z12 | Tool generated code 2015-8134 upon further validation, the tool confirmed the code as 2015-8134 | +-----------+-------------------------------------------------------------------------------------------------+
使用下面的代码
regmatches(table[1,2],gregexpr("2000-\d{4}",table[1,2]))
我能够将输出提取为
[[1]]
[1] "2000-0511" "2000-0511"
但是我正在寻找的输出如下所示
+-----------+---------------------------------------------------------------------------+-----------+-----------+ | Unique_Id | Text | Column1 | Column2 | +-----------+---------------------------------------------------------------------------+-----------+-----------+ | Ax23z12 | Tool generated code 2015-8134 upon further validation, the tool confirmed | 2015-8134 | 2015-8134 | | | the code as 2015-8134 | | | +-----------+---------------------------------------------------------------------------+-----------+-----------+
文本列下的数据多次包含此数字(最多 7 次)因此寻找动态解决方案
非常感谢
类似的内容可能适合您
df[apply(df, 1, function(x) any(grepl("2000-\d{4}", x))), ]
查看这个可重现的示例
iris[apply(iris, 1, function(x) any(grepl("set", x))), ]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
# etc
使用stringr
和data.table
:
1) 使用str_match_all
提取所有匹配的模式;
2) 使用transpose
将提取的模式转换为列;
3) 将提取的列与原始列组合起来构建新的数据框;
library(stringr)
library(data.table)
lst = transpose(str_match_all(df$Text, "2015-\d{4}"))
data.frame(df, setNames(lst, paste0("Column", seq_along(lst))))
# Unique_Id Text Column1 Column2
#1 Ax23z12 Tool generated code 2015-8134 upon further validation, the tool confirmed the code as 2015-8134 2015-8134 2015-8134
#2 By56m22 Tool generated code 2015-8134 upon further validation 2015-8134 <NA>
这是适合您的一种方法。我使用了以下示例数据,称为 foo
.
# id text
# <int> <chr>
#1 1 Here is my code, 2015-8134. Here is your code, 2015-1111.
#2 2 His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666
我首先用 stri_extract_all_regex()
为 text
提取了数字。这个returns一个矩阵,所以我把它转换成了一个数据框。然后,我使用 bind_cols()
将其与原始数据集结合起来。最后一项工作是修改列名。我用 gsub()
Column
替换了列名中的 X
library(dplyr)
library(stringi)
out <- stri_extract_all_regex(str = foo$text, pattern = "\d+-\d+", simplify = TRUE) %>%
data.frame(stringsAsFactors = FALSE) %>%
bind_cols(foo,. )
names(out) <- names(out) %>%
gsub(pattern = "X", replacement = "Column")
# id text Column1 Column2 Column3
# <int> <chr> <chr> <chr> <chr>
#1 1 Here is my code, 2015-8134. Here is your code, 2015-1111. 2015-8134 2015-1111
#2 2 His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666 2016-8888 2016-7777 2016-6666
数据
foo <- structure(list(id = 1:2, text = c("Here is my code, 2015-8134. Here is your code, 2015-1111.",
"His code is 2016-8888, her code is 2016-7777, and your code is 2016-6666"
)), .Names = c("id", "text"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L))