在 R 中,如何找到一个单词在字符串中的位置?
In R, how to find the location of a word in a string?
如何找到数据框单元格中特定单词的第一个位置,并将输出保存在同一数据框的新列中?
理想情况下,我想要字典中每个单词的第一个匹配项。
df <- data.frame(text = c("omg coke is so awsme","i always preferred pepsi", "mozart is so overrated by yeah fanta makes my day, always"))
dict <- c("coke", "pepsi", "fanta")
位置可以是字典单词之前的 N 个字符或单词。
我一直在研究找到的代码 here,但我无法让它工作。
例如,这段代码完成了这项工作,但只针对一个词和一个字符串(而不是 df 和字典)
my_string = "omg coke is so awsme"
unlist(gregexpr("coke", my_string))[1]
期望的输出:
text location
1 omg coke is so awsme 2
2 i always preferred pepsi 4
3 mozart is so overrated by yeah fanta makes my day, always 7
就像我说的,位置也可以是字符串而不是单词,如果这样更容易的话。
就运行
c(regexpr(paste0(dict,collapse = '|'), df$text))
[1] 5 20 32
编辑:
如果你想要单词的位置:
library(tidyverse)
pat <- sprintf(".*(%s)", paste0(dict,collapse = '|'))
df %>%
mutate(loc = str_count(str_extract(text,pat), "\w+"))
text loc
1 omg coke is so awsme 2
2 i always preferred pepsi 4
3 mozart is so overrated by yeah fanta makes my day, always 7
这是一个简单的 for 循环:
for(i in dict) {
df[[i]] = stringi::stri_locate_first_fixed(df$text, i)[, 1]
}
df
# text coke pepsi fanta
# 1 omg coke is so awsme 5 NA NA
# 2 i always preferred pepsi NA 20 NA
# 3 mozart is so overrated by yeah fanta makes my day, always NA NA 32
或使用 regexpr
(基础的一部分,因此没有依赖性):
for(i in dict) {
df[[i]] = regexpr(i, df$text, fixed = TRUE)
}
df
# text coke pepsi fanta
# 1 omg coke is so awsme 5 -1 -1
# 2 i always preferred pepsi -1 20 -1
# 3 mozart is so overrated by yeah fanta makes my day, always -1 -1 32
下面是单词编号的解决方案,但我建议在使用之前删除所有标点符号:
df$words = strsplit(df$text, split = " ")
for(i in dict) {
df[[i]] = sapply(df$words, \(x) match(i, unlist(x)))
}
df
# text coke pepsi fanta
# 1 omg coke is so awsme 2 NA NA
# 2 i always preferred pepsi NA 4 NA
# 3 mozart is so overrated by yeah fanta makes my day, always NA NA 7
# words
# 1 omg, coke, is, so, awsme
# 2 i, always, preferred, pepsi
# 3 mozart, is, so, overrated, by, yeah, fanta, makes, my, day,, always
require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
df <- data.frame(text = c("omg coke is so awsme",
"i always preferred pepsi",
"mozart is so overrated by yeah fanta makes my day, always"))
dict <- c("coke", "pepsi", "fanta")
corp <- corpus(df)
toks <- tokens(corp)
index(toks, dict)
#> docname from to pattern
#> 1 text1 2 2 coke
#> 2 text2 4 4 pepsi
#> 3 text3 7 7 fanta
由 reprex package (v2.0.1)
创建于 2022-05-27
如何找到数据框单元格中特定单词的第一个位置,并将输出保存在同一数据框的新列中?
理想情况下,我想要字典中每个单词的第一个匹配项。
df <- data.frame(text = c("omg coke is so awsme","i always preferred pepsi", "mozart is so overrated by yeah fanta makes my day, always"))
dict <- c("coke", "pepsi", "fanta")
位置可以是字典单词之前的 N 个字符或单词。
我一直在研究找到的代码 here,但我无法让它工作。
例如,这段代码完成了这项工作,但只针对一个词和一个字符串(而不是 df 和字典)
my_string = "omg coke is so awsme"
unlist(gregexpr("coke", my_string))[1]
期望的输出:
text location
1 omg coke is so awsme 2
2 i always preferred pepsi 4
3 mozart is so overrated by yeah fanta makes my day, always 7
就像我说的,位置也可以是字符串而不是单词,如果这样更容易的话。
就运行
c(regexpr(paste0(dict,collapse = '|'), df$text))
[1] 5 20 32
编辑:
如果你想要单词的位置:
library(tidyverse)
pat <- sprintf(".*(%s)", paste0(dict,collapse = '|'))
df %>%
mutate(loc = str_count(str_extract(text,pat), "\w+"))
text loc
1 omg coke is so awsme 2
2 i always preferred pepsi 4
3 mozart is so overrated by yeah fanta makes my day, always 7
这是一个简单的 for 循环:
for(i in dict) {
df[[i]] = stringi::stri_locate_first_fixed(df$text, i)[, 1]
}
df
# text coke pepsi fanta
# 1 omg coke is so awsme 5 NA NA
# 2 i always preferred pepsi NA 20 NA
# 3 mozart is so overrated by yeah fanta makes my day, always NA NA 32
或使用 regexpr
(基础的一部分,因此没有依赖性):
for(i in dict) {
df[[i]] = regexpr(i, df$text, fixed = TRUE)
}
df
# text coke pepsi fanta
# 1 omg coke is so awsme 5 -1 -1
# 2 i always preferred pepsi -1 20 -1
# 3 mozart is so overrated by yeah fanta makes my day, always -1 -1 32
下面是单词编号的解决方案,但我建议在使用之前删除所有标点符号:
df$words = strsplit(df$text, split = " ")
for(i in dict) {
df[[i]] = sapply(df$words, \(x) match(i, unlist(x)))
}
df
# text coke pepsi fanta
# 1 omg coke is so awsme 2 NA NA
# 2 i always preferred pepsi NA 4 NA
# 3 mozart is so overrated by yeah fanta makes my day, always NA NA 7
# words
# 1 omg, coke, is, so, awsme
# 2 i, always, preferred, pepsi
# 3 mozart, is, so, overrated, by, yeah, fanta, makes, my, day,, always
require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
df <- data.frame(text = c("omg coke is so awsme",
"i always preferred pepsi",
"mozart is so overrated by yeah fanta makes my day, always"))
dict <- c("coke", "pepsi", "fanta")
corp <- corpus(df)
toks <- tokens(corp)
index(toks, dict)
#> docname from to pattern
#> 1 text1 2 2 coke
#> 2 text2 4 4 pepsi
#> 3 text3 7 7 fanta
由 reprex package (v2.0.1)
创建于 2022-05-27