字符串拆分 data.table 列产生 NA
String splitting data.table column produces NAs
这是我关于 SO 的第一个问题,所以请告诉我它是否可以改进。我正在 R 中从事一个自然语言处理项目,并试图构建一个包含测试用例的 data.table。在这里,我构建了一个更加简化的示例:
texts.dt <- data.table(string = c("one",
"two words",
"three words here",
"four useless words here",
"five useless meaningless words here",
"six useless meaningless words here just",
"seven useless meaningless words here just to",
"eigth useless meaningless words here just to fill",
"nine useless meaningless words here just to fill up",
"ten useless meaningless words here just to fill up space"),
word.count = 1:10,
stop.at.word = c(0, 1, 2, 2, 4, 3, 3, 6, 7, 5))
这个 returns 我们将致力于 data.table:
string word.count stop.at.word
1: one 1 0
2: two words 2 1
3: three words here 3 2
4: four useless words here 4 2
5: five useless meaningless words here 5 4
6: six useless meaningless words here just 6 3
7: seven useless meaningless words here just to 7 3
8: eigth useless meaningless words here just to fill 8 6
9: nine useless meaningless words here just to fill up 9 7
10: ten useless meaningless words here just to fill up space 10 5
在实际应用中,stop.at.word
列中的值在 运行dom 处确定(上限 = word.count
- 1)。此外,字符串不是按长度排序的,但这应该没有什么区别。
代码应添加两列 input
和 output
,其中 input
包含从位置 1 到 stop.at.word
的子字符串,并且 output
包含后面的单词(单个单词),如下所示:
>desired_result
string word.count stop.at.word input
1: one 1 0
2: two words 2 1 two
3: three words here 3 2 three words
4: four useless words here 4 2 four useless
5: five useless meaningless words here 5 4 five useless meaningless words
6: six useless meaningless words here just 6 2 six useless
7: seven useless meaningless words here just to 7 3 seven useless meaningless
8: eigth useless meaningless words here just to fill 8 6 eigth useless meaningless words here just
9: nine useless meaningless words here just to fill up 9 7 nine useless meaningless words here just to
10: ten useless meaningless words here just to fill up space 10 5 ten useless meaningless words here
output
1:
2: words
3: here
4: words
5: here
6: meaningless
7: words
8: to
9: fill
10: just
不幸的是,我得到的是:
string word.count stop.at.word input output
1: one 1 0
2: two words 2 1 NA NA
3: three words here 3 2 NA NA
4: four useless words here 4 2 NA NA
5: five useless meaningless words here 5 4 NA NA
6: six useless meaningless words here just 6 3 NA NA
7: seven useless meaningless words here just to 7 3 NA NA
8: eigth useless meaningless words here just to fill 8 6 NA NA
9: nine useless meaningless words here just to fill up 9 7 NA NA
10: ten useless meaningless words here just to fill up space 10 5 ten NA
请注意结果不一致,第 1 行为空字符串,第 10 行返回 "ten"。
这是我使用的代码:
texts.dt[, c("input", "output") := .(
substr(string,
1,
sapply(gregexpr(" ", string),"[", stop.at.word) - 1),
substr(string,
sapply(gregexpr(" ", string),"[", stop.at.word),
sapply(gregexpr(" ", string),"[", stop.at.word + 1) - 1)
)]
我 运行 许多测试和 substr
指令在我尝试控制台中的单个字符串时运行良好,但在应用于 data.table 时失败。
我怀疑我在 data.table 中遗漏了一些与范围界定相关的东西,但我已经很久没有使用这个包了,所以我很困惑。
非常感谢您的帮助。
提前致谢!
我可能会
texts.dt[stop.at.word > 0, c("input","output") := {
sp = strsplit(string, " ")
list(
mapply(function(p,n) paste(p[seq_len(n)], collapse = " "), sp, stop.at.word),
mapply(`[`, sp, stop.at.word+1L)
)
}]
# partial result
head(texts.dt, 4)
string word.count stop.at.word input output
1: one 1 0 NA NA
2: two words 2 1 two words
3: three words here 3 2 three words here
4: four useless words here 4 2 four useless words
或者:
library(stringi)
texts.dt[stop.at.word > 0, c("input","output") := {
patt = paste0("((\w+ ){", stop.at.word-1, "}\w+) (.*)")
m = stri_match(string, regex = patt)
list(m[, 2], m[, 4])
}]
@Frank 的 mapply
解决方案的替代方法是将 by = 1:nrow(texts.dt)
与 strsplit
和 paste
一起使用:
library(data.table)
texts.dt[, `:=` (input = paste(strsplit(string, ' ')[[1]][1:stop.at.word][stop.at.word>0],
collapse = " "),
output = strsplit(string, ' ')[[1]][stop.at.word + 1]),
by = 1:nrow(texts.dt)]
给出:
> texts.dt
string word.count stop.at.word input output
1: one 1 0 one
2: two words 2 1 two words
3: three words here 3 2 three words here
4: four useless words here 4 2 four useless words
5: five useless meaningless words here 5 4 five useless meaningless words here
6: six useless meaningless words here just 6 3 six useless meaningless words
7: seven useless meaningless words here just to 7 3 seven useless meaningless words
8: eigth useless meaningless words here just to fill 8 6 eigth useless meaningless words here just to
9: nine useless meaningless words here just to fill up 9 7 nine useless meaningless words here just to fill
10: ten useless meaningless words here just to fill up space 10 5 ten useless meaningless words here just
除了使用 [[1]]
,您还可以将 strsplit
包装在 unlist
中,如下所示:unlist(strsplit(string, ' '))
(而不是 strsplit(string, ' ')[[1]]
)。这会给你相同的结果。
另外两个选项:
1) 与 stringi 包:
library(stringi)
texts.dt[, `:=`(input = paste(stri_extract_all_words(string[stop.at.word>0],
simplify = TRUE)[1:stop.at.word],
collapse = " "),
output = stri_extract_all_words(string[stop.at.word>0],
simplify = TRUE)[stop.at.word+1]),
1:nrow(texts.dt)]
2) 或改编自 this answer:
texts.dt[stop.at.word>0,
c('input','output') := tstrsplit(string,
split = paste0("(?=(?>\s+\S*){",
word.count - stop.at.word,
"}$)\s"),
perl = TRUE)
][, output := sub('(\w+).*','\1',output)]
两者都给出:
> texts.dt
string word.count stop.at.word input output
1: one 1 0 NA NA
2: two words 2 1 two words
3: three words here 3 2 three words here
4: four useless words here 4 2 four useless words
5: five useless meaningless words here 5 4 five useless meaningless words here
6: six useless meaningless words here just 6 3 six useless meaningless words
7: seven useless meaningless words here just to 7 3 seven useless meaningless words
8: eigth useless meaningless words here just to fill 8 6 eigth useless meaningless words here just to
9: nine useless meaningless words here just to fill up 9 7 nine useless meaningless words here just to fill
10: ten useless meaningless words here just to fill up space 10 5 ten useless meaningless words here just
dt[, `:=`(input = sub(paste0('((\s*\w+){', stop.at.word, '}).*'), '\1', string),
output = sub(paste0('(\s*\w+){', stop.at.word, '}\s*(\w+).*'), '\2', string))
, by = stop.at.word][]
# string word.count stop.at.word
# 1: one 1 0
# 2: two words 2 1
# 3: three words here 3 2
# 4: four useless words here 4 2
# 5: five useless meaningless words here 5 4
# 6: six useless meaningless words here just 6 3
# 7: seven useless meaningless words here just to 7 3
# 8: eigth useless meaningless words here just to fill 8 6
# 9: nine useless meaningless words here just to fill up 9 7
#10: ten useless meaningless words here just to fill up space 10 5
# input output
# 1: one
# 2: two words
# 3: three words here
# 4: four useless words
# 5: five useless meaningless words here
# 6: six useless meaningless words
# 7: seven useless meaningless words
# 8: eigth useless meaningless words here just to
# 9: nine useless meaningless words here just to fill
#10: ten useless meaningless words here just
我不确定我是否理解 output
第一行什么都没有的逻辑,但如果确实需要,那么微不足道的修复留给 OP。
这是我关于 SO 的第一个问题,所以请告诉我它是否可以改进。我正在 R 中从事一个自然语言处理项目,并试图构建一个包含测试用例的 data.table。在这里,我构建了一个更加简化的示例:
texts.dt <- data.table(string = c("one",
"two words",
"three words here",
"four useless words here",
"five useless meaningless words here",
"six useless meaningless words here just",
"seven useless meaningless words here just to",
"eigth useless meaningless words here just to fill",
"nine useless meaningless words here just to fill up",
"ten useless meaningless words here just to fill up space"),
word.count = 1:10,
stop.at.word = c(0, 1, 2, 2, 4, 3, 3, 6, 7, 5))
这个 returns 我们将致力于 data.table:
string word.count stop.at.word
1: one 1 0
2: two words 2 1
3: three words here 3 2
4: four useless words here 4 2
5: five useless meaningless words here 5 4
6: six useless meaningless words here just 6 3
7: seven useless meaningless words here just to 7 3
8: eigth useless meaningless words here just to fill 8 6
9: nine useless meaningless words here just to fill up 9 7
10: ten useless meaningless words here just to fill up space 10 5
在实际应用中,stop.at.word
列中的值在 运行dom 处确定(上限 = word.count
- 1)。此外,字符串不是按长度排序的,但这应该没有什么区别。
代码应添加两列 input
和 output
,其中 input
包含从位置 1 到 stop.at.word
的子字符串,并且 output
包含后面的单词(单个单词),如下所示:
>desired_result
string word.count stop.at.word input
1: one 1 0
2: two words 2 1 two
3: three words here 3 2 three words
4: four useless words here 4 2 four useless
5: five useless meaningless words here 5 4 five useless meaningless words
6: six useless meaningless words here just 6 2 six useless
7: seven useless meaningless words here just to 7 3 seven useless meaningless
8: eigth useless meaningless words here just to fill 8 6 eigth useless meaningless words here just
9: nine useless meaningless words here just to fill up 9 7 nine useless meaningless words here just to
10: ten useless meaningless words here just to fill up space 10 5 ten useless meaningless words here
output
1:
2: words
3: here
4: words
5: here
6: meaningless
7: words
8: to
9: fill
10: just
不幸的是,我得到的是:
string word.count stop.at.word input output
1: one 1 0
2: two words 2 1 NA NA
3: three words here 3 2 NA NA
4: four useless words here 4 2 NA NA
5: five useless meaningless words here 5 4 NA NA
6: six useless meaningless words here just 6 3 NA NA
7: seven useless meaningless words here just to 7 3 NA NA
8: eigth useless meaningless words here just to fill 8 6 NA NA
9: nine useless meaningless words here just to fill up 9 7 NA NA
10: ten useless meaningless words here just to fill up space 10 5 ten NA
请注意结果不一致,第 1 行为空字符串,第 10 行返回 "ten"。
这是我使用的代码:
texts.dt[, c("input", "output") := .(
substr(string,
1,
sapply(gregexpr(" ", string),"[", stop.at.word) - 1),
substr(string,
sapply(gregexpr(" ", string),"[", stop.at.word),
sapply(gregexpr(" ", string),"[", stop.at.word + 1) - 1)
)]
我 运行 许多测试和 substr
指令在我尝试控制台中的单个字符串时运行良好,但在应用于 data.table 时失败。
我怀疑我在 data.table 中遗漏了一些与范围界定相关的东西,但我已经很久没有使用这个包了,所以我很困惑。
非常感谢您的帮助。 提前致谢!
我可能会
texts.dt[stop.at.word > 0, c("input","output") := {
sp = strsplit(string, " ")
list(
mapply(function(p,n) paste(p[seq_len(n)], collapse = " "), sp, stop.at.word),
mapply(`[`, sp, stop.at.word+1L)
)
}]
# partial result
head(texts.dt, 4)
string word.count stop.at.word input output
1: one 1 0 NA NA
2: two words 2 1 two words
3: three words here 3 2 three words here
4: four useless words here 4 2 four useless words
或者:
library(stringi)
texts.dt[stop.at.word > 0, c("input","output") := {
patt = paste0("((\w+ ){", stop.at.word-1, "}\w+) (.*)")
m = stri_match(string, regex = patt)
list(m[, 2], m[, 4])
}]
@Frank 的 mapply
解决方案的替代方法是将 by = 1:nrow(texts.dt)
与 strsplit
和 paste
一起使用:
library(data.table)
texts.dt[, `:=` (input = paste(strsplit(string, ' ')[[1]][1:stop.at.word][stop.at.word>0],
collapse = " "),
output = strsplit(string, ' ')[[1]][stop.at.word + 1]),
by = 1:nrow(texts.dt)]
给出:
> texts.dt
string word.count stop.at.word input output
1: one 1 0 one
2: two words 2 1 two words
3: three words here 3 2 three words here
4: four useless words here 4 2 four useless words
5: five useless meaningless words here 5 4 five useless meaningless words here
6: six useless meaningless words here just 6 3 six useless meaningless words
7: seven useless meaningless words here just to 7 3 seven useless meaningless words
8: eigth useless meaningless words here just to fill 8 6 eigth useless meaningless words here just to
9: nine useless meaningless words here just to fill up 9 7 nine useless meaningless words here just to fill
10: ten useless meaningless words here just to fill up space 10 5 ten useless meaningless words here just
除了使用 [[1]]
,您还可以将 strsplit
包装在 unlist
中,如下所示:unlist(strsplit(string, ' '))
(而不是 strsplit(string, ' ')[[1]]
)。这会给你相同的结果。
另外两个选项:
1) 与 stringi 包:
library(stringi)
texts.dt[, `:=`(input = paste(stri_extract_all_words(string[stop.at.word>0],
simplify = TRUE)[1:stop.at.word],
collapse = " "),
output = stri_extract_all_words(string[stop.at.word>0],
simplify = TRUE)[stop.at.word+1]),
1:nrow(texts.dt)]
2) 或改编自 this answer:
texts.dt[stop.at.word>0,
c('input','output') := tstrsplit(string,
split = paste0("(?=(?>\s+\S*){",
word.count - stop.at.word,
"}$)\s"),
perl = TRUE)
][, output := sub('(\w+).*','\1',output)]
两者都给出:
> texts.dt
string word.count stop.at.word input output
1: one 1 0 NA NA
2: two words 2 1 two words
3: three words here 3 2 three words here
4: four useless words here 4 2 four useless words
5: five useless meaningless words here 5 4 five useless meaningless words here
6: six useless meaningless words here just 6 3 six useless meaningless words
7: seven useless meaningless words here just to 7 3 seven useless meaningless words
8: eigth useless meaningless words here just to fill 8 6 eigth useless meaningless words here just to
9: nine useless meaningless words here just to fill up 9 7 nine useless meaningless words here just to fill
10: ten useless meaningless words here just to fill up space 10 5 ten useless meaningless words here just
dt[, `:=`(input = sub(paste0('((\s*\w+){', stop.at.word, '}).*'), '\1', string),
output = sub(paste0('(\s*\w+){', stop.at.word, '}\s*(\w+).*'), '\2', string))
, by = stop.at.word][]
# string word.count stop.at.word
# 1: one 1 0
# 2: two words 2 1
# 3: three words here 3 2
# 4: four useless words here 4 2
# 5: five useless meaningless words here 5 4
# 6: six useless meaningless words here just 6 3
# 7: seven useless meaningless words here just to 7 3
# 8: eigth useless meaningless words here just to fill 8 6
# 9: nine useless meaningless words here just to fill up 9 7
#10: ten useless meaningless words here just to fill up space 10 5
# input output
# 1: one
# 2: two words
# 3: three words here
# 4: four useless words
# 5: five useless meaningless words here
# 6: six useless meaningless words
# 7: seven useless meaningless words
# 8: eigth useless meaningless words here just to
# 9: nine useless meaningless words here just to fill
#10: ten useless meaningless words here just
我不确定我是否理解 output
第一行什么都没有的逻辑,但如果确实需要,那么微不足道的修复留给 OP。