从字符串中获取最后 10 个单词,应用于字符串向量
getting the last 10 words from a string, applied on a vector of strings
我在数据框中有一个文本向量 (df1$text),我正在尝试用文本的最后 10 个词创建一个新向量 (df1$last.ten)。我尝试了以下但没有成功:
df1$last.ten = mapply(function(x,y) paste(word(x,y), collapse=" "), df1$text, -1:-10)
但是我只得到一个词而不是一串十个词:
> df1$last.ten[1]
[1] "end."
当我给它一个字符串时它工作得很好,所以我似乎错误地使用了 mapply
。
我试过为此使用 gsub
但无法弄清楚语法。希望 word()
或 gsub()
解决方案。
这是一个基本的 R 选项 -
#example data
df1 <- data.frame(text = c('This is a long text which consists of words more than 10',
'This is another one which is similar to first one but even longer'))
#split string on space for every word and paste the last 10 words in one string
df1$last.ten <- sapply(strsplit(df1$text, '\s+'), function(x)
paste0(tail(x, 10), collapse = ' '))
df1
如果这是你的数据框(玩具数据)
df1
text
1 one two three four five six seven eight nine ten eleven twelve
2 one two three four five six seven eight nine ten eleven twelve
3 one two three four five six seven eight nine ten eleven twelve
然后像这样提取最后 10 个单词
rnge <- 10:1
df1$last.ten <- apply( t(apply( as.data.frame(df1$text), 1, function(x)
rev( unlist( strsplit(x, " ") ) ) )[rnge,]), 1, paste, collapse=" " )
df1
text
1 one two three four five six seven eight nine ten eleven twelve
2 one two three four five six seven eight nine ten eleven twelve
3 one two three four five six seven eight nine ten eleven twelve
last.ten
1 three four five six seven eight nine ten eleven twelve
2 three four five six seven eight nine ten eleven twelve
3 three four five six seven eight nine ten eleven twelve
如果您调整范围,这会从任何地方提取数据 rnge
rnge <- 5:3
df1$mid <- apply( t(apply( as.data.frame(df1$text), 1, function(x)
rev( unlist( strsplit(x, " ") ) ) )[rnge,]), 1, paste, collapse=" " )
df1
text
1 one two three four five six seven eight nine ten eleven twelve
2 one two three four five six seven eight nine ten eleven twelve
3 one two three four five six seven eight nine ten eleven twelve
last.ten mid
1 three four five six seven eight nine ten eleven twelve eight nine ten
2 three four five six seven eight nine ten eleven twelve eight nine ten
3 three four five six seven eight nine ten eleven twelve eight nine ten
我制作了一些示例数据。也许您不需要使用应用功能。
df1 <- data.frame(text = c("one two three four five six seven eight nine ten eleven","one two three four five six seven eight nine ten eleven twelve"))
df1$last.ten <- word(df1[[1]], str_count(df1[[1]], '\w+') - 9, str_count(df1[[1]], '\w+'))
我在数据框中有一个文本向量 (df1$text),我正在尝试用文本的最后 10 个词创建一个新向量 (df1$last.ten)。我尝试了以下但没有成功:
df1$last.ten = mapply(function(x,y) paste(word(x,y), collapse=" "), df1$text, -1:-10)
但是我只得到一个词而不是一串十个词:
> df1$last.ten[1]
[1] "end."
当我给它一个字符串时它工作得很好,所以我似乎错误地使用了 mapply
。
我试过为此使用 gsub
但无法弄清楚语法。希望 word()
或 gsub()
解决方案。
这是一个基本的 R 选项 -
#example data
df1 <- data.frame(text = c('This is a long text which consists of words more than 10',
'This is another one which is similar to first one but even longer'))
#split string on space for every word and paste the last 10 words in one string
df1$last.ten <- sapply(strsplit(df1$text, '\s+'), function(x)
paste0(tail(x, 10), collapse = ' '))
df1
如果这是你的数据框(玩具数据)
df1
text
1 one two three four five six seven eight nine ten eleven twelve
2 one two three four five six seven eight nine ten eleven twelve
3 one two three four five six seven eight nine ten eleven twelve
然后像这样提取最后 10 个单词
rnge <- 10:1
df1$last.ten <- apply( t(apply( as.data.frame(df1$text), 1, function(x)
rev( unlist( strsplit(x, " ") ) ) )[rnge,]), 1, paste, collapse=" " )
df1
text
1 one two three four five six seven eight nine ten eleven twelve
2 one two three four five six seven eight nine ten eleven twelve
3 one two three four five six seven eight nine ten eleven twelve
last.ten
1 three four five six seven eight nine ten eleven twelve
2 three four five six seven eight nine ten eleven twelve
3 three four five six seven eight nine ten eleven twelve
如果您调整范围,这会从任何地方提取数据 rnge
rnge <- 5:3
df1$mid <- apply( t(apply( as.data.frame(df1$text), 1, function(x)
rev( unlist( strsplit(x, " ") ) ) )[rnge,]), 1, paste, collapse=" " )
df1
text
1 one two three four five six seven eight nine ten eleven twelve
2 one two three four five six seven eight nine ten eleven twelve
3 one two three four five six seven eight nine ten eleven twelve
last.ten mid
1 three four five six seven eight nine ten eleven twelve eight nine ten
2 three four five six seven eight nine ten eleven twelve eight nine ten
3 three four five six seven eight nine ten eleven twelve eight nine ten
我制作了一些示例数据。也许您不需要使用应用功能。
df1 <- data.frame(text = c("one two three four five six seven eight nine ten eleven","one two three four five six seven eight nine ten eleven twelve"))
df1$last.ten <- word(df1[[1]], str_count(df1[[1]], '\w+') - 9, str_count(df1[[1]], '\w+'))