从 r 编程中的同一行获取标记
take tokens from the same line in r programming
使用 R 编程,我需要从文件中获取标记 ngram=2。
问题是它合并了行,有些标记有一部分在行尾,另一部分在下一行的开头
Req_tok <-jobs %>% unnest_tokens(ngram,POSITION, token = "ngrams", n = 2)
在文件作业中,我有前两行:
it architect
it helpdesk support agents
我得到的标记如下:
it architect
architect it
it helpdesk
and so on ....
如何才能不获得像 "architect it"
这样的代币
我想分别标记每一行
只需在 unnest_tokens
中添加 collapse = FALSE
:
library(tidytext)
library(dplyr)
jobs %>%
unnest_tokens(ngram, POSITION, token = "ngrams", n = 2, collapse = FALSE)
结果:
ngram
1 it architect
2 it helpdesk
2.1 helpdesk support
2.2 support agents
如果是因子变量,请记住将字符串向量转换为字符,否则 unnest_token
会抛出错误。
数据:
jobs = data.frame(POSITION = c("it architect", "it helpdesk support agents"), stringsAsFactors = FALSE)
使用 R 编程,我需要从文件中获取标记 ngram=2。
问题是它合并了行,有些标记有一部分在行尾,另一部分在下一行的开头
Req_tok <-jobs %>% unnest_tokens(ngram,POSITION, token = "ngrams", n = 2)
在文件作业中,我有前两行:
it architect
it helpdesk support agents
我得到的标记如下:
it architect
architect it
it helpdesk
and so on ....
如何才能不获得像 "architect it"
这样的代币我想分别标记每一行
只需在 unnest_tokens
中添加 collapse = FALSE
:
library(tidytext)
library(dplyr)
jobs %>%
unnest_tokens(ngram, POSITION, token = "ngrams", n = 2, collapse = FALSE)
结果:
ngram
1 it architect
2 it helpdesk
2.1 helpdesk support
2.2 support agents
如果是因子变量,请记住将字符串向量转换为字符,否则 unnest_token
会抛出错误。
数据:
jobs = data.frame(POSITION = c("it architect", "it helpdesk support agents"), stringsAsFactors = FALSE)