R - 带有逗号分隔文本列条目的文档术语矩阵
R - Document Term Matrix with comma separated text column entries
我有一个数据框,其中有一列由字符串 (project_skills) 组成,表示某项工作 (job_id) 提供的技能。我想为每个工作拆分这个字符串以获得工作提供的技能向量,然后创建一个文档术语矩阵来表示某个工作提供的技能(在所有可能的技能中)。
我得到了以下数据框:
job_id project_skills
107182 CSS,HTML,Joomla,PHP
108169 XTCommerce,Magento,Prestashop,VirtueMart,osCommerce
112969 Google Search Console,Google Analytics,Google Webmaster Central,C++,Java,C#
114660 Marketing,Email Marketing
118686 PHP
结果应该看起来像这样(基本上是一个用逗号分隔的短语的文档术语矩阵:
project_skills
job_id CSS HTML PHP Google Search Console Google Analytics Java ...
107182 1 0 0 ...
108169 0 0 0 0 0
112969 0 0 0 1 1 ...
114660 0 0 0 ...
118686 0 0 1 ...
我尝试了以下方法:
df <- data.frame(job_id = c(107182, 108169, 112969, 114660, 118686), project_skills = c("CSS,HTML,Joomla,PHP", "XTCommerce,Magento,Prestashop,VirtueMart,osCommerce", "Google Search Console,Google Analytics,Google Webmaster Central,C++,Java,C#", "Marketing,Email Marketing", "PHP"))
corpus <- Corpus(VectorSource(df$project_skills))
corpus <- tm_map(corpus, function(x) {
PlainTextDocument(
strsplit(x,"\,")[[1]],
id=ID(x)
)
})
inspect(corpus)
dtm <- DocumentTermMatrix(corpus)
as.matrix(dtm)
但不幸的是,这会拆分所有单词而不是逗号(例如 Google Search Console 应在 DTM 中被视为一个术语)。
tm(或其他一些文本挖掘包)按单词(空格)拆分,如果您不检查,则倾向于删除 + 和 # 等标点符号。最简单的选择就是使用 strsplit
。我在下面使用 tidyr 和 dplyr 显示了一个选项。首先按 job_id 分组,然后拆分列。这将创建一个嵌套,当未嵌套时创建一个长 data.frame。在这里,我为每个条目添加了值 1,它在文档术语矩阵中的作用类似于 1。然后展开成宽格式以获得预期的输出。如果您查看生成的结构,列名就是您所期望的,没有显示波浪号 (~)。
library(tidyr)
library(dplyr)
outcome <- df1 %>%
group_by(job_id) %>%
mutate(project_skills = strsplit(project_skills, ",")) %>%
unnest() %>%
mutate(value = 1) %>% # add 1 for every value
spread(key = project_skills, value = value) # use fill = 0 if you don't want NA's
head(outcome)
# A tibble: 5 x 18
# Groups: job_id [5]
job_id `C#` `C++` CSS `Email Marketin~ `Google Analyti~ `Google Search ~ `Google Webmast~ HTML Java Joomla Magento Marketing
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 107182 NA NA 1 NA NA NA NA 1 NA 1 NA NA
2 108169 NA NA NA NA NA NA NA NA NA NA 1 NA
3 112969 1 1 NA NA 1 1 1 NA 1 NA NA NA
4 114660 NA NA NA 1 NA NA NA NA NA NA NA 1
5 118686 NA NA NA NA NA NA NA NA NA NA NA NA
# ... with 5 more variables: osCommerce <dbl>, PHP <dbl>, Prestashop <dbl>, VirtueMart <dbl>, XTCommerce <dbl>
对此有很多解决方案,但 strsplit 是您的朋友。这正是以下代码所做的:
library(udpipe)
df <- data.frame(job_id = c(107182, 108169, 112969, 114660, 118686), project_skills = c("CSS,HTML,Joomla,PHP", "XTCommerce,Magento,Prestashop,VirtueMart,osCommerce", "Google Search Console,Google Analytics,Google Webmaster Central,C++,Java,C#", "Marketing,Email Marketing", "PHP"),
stringsAsFactors = FALSE)
dtm <- document_term_frequencies(x = df$project_skills, document = df$job_id, split = ",")
dtm <- document_term_matrix(dtm)
colnames(dtm)
[1] "C#" "C++" "CSS" "Email Marketing"
[5] "Google Analytics" "Google Search Console" "Google Webmaster Central" "HTML"
[9] "Java" "Joomla" "Magento" "Marketing"
[13] "osCommerce" "PHP" "Prestashop" "VirtueMart"
[17] "XTCommerce"
rownames(dtm)
[1] "107182" "108169" "112969" "114660" "118686"
dim(dtm)
[1] 5 17
我有一个数据框,其中有一列由字符串 (project_skills) 组成,表示某项工作 (job_id) 提供的技能。我想为每个工作拆分这个字符串以获得工作提供的技能向量,然后创建一个文档术语矩阵来表示某个工作提供的技能(在所有可能的技能中)。
我得到了以下数据框:
job_id project_skills
107182 CSS,HTML,Joomla,PHP
108169 XTCommerce,Magento,Prestashop,VirtueMart,osCommerce
112969 Google Search Console,Google Analytics,Google Webmaster Central,C++,Java,C#
114660 Marketing,Email Marketing
118686 PHP
结果应该看起来像这样(基本上是一个用逗号分隔的短语的文档术语矩阵:
project_skills
job_id CSS HTML PHP Google Search Console Google Analytics Java ...
107182 1 0 0 ...
108169 0 0 0 0 0
112969 0 0 0 1 1 ...
114660 0 0 0 ...
118686 0 0 1 ...
我尝试了以下方法:
df <- data.frame(job_id = c(107182, 108169, 112969, 114660, 118686), project_skills = c("CSS,HTML,Joomla,PHP", "XTCommerce,Magento,Prestashop,VirtueMart,osCommerce", "Google Search Console,Google Analytics,Google Webmaster Central,C++,Java,C#", "Marketing,Email Marketing", "PHP"))
corpus <- Corpus(VectorSource(df$project_skills))
corpus <- tm_map(corpus, function(x) {
PlainTextDocument(
strsplit(x,"\,")[[1]],
id=ID(x)
)
})
inspect(corpus)
dtm <- DocumentTermMatrix(corpus)
as.matrix(dtm)
但不幸的是,这会拆分所有单词而不是逗号(例如 Google Search Console 应在 DTM 中被视为一个术语)。
tm(或其他一些文本挖掘包)按单词(空格)拆分,如果您不检查,则倾向于删除 + 和 # 等标点符号。最简单的选择就是使用 strsplit
。我在下面使用 tidyr 和 dplyr 显示了一个选项。首先按 job_id 分组,然后拆分列。这将创建一个嵌套,当未嵌套时创建一个长 data.frame。在这里,我为每个条目添加了值 1,它在文档术语矩阵中的作用类似于 1。然后展开成宽格式以获得预期的输出。如果您查看生成的结构,列名就是您所期望的,没有显示波浪号 (~)。
library(tidyr)
library(dplyr)
outcome <- df1 %>%
group_by(job_id) %>%
mutate(project_skills = strsplit(project_skills, ",")) %>%
unnest() %>%
mutate(value = 1) %>% # add 1 for every value
spread(key = project_skills, value = value) # use fill = 0 if you don't want NA's
head(outcome)
# A tibble: 5 x 18
# Groups: job_id [5]
job_id `C#` `C++` CSS `Email Marketin~ `Google Analyti~ `Google Search ~ `Google Webmast~ HTML Java Joomla Magento Marketing
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 107182 NA NA 1 NA NA NA NA 1 NA 1 NA NA
2 108169 NA NA NA NA NA NA NA NA NA NA 1 NA
3 112969 1 1 NA NA 1 1 1 NA 1 NA NA NA
4 114660 NA NA NA 1 NA NA NA NA NA NA NA 1
5 118686 NA NA NA NA NA NA NA NA NA NA NA NA
# ... with 5 more variables: osCommerce <dbl>, PHP <dbl>, Prestashop <dbl>, VirtueMart <dbl>, XTCommerce <dbl>
对此有很多解决方案,但 strsplit 是您的朋友。这正是以下代码所做的:
library(udpipe)
df <- data.frame(job_id = c(107182, 108169, 112969, 114660, 118686), project_skills = c("CSS,HTML,Joomla,PHP", "XTCommerce,Magento,Prestashop,VirtueMart,osCommerce", "Google Search Console,Google Analytics,Google Webmaster Central,C++,Java,C#", "Marketing,Email Marketing", "PHP"),
stringsAsFactors = FALSE)
dtm <- document_term_frequencies(x = df$project_skills, document = df$job_id, split = ",")
dtm <- document_term_matrix(dtm)
colnames(dtm)
[1] "C#" "C++" "CSS" "Email Marketing"
[5] "Google Analytics" "Google Search Console" "Google Webmaster Central" "HTML"
[9] "Java" "Joomla" "Magento" "Marketing"
[13] "osCommerce" "PHP" "Prestashop" "VirtueMart"
[17] "XTCommerce"
rownames(dtm)
[1] "107182" "108169" "112969" "114660" "118686"
dim(dtm)
[1] 5 17