R:如何根据行字符串创建集群

R: How to create clusters based on row strings

我正在尝试根据每行的字符串值从数据创建集群。我正在使用 R 语言。我所说的 "cluster" 是一个大主题(= 系列),可以定义每个关键字。我想象一些基于关键字自动生成的东西,可能是通过使用词形还原或 ngram。

例如关键字 "cloud services" 和 "the cloud service" 都应该在 "service" 集群中。

这是我的输入向量:

keywords_df <- c("cloud storage", "cloud computing", "google cloud storage", "the cloud service", 
        "free cloud storage", "what is cloud computing", "best cloud storage","cloud computing definition", 
        "amazon cloud services", "cloud service providers", "cloud services", "google cloud computing", "cloud computing services", "benefits of cloud computing")

这是预期的输出数据帧:

| Keyword                   |  Thematic |
|---------------------------|:---------:|
|cloud storage              |storage  |
|cloud computing            |computing|
|google cloud storage       |storage  |
|the cloud service          |service  |
|free cloud storage         |storage  |
|what is cloud computing    |computing|
|best cloud storage         |storage  |
|cloud computing definition |computing|
|amazon cloud service       |service |
|cloud service providers        |services |
|cloud service              |service |
|google cloud computing     |computing|
|cloud computing services   |service |
|benefits of cloud computing|computing|

目标是清理"keyword"列中的数据并自动提取一种lemm或ngram。

这是我目前所做的:

  1. 根据关键字列创建"Thematic"列:

    keywords_df <- mutate(keywords_df,Thematic=Keyword)
    keywords_df$Thematic <- as.character(keywords_df$Thematic)
    
  2. 删除停用词:

    stopwords_list<-(c("cloud")) #Remove the main word
    stopwords <- stopwords(kind = "en")
    stopwords <- append(stopwords,stopwords_list)
    x  = keywords_df$Thematic        
    x  =  removeWords(x,stopwords)
    keywords_df$Thematic <- x  
    

您可以使用 grepl() 检查某些词是否存在,例如 storagecomputingservice。这样,您可以检查 df:

中是否存在给定单词
fams   <- c("storage", "computing", "service")
family <- rep("emtpy_fam", length(df))

for(fam in fams){
  family[grepl(fam, Keywords)] <- fam
}
cbind(df, family)
#      Keywords                      family     
# [1,] "cloud storage"               "storage"  
# [2,] "cloud computing"             "computing"
---
#[13,] "cloud computing services"    "service"  
#[14,] "benefits of cloud computing" "computing"

当然有更好的方法来做到这一点,不过


编辑: 更好的方法,使用 stringr

library(stringr)
family <- str_extract(df, pattern="storage|computing|service")
cbind(df, family)

Edit2: 我看到你最近的编辑,表明你正在寻找非预先指定的家庭描述。在这种情况下,我想到的第一种方法是 Latent Dirichlet Allocation(LDA——不过不要与线性判别分析混淆)。

LDA 分析文档语料库并将潜在主题识别为单词分布(如下 terms(lda.output) 所示)并识别哪个文档属于哪个主题(如下 topic(lda.output) 所示):

library(topicmodels)
library(tm)

# Preliminary textmining
corpus <- Corpus(VectorSource(df))
corpus <- tm_map(corpus, removeWords, "cloud")
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stemDocument)

# Term Frequency matrix
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))

lda.output <- LDA(TF, k=3)
terms(lda.output)
# Topic 1  Topic 2  Topic 3 
# "servic" "comput" "storag"

cbind(df, terms(lda.output)[topics(lda.output)])
#            df                                    
#Topic 3 "cloud storage"               "storag"
#Topic 2 "cloud computing"             "comput"
#Topic 3 "google cloud storage"        "storag"
#Topic 1 "cloud services"              "servic"
#Topic 3 "free cloud storage"          "storag"
#Topic 2 "what is cloud computing"     "comput"
#Topic 3 "best cloud storage"          "storag"
#Topic 1 "cloud computing definition"  "servic"
#Topic 1 "amazon cloud services"       "servic"
#Topic 3 "cloud service providers"     "storag"
#Topic 2 "google cloud services"       "comput"
#Topic 2 "google cloud computing"      "comput"
#Topic 1 "cloud computing services"    "servic"
#Topic 2 "benefits of cloud computing" "comput"

最后说明:如果你想得到 "computing" 而不是 "comput" 等,你应该在文本挖掘中更改词干提取部分。您也可以将其省略,但 "service""services" 将不会被识别为同一个词。但是,您可以手动将 "service" 替换为 "services",反之亦然。