在继续进行 dtm 之前，从 data.frame 中删除带有字符 (0) 的行

Question

我正在分析产品评论的数据框，其中包含一些空条目或用外语编写的文本。数据还包含一些客户属性，可以作为"features"在后面的分析中使用。

首先，我会先把reviews列转换成DocumentTermMatrix，然后再转换成lda格式，然后我打算把documents和从 lda 进程生成的 vocab 对象以及从原始数据帧中选择的列到 stm 的 prepDocuments() 函数中，这样我就可以利用该包中更通用的估计函数，使用客户属性作为预测主题显着性的特征。

但是，由于一些空单元格、标点符号和外来字符可能会在预处理过程中被删除，从而在 lda 的 documents 对象中创建一些 character(0) 行，使得这些评论无法匹配原始数据框中的相应行。最终，这将阻止我从 prepDocuments().

生成所需的 stm 对象

确实有删除空文档的方法（比如之前thread中推荐的方法），但我想知道是否有方法也可以删除原始文档中对应于空文档的行数据框使得 lda documents 的数量和将在 stm 函数中用作 meta 的数据框的行维度对齐？编制索引有帮助吗？

下面列出了我的部分数据。

df = data.frame(reviews = c("buenisimoooooo", "excelente", "excelent", 
"awesome phone awesome price almost month issue highly use blu manufacturer high speed processor blu iphone", 
"phone multiple failure poorly touch screen 2 slot sim card work responsible disappoint brand good team shop store wine money unfortunately precaution purchase", 
"//:", "//:", "phone work card non sim card description", "perfect reliable kinda fast even simple mobile sim digicel never problem far strongly anyone need nice expensive dual sim phone perfect gift love friend", "1111111", "great bang buck", "actually happy little sister really first good great picture late", 
"good phone good reception home fringe area screen lovely just right size good buy", "@#haha", "phone verizon contract phone buyer beware", "这东西太棒了", 
"excellent product total satisfaction", "dreadful phone home button never screen unresponsive answer call easily month phone test automatically emergency police round supplier network nothing never electricals amazon good buy locally refund", 
"good phone price fine", "phone star battery little soon yes"), 
rating = c(4, 4, 4, 4, 4, 3, 2, 4, 1, 4, 3, 1, 4, 3, 1, 2, 4, 4, 1, 1), 
source = c("amazon", "bestbuy", "amazon", "newegg", "amazon", 
           "amazon", "zappos", "newegg", "amazon", "amazon", 
           "amazon", "amazon", "amazon", "zappos", "amazon", 
           "amazon", "newegg", "amazon", "amazon", "amazon"))

Answer 1

在这种情况下，采用整洁的数据原则确实可以提供一个很好的解决方案。首先，"annotate" 您提供的数据框有一个新列，该列跟踪 doc_id，每个单词属于哪个文档，然后使用 unnest_tokens() 将其转换为整洁的数据结构.

library(tidyverse)
library(tidytext)
library(stm)

df <- tibble(reviews = c("buenisimoooooo", "excelente", "excelent", 
                         "awesome phone awesome price almost month issue highly use blu manufacturer high speed processor blu iphone", 
                         "phone multiple failure poorly touch screen 2 slot sim card work responsible disappoint brand good team shop store wine money unfortunately precaution purchase", 
                         "//:", "//:", "phone work card non sim card description", "perfect reliable kinda fast even simple mobile sim digicel never problem far strongly anyone need nice expensive dual sim phone perfect gift love friend", "1111111", "great bang buck", "actually happy little sister really first good great picture late", 
                         "good phone good reception home fringe area screen lovely just right size good buy", "@#haha", "phone verizon contract phone buyer beware", "这东西太棒了", 
                         "excellent product total satisfaction", "dreadful phone home button never screen unresponsive answer call easily month phone test automatically emergency police round supplier network nothing never electricals amazon good buy locally refund", 
                         "good phone price fine", "phone star battery little soon yes"), 
             rating = c(4, 4, 4, 4, 4, 3, 2, 4, 1, 4, 3, 1, 4, 3, 1, 2, 4, 4, 1, 1), 
             source = c("amazon", "bestbuy", "amazon", "newegg", "amazon", 
                        "amazon", "zappos", "newegg", "amazon", "amazon", 
                        "amazon", "amazon", "amazon", "zappos", "amazon", 
                        "amazon", "newegg", "amazon", "amazon", "amazon"))


tidy_df <- df %>%
  mutate(doc_id = row_number()) %>%
  unnest_tokens(word, reviews)

tidy_df
#> # A tibble: 154 x 4
#>    rating source  doc_id word          
#>     <dbl> <chr>    <int> <chr>         
#>  1      4 amazon       1 buenisimoooooo
#>  2      4 bestbuy      2 excelente     
#>  3      4 amazon       3 excelent      
#>  4      4 newegg       4 awesome       
#>  5      4 newegg       4 phone         
#>  6      4 newegg       4 awesome       
#>  7      4 newegg       4 price         
#>  8      4 newegg       4 almost        
#>  9      4 newegg       4 month         
#> 10      4 newegg       4 issue         
#> # … with 144 more rows

请注意，您仍然拥有之前拥有的所有信息；所有信息仍然存在，但以不同的结构排列。您可以微调标记化过程以满足您的特定分析需求，也许可以根据需要处理非英语，或者 keeping/not 保留标点符号等。如果适合您，这就是空文档被丢弃的地方。

接下来，将这个整洁的数据结构转化为稀疏矩阵，用于主题建模。列对应单词，行对应文档。

sparse_reviews <- tidy_df %>%
  count(doc_id, word) %>%
  cast_sparse(doc_id, word, n)

colnames(sparse_reviews) %>% head()
#> [1] "buenisimoooooo" "excelente"      "excelent"       "almost"        
#> [5] "awesome"        "blu"
rownames(sparse_reviews) %>% head()
#> [1] "1" "2" "3" "4" "5" "8"

接下来，根据您已有的整洁数据集。

制作协变量（即元）信息数据框以用于主题建模
covariates <- tidy_df %>% distinct(doc_id, rating, source) covariates #> # A tibble: 18 x 3 #> doc_id rating source #> <int> <dbl> <chr> #> 1 1 4 amazon #> 2 2 4 bestbuy #> 3 3 4 amazon #> 4 4 4 newegg #> 5 5 4 amazon #> 6 8 4 newegg #> 7 9 1 amazon #> 8 10 4 amazon #> 9 11 3 amazon #> 10 12 1 amazon #> 11 13 4 amazon #> 12 14 3 zappos #> 13 15 1 amazon #> 14 16 2 amazon #> 15 17 4 newegg #> 16 18 4 amazon #> 17 19 1 amazon #> 18 20 1 amazon

现在您可以将其组合成 stm()。例如，如果你想用文档级协变量训练一个主题模型，看看主题是否改变 a) 源和 b) 平滑地评级，你会做这样的事情：

topic_model <- stm(sparse_reviews, K = 0, init.type = "Spectral", prevalence = ~source + s(rating), data = covariates, verbose = FALSE)

^{由 reprex package (v0.3.0)}
于 2019-08-03 创建

在继续进行 dtm 之前，从 data.frame 中删除带有字符 (0) 的行

Remove rows with character(0) from a data.frame before proceeding to dtm

text-mining

lda

topic-modeling

tidytext