ngram 文本作为 R 中的单独列

Question

我从 ngram 获得了几个文本的列表，并想将其作为列添加到原始数据表中。

> prep_test
                                                                                          prep_test
 1:                      Women Athletic,Athletic Apparel,Apparel Pants,Pants Tights,Tights Leggings
 2:                                                                       Beauty Makeup,Makeup Face
 3:                                                                       Beauty Makeup,Makeup Face
 4:     Electronics Cell,Cell Phones,Phones Accessories,Accessories Cases,Cases Covers,Covers Skins
 5:                                                                         Women Shoes,Shoes Boots
 6:                                                   Men Men,Men s,s Accessories,Accessories Belts
 7: Electronics Cell,Cell Phones,Phones Accessories,Accessories Cell,Cell Phones,Phones Smartphones
 8:                                                           Women Tops,Tops Blouses,Blouses Other
 9:                      Women Athletic,Athletic Apparel,Apparel Pants,Pants Tights,Tights Leggings
10:                                                Home Home,Home DÃ,DÃ cor,cor Home,Home Fragrance



str(prep_test)
Classes ‘data.table’ and 'data.frame':  10 obs. of  1 variable:
 $ prep_test:List of 10
  ..$ : chr  "Women Athletic" "Athletic Apparel" "Apparel Pants" "Pants Tights" ...
  ..$ : chr  "Beauty Makeup" "Makeup Face"
  ..$ : chr  "Beauty Makeup" "Makeup Face"
  ..$ : chr  "Electronics Cell" "Cell Phones" "Phones Accessories" "Accessories Cases" ...
  ..$ : chr  "Women Shoes" "Shoes Boots"
  ..$ : chr  "Men Men" "Men s" "s Accessories" "Accessories Belts"
  ..$ : chr  "Electronics Cell" "Cell Phones" "Phones Accessories" "Accessories Cell" ...
  ..$ : chr  "Women Tops" "Tops Blouses" "Blouses Other"
  ..$ : chr  "Women Athletic" "Athletic Apparel" "Apparel Pants" "Pants Tights" ...
  ..$ : chr  "Home Home" "Home DÃ" "DÃ cor" "cor Home" ...
 - attr(*, ".internal.selfref")=<externalptr>

为列生成 n-gram 的当前代码

bigram_fun <- function(y){
  y <- gsub("[[:punct:][:blank:]]+", " ", y)
  y <- ngram_asweka(y, min=2, max=2)
  #y <- str_split_fixed(y, ",", n=Inf)
  #y <- unlist(y)
  return(y)
}

prep_test <- all[1:10, 9]
prep_test <- apply(prep_test, 1, bigram_fun)
prep_test <- data.table(prep_test)
prep_test

放在这里

> dput(prep_test)
list(c("Women Athletic", "Athletic Apparel", "Apparel Pants", 
"Pants Tights", "Tights Leggings"), c("Beauty Makeup", "Makeup Face"
), c("Beauty Makeup", "Makeup Face"), c("Electronics Cell", "Cell Phones", 
"Phones Accessories", "Accessories Cases", "Cases Covers", "Covers Skins"
), c("Women Shoes", "Shoes Boots"), c("Men Men", "Men s", "s Accessories", 
"Accessories Belts"), c("Electronics Cell", "Cell Phones", "Phones Accessories", 
"Accessories Cell", "Cell Phones", "Phones Smartphones"), c("Women Tops", 
"Tops Blouses", "Blouses Other"), c("Women Athletic", "Athletic Apparel", 
"Apparel Pants", "Pants Tights", "Tights Leggings"), c("Home Home", 
"Home DÃ", "DÃ cor", "cor Home", "Home Fragrance"))

期望的结果

Bigram 1           Bigram 2           Bigram 3              Bigram 4     ...  
"Women Athletic"   "Athletic Apparel" "Apparel Pants"      "Pants Tights"...
"Beauty Makeup"    "Makeup Face"      NA                    NA           ...
"Beauty Makeup"    "Makeup Face"      NA                    NA           ...
"Electronics Cell" "Cell Phones"      "Phones Accessories" "Accessories Cases" 
"Women Shoes"      "Shoes Boots"      NA                    NA

感谢任何答案，对于这里作为新手提出的糟糕问题感到抱歉

Answer 1

这应该有效：

library(plyr)
df = rbind.fill(lapply(mylist,function(x) {as.data.frame(t(x))}))
colnames(df) = sapply(seq(1:ncol(df)),function(x) {paste0("Bigram ",x)})

输出：

           Bigram 1         Bigram 2           Bigram 3          Bigram 4        Bigram 5           Bigram 6
1    Women Athletic Athletic Apparel      Apparel Pants      Pants Tights Tights Leggings               <NA>
2     Beauty Makeup      Makeup Face               <NA>              <NA>            <NA>               <NA>
3     Beauty Makeup      Makeup Face               <NA>              <NA>            <NA>               <NA>
4  Electronics Cell      Cell Phones Phones Accessories Accessories Cases    Cases Covers       Covers Skins
5       Women Shoes      Shoes Boots               <NA>              <NA>            <NA>               <NA>
6           Men Men            Men s      s Accessories Accessories Belts            <NA>               <NA>
7  Electronics Cell      Cell Phones Phones Accessories  Accessories Cell     Cell Phones Phones Smartphones
8        Women Tops     Tops Blouses      Blouses Other              <NA>            <NA>               <NA>
9    Women Athletic Athletic Apparel      Apparel Pants      Pants Tights Tights Leggings               <NA>
10        Home Home          Home DÃ             DÃ cor          cor Home  Home Fragrance               <NA>

希望对您有所帮助！

Answer 2

我们可以将 bigrams 转换为数据帧，绑定到熔化数据帧，然后转换为宽格式整齐的数据文件，如下所示。

theBigrams <- list(c("Women Athletic", "Athletic Apparel", "Apparel Pants", 
"Pants Tights", "Tights Leggings"), c("Beauty Makeup", "Makeup Face"),
 c("Beauty Makeup", "Makeup Face"), c("Electronics Cell", "Cell Phones", 
"Phones Accessories", "Accessories Cases", "Cases Covers", "Covers Skins"
), c("Women Shoes", "Shoes Boots"), c("Men Men", "Men s", "s Accessories", 
"Accessories Belts"), c("Electronics Cell", "Cell Phones", "Phones Accessories", 
"Accessories Cell", "Cell Phones", "Phones Smartphones"), c("Women Tops", 
"Tops Blouses", "Blouses Other"), c("Women Athletic", "Athletic Apparel", 
"Apparel Pants", "Pants Tights", "Tights Leggings"), c("Home Home", 
"Home DÃ", "DÃ cor", "cor Home", "Home Fragrance"))

meltedBigrams <- do.call(rbind,lapply(seq_along(theBigrams),function(i) {
     x <- theBigrams[[i]]
     bigram <- 1:length(x)
     id <- rep(i,length(x))
     data.frame(id,bigram,value=x,stringsAsFactors=FALSE)
}))
library(reshape2)
castData <- dcast(meltedBigrams,id ~ bigram )
castData

...输出：

> castData
   id                1                2                  3                 4               5                  6
1   1   Women Athletic Athletic Apparel      Apparel Pants      Pants Tights Tights Leggings               <NA>
2   2    Beauty Makeup      Makeup Face               <NA>              <NA>            <NA>               <NA>
3   3    Beauty Makeup      Makeup Face               <NA>              <NA>            <NA>               <NA>
4   4 Electronics Cell      Cell Phones Phones Accessories Accessories Cases    Cases Covers       Covers Skins
5   5      Women Shoes      Shoes Boots               <NA>              <NA>            <NA>               <NA>
6   6          Men Men            Men s      s Accessories Accessories Belts            <NA>               <NA>
7   7 Electronics Cell      Cell Phones Phones Accessories  Accessories Cell     Cell Phones Phones Smartphones
8   8       Women Tops     Tops Blouses      Blouses Other              <NA>            <NA>               <NA>
9   9   Women Athletic Athletic Apparel      Apparel Pants      Pants Tights Tights Leggings               <NA>
10 10        Home Home          Home DÃ             DÃ cor          cor Home  Home Fragrance               <NA>
>

ngram 文本作为 R 中的单独列

ngram text to be as separate column in R

r

n-gram