如何对 data.frame 的每一行应用 findAssoc

How to apply findAssoc against each row of data.frame

我创建了一个 data.frame 来保存我的单词及其频率。现在我想对框架的每一行执行 findAssocs,但我无法让我的代码工作。任何帮助表示赞赏。

这是我的例子 data.frame term.df

term.df <- data.frame(word = names(v),freq=v)

word freq
ounce 8917
pack 6724
count 4992
organic 3696
frozen 2534
free 1728

我创建了一个 TermDocumentMatrix tdm,下面的代码按预期工作。

findAssocs(tdm, 'frozen', 0.20) 

我想将 findAssocs 的输出附加为新列

这是我试过的代码:

library(dplyr)
library(tm)
library(pbapply)

#I would like to append all findings in a new column

res <- merge(do.call(rbind.data.frame, pblapply(term.df, findAssocs(tdm, term.df$word , 0.18))),
              term.df[, c("word")], by.x="list.q", by.y="word", all.x=TRUE)

编辑: 至于输出。上面的单个语句让我得到了这样的东西。

$yogurt
  greek ellenos     fat chobani  dannon    fage yoplait  nonfat wallaby 
   0.62    0.36    0.25    0.24    0.24    0.24    0.24    0.22    0.20 

我希望可以在我原来的 table (ASSOC) 中添加一个列,并将结果作为逗号分隔的 name:value 元组,但我真的很乐于接受想法。

我认为最容易处理的结构是嵌套列表:

lapply(seq_len(nrow(text.df)), function(i) {
  list(word=text.df$word[i],
       freq=text.df$freq[i],
       assoc=findAssocs(tdm, as.character(text.df$word[i]), 0.7)[[1]])
})
# [[1]]
# [[1]]$word
# [1] "oil"
# 
# [[1]]$freq
# [1] 3
# 
# [[1]]$assoc
#      15.8      opec   clearly      late    trying       who    winter  analysts 
#      0.87      0.87      0.80      0.80      0.80      0.80      0.80      0.79 
#      said   meeting     above emergency    market     fixed      that    prices 
#      0.78      0.77      0.76      0.75      0.75      0.73      0.73      0.72 
# agreement    buyers 
#      0.71      0.70 
# 
# 
# [[2]]
# [[2]]$word
# [1] "opec"
# 
# [[2]]$freq
# [1] 2
# 
# [[2]]$assoc
#    meeting  emergency        oil       15.8   analysts     buyers      above 
#       0.88       0.87       0.87       0.85       0.85       0.83       0.82 
#       said    ability       they    prices.  agreement        but    clearly 
#       0.82       0.80       0.80       0.79       0.76       0.74       0.74 
#  december.   however,       late production       sell     trying        who 
#       0.74       0.74       0.74       0.74       0.74       0.74       0.74 
#     winter      quota       that    through        bpd     market 
#       0.74       0.73       0.73       0.73       0.70       0.70 
# 
# 
# [[3]]
# [[3]]$word
# [1] "xyz"
# 
# [[3]]$freq
# [1] 1
# 
# [[3]]$assoc
# numeric(0)

根据我的经验,这比嵌套字符串更容易处理,因为您仍然可以通过访问输出列表中的相应元素来访问原始 text.df 对象每一行的单词关联。

如果你真的想保留数据帧结构,那么你可以很容易地将 findAssocs 输出转换为字符串表示,例如使用 toJSON:

library(RJSONIO)
text.df$assoc <- sapply(text.df$word, function(x) toJSON(findAssocs(tdm, x, 0.7)[[1]], collapse=""))
text.df
#   word freq
# 1  oil    3
# 2 opec    2
# 3  xyz    1
#                                                                                                                                                                                                                                                                                                                                                                                                                                                                        assoc
# 1 { "15.8":   0.87,"opec":   0.87,"clearly":    0.8,"late":    0.8,"trying":    0.8,"who":    0.8,"winter":    0.8,"analysts":   0.79,"said":   0.78,"meeting":   0.77,"above":   0.76,"emergency":   0.75,"market":   0.75,"fixed":   0.73,"that":   0.73,"prices":   0.72,"agreement":   0.71,"buyers":    0.7 }
# 2 { "meeting":   0.88,"emergency":   0.87,"oil":   0.87,"15.8":   0.85,"analysts":   0.85,"buyers":   0.83,"above":   0.82,"said":   0.82,"ability":    0.8,"they":    0.8,"prices.":   0.79,"agreement":   0.76,"but":   0.74,"clearly":   0.74,"december.":   0.74,"however,":   0.74,"late":   0.74,"production":   0.74,"sell":   0.74,"trying":   0.74,"who":   0.74,"winter":   0.74,"quota":   0.73,"that":   0.73,"through":   0.73,"bpd":    0.7,"market":    0.7 }
# 3 [  ]

数据:

library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
(text.df <- data.frame(word=c("oil", "opec", "xyz"), freq=c(3, 2, 1), stringsAsFactors=FALSE))
#   word freq
# 1  oil    3
# 2 opec    2
# 3  xyz    1