使用 text2vec 循环标记化

looping to tokenize using text2vec

编辑以缩短并提供示例数据。

我有文本数据,其中包含 8 个问题,问了一些参与者两次。我想使用 text2vec 来比较他们在两个时间点(重复检测)对这些问题的回答的相似性。这是我的初始数据的结构(在这个例子中只有 3 个参与者,4 个问题而不是 8 个问题,以及 2 quarters/time 个句点)。我想对每个参与者在第一季度和第二季度的反应进行相似性比较。我打算使用包 text2vec 的 psim 命令来执行此操作。

df<-read.table(text="ID,Quarter,Question,Answertext
               Joy,1,And another question,adsfjasljsdaf jkldfjkl
               Joy,2,And another question,dsadsj jlijsad jkldf 
               Paul,1,And another question,adsfj aslj sd afs dfj ksdf
               Paul,2,And another question,dsadsj jlijsad
               Greg,1,And another question,adsfjasljsdaf
               Greg,2,And another question, asddsf asdfasd sdfasfsdf
               Joy,1,this is the first question that was asked,this is joys answer to this question
               Joy,2,this is the first question that was asked,this is joys answer to this question
               Paul,1,this is the first question that was asked,this is Pauls answer to this question
               Paul,2,this is the first question that was asked,Pauls answer is different 
               Greg,1,this is the first question that was asked,this is Gregs answer to this question nearly the same
               Greg,2,this is the first question that was asked,this is Gregs answer to this question
               Joy,1,This is the text of another question,more random text
               Joy,2,This is the text of another question, adkjjlj;ds sdafd
               Paul,1,This is the text of another question,more random text
               Paul,2,This is the text of another question, adkjjlj;ds sdafd
               Greg,1,This is the text of another question,more random text
               Greg,2,This is the text of another question,sdaf asdfasd asdff
               Joy,1,this was asked second.,some random text
               Joy,2,this was asked second.,some random text that doesn't quite match joy's response the first time around
               Paul,1,this was asked second.,some random text
               Paul,2,this was asked second.,some random text that doesn't quite match Paul's response the first time around
               Greg,1,this was asked second.,some random text
               Greg,2,this was asked second.,ada dasdffasdf asdf  asdfa fasd sdfadsfasd fsdas asdffasd
", header=TRUE,sep=',')

我做了更多思考,我认为正确的方法是将数据框拆分为数据框列表,而不是单独的项目。

questlist<-split(df,f=df$Question)

然后编写一个函数来为每个问题创建词汇表。

library(text2vec)

vocabmkr<-function(x) { itoken(x$AnswerText, ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 2) %>% vocab_vectorizer() }

test<-lapply(questlist, vocabmkr)

但后来我想我需要将原始数据框拆分为问题-四分之一组合并将另一个列表中的词汇应用于它,但我不确定如何去做。

最终,我想要一个相似度分数来告诉我参与者是否重复了第一节和第二节的部分或全部回答。

编辑:对于从上述数据框开始的单个问题,我将如何执行此操作。

quest1 <- filter(df,Question=="this is the first question that was asked")
quest1vocab <- itoken(as.character(quest1$Answertext), ids=quest1$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()

quest1q1<-filter(quest1,Quarter==1)
quest1q1<-itoken(as.character(quest1q1$Answertext),ids=quest1q1$ID) # tokenize question1 quarter 1

quest1q2<-filter(quest1,Quarter==2) 
quest1q2<-itoken(as.character(quest1q2$Answertext),ids=quest1q2$ID) # tokenize question1 quarter 2

#now apply the vocabulary to the two matrices
quest1q1<-create_dtm(quest1q1,quest1vocab)
quest1q2<-create_dtm(quest1q2,quest1vocab)

similarity<-psim2(quest1q1, quest1q2, method="jaccard", norm="none") #row by row similarity.

b<-data.frame(ID=names(similarity),Similarity=similarity,row.names=NULL) #make dataframe of similarity scores
endproduct<-full_join(b,quest1)

编辑: 好的,我已经与 lapply 合作了一些。

df1<-split.data.frame(df,df$Question) #now we have 4 dataframes in the list, 1 for each question

vocabmkr<-function(x) {
  itoken(as.character(x$Answertext), ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()
}

vocab<-lapply(df1,vocabmkr) #this gets us another list and in it are the 4 vocabularies.

dfqq<-split.data.frame(df,list(df$Question,df$Quarter)) #and now we have 8 items in the list - each list is a combination of question and quarter (4 questions over 2 quarters)

如何将词汇表(由 4 个元素组成)应用到 dfqq 列表(由 8 个元素组成)?

我放弃了,一次手动完成一个数据帧。我确信有一种简单的方法可以将它作为一个列表来完成,但我终生无法弄清楚如何将函数列表(vocab vectorizers)应用到列表中的 "Answertext" 列数据框。

尽管 R 非常强大,但严重缺乏允许将文本交换到命令中的简单 for 循环(la Stata 的 "foreach")。我知道有一个不同的工作流程,涉及将数据框分解为列表并对其进行迭代,但对于某些活动,这会使事情变得非常复杂,需要复杂的索引不仅要引用列表,还要引用列表中包含的特定向量。我还认识到,可以使用 assign 和 paste0 来实现类似 Stata 的行为,但这与 R 中的大多数代码一样,非常笨拙和迟钝。叹息

对不起,这听起来很令人沮丧。如果您有更多工作要做并且确实想要一种更自动化的方式来完成,这里有一种方法可能适合您:

首先,将单个数据帧的示例代码转换为函数:

analyze_vocab <- function(df_) {
  quest1vocab =
    itoken(as.character(df_$Answertext), ids = df_$ID) %>%
    create_vocabulary() %>%
    prune_vocabulary(term_count_min = 1) %>%
    vocab_vectorizer()

  quarter1 = filter(df_, Quarter == 1)
  quarter1 = itoken(as.character(quarter1$Answertext), 
                    ids = quarter1$ID)

  quarter2 = filter(df_, Quarter == 2)
  quarter2 = itoken(as.character(quarter2$Answertext),
                    ids = quarter2$ID)

  q1mat = create_dtm(quarter1, quest1vocab)
  q2mat = create_dtm(quarter2, quest1vocab)

  similarity = psim2(q1mat, q2mat, method = "jaccard", norm = "none")

  b = data.frame(
    ID = names(similarity),
    Similarity = similarity)

  output <- full_join(b, df_)
  return(output)
}

现在,如果需要,您可以 split,然后像这样使用 lapplylapply(split(df, df$Question), analyze_vocab)。但是,您似乎已经对管道很满意,所以您不妨采用这种方法:

similarity_df <- df %>% 
  group_by(Question) %>%
  do(analyze_vocab(.))

输出:

> head(similarity_df, 12)
# A tibble: 12 x 5
# Groups:   Question [2]
   ID    Similarity Quarter Question                                  Answertext                                           
   <fct>      <dbl>   <int> <fct>                                     <fct>                                                
 1 Joy        0           1 And another question                      adsfjasljsdaf jkldfjkl                               
 2 Joy        0           2 And another question                      "dsadsj jlijsad jkldf "                              
 3 Paul       0           1 And another question                      adsfj aslj sd afs dfj ksdf                           
 4 Paul       0           2 And another question                      dsadsj jlijsad                                       
 5 Greg       0           1 And another question                      adsfjasljsdaf                                        
 6 Greg       0           2 And another question                      " asddsf asdfasd sdfasfsdf"                          
 7 Joy        1           1 this is the first question that was asked this is joys answer to this question                 
 8 Joy        1           2 this is the first question that was asked this is joys answer to this question                 
 9 Paul       0.429       1 this is the first question that was asked this is Pauls answer to this question                
10 Paul       0.429       2 this is the first question that was asked "Pauls answer is different "                         
11 Greg       0.667       1 this is the first question that was asked this is Gregs answer to this question nearly the same
12 Greg       0.667       2 this is the first question that was asked this is Gregs answer to this question 

相似度中的值与您的示例中显示的值匹配 endproduct(请注意,显示的值是四舍五入的,以便显示小标题),因此它似乎按预期工作。