了解另一个删除相似字符串的文本挖掘功能

Understanding another's text-mining function that removes similar strings

我正在尝试复制这篇文章 538 Post about Most Repetitive Phrases 中的方法,作者在这篇文章中挖掘了美国总统辩论的记录,以确定每位候选人最重复的短语。

我正在尝试使用 tm 包在 R 中的另一个数据集上实施此方法。

大部分代码 (GitHub repository) 涉及挖掘每个 ngram 的转录本和组装计数,但我迷失在下面的 prune_substrings() 函数代码中:

def prune_substrings(tfidf_dicts, prune_thru=1000):

    pruned = tfidf_dicts

    for candidate in range(len(candidates)):
        # growing list of n-grams in list form
        so_far = []

        ngrams_sorted = sorted(tfidf_dicts[candidate].items(), key=operator.itemgetter(1), reverse=True)[:prune_thru]
        for ngram in ngrams_sorted:
            # contained in a previous aka 'better' phrase
            for better_ngram in so_far:
                if overlap(list(better_ngram), list(ngram[0])):
                    #print "PRUNING!! "
                    #print list(better_ngram)
                    #print list(ngram[0])

                    pruned[candidate][ngram[0]] = 0
            # not contained, so add to so_far to prevent future subphrases
            else:
                so_far += [list(ngram[0])]

    return pruned 

函数的输入 tfidf_dicts 是一组字典(每个候选人一个),其中 ngrams 作为键,tf-idf 分数作为值。例如,特朗普的 tf-idf dict 是这样开头的:

trump.tfidf.dict = {'we don't win': 83.2, 'you have to': 72.8, ... }

所以输入的结构是这样的:

tfidf_dicts = {trump.tfidf.dict, rubio.tfidf.dict, etc }

我的理解是 prune_substrings 做了以下事情,但我卡在了 else if 子句上,这是一个我还不明白的 pythonic 东西。

A. create list : pruned as tfidf_dicts; a list of tfidf dicts for each candidate

B loop through each candidate:

  1. so_far = start an empty list of ngrams gone through so so_far
  2. ngrams_sorted = sorted member's tf-idf dict from smallest to biggest
  3. loop through each ngram in sorted
    • loop through each better_ngram in so_far
      1. IF overlap b/w (below) == TRUE:
        • better_ngram (from so_far) and
        • ngram (from ngrams_sorted)
        • THEN zero out tf-idf for ngram
      2. ELSE if (WHAT?!?)
        • add ngram to list, so_far

C. return pruned, i.e. list of unique ngrams sorted in order

非常感谢任何帮助!

请注意代码中的缩进...else 与第二个 for 对齐,而不是 if。这是一个 for-else 结构,而不是 if-else

在这种情况下,else用于初始化内循环,因为它会在so_far第一次为空时执行,并且每次内循环都运行完要迭代的项目...

我不确定这是实现这些比较的最有效方法,但从概念上讲,您可以通过以下代码片段了解流程:

s=[]
for j in "ABCD":
   for i in s:
      print i,
   else:
       print "\nelse"
       s.append(j)

输出:

else
A 
else
A B 
else
A B C 
else

我认为在 R 中有比嵌套循环更好的方法....

4 个月后,但这是我的解决方案。我确信有一个更有效的解决方案,但就我的目的而言,它奏效了。 pythonic for-else 不会转换为 R。所以步骤不同。

  1. 排名前 n ngram。
  2. 创建一个列表,t,其中列表的每个元素都是一个长度为 n 的逻辑向量,表示所讨论的 ngram 是否与所有其他 ngram 重叠(但修复 1:x自动为 false)
  3. C将t的每个元素绑定到一个table,t2
  4. Return 只有 t2 行的元素总和为零 将元素 1:n 设置为 FALSE(即无重叠)

哇啦!

PrunedList 函数

#' GetPrunedList
#' 
#' takes a word freq df with columns Words and LenNorm, returns df of nonoverlapping strings
GetPrunedList <- function(wordfreqdf, prune_thru = 100) {
        #take only first n items in list
        tmp <- head(wordfreqdf, n = prune_thru) %>%
                select(ngrams = Words, tfidfXlength = LenNorm)
        #for each ngram in list:
        t <- (lapply(1:nrow(tmp), function(x) {
                #find overlap between ngram and all items in list (overlap = TRUE)
                idx <- overlap(tmp[x, "ngrams"], tmp$ngrams)
                #set overlap as false for itself and higher-scoring ngrams
                idx[1:x] <- FALSE
                idx
        }))
        
        #bind each ngram's overlap vector together to make a matrix
        t2 <- do.call(cbind, t)   
        
        #find rows(i.e. ngrams) that do not overlap with those below
        idx <- rowSums(t2) == 0
        pruned <- tmp[idx,]
        rownames(pruned) <- NULL
        pruned
}

重叠函数

#' overlap
#' OBJ: takes two ngrams (as strings) and to see if they overlap
#' INPUT: a,b ngrams as strings
#' OUTPUT: TRUE if overlap
overlap <- function(a, b) {
        max_overlap <- min(3, CountWords(a), CountWords(b))
        
        a.beg <- word(a, start = 1L, end = max_overlap)
        a.end <- word(a, start = -max_overlap, end = -1L)
        b.beg <- word(b, start = 1L, end = max_overlap)
        b.end <- word(b, start = -max_overlap, end = -1L)
        
        # b contains a's beginning
        w <- str_detect(b, coll(a.beg, TRUE))
        # b contains a's end
        x <- str_detect(b, coll(a.end, TRUE))
        # a contains b's beginning
        y <- str_detect(a, coll(b.beg, TRUE))
        # a contains b's end
        z <- str_detect(a, coll(b.end, TRUE))
        
        #return TRUE if any of above are true
        (w | x | y | z)
}