了解另一个删除相似字符串的文本挖掘功能
Understanding another's text-mining function that removes similar strings
我正在尝试复制这篇文章 538 Post about Most Repetitive Phrases 中的方法,作者在这篇文章中挖掘了美国总统辩论的记录,以确定每位候选人最重复的短语。
我正在尝试使用 tm
包在 R 中的另一个数据集上实施此方法。
大部分代码 (GitHub repository) 涉及挖掘每个 ngram 的转录本和组装计数,但我迷失在下面的 prune_substrings()
函数代码中:
def prune_substrings(tfidf_dicts, prune_thru=1000):
pruned = tfidf_dicts
for candidate in range(len(candidates)):
# growing list of n-grams in list form
so_far = []
ngrams_sorted = sorted(tfidf_dicts[candidate].items(), key=operator.itemgetter(1), reverse=True)[:prune_thru]
for ngram in ngrams_sorted:
# contained in a previous aka 'better' phrase
for better_ngram in so_far:
if overlap(list(better_ngram), list(ngram[0])):
#print "PRUNING!! "
#print list(better_ngram)
#print list(ngram[0])
pruned[candidate][ngram[0]] = 0
# not contained, so add to so_far to prevent future subphrases
else:
so_far += [list(ngram[0])]
return pruned
函数的输入 tfidf_dicts
是一组字典(每个候选人一个),其中 ngrams 作为键,tf-idf 分数作为值。例如,特朗普的 tf-idf dict 是这样开头的:
trump.tfidf.dict = {'we don't win': 83.2, 'you have to': 72.8, ... }
所以输入的结构是这样的:
tfidf_dicts = {trump.tfidf.dict, rubio.tfidf.dict, etc }
我的理解是 prune_substrings
做了以下事情,但我卡在了 else if
子句上,这是一个我还不明白的 pythonic 东西。
A. create list : pruned as tfidf_dicts; a list of tfidf dicts for each candidate
B loop through each candidate:
- so_far = start an empty list of ngrams gone through so so_far
- ngrams_sorted = sorted member's tf-idf dict from smallest to biggest
- loop through each ngram in sorted
- loop through each better_ngram in so_far
- IF overlap b/w (below) == TRUE:
- better_ngram (from so_far) and
- ngram (from ngrams_sorted)
- THEN zero out tf-idf for ngram
- ELSE if (WHAT?!?)
- add ngram to list, so_far
C. return pruned, i.e. list of unique ngrams sorted in order
非常感谢任何帮助!
请注意代码中的缩进...else
与第二个 for
对齐,而不是 if
。这是一个 for-else
结构,而不是 if-else
。
在这种情况下,else
用于初始化内循环,因为它会在so_far
第一次为空时执行,并且每次内循环都运行完要迭代的项目...
我不确定这是实现这些比较的最有效方法,但从概念上讲,您可以通过以下代码片段了解流程:
s=[]
for j in "ABCD":
for i in s:
print i,
else:
print "\nelse"
s.append(j)
输出:
else
A
else
A B
else
A B C
else
我认为在 R 中有比嵌套循环更好的方法....
4 个月后,但这是我的解决方案。我确信有一个更有效的解决方案,但就我的目的而言,它奏效了。 pythonic for-else 不会转换为 R。所以步骤不同。
- 排名前
n
ngram。
- 创建一个列表,
t
,其中列表的每个元素都是一个长度为 n
的逻辑向量,表示所讨论的 ngram 是否与所有其他 ngram 重叠(但修复 1:x自动为 false)
- C将
t
的每个元素绑定到一个table,t2
- Return 只有
t2
行的元素总和为零
将元素 1:n 设置为 FALSE(即无重叠)
哇啦!
PrunedList 函数
#' GetPrunedList
#'
#' takes a word freq df with columns Words and LenNorm, returns df of nonoverlapping strings
GetPrunedList <- function(wordfreqdf, prune_thru = 100) {
#take only first n items in list
tmp <- head(wordfreqdf, n = prune_thru) %>%
select(ngrams = Words, tfidfXlength = LenNorm)
#for each ngram in list:
t <- (lapply(1:nrow(tmp), function(x) {
#find overlap between ngram and all items in list (overlap = TRUE)
idx <- overlap(tmp[x, "ngrams"], tmp$ngrams)
#set overlap as false for itself and higher-scoring ngrams
idx[1:x] <- FALSE
idx
}))
#bind each ngram's overlap vector together to make a matrix
t2 <- do.call(cbind, t)
#find rows(i.e. ngrams) that do not overlap with those below
idx <- rowSums(t2) == 0
pruned <- tmp[idx,]
rownames(pruned) <- NULL
pruned
}
重叠函数
#' overlap
#' OBJ: takes two ngrams (as strings) and to see if they overlap
#' INPUT: a,b ngrams as strings
#' OUTPUT: TRUE if overlap
overlap <- function(a, b) {
max_overlap <- min(3, CountWords(a), CountWords(b))
a.beg <- word(a, start = 1L, end = max_overlap)
a.end <- word(a, start = -max_overlap, end = -1L)
b.beg <- word(b, start = 1L, end = max_overlap)
b.end <- word(b, start = -max_overlap, end = -1L)
# b contains a's beginning
w <- str_detect(b, coll(a.beg, TRUE))
# b contains a's end
x <- str_detect(b, coll(a.end, TRUE))
# a contains b's beginning
y <- str_detect(a, coll(b.beg, TRUE))
# a contains b's end
z <- str_detect(a, coll(b.end, TRUE))
#return TRUE if any of above are true
(w | x | y | z)
}
我正在尝试复制这篇文章 538 Post about Most Repetitive Phrases 中的方法,作者在这篇文章中挖掘了美国总统辩论的记录,以确定每位候选人最重复的短语。
我正在尝试使用 tm
包在 R 中的另一个数据集上实施此方法。
大部分代码 (GitHub repository) 涉及挖掘每个 ngram 的转录本和组装计数,但我迷失在下面的 prune_substrings()
函数代码中:
def prune_substrings(tfidf_dicts, prune_thru=1000):
pruned = tfidf_dicts
for candidate in range(len(candidates)):
# growing list of n-grams in list form
so_far = []
ngrams_sorted = sorted(tfidf_dicts[candidate].items(), key=operator.itemgetter(1), reverse=True)[:prune_thru]
for ngram in ngrams_sorted:
# contained in a previous aka 'better' phrase
for better_ngram in so_far:
if overlap(list(better_ngram), list(ngram[0])):
#print "PRUNING!! "
#print list(better_ngram)
#print list(ngram[0])
pruned[candidate][ngram[0]] = 0
# not contained, so add to so_far to prevent future subphrases
else:
so_far += [list(ngram[0])]
return pruned
函数的输入 tfidf_dicts
是一组字典(每个候选人一个),其中 ngrams 作为键,tf-idf 分数作为值。例如,特朗普的 tf-idf dict 是这样开头的:
trump.tfidf.dict = {'we don't win': 83.2, 'you have to': 72.8, ... }
所以输入的结构是这样的:
tfidf_dicts = {trump.tfidf.dict, rubio.tfidf.dict, etc }
我的理解是 prune_substrings
做了以下事情,但我卡在了 else if
子句上,这是一个我还不明白的 pythonic 东西。
A. create list : pruned as tfidf_dicts; a list of tfidf dicts for each candidate
B loop through each candidate:
- so_far = start an empty list of ngrams gone through so so_far
- ngrams_sorted = sorted member's tf-idf dict from smallest to biggest
- loop through each ngram in sorted
- loop through each better_ngram in so_far
- IF overlap b/w (below) == TRUE:
- better_ngram (from so_far) and
- ngram (from ngrams_sorted)
- THEN zero out tf-idf for ngram
- ELSE if (WHAT?!?)
- add ngram to list, so_far
C. return pruned, i.e. list of unique ngrams sorted in order
非常感谢任何帮助!
请注意代码中的缩进...else
与第二个 for
对齐,而不是 if
。这是一个 for-else
结构,而不是 if-else
。
在这种情况下,else
用于初始化内循环,因为它会在so_far
第一次为空时执行,并且每次内循环都运行完要迭代的项目...
我不确定这是实现这些比较的最有效方法,但从概念上讲,您可以通过以下代码片段了解流程:
s=[]
for j in "ABCD":
for i in s:
print i,
else:
print "\nelse"
s.append(j)
输出:
else
A
else
A B
else
A B C
else
我认为在 R 中有比嵌套循环更好的方法....
4 个月后,但这是我的解决方案。我确信有一个更有效的解决方案,但就我的目的而言,它奏效了。 pythonic for-else 不会转换为 R。所以步骤不同。
- 排名前
n
ngram。 - 创建一个列表,
t
,其中列表的每个元素都是一个长度为n
的逻辑向量,表示所讨论的 ngram 是否与所有其他 ngram 重叠(但修复 1:x自动为 false) - C将
t
的每个元素绑定到一个table,t2
- Return 只有
t2
行的元素总和为零 将元素 1:n 设置为 FALSE(即无重叠)
哇啦!
PrunedList 函数
#' GetPrunedList
#'
#' takes a word freq df with columns Words and LenNorm, returns df of nonoverlapping strings
GetPrunedList <- function(wordfreqdf, prune_thru = 100) {
#take only first n items in list
tmp <- head(wordfreqdf, n = prune_thru) %>%
select(ngrams = Words, tfidfXlength = LenNorm)
#for each ngram in list:
t <- (lapply(1:nrow(tmp), function(x) {
#find overlap between ngram and all items in list (overlap = TRUE)
idx <- overlap(tmp[x, "ngrams"], tmp$ngrams)
#set overlap as false for itself and higher-scoring ngrams
idx[1:x] <- FALSE
idx
}))
#bind each ngram's overlap vector together to make a matrix
t2 <- do.call(cbind, t)
#find rows(i.e. ngrams) that do not overlap with those below
idx <- rowSums(t2) == 0
pruned <- tmp[idx,]
rownames(pruned) <- NULL
pruned
}
重叠函数
#' overlap
#' OBJ: takes two ngrams (as strings) and to see if they overlap
#' INPUT: a,b ngrams as strings
#' OUTPUT: TRUE if overlap
overlap <- function(a, b) {
max_overlap <- min(3, CountWords(a), CountWords(b))
a.beg <- word(a, start = 1L, end = max_overlap)
a.end <- word(a, start = -max_overlap, end = -1L)
b.beg <- word(b, start = 1L, end = max_overlap)
b.end <- word(b, start = -max_overlap, end = -1L)
# b contains a's beginning
w <- str_detect(b, coll(a.beg, TRUE))
# b contains a's end
x <- str_detect(b, coll(a.end, TRUE))
# a contains b's beginning
y <- str_detect(a, coll(b.beg, TRUE))
# a contains b's end
z <- str_detect(a, coll(b.end, TRUE))
#return TRUE if any of above are true
(w | x | y | z)
}