计算两个文档之间的对称 Kullback-Leibler 散度
Computing symmetric Kullback-Leibler divergence between two documents
我已经按照论文here and the code here(它是使用对称kld和第1篇论文link中提出的退避模型实现的)计算两个文本数据集之间的KLD .我最后将for循环更改为return两个数据集的概率分布,以测试两者之和是否为1:
import re, math, collections
def tokenize(_str):
stopwords = ['and', 'for', 'if', 'the', 'then', 'be', 'is', \
'are', 'will', 'in', 'it', 'to', 'that']
tokens = collections.defaultdict(lambda: 0.)
for m in re.finditer(r"(\w+)", _str, re.UNICODE):
m = m.group(1).lower()
if len(m) < 2: continue
if m in stopwords: continue
tokens[m] += 1
return tokens
#end of tokenize
def kldiv(_s, _t):
if (len(_s) == 0):
return 1e33
if (len(_t) == 0):
return 1e33
ssum = 0. + sum(_s.values())
slen = len(_s)
tsum = 0. + sum(_t.values())
tlen = len(_t)
vocabdiff = set(_s.keys()).difference(set(_t.keys()))
lenvocabdiff = len(vocabdiff)
""" epsilon """
epsilon = min(min(_s.values())/ssum, min(_t.values())/tsum) * 0.001
""" gamma """
gamma = 1 - lenvocabdiff * epsilon
""" Check if distribution probabilities sum to 1"""
sc = sum([v/ssum for v in _s.itervalues()])
st = sum([v/tsum for v in _t.itervalues()])
ps=[]
pt = []
for t, v in _s.iteritems():
pts = v / ssum
ptt = epsilon
if t in _t:
ptt = gamma * (_t[t] / tsum)
ps.append(pts)
pt.append(ptt)
return ps, pt
我用
测试过
d1 = """Many research publications want you to use BibTeX, which better
organizes the whole process. Suppose for concreteness your source
file is x.tex. Basically, you create a file x.bib containing the
bibliography, and run bibtex on that file."""
d2 = """In this case you must supply both a \left and a \right because the
delimiter height are made to match whatever is contained between the
two commands. But, the \left doesn't have to be an actual 'left
delimiter', that is you can use '\left)' if there were some reason
to do it."""
sum(ps)
= 1 但 sum(pt)
远小于 1 时:
代码中是否有不正确的地方?谢谢!
更新:
为了使 pt 和 ps 的总和为 1,我不得不将代码更改为:
vocab = Counter(_s)+Counter(_t)
ps=[]
pt = []
for t, v in vocab.iteritems():
if t in _s:
pts = gamma * (_s[t] / ssum)
else:
pts = epsilon
if t in _t:
ptt = gamma * (_t[t] / tsum)
else:
ptt = epsilon
ps.append(pts)
pt.append(ptt)
return ps, pt
每个文档的概率分布之和存储在变量 sc
和 st
中,它们接近于 1。
sum(ps) 和 sum(pt) 都是 _s 和 _t 在 s 的支持下的总概率质量(由 "support of s"我的意思是所有出现在 _s 中的词,不管出现在 _t 中的词)。
这意味着
- sum(ps)==1,因为 for-loop 对 _s 中的所有单词求和。
- sum(pt) <= 1,如果 t 的支持度是 s 的支持度的子集(也就是说,如果 _t 中的所有单词都出现在 _s 中),则等式成立。此外,如果 _s 和 _t 中的单词之间的重叠很小,则 sum(pt) 可能接近 0。具体来说,如果 _s 和 _t 的交集是空集,则 sum(pt) == epsilon*len(_s).
所以,我认为代码没有问题。
此外,与问题的标题相反,kldiv() 不计算对称 KL-divergence,而是计算 _s 和 _t 的平滑版本之间的 KL-divergence。
我已经按照论文here and the code here(它是使用对称kld和第1篇论文link中提出的退避模型实现的)计算两个文本数据集之间的KLD .我最后将for循环更改为return两个数据集的概率分布,以测试两者之和是否为1:
import re, math, collections
def tokenize(_str):
stopwords = ['and', 'for', 'if', 'the', 'then', 'be', 'is', \
'are', 'will', 'in', 'it', 'to', 'that']
tokens = collections.defaultdict(lambda: 0.)
for m in re.finditer(r"(\w+)", _str, re.UNICODE):
m = m.group(1).lower()
if len(m) < 2: continue
if m in stopwords: continue
tokens[m] += 1
return tokens
#end of tokenize
def kldiv(_s, _t):
if (len(_s) == 0):
return 1e33
if (len(_t) == 0):
return 1e33
ssum = 0. + sum(_s.values())
slen = len(_s)
tsum = 0. + sum(_t.values())
tlen = len(_t)
vocabdiff = set(_s.keys()).difference(set(_t.keys()))
lenvocabdiff = len(vocabdiff)
""" epsilon """
epsilon = min(min(_s.values())/ssum, min(_t.values())/tsum) * 0.001
""" gamma """
gamma = 1 - lenvocabdiff * epsilon
""" Check if distribution probabilities sum to 1"""
sc = sum([v/ssum for v in _s.itervalues()])
st = sum([v/tsum for v in _t.itervalues()])
ps=[]
pt = []
for t, v in _s.iteritems():
pts = v / ssum
ptt = epsilon
if t in _t:
ptt = gamma * (_t[t] / tsum)
ps.append(pts)
pt.append(ptt)
return ps, pt
我用
测试过d1 = """Many research publications want you to use BibTeX, which better
organizes the whole process. Suppose for concreteness your source
file is x.tex. Basically, you create a file x.bib containing the
bibliography, and run bibtex on that file."""
d2 = """In this case you must supply both a \left and a \right because the
delimiter height are made to match whatever is contained between the
two commands. But, the \left doesn't have to be an actual 'left
delimiter', that is you can use '\left)' if there were some reason
to do it."""
sum(ps)
= 1 但 sum(pt)
远小于 1 时:
代码中是否有不正确的地方?谢谢!
更新:
为了使 pt 和 ps 的总和为 1,我不得不将代码更改为:
vocab = Counter(_s)+Counter(_t)
ps=[]
pt = []
for t, v in vocab.iteritems():
if t in _s:
pts = gamma * (_s[t] / ssum)
else:
pts = epsilon
if t in _t:
ptt = gamma * (_t[t] / tsum)
else:
ptt = epsilon
ps.append(pts)
pt.append(ptt)
return ps, pt
每个文档的概率分布之和存储在变量 sc
和 st
中,它们接近于 1。
sum(ps) 和 sum(pt) 都是 _s 和 _t 在 s 的支持下的总概率质量(由 "support of s"我的意思是所有出现在 _s 中的词,不管出现在 _t 中的词)。 这意味着
- sum(ps)==1,因为 for-loop 对 _s 中的所有单词求和。
- sum(pt) <= 1,如果 t 的支持度是 s 的支持度的子集(也就是说,如果 _t 中的所有单词都出现在 _s 中),则等式成立。此外,如果 _s 和 _t 中的单词之间的重叠很小,则 sum(pt) 可能接近 0。具体来说,如果 _s 和 _t 的交集是空集,则 sum(pt) == epsilon*len(_s).
所以,我认为代码没有问题。
此外,与问题的标题相反,kldiv() 不计算对称 KL-divergence,而是计算 _s 和 _t 的平滑版本之间的 KL-divergence。