如何计算 nltk python 中三元组的 conditional_frequency_distribution 和 conditional_probability_distribution
How to calculate conditional_frequency_distribution and conditional_probability_distribution for trigrams in nltk python
我想为我的语言模型计算条件概率分布,但我做不到,因为我需要条件频率分布我无法生成。这是我的代码:
# -*- coding: utf-8 -*-
import io
import nltk
from nltk.util import ngrams
from nltk.tokenize import sent_tokenize
from preprocessor import utf8_to_ascii
with io.open("mypet.txt",'r',encoding='utf8') as utf_file:
file_content = utf_file.read()
ascii_content = utf8_to_ascii(file_content)
sentence_tokenize_list = sent_tokenize(ascii_content)
all_trigrams = []
for sentence in sentence_tokenize_list:
sentence = sentence.rstrip('.!?')
tokens = nltk.re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", sentence)
trigrams = ngrams(tokens, 3,pad_left=True,pad_right=True,left_pad_symbol='<s>', right_pad_symbol="</s>")
all_trigrams.extend(trigrams)
conditional_frequency_distribution = nltk.ConditionalFreqDist(all_trigrams)
conditional_probability_distribution = nltk.ConditionalProbDist(conditional_frequency_distribution, nltk.MLEProbDist)
for trigram in all_trigrams:
print "{0}: {1}".format(conditional_probability_distribution[trigram[0]].prob(trigram[1]), trigram)
但是我收到这个错误:
line 23, in <module>
ValueError: too many values to unpack
这是我的 preprocessor.py 文件,它正在处理 utf-8 字符:
# -*- coding: utf-8 -*-
import json
def utf8_to_ascii(utf8_text):
with open("utf_to_ascii.json") as data_file:
data = json.load(data_file)
utf_table = data["chars"]
for key, value in utf_table.items():
utf8_text = utf8_text.replace(key, value)
return utf8_text.encode('ascii')
这是我用来将 utf-8 字符替换为 ascii 字符的 utf_to_ascii.json 文件:
{
"chars": {
"“":"",
"”":"",
"’":"'",
"—":"-",
"–":"-"
}
}
有人可以建议我如何计算 NLTK 中 trigrams 的条件频率分布吗?
我终于知道怎么做了。所以在上面的代码中,我 将三字母组转换为双字母组 。例如,我有 ('I', 'am', 'going')
,正在将其转换为 (('I', 'am'), 'going')
。所以它是一个有两个元组的二元组,其中第一个元组又是两个单词的元组。为了实现这一点,我只是更改了这段代码的几行:
trigrams_as_bigrams = []
for sentence in sentence_tokenize_list:
....
....
trigrams = ngrams(tokens, 3,pad_left=True,pad_right=True,left_pad_symbol='<s>', right_pad_symbol="</s>")
trigrams_as_bigrams.extend([((t[0],t[1]), t[2]) for t in trigrams])
....
....
其余代码与之前相同。它对我来说很好用。谢谢你的努力。
我想为我的语言模型计算条件概率分布,但我做不到,因为我需要条件频率分布我无法生成。这是我的代码:
# -*- coding: utf-8 -*-
import io
import nltk
from nltk.util import ngrams
from nltk.tokenize import sent_tokenize
from preprocessor import utf8_to_ascii
with io.open("mypet.txt",'r',encoding='utf8') as utf_file:
file_content = utf_file.read()
ascii_content = utf8_to_ascii(file_content)
sentence_tokenize_list = sent_tokenize(ascii_content)
all_trigrams = []
for sentence in sentence_tokenize_list:
sentence = sentence.rstrip('.!?')
tokens = nltk.re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", sentence)
trigrams = ngrams(tokens, 3,pad_left=True,pad_right=True,left_pad_symbol='<s>', right_pad_symbol="</s>")
all_trigrams.extend(trigrams)
conditional_frequency_distribution = nltk.ConditionalFreqDist(all_trigrams)
conditional_probability_distribution = nltk.ConditionalProbDist(conditional_frequency_distribution, nltk.MLEProbDist)
for trigram in all_trigrams:
print "{0}: {1}".format(conditional_probability_distribution[trigram[0]].prob(trigram[1]), trigram)
但是我收到这个错误:
line 23, in <module>
ValueError: too many values to unpack
这是我的 preprocessor.py 文件,它正在处理 utf-8 字符:
# -*- coding: utf-8 -*-
import json
def utf8_to_ascii(utf8_text):
with open("utf_to_ascii.json") as data_file:
data = json.load(data_file)
utf_table = data["chars"]
for key, value in utf_table.items():
utf8_text = utf8_text.replace(key, value)
return utf8_text.encode('ascii')
这是我用来将 utf-8 字符替换为 ascii 字符的 utf_to_ascii.json 文件:
{
"chars": {
"“":"",
"”":"",
"’":"'",
"—":"-",
"–":"-"
}
}
有人可以建议我如何计算 NLTK 中 trigrams 的条件频率分布吗?
我终于知道怎么做了。所以在上面的代码中,我 将三字母组转换为双字母组 。例如,我有 ('I', 'am', 'going')
,正在将其转换为 (('I', 'am'), 'going')
。所以它是一个有两个元组的二元组,其中第一个元组又是两个单词的元组。为了实现这一点,我只是更改了这段代码的几行:
trigrams_as_bigrams = []
for sentence in sentence_tokenize_list:
....
....
trigrams = ngrams(tokens, 3,pad_left=True,pad_right=True,left_pad_symbol='<s>', right_pad_symbol="</s>")
trigrams_as_bigrams.extend([((t[0],t[1]), t[2]) for t in trigrams])
....
....
其余代码与之前相同。它对我来说很好用。谢谢你的努力。