自然语言语料库字符串到 int

natural language corpus string to int

分别从语料库 1、语料库 2 和语料库 3 语料库中提取句子样本并显示平均长度(根据句子中的字符数来衡量)。

所以我有 3 个语料库,sample_raw_sents 是 return 随机句子的定义函数:

tcr = corpus1()
rcr = corpus2()
mcr = corpus3()  
sample_size=50
for sentence in tcr.sample_raw_sents(sample_size):
    print(len(sentence))
for sentence in rcr.sample_raw_sents(sample_size):
    print(len(sentence))
for sentence in mcr.sample_raw_sents(sample_size):
    print(len(sentence))  

因此使用此代码打印所有长度,但我如何对这些长度求和()?

您可以将 sentences 的所有长度存储在 list 中,然后将它们相加。

tcr = corpus1()
rcr = corpus2()
mcr = corpus3()  
sample_size=50

lengths = []
for sentence in tcr.sample_raw_sents(sample_size):
    lengths.append(len(sentence))
for sentence in rcr.sample_raw_sents(sample_size):
    lengths.append(len(sentence))
for sentence in mcr.sample_raw_sents(sample_size):
    lengths.append(len(sentence))

print(sum(lengths) / len(lengths))

使用zip,它可以让你一次从每个语料库中抽取一个句子。

tcr = corpus1()
rcr = corpus2()
mcr = corpus3()  
sample_size=50

zipped = zip(tcr.sample_raw_sents(sample_size),
             rcr.sample_raw_sents(sample_size),
             mcr.sample_raw_sents(sample_size))

for s1, s2, s3 in zipped:
    summed = len(s1) + len(s2) + len(s3)
    average = summed/3
    print(summed, average)
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()  
sample_size=50
s = 0
for sentence in tcr.sample_raw_sents(sample_size):
    s = s + len(sentence)
for sentence in rcr.sample_raw_sents(sample_size):
    s = s + len(sentence)
for sentence in mcr.sample_raw_sents(sample_size):
    s = s + len(sentence)

average = s/150
print('average: {}'.format(average))