如何在 python 中导入和阅读 wsj 语料库

How to import and read a wsj corpus in python

我有一个代码可以构建 n-gram 模型,以根据提供的语料库测试下一个单词预测。如何替换给定的语料库以阅读 WSJ 语料库作为训练语料库?下面给出部分程序。

# import libraries needed, read the dataset
import nltk, re, pprint, string
from nltk import word_tokenize, sent_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords
string.punctuation = string.punctuation +'“'+'”'+'-'+'’'+'‘'+'—'
string.punctuation = string.punctuation.replace('.', '')
file = open('./corpus.txt', encoding = 'utf8').read()

#preprocess data
file_nl_removed = ""
for line in file:
  line_nl_removed = line.replace("\n", " ")     
  file_nl_removed += line_nl_removed
file_p = "".join([char for char in file_nl_removed if char not in string.punctuation]) 

#nltk.download('punkt')
sents = nltk.sent_tokenize(file_p)
print("The number of sentences is", len(sents)) 

如果您要使用 nltk 包中的 WSJ 语料库,下载后即可使用:

import nltk
nltk.download('treebank')
from nltk.corpus import treebank
print(treebank.fileids()[:10])
print(treebank.words('wsj_0003.mrg')[:10])

输出:

['wsj_0001.mrg', 'wsj_0002.mrg', 'wsj_0003.mrg', 'wsj_0004.mrg', 'wsj_0005.mrg', 'wsj_0006.mrg', 'wsj_0007.mrg', 'wsj_0008.mrg', 'wsj_0009.mrg', 'wsj_0010.mrg']
['A', 'form', 'of', 'asbestos', 'once', 'used', '*', '*', 'to', 'make']