NLTK PlaintextCorpusReader 读取文件并在分隔符上拆分它们

NLTK PlaintextCorpusReader reading files in and splitting them on delimiters

我想根据分隔符拆分输入文本,只提取特定部分使用 NLTK 处理,这里是输入信息示例:

[t] troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . 
##repost from january 13 , 2004 with a better fit title . 
support[-3][u]##apex does n't answer the phone . 
player[-2][p]##unfortunately it turns out to be the " disposable " type . 
battery[+2]##i treat the battery well and it has lasted . 
sound quality[+2], fm[+1], earpiece[+1]##while i had the phone , the positive features were : good sound quality and an excellent fm phone and earpiece . 
speakerphone[+3][u]##you can be up to about 3 feet away from it and it will still work perfectly . 
size[+2],weight[+2]##i like the size and weight of this little critter . 
[t]excellent picture quality / color 
canon g3[+3]##i bought my canon g3 about a month ago and i have to say i am very satisfied . 
zoom[+2],lense[+2]##the extended zoom range and faster lense put it at the top of it 's class . 

我正在尝试使用 NLTK 拆分文件以拆分行,然后仅使用 ## 之后的部分。这是我的尝试,但是我找不到在分隔符上最好地拆分文件的解决方案:

# Import  Natural Language Toolkit Library
import nltk
# Importing Operator Module
import operator

from nltk.corpus import PlaintextCorpusReader

# Root folder where the text files are located
corpus_root = "Data"

# Read the list of files
filelists = PlaintextCorpusReader(corpus_root, '.*', encoding='utf-8')

# List down the IDs of the files read from the local storage
filelists.fileids()

# Read the text from specific file,
# like plaintext corpora support methods to read the corpus as 
# raw text, a list of words, a list of sentences, or a list of paragraphs.
rawlist = filelists.raw('text.txt')
wordslist = filelists.words('text.txt')
sentslist = filelists.sents('text.txt')
paraslist = filelists.paras('text.txt')

print("a list of filenames:")
print(filelists.fileids(),'\n')
print("a list of words:")
print(wordslist,'\n')
print("a list of sentences:")
print(sentslist,'\n')
print("a list of paragraphs:")
print(paraslist,'\n')
print("a list of raw text:")
print(rawlist,'\n')

期望的输出:

troubleshooting ad-2500 and ad-2600 no picture scrolling b/w . 
repost from january 13 , 2004 with a better fit title .
apex does n't answer the phone . 
unfortunately it turns out to be the " disposable " type . 
i treat the battery well and it has lasted . 
while i had the phone , the positive features were : good sound quality and an excellent fm phone and earpiece . 
you can be up to about 3 feet away from it and it will still work perfectly . 
i like the size and weight of this little critter . 
excellent picture quality / color 
i bought my canon g3 about a month ago and i have to say i am very satisfied . 
the extended zoom range and faster lense put it at the top of it 's class . 

我使用了 NLTK 中现有的 Corpora 导入功能来利用该项目的文件。首先,我找到了文件夹 from nltk.corpus import product_reviews_1 的实际目录,因为产品评论 1 是当前 NLTK data 包中的一个已知模块。然后 运行 nltk.corpus.product_reviews_1.abspaths() 获取文件夹的确切路径。之后我将文件夹复制到语料库目录