非结构化数据,NLP Lemmatize 书评

Unstructured data, NLP Lemmatize Book Review

我在这里尝试阅读内容,比方说 'book1.txt',在这里我必须删除所有特殊字符和标点符号,并使用 nltk 的单词标记器对内容进行单词标记。 使用 wordnetLemmatizer 对这些标记进行词形还原 并将这些令牌一一写入 csv 文件。 这是我正在使用的代码,它显然不起作用,但只需要一些建议。

    import nltk
from nltk.stem import WordNetLemmatizer
import csv
from nltk.tokenize import word_tokenize

file_out=open('data.csv','w')
with open('book1.txt','r') as myfile:
  for s in myfile:
    words = nltk.word_tokenize(s)
    words=[word.lower() for word in words if word.isalpha()]
    for word in words:
      token=WordNetLemmatizer().lemmatize(words,'v')
      filtered_sentence=[""]
      for n in words:
        if n not in token:
          filtered_sentence.append(""+n)
        file_out.writelines(filtered_sentence+["\n"])

这里有一些问题,最明显的是最后 2 个 for 循环。

你这样做的方式让它写成如下:

word1
word1word2
word1word2word3
word1word2word3word4
........etc

我猜这不是预期的输出。我假设预期的输出是:

word1
word2
word3
word4
........etc (without creating duplicates)

我将下面的代码应用于 3 段 Cat Ipsum 文件。请注意,由于我自己的命名约定,我更改了一些变量名称。

import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from pprint import pprint


# read the text into a single string.
with open("book1.txt") as infile:
    text = ' '.join(infile.readlines())
words = word_tokenize(text)
words = [word.lower() for word in words if word.isalpha()]


# create the lemmatized word list
results = []
for word in words:
    # you were using words instead of word below
    token = WordNetLemmatizer().lemmatize(word, "v")
    # check if token not already in results. 
    if token not in results:
        results.append(token)


# sort results, just because :)
results.sort()

# print and save the results
pprint(results)
print(len(results))
with open("nltk_data.csv", "w") as outfile:
    outfile.writelines(results)