读取txt文件并分词

Question

我想在 python 中创建一个程序来读取用户输入的 txt 文件。然后我希望程序在下面的示例中按如下方式分隔单词：

在他上任时，瑞典议会拥有比君主制更多的权力，但在敌对党派之间严重分裂。

当时
时间
他的时间
他的加入
他的加入 ...

我希望这个程序将它们保存在不同的文件中。有什么想法吗？

Answer 1

您没有详细说明要将文本保存在不同文件中的格式。假设你想要一行一行，那会做：

def only_letters(word):
    return ''.join(c for c in word if 'a' <= c <= 'z' or 'A' <= c <= 'Z')

with open('input.txt') as f, open('output.txt', 'w') as w:
    s = f.read()
    words = [only_letters(word) for word in s.split()]
    triplets = [words[i:i + 3] for i in range(len(words) - 2)]
    for triplet in triplets:
        w.write(' '.join(triplet) + '\n')

Answer 2

你可以试试这个，注意，如果你不给它至少3个字，它会失败。

def get_words():
    with open("file.txt", "r") as f:
        for word in f.readline().split(" "):
            yield word.replace(",", "").replace(".", "")

with open("output.txt", "w") as f:
    it = get_words()
    current = [""] + [next(it) for _ in range(2)]
    for word in it:
        current = current[1:] + [word]
        f.write(" ".join(current) + "\n")

Answer 3

我的理解是您想生成 n-grams，这是在执行任何 NLP 之前进行文本矢量化的常见做法。这是一个简单的实现：

from sklearn.feature_extraction.text import CountVectorizer

string = ["At the time of his accession, the Swedish Riksdag held more power than the monarchy but was bitterly divided between rival parties."]
# you can change the ngram_range to get any combination of words
vectorizer = CountVectorizer(encoding='utf-8', stop_words='english', ngram_range=(3,3))

X = vectorizer.fit_transform(string)
print(vectorizer.get_feature_names())

这会给你一个长度为 3 的 ngram 列表，但是顺序丢失了。

['accession the swedish', 'at the time', 'between rival parties', 'bitterly divided between', 'but was bitterly', 'divided between rival', 'held more power', 'his accession the', 'monarchy but was', 'more power than', 'of his accession', 'power than the', 'riksdag held more', 'swedish riksdag held', 'than the monarchy', 'the monarchy but', 'the swedish riksdag', 'the time of', 'time of his', 'was bitterly divided']

读取txt文件并分词

read from txt file and divide words

python

text

n-gram