使用 Python 从文本文件中创建一个包含 n 个单词的（随机）样本

Question

对于我的博士项目，我正在评估所有现有的荷兰语命名实体识别标记器。为了检查这些标记器的精度和召回率，我想手动注释我语料库中随机样本中的所有命名实体。手动注释的样本将用作 'gold standard'，我将与它比较不同标记器的结果。

我的语料库包含 170 部荷兰小说。我正在编写一个 Python 脚本来为每本小说生成特定数量单词的随机样本（之后我将用它来注释）。所有小说将存储在同一目录中。以下脚本旨在为该目录中的每本小说生成 n 行的随机样本：

import random
import os
import glob
import sys
import errno

path = '/Users/roelsmeets/Desktop/libris_corpus_clean/*.txt'
files = glob.glob(path)  

for text in files:
    try:
        with open(text, 'rt', encoding='utf-8') as f:
             # number of lines from txt file
             random_sample_input = random.sample(f.readlines(),100) 

    except IOError as exc:
    # Do not fail if a directory is found, just ignore it.
        if exc.errno != errno.EISDIR: 
            raise 


# This block of code writes the result of the previous to a new file
random_sample_output = open("randomsample", "w", encoding='utf-8') 
random_sample_input = map(lambda x: x+"\n", random_sample_input)
random_sample_output.writelines(random_sample_input)
random_sample_output.close()

这段代码有两个问题：

目前目录下放了两本小说（.txt文件）。但是代码只为每本小说中的一本输出一个随机样本。
目前，代码从每个 .txt 文件中抽取随机数量的 LINES，但我更喜欢为每个 .txt 文件生成随机数量的 WORDS。理想情况下，我想生成 170 个 .txt 文件中每个文件的前 100 个或后 100 个单词的样本。在那种情况下，样本根本就不是随机的；但到目前为止，我找不到不使用随机库创建样本的方法。

谁能给出解决这两个问题的建议？我对 Python 和一般编程还很陌生（我是文学学者），所以我很乐意学习不同的方法。非常感谢！

Answer 1

你只需要将你的行拆分成单词，将它们存储在某个地方，然后在读取所有文件并存储它们的单词后，用 random.sample 选择 100 个。它就是我在下面的代码中所做的。不过，我不太确定它是否能够处理170本小说，因为它可能会导致大量内存占用。

import random
import os
import glob
import sys
import errno

path = '/Users/roelsmeets/Desktop/libris_corpus_clean/*.txt'
files = glob.glob(path)
words = []

for text in files:
    try:
        with open(text, 'rt', encoding='utf-8') as f:
             # number of lines from txt file
             for line in f:
                 for word in line.split():
                     words.append(word)

    except IOError as exc:
    # Do not fail if a directory is found, just ignore it.
        if exc.errno != errno.EISDIR: 
            raise 

random_sample_input = random.sample(words, 100)

# This block of code writes the result of the previous to a new file
random_sample_output = open("randomsample", "w", encoding='utf-8') 
random_sample_input = map(lambda x: x+"\n", random_sample_input)
random_sample_output.writelines(random_sample_input)
random_sample_output.close()

在上面的代码中，小说的单词越多，输出样本中出现的可能性就越大。这可能是也可能不是期望的行为。如果你想让每部小说都有相同的思考，你可以select，比方说，从它中提取100个单词添加到words变量中，然后从那里select 100百个单词结束。它还会有使用更少内存的副作用，因为一次只能存储一本小说。

import random
import os
import glob
import sys
import errno

path = '/Users/roelsmeets/Desktop/libris_corpus_clean/*.txt'
files = glob.glob(path)
words = []

for text in files:
    try:
        novel = []
        with open(text, 'rt', encoding='utf-8') as f:
             # number of lines from txt file
             for line in f:
                 for word in line.split():
                     novel.append(word)
             words.append(random.sample(novel, 100))


    except IOError as exc:
    # Do not fail if a directory is found, just ignore it.
        if exc.errno != errno.EISDIR: 
            raise 


random_sample_input = random.sample(words, 100)

# This block of code writes the result of the previous to a new file
random_sample_output = open("randomsample", "w", encoding='utf-8') 
random_sample_input = map(lambda x: x+"\n", random_sample_input)
random_sample_output.writelines(random_sample_input)
random_sample_output.close()

第三个版本，这个将处理句子而不是单词，并保留标点符号。此外，每本书在保留的最后一句话上都有相同的 "weight"，无论其大小如何。请记住，句子检测是由一种非常聪明但并非万无一失的算法完成的。

import random
import os
import glob
import sys
import errno
import nltk.data

path = '/home/clement/Documents/randomPythonScripts/data/*.txt'
files = glob.glob(path)

sentence_detector = nltk.data.load('tokenizers/punkt/dutch.pickle')
listOfSentences = []

for text in files:
    try:
        with open(text, 'rt', encoding='utf-8') as f:
            fullText = f.read()
        listOfSentences += [x.replace("\n", " ").replace("  "," ").strip() for x in random.sample(sentence_detector.tokenize(fullText), 30)]

    except IOError as exc:
    # Do not fail if a directory is found, just ignore it.
        if exc.errno != errno.EISDIR:
            raise

random_sample_input = random.sample(listOfSentences, 15)
print(random_sample_input)

# This block of code writes the result of the previous to a new file
random_sample_output = open("randomsample", "w", encoding='utf-8')
random_sample_input = map(lambda x: x+"\n", random_sample_input)
random_sample_output.writelines(random_sample_input)
random_sample_output.close()

Answer 2

几点建议：

随机取句子，而不是单词或行。如果输入是符合语法的句子，NE 标注器会工作得更好。所以你需要使用分句器。

当您遍历文件时，random_sample_input 仅包含来自最后一个文件的行。您应该将用于将所选内容写入文件的代码块移动到 for 循环内。然后，您可以将选定的句子写入一个文件或单独的文件中。例如：

out = open("selected-sentences.txt", "w")

for text in files:
    try:
        with open(text, 'rt', encoding='utf-8') as f:
             sentences = sentence_splitter.split(f.read())
             for sentence in random.sample(sentences, 100):
                 print >> out, sentence

    except IOError as exc:
    # Do not fail if a directory is found, just ignore it.
        if exc.errno != errno.EISDIR: 
            raise 

out.close()

[编辑] 以下是您应该如何使用 NLTK 句子拆分器：

import nltk.data
sentence_splitter = nltk.data.load("tokenizers/punkt/dutch.pickle")
text = "Dit is de eerste zin. Dit is de tweede zin."
print sentence_splitter.tokenize(text)

打印：

["Dit is de eerste zin.", "Dit is de tweede zin."]

请注意，您需要先从交互式控制台使用 nltk.download() 下载荷兰语分词器。

Answer 3

这解决了两个问题：

import random
import os
import glob
import sys
import errno

path = '/Users/roelsmeets/Desktop/libris_corpus_clean/*.txt'
files = glob.glob(path)

with open("randomsample", "w", encoding='utf-8') as random_sample_output:
    for text in files:
        try:
            with open(text, 'rt', encoding='utf-8') as f:
                # number of lines from txt file
                random_sample_input = random.sample(f.read().split(), 10)

        except IOError as exc:
            # Do not fail if a directory is found, just ignore it.
            if exc.errno != errno.EISDIR:
            raise

        # This block of code writes the result of the previous to a new file
        random_sample_input = map(lambda x: x + "\n", random_sample_input)
        random_sample_output.writelines(random_sample_input)

使用 Python 从文本文件中创建一个包含 n 个单词的（随机）样本

Using Python to create a (random) sample of n words from text files

python

random

text

nlp

named-entity-recognition