从文本文件中随机 Select 句,查找相应的 ID 号

Randomly Select Sentences from Text File, Find Coresponding ID Number

我正在帮助我的一位教授进行一项研究项目,该项目涉及从一组 20 个文本文件中随机抽取一千个句子。这是来自当代美国英语语料库的所有数据,如果有人熟悉使用它的话。在这些文本文件中,数据排列如下:

Blockquote ##4000348 I must begin by saying this : In preparation for this lecture , I read ( or in some cases reread ) a number of the writings of Sidney Hook . I read them solely to give me the right starting point for a lecture given in honor of Sidney Hook . But instead I found myself infused with a set of ideas that were relevant to a different setting , a different occasion .

##4000349 I would like to think I am best known for my wisdom and learning , but in truth such fame as I have derives from my being a reputed conservative who is also dean of Yale College . That was the reason news of my appointment appeared in the Wall Street Journal and the National Review , which does n't usually happen to deans of Yale College , and does n't help them much when it does .

Blockquote>

因此,有数百个段落,每个段落都以六位数字开头,前面加上“##”。该数字对应于句子的来源。我需要从这些文件中随机抽取句子,并获得六位数字来标识它们的来源。所以理想情况下,我会得到类似的东西:

Blockquote ##4000348 I read them solely to give me the right starting point for a lecture given in honor of Sidney Hook

##4000349 I would like to think I am best known for my wisdom and learning , but in truth such fame as I have derives from my being a reputed conservative who is also dean of Yale College .

我已经成功地从文件中获取随机句子(在 Whosebug 的好心人的帮助下),但我不知道如何获取附加到它们的数字(例如,如果我拉一个段落中间的句子,我怎么能从段落的开头得到数字)。谁能帮我想办法做到这一点?这是我到目前为止的代码,它成功地提取了句子。

# -*- coding: utf-8 -*-

import re
from random import sample

sentences = []
for i in range(1990,2013):
    with open('w_acad_{}.txt'.format(i)) as f:
        sentences += re.findall(r".*?[\.\!\?]+", f.read())

selected = sample(sentences, 2000)
with open('out.txt', 'w') as f:
    f.write('\n'.join(selected))

也许您可以使用正则表达式提取每个段落及其源 ID,然后从该段落中提取句子,这与您目前的做法类似。这应该可以帮助您抓住段落:

# with open... etc.
for source_id, paragraph in re.findall(r"(##\d+)([^#]+)", f.read()):
    sentences += [(source_id, sentence) for sentence in re.findall(r".*?[\.\!\?]+", paragraph)]

现在,sentences 应该是像 ('##123', 'A sentence.') 这样的元组列表,您可以从中像以前一样进行采样。

一般来说,为了避免一次将(可能很大的)文件全部加载到内存中,您可以使用 a reservoir sampling algorithm——只需向它传递一个迭代器,该迭代器产生标记(带有 ##-数字)句数:

#!/usr/bin/env python
import re
import nltk  # $ pip install nltk

def paragraphs(file):
    """Yield blank-line separated paragraphs labeled with ##-numbers."""
    lines = []
    for line in file:
        if line.strip():
            lines.append(line)
        elif lines:  # blank line, the end of a non-empty paragraph
            paragraph = ''.join(lines)
            numbers = re.findall(r'##([0-9]+)', paragraph)  # only ASCII-digits
            assert len(numbers) == 1  # only one ##-number per paragraph
            yield int(numbers[0]), paragraph
            del lines[:]

def sentences(filenames):
    for filename in filenames:
        with open(filename) as file:
            for number, paragraph in paragraphs(file):
                for sentence in nltk.sent_tokenize(paragraph):
                    yield number, sentence

filenames = ('w_acad_%d.txt' % n for n in range(1990, 2013))
print(reservoir_sample(sentences(filenames), 2000))

其中 reservoir_sample() is defined here.

nltk.sent_tokenize() 可能是比 r".*?[\.\!\?]+" 正则表达式更可靠的解决方案。