从文本文件中随机 Select 句,查找相应的 ID 号
Randomly Select Sentences from Text File, Find Coresponding ID Number
我正在帮助我的一位教授进行一项研究项目,该项目涉及从一组 20 个文本文件中随机抽取一千个句子。这是来自当代美国英语语料库的所有数据,如果有人熟悉使用它的话。在这些文本文件中,数据排列如下:
Blockquote ##4000348 I must begin by saying this : In preparation for this lecture , I read ( or in some cases reread ) a number of the writings of Sidney Hook . I read them solely to give me the right starting point for a lecture given in honor of Sidney Hook . But instead I found myself infused with a set of ideas that were relevant to a different setting , a different occasion .
##4000349 I would like to think I am best known for my wisdom and learning , but in truth such fame as I have derives from my being a reputed conservative who is also dean of Yale College . That was the reason news of my appointment appeared in the Wall Street Journal and the National Review , which does n't usually happen to deans of Yale College , and does n't help them much when it does .
Blockquote>
因此,有数百个段落,每个段落都以六位数字开头,前面加上“##”。该数字对应于句子的来源。我需要从这些文件中随机抽取句子,并获得六位数字来标识它们的来源。所以理想情况下,我会得到类似的东西:
Blockquote ##4000348 I read them solely to give me the right starting point for a lecture given in honor of Sidney Hook
##4000349 I would like to think I am best known for my wisdom and learning , but in truth such fame as I have derives from my being a reputed conservative who is also dean of Yale College .
我已经成功地从文件中获取随机句子(在 Whosebug 的好心人的帮助下),但我不知道如何获取附加到它们的数字(例如,如果我拉一个段落中间的句子,我怎么能从段落的开头得到数字)。谁能帮我想办法做到这一点?这是我到目前为止的代码,它成功地提取了句子。
# -*- coding: utf-8 -*-
import re
from random import sample
sentences = []
for i in range(1990,2013):
with open('w_acad_{}.txt'.format(i)) as f:
sentences += re.findall(r".*?[\.\!\?]+", f.read())
selected = sample(sentences, 2000)
with open('out.txt', 'w') as f:
f.write('\n'.join(selected))
也许您可以使用正则表达式提取每个段落及其源 ID,然后从该段落中提取句子,这与您目前的做法类似。这应该可以帮助您抓住段落:
# with open... etc.
for source_id, paragraph in re.findall(r"(##\d+)([^#]+)", f.read()):
sentences += [(source_id, sentence) for sentence in re.findall(r".*?[\.\!\?]+", paragraph)]
现在,sentences
应该是像 ('##123', 'A sentence.')
这样的元组列表,您可以从中像以前一样进行采样。
一般来说,为了避免一次将(可能很大的)文件全部加载到内存中,您可以使用 a reservoir sampling algorithm——只需向它传递一个迭代器,该迭代器产生标记(带有 ##
-数字)句数:
#!/usr/bin/env python
import re
import nltk # $ pip install nltk
def paragraphs(file):
"""Yield blank-line separated paragraphs labeled with ##-numbers."""
lines = []
for line in file:
if line.strip():
lines.append(line)
elif lines: # blank line, the end of a non-empty paragraph
paragraph = ''.join(lines)
numbers = re.findall(r'##([0-9]+)', paragraph) # only ASCII-digits
assert len(numbers) == 1 # only one ##-number per paragraph
yield int(numbers[0]), paragraph
del lines[:]
def sentences(filenames):
for filename in filenames:
with open(filename) as file:
for number, paragraph in paragraphs(file):
for sentence in nltk.sent_tokenize(paragraph):
yield number, sentence
filenames = ('w_acad_%d.txt' % n for n in range(1990, 2013))
print(reservoir_sample(sentences(filenames), 2000))
其中 reservoir_sample()
is defined here.
nltk.sent_tokenize()
可能是比 r".*?[\.\!\?]+"
正则表达式更可靠的解决方案。
我正在帮助我的一位教授进行一项研究项目,该项目涉及从一组 20 个文本文件中随机抽取一千个句子。这是来自当代美国英语语料库的所有数据,如果有人熟悉使用它的话。在这些文本文件中,数据排列如下:
Blockquote ##4000348 I must begin by saying this : In preparation for this lecture , I read ( or in some cases reread ) a number of the writings of Sidney Hook . I read them solely to give me the right starting point for a lecture given in honor of Sidney Hook . But instead I found myself infused with a set of ideas that were relevant to a different setting , a different occasion .
##4000349 I would like to think I am best known for my wisdom and learning , but in truth such fame as I have derives from my being a reputed conservative who is also dean of Yale College . That was the reason news of my appointment appeared in the Wall Street Journal and the National Review , which does n't usually happen to deans of Yale College , and does n't help them much when it does .
Blockquote>
因此,有数百个段落,每个段落都以六位数字开头,前面加上“##”。该数字对应于句子的来源。我需要从这些文件中随机抽取句子,并获得六位数字来标识它们的来源。所以理想情况下,我会得到类似的东西:
Blockquote ##4000348 I read them solely to give me the right starting point for a lecture given in honor of Sidney Hook
##4000349 I would like to think I am best known for my wisdom and learning , but in truth such fame as I have derives from my being a reputed conservative who is also dean of Yale College .
我已经成功地从文件中获取随机句子(在 Whosebug 的好心人的帮助下),但我不知道如何获取附加到它们的数字(例如,如果我拉一个段落中间的句子,我怎么能从段落的开头得到数字)。谁能帮我想办法做到这一点?这是我到目前为止的代码,它成功地提取了句子。
# -*- coding: utf-8 -*-
import re
from random import sample
sentences = []
for i in range(1990,2013):
with open('w_acad_{}.txt'.format(i)) as f:
sentences += re.findall(r".*?[\.\!\?]+", f.read())
selected = sample(sentences, 2000)
with open('out.txt', 'w') as f:
f.write('\n'.join(selected))
也许您可以使用正则表达式提取每个段落及其源 ID,然后从该段落中提取句子,这与您目前的做法类似。这应该可以帮助您抓住段落:
# with open... etc.
for source_id, paragraph in re.findall(r"(##\d+)([^#]+)", f.read()):
sentences += [(source_id, sentence) for sentence in re.findall(r".*?[\.\!\?]+", paragraph)]
现在,sentences
应该是像 ('##123', 'A sentence.')
这样的元组列表,您可以从中像以前一样进行采样。
一般来说,为了避免一次将(可能很大的)文件全部加载到内存中,您可以使用 a reservoir sampling algorithm——只需向它传递一个迭代器,该迭代器产生标记(带有 ##
-数字)句数:
#!/usr/bin/env python
import re
import nltk # $ pip install nltk
def paragraphs(file):
"""Yield blank-line separated paragraphs labeled with ##-numbers."""
lines = []
for line in file:
if line.strip():
lines.append(line)
elif lines: # blank line, the end of a non-empty paragraph
paragraph = ''.join(lines)
numbers = re.findall(r'##([0-9]+)', paragraph) # only ASCII-digits
assert len(numbers) == 1 # only one ##-number per paragraph
yield int(numbers[0]), paragraph
del lines[:]
def sentences(filenames):
for filename in filenames:
with open(filename) as file:
for number, paragraph in paragraphs(file):
for sentence in nltk.sent_tokenize(paragraph):
yield number, sentence
filenames = ('w_acad_%d.txt' % n for n in range(1990, 2013))
print(reservoir_sample(sentences(filenames), 2000))
其中 reservoir_sample()
is defined here.
nltk.sent_tokenize()
可能是比 r".*?[\.\!\?]+"
正则表达式更可靠的解决方案。