使用拆分方法预处理文本文件中的数据
Preprocessing the data from a text file with a split method
我在下面写了一个文本示例。我想要的是将此文本附加到 python 中的列表数据结构中。我首先使用 '<EOS>'
作为分隔符拆分此文本。然后将 split 方法的结果的每个元素追加到列表数据类型中。
但我面临的是 split
方法以 '\n'
和 '<EOS>'
作为分隔符拆分文本。因此,现在将单行添加到列表数据类型,而不是完整部分。
请查看下面示例文本后面的代码,让我知道我做错了什么。
Old Major, the old boar on the Manor Farm, summons the animals on the farm together for a meeting, during which he refers to humans as "enemies" and teaches the animals a revolutionary song called "Beasts of England".
When Major dies, two young pigs, Snowball and Napoleon, assume command and consider it a duty to prepare for the Rebellion.<EOS>
Alex is a 15-year-old living in near-future dystopian England who leads his gang on a night of opportunistic, random "ultra-violence".
Alex's friends ("droogs" in the novel's Anglo-Russian slang, 'Nadsat') are Dim, a slow-witted bruiser who is the gang's muscle; Georgie, an ambitious second-in-command; and Pete, who mostly plays along as the droogs indulge their taste for ultra-violence.
Characterised as a sociopath and a hardened juvenile delinquent, Alex also displays intelligence, quick wit, and a predilection for classical music; he is particularly fond of Beethoven, referred to as "Lovely Ludwig Van".`
Python 将文档读入列表类型的代码:
f=open('./plots')
documents=[]
for x in f:
documents.append(x.split('<EOS>'))
print documents[0]
#documents[0] must start from 'Old Major' and stops at 'Rebellion'.
循环遍历 f 会导致文件内容被换行符分割。改用这个:
f=open('./plots')
documents=f.read().split('<EOS>')
print documents[0]
split('<EOS>')
仅如您预期的那样在 <EOS>
上拆分。但是,for x in f:
逐行工作,因此有效地对您的文件执行隐式 split
。
相反,也许可以这样做:
f=open('./plots')
documents=f.read().split('<EOS>')
print documents[0]
split()
不会将文本与 '\n'
和 '<EOS>'
拆分,它只是针对后者进行拆分。 for x in f:
通过换行符 (\n
) 有效地分割了文件的内容。
下面的代码与您的大致相同,它们更好地说明了发生了什么:
with open('./plots') as f:
documents=[]
for x in f:
documents.append(x.split('<EOS>'))
for i, document in enumerate(documents):
print('documents[{}]: {!r}'.format(i, document))
我在下面写了一个文本示例。我想要的是将此文本附加到 python 中的列表数据结构中。我首先使用 '<EOS>'
作为分隔符拆分此文本。然后将 split 方法的结果的每个元素追加到列表数据类型中。
但我面临的是 split
方法以 '\n'
和 '<EOS>'
作为分隔符拆分文本。因此,现在将单行添加到列表数据类型,而不是完整部分。
请查看下面示例文本后面的代码,让我知道我做错了什么。
Old Major, the old boar on the Manor Farm, summons the animals on the farm together for a meeting, during which he refers to humans as "enemies" and teaches the animals a revolutionary song called "Beasts of England".
When Major dies, two young pigs, Snowball and Napoleon, assume command and consider it a duty to prepare for the Rebellion.<EOS>
Alex is a 15-year-old living in near-future dystopian England who leads his gang on a night of opportunistic, random "ultra-violence".
Alex's friends ("droogs" in the novel's Anglo-Russian slang, 'Nadsat') are Dim, a slow-witted bruiser who is the gang's muscle; Georgie, an ambitious second-in-command; and Pete, who mostly plays along as the droogs indulge their taste for ultra-violence.
Characterised as a sociopath and a hardened juvenile delinquent, Alex also displays intelligence, quick wit, and a predilection for classical music; he is particularly fond of Beethoven, referred to as "Lovely Ludwig Van".`
Python 将文档读入列表类型的代码:
f=open('./plots')
documents=[]
for x in f:
documents.append(x.split('<EOS>'))
print documents[0]
#documents[0] must start from 'Old Major' and stops at 'Rebellion'.
循环遍历 f 会导致文件内容被换行符分割。改用这个:
f=open('./plots')
documents=f.read().split('<EOS>')
print documents[0]
split('<EOS>')
仅如您预期的那样在 <EOS>
上拆分。但是,for x in f:
逐行工作,因此有效地对您的文件执行隐式 split
。
相反,也许可以这样做:
f=open('./plots')
documents=f.read().split('<EOS>')
print documents[0]
split()
不会将文本与 '\n'
和 '<EOS>'
拆分,它只是针对后者进行拆分。 for x in f:
通过换行符 (\n
) 有效地分割了文件的内容。
下面的代码与您的大致相同,它们更好地说明了发生了什么:
with open('./plots') as f:
documents=[]
for x in f:
documents.append(x.split('<EOS>'))
for i, document in enumerate(documents):
print('documents[{}]: {!r}'.format(i, document))