从大型 .txt 文件生成模型读取语料库时出错
Error generating a model reading corpus from a big .txt file
我正在尝试读取文件 corpus.txt(训练集)并生成模型,输出必须称为 lexic.txt 并包含单词、标签和出现次数。 ..对于小型训练集它有效,但对于大学给定的训练集(30mb .txt 文件,数百万行)代码不起作用,我想这将是效率问题,因此系统内存不足...有人可以帮我处理代码吗?
这里附上我的代码:
from collections import Counter
file=open('corpus.txt','r')
data=file.readlines()
file.close()
palabras = []
count_list = []
for linea in data:
linea.decode('latin_1').encode('UTF-8') # para los acentos
palabra_tag = linea.split('\n')
palabras.append(palabra_tag[0])
cuenta = Counter(palabras) # dictionary for count ocurrences for a word + tag
#Assign for every word + tag the number of times appears
for palabraTag in palabras:
for i in range(len(palabras)):
if palabras[i] == palabraTag:
count_list.append([palabras[i], str(cuenta[palabraTag])])
#We delete repeated ones
finalList = []
for i in count_list:
if i not in finalList:
finalList.append(i)
outfile = open('lexic.txt', 'w')
outfile.write('Palabra\tTag\tApariciones\n')
for i in range(len(finalList)):
outfile.write(finalList[i][0]+'\t'+finalList[i][1]+'\n') # finalList[i][0] is the word + tag and finalList[i][1] is the numbr of ocurrences
outfile.close()
在这里您可以看到 corpus.txt:
的示例
Al Prep
menos Adv
cinco Det
reclusos Adj
murieron V
en Prep
las Det
últimas Adj
24 Num
horas NC
en Prep
las Det
cárceles NC
de Prep
Valencia NP
y Conj
Barcelona NP
en Prep
incidentes NC
en Prep
los Det
que Pron
su Det
提前致谢!
如果将这两个代码块结合起来,您或许可以减少内存使用量。
#Assign for every word + tag the number of times appears
for palabraTag in palabras:
for i in range(len(palabras)):
if palabras[i] == palabraTag:
count_list.append([palabras[i], str(cuenta[palabraTag])])
#We delete repeated ones
finalList = []
for i in count_list:
if i not in finalList:
finalList.append(i)
您可以检查一个项目是否已经存在于计数列表中,这样做,而不是首先添加重复项。这应该会减少您的内存使用量。见下文;
#Assign for every word + tag the number of times appears
for palabraTag in palabras:
for i in range(len(palabras)):
if palabras[i] == palabraTag and
[palabras[i], str(cuenta[palabraTag])] not in count_list:
count_list.append([palabras[i], str(cuenta[palabraTag])])
最后我使用字典改进了代码,这是 100% 正常工作的结果:
file=open('corpus.txt','r')
data=file.readlines()
file.close()
diccionario = {}
for linea in data:
linea.decode('latin_1').encode('UTF-8') # para los acentos
palabra_tag = linea.split('\n')
cadena = str(palabra_tag[0])
if(diccionario.has_key(cadena)):
aux = diccionario.get(cadena)
aux += 1
diccionario.update({cadena:aux})
else:
diccionario.update({cadena:1})
outfile = open('lexic.txt', 'w')
outfile.write('Palabra\tTag\tApariciones\n')
for key, value in diccionario.iteritems() :
s = str(value)
outfile.write(key +" "+s+'\n')
outfile.close()
我正在尝试读取文件 corpus.txt(训练集)并生成模型,输出必须称为 lexic.txt 并包含单词、标签和出现次数。 ..对于小型训练集它有效,但对于大学给定的训练集(30mb .txt 文件,数百万行)代码不起作用,我想这将是效率问题,因此系统内存不足...有人可以帮我处理代码吗?
这里附上我的代码:
from collections import Counter
file=open('corpus.txt','r')
data=file.readlines()
file.close()
palabras = []
count_list = []
for linea in data:
linea.decode('latin_1').encode('UTF-8') # para los acentos
palabra_tag = linea.split('\n')
palabras.append(palabra_tag[0])
cuenta = Counter(palabras) # dictionary for count ocurrences for a word + tag
#Assign for every word + tag the number of times appears
for palabraTag in palabras:
for i in range(len(palabras)):
if palabras[i] == palabraTag:
count_list.append([palabras[i], str(cuenta[palabraTag])])
#We delete repeated ones
finalList = []
for i in count_list:
if i not in finalList:
finalList.append(i)
outfile = open('lexic.txt', 'w')
outfile.write('Palabra\tTag\tApariciones\n')
for i in range(len(finalList)):
outfile.write(finalList[i][0]+'\t'+finalList[i][1]+'\n') # finalList[i][0] is the word + tag and finalList[i][1] is the numbr of ocurrences
outfile.close()
在这里您可以看到 corpus.txt:
的示例Al Prep
menos Adv
cinco Det
reclusos Adj
murieron V
en Prep
las Det
últimas Adj
24 Num
horas NC
en Prep
las Det
cárceles NC
de Prep
Valencia NP
y Conj
Barcelona NP
en Prep
incidentes NC
en Prep
los Det
que Pron
su Det
提前致谢!
如果将这两个代码块结合起来,您或许可以减少内存使用量。
#Assign for every word + tag the number of times appears
for palabraTag in palabras:
for i in range(len(palabras)):
if palabras[i] == palabraTag:
count_list.append([palabras[i], str(cuenta[palabraTag])])
#We delete repeated ones
finalList = []
for i in count_list:
if i not in finalList:
finalList.append(i)
您可以检查一个项目是否已经存在于计数列表中,这样做,而不是首先添加重复项。这应该会减少您的内存使用量。见下文;
#Assign for every word + tag the number of times appears
for palabraTag in palabras:
for i in range(len(palabras)):
if palabras[i] == palabraTag and
[palabras[i], str(cuenta[palabraTag])] not in count_list:
count_list.append([palabras[i], str(cuenta[palabraTag])])
最后我使用字典改进了代码,这是 100% 正常工作的结果:
file=open('corpus.txt','r')
data=file.readlines()
file.close()
diccionario = {}
for linea in data:
linea.decode('latin_1').encode('UTF-8') # para los acentos
palabra_tag = linea.split('\n')
cadena = str(palabra_tag[0])
if(diccionario.has_key(cadena)):
aux = diccionario.get(cadena)
aux += 1
diccionario.update({cadena:aux})
else:
diccionario.update({cadena:1})
outfile = open('lexic.txt', 'w')
outfile.write('Palabra\tTag\tApariciones\n')
for key, value in diccionario.iteritems() :
s = str(value)
outfile.write(key +" "+s+'\n')
outfile.close()