IndexError: string index out of range with Python when reading big .txt file
IndexError: string index out of range with Python when reading big .txt file
我正在尝试使用 Python 创建初学者级别的程序,但在读取大型 .txt 文件时出现以下错误:
Traceback (most recent call last):
File "P4.py", line 58, in <module>
maximo = diccionario.get(keyword[1]) #maximo is a variable for the maximum number of ocurrences in a keyword
IndexError: string index out of range
对于小文档,该程序运行良好,但对于 class 中提供的文档(> 200000 行,~2/3Mb)我收到错误。
这是我编写的代码:
file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()
diccionario = {}
"""
In this portion of code we iterate the lines of the .txt document and we create a dictionary with a word as a key and a List as a value
Key: word
Value: List ([tag, #ocurrencesWithTheTag])
"""
for linea in data:
aux = linea.decode('latin_1').encode('utf-8')
sintagma = aux.split('\t') # Here we separate the String in a list: [word, tag, ocurrences], word=sintagma[0], tag=sintagma[1], ocurrences=sintagma[2]
if (sintagma[0] != "Palabra" and sintagma[1] != "Tag"): #We are not interested in the first line of the file, this is the filter
if (diccionario.has_key(sintagma[0])): #Here we check it the word was included before in the dictionary
aux_list = diccionario.get(sintagma[0]) #We know the name already exists in the dic, so we create a List for every value
aux_list.append([sintagma[1], sintagma[2]]) #We add to the list the tag and th ocurrences for this concrete word
diccionario.update({sintagma[0]:aux_list}) #Update the value with the new list (new list = previous list + new appended element to the list)
else: #If in the dic do not exist the key, que add the values to the empty list (no need to append)
aux_list_else = ([sintagma[1],sintagma[2]])
diccionario.update({sintagma[0]:aux_list_else})
"""
Here we create a new dictionary based on the dictionary created before, in this new dictionary (diccionario2) we want to keep the next
information:
Key: word
Value: List ([suggestedTag, #ocurrencesOfTheWordInTheDocument, probability])
For retrieve the information from diccionario, we have to keep in mind:
In case we have more than 1 Tag associated to a word (keyword), we access to the first tag with keyword[0], and for ocurrencesWithTheTag with keyword[1],
from the second case and forward, we access to the information by this way:
diccionario.get(keyword)[2][0] -> with this we access to the second tag
diccionario.get(keyword)[2][1] -> with this we access to the second ocurrencesWithTheTag
diccionario.get(keyword)[3][0] -> with this we access to the third tag
...
..
.
etc.
"""
diccionario2 = dict.fromkeys(diccionario.keys())#We create a dictionary with the keys from diccionario and we set all the values to None
for keyword in diccionario:
tagSugerido = diccionario.get(keyword[0]) #tagSugerido is the tag with more ocurrences for a concrete keyword
maximo = diccionario.get(keyword[1]) #maximo is a variable for the maximum number of ocurrences in a keyword
if ((len(diccionario.get(keyword))) > 2): #in case we have > 2 tags for a concrete word
suma = float(diccionario.get(keyword)[1])
for i in range (2, len(diccionario.get(keyword))):
suma += float(diccionario.get(keyword)[i][1])
if (diccionario.get(keyword)[i][1] > maximo):
tagSugerido = diccionario.get(keyword)[i][0]
maximo = float(diccionario.get(keyword)[i][1])
probabilidad = float(maximo/suma);
diccionario2.update({keyword:([tagSugerido, suma, probabilidad])})
else:
diccionario2.update({keyword:([diccionario.get(keyword)[0],diccionario.get(keyword)[1], 1])})
最后是一个输入样本(想象一下再增加 200000 行):
Palabra Tag Apariciones
Jordi_Savall NP 5
LIma NP 3
LIma NC 8
LIma V 65
Participaron V 1
Tejkowski NP 1
Tejkowski NC 400
Tejkowski V 23
Iglesia_Catolica NP 1
Feria_Internacional_del_Turismo NP 4
38,5 Num 3
concertada Adj 7
ríspida Adj 1
8.035 Num 1
José_Luis_Barbagelata NP 1
lunes_tres Data 1
misionero NC 1
457.500 Num 1
El_Goloso NP 1
suplente NC 7
colocada Adj 18
Frankfurter_Allgemeine NP 2
reducía V 2
descendieron V 21
escuela NC 113
.56 Num 9
curativos Adj 1
Varios Pron 5
delincuencia NC 48
ratito NC 1
conservamos V 1
dirigí V 1
CECA NP 6
formación NC 317
experiencias NC 48
根据您的评论。你这样写:
create a dictionary with *a word as a key* and a List as a value
所以你在词典 diccionario
中的关键字是一个单词。但是在你的第二个 for 循环中,你有这个:
for keyword in diccionario:
tagSugerido = diccionario.get(keyword[0])
maximo = diccionario.get(keyword[1])
这意味着您使用实际关键字(根据您的评论是一个词)的第一个字母(即关键字[0]),然后使用关键字的第二个字母(即关键字[1])来查找对于字典中的值。我认为这是不正确的。此外,如果您的关键字在某些行中只有一个字母,keyword[1]
似乎超出了索引范围。
我正在尝试使用 Python 创建初学者级别的程序,但在读取大型 .txt 文件时出现以下错误:
Traceback (most recent call last):
File "P4.py", line 58, in <module>
maximo = diccionario.get(keyword[1]) #maximo is a variable for the maximum number of ocurrences in a keyword
IndexError: string index out of range
对于小文档,该程序运行良好,但对于 class 中提供的文档(> 200000 行,~2/3Mb)我收到错误。
这是我编写的代码:
file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()
diccionario = {}
"""
In this portion of code we iterate the lines of the .txt document and we create a dictionary with a word as a key and a List as a value
Key: word
Value: List ([tag, #ocurrencesWithTheTag])
"""
for linea in data:
aux = linea.decode('latin_1').encode('utf-8')
sintagma = aux.split('\t') # Here we separate the String in a list: [word, tag, ocurrences], word=sintagma[0], tag=sintagma[1], ocurrences=sintagma[2]
if (sintagma[0] != "Palabra" and sintagma[1] != "Tag"): #We are not interested in the first line of the file, this is the filter
if (diccionario.has_key(sintagma[0])): #Here we check it the word was included before in the dictionary
aux_list = diccionario.get(sintagma[0]) #We know the name already exists in the dic, so we create a List for every value
aux_list.append([sintagma[1], sintagma[2]]) #We add to the list the tag and th ocurrences for this concrete word
diccionario.update({sintagma[0]:aux_list}) #Update the value with the new list (new list = previous list + new appended element to the list)
else: #If in the dic do not exist the key, que add the values to the empty list (no need to append)
aux_list_else = ([sintagma[1],sintagma[2]])
diccionario.update({sintagma[0]:aux_list_else})
"""
Here we create a new dictionary based on the dictionary created before, in this new dictionary (diccionario2) we want to keep the next
information:
Key: word
Value: List ([suggestedTag, #ocurrencesOfTheWordInTheDocument, probability])
For retrieve the information from diccionario, we have to keep in mind:
In case we have more than 1 Tag associated to a word (keyword), we access to the first tag with keyword[0], and for ocurrencesWithTheTag with keyword[1],
from the second case and forward, we access to the information by this way:
diccionario.get(keyword)[2][0] -> with this we access to the second tag
diccionario.get(keyword)[2][1] -> with this we access to the second ocurrencesWithTheTag
diccionario.get(keyword)[3][0] -> with this we access to the third tag
...
..
.
etc.
"""
diccionario2 = dict.fromkeys(diccionario.keys())#We create a dictionary with the keys from diccionario and we set all the values to None
for keyword in diccionario:
tagSugerido = diccionario.get(keyword[0]) #tagSugerido is the tag with more ocurrences for a concrete keyword
maximo = diccionario.get(keyword[1]) #maximo is a variable for the maximum number of ocurrences in a keyword
if ((len(diccionario.get(keyword))) > 2): #in case we have > 2 tags for a concrete word
suma = float(diccionario.get(keyword)[1])
for i in range (2, len(diccionario.get(keyword))):
suma += float(diccionario.get(keyword)[i][1])
if (diccionario.get(keyword)[i][1] > maximo):
tagSugerido = diccionario.get(keyword)[i][0]
maximo = float(diccionario.get(keyword)[i][1])
probabilidad = float(maximo/suma);
diccionario2.update({keyword:([tagSugerido, suma, probabilidad])})
else:
diccionario2.update({keyword:([diccionario.get(keyword)[0],diccionario.get(keyword)[1], 1])})
最后是一个输入样本(想象一下再增加 200000 行):
Palabra Tag Apariciones
Jordi_Savall NP 5
LIma NP 3
LIma NC 8
LIma V 65
Participaron V 1
Tejkowski NP 1
Tejkowski NC 400
Tejkowski V 23
Iglesia_Catolica NP 1
Feria_Internacional_del_Turismo NP 4
38,5 Num 3
concertada Adj 7
ríspida Adj 1
8.035 Num 1
José_Luis_Barbagelata NP 1
lunes_tres Data 1
misionero NC 1
457.500 Num 1
El_Goloso NP 1
suplente NC 7
colocada Adj 18
Frankfurter_Allgemeine NP 2
reducía V 2
descendieron V 21
escuela NC 113
.56 Num 9
curativos Adj 1
Varios Pron 5
delincuencia NC 48
ratito NC 1
conservamos V 1
dirigí V 1
CECA NP 6
formación NC 317
experiencias NC 48
根据您的评论。你这样写:
create a dictionary with *a word as a key* and a List as a value
所以你在词典 diccionario
中的关键字是一个单词。但是在你的第二个 for 循环中,你有这个:
for keyword in diccionario:
tagSugerido = diccionario.get(keyword[0])
maximo = diccionario.get(keyword[1])
这意味着您使用实际关键字(根据您的评论是一个词)的第一个字母(即关键字[0]),然后使用关键字的第二个字母(即关键字[1])来查找对于字典中的值。我认为这是不正确的。此外,如果您的关键字在某些行中只有一个字母,keyword[1]
似乎超出了索引范围。