Getting IndexError: list index out of range when calculating Euclidean distance
Getting IndexError: list index out of range when calculating Euclidean distance
我正在尝试应用 https://towardsdatascience.com/3-basic-distance-measurement-in-text-mining-5852becff1d7 提供的代码。当我将它与我自己的数据一起使用时,我似乎访问了不存在的列表的一部分,只是无法确定我在哪里犯了这个错误:
File "D:/Proj/projects/eucledian_1/nltk_headline_1.py", line 190, in eucledian
score = sklearn.metrics.pairwise.euclidean_distances([transformed_results[i]], [transformed_results[0]])[0][0]
IndexError: list index out of range
我认为我没有超过 only_event
或 'transformed_results' 的长度。这是我的代码:
def eucledian(self):
print('inside eucledian', only_event)
for i, news_headline in enumerate(only_event): # range(len(only_event))
print('*******', only_event[i])
print('this is transformed results: ', transformed_results[i]) # prints
score = sklearn.metrics.pairwise.euclidean_distances([transformed_results[i]], [transformed_results[0]])[0][0]
print('-----', only_event[i]) # prints
print('Score: %.2f, Comparing Sentence: %s' % (score, news_headline)) # prints
我能够从数据库中读取并存储在列表中的数据 only_event
(length = 2) 如下所示:
['Perhaps this code is incomplete or mistyped in some way.', 'Use one of the following methods:\n* Ensure that the power is turned on.\n* Only concatenate a user-supplied value into a query, or if it must be a Boolean or numeric type.\n']
。
打印语句给出了输出,但是调用 euclidean_distances 的行抛出了 IndexError: list index out of range
错误。 transformed_results
(长度 = 1)看起来像这样:
[array([329., 2., 57., 44., 44., 44., 88., 57., 44., 44., 44.,
57., 13., 13., 88., 1., 2., 13., 136., 13., 13., 13.,
220., 44., 44., 44., 88., 88., 44., 44., 89., 2., 13.,
88., 13., 44., 132., 26., 4., 4., 132., 44., 1., 13.,
48., 27., 88., 132., 88., 44., 44., 132., 13., 4., 13.,
44., 13., 158., 15., 13., 162., 4., 44., 44., 26., 13.,
1., 44., 1., 57., 13., 1., 44., 44., 45., 44., 44.,
4., 13., 44., 1., 13., 44., 44., 44., 44., 336., 44.,
51., 2., 235., 13., 132., 132., 70., 26., 44., 13., 13.,
13., 44., 4., 1., 57., 44., 44., 2., 44.])]
提前感谢您浏览本文
已更新以包含可重现的代码@dzang
import numpy as np
import sklearn.preprocessing
import sklearn.metrics
token_event_obj = ['perhaps', 'this', 'code', 'is', 'incomplete', 'or', 'mistyped', 'in', 'some', 'way', 'use', 'one', 'of', 'the', 'following', 'methodsn', 'use', 'a', 'querypreparation', 'api', 'to', 'safely', 'construct', 'the', 'sql', 'query', 'containing', 'usersupplied', 'valuesn', 'only', 'concatenate', 'a', 'usersupplied', 'value', 'into', 'a', 'query', 'if', 'it', 'has', 'been', 'checked', 'against', 'a', 'whitelist', 'of', 'safe', 'string', 'values', 'or', 'if', 'it', 'must', 'be', 'a', 'boolean', 'or', 'numeric', 'typen']
only_event = ['Perhaps this code is incomplete or mistyped in some way.', 'Use one of the following methods:\n* Use a query-preparation API to safely construct the SQL query containing user-supplied values.\n* Only concatenate a user-supplied value into a query if it has been checked against a whitelist of safe string values, or if it must be a Boolean or numeric type.\n']
def transform(headlines):
print('inside transform', headlines)
tokens = [w for s in headlines for w in s]
print()
print('All Tokens:')
print(tokens)
results = []
label_enc = sklearn.preprocessing.LabelEncoder()
onehot_enc = sklearn.preprocessing.OneHotEncoder()
encoded_all_tokens = label_enc.fit_transform(list(set(tokens)))
encoded_all_tokens = encoded_all_tokens.reshape(len(encoded_all_tokens), 1)
onehot_enc.fit(encoded_all_tokens)
for headline_tokens in headlines:
print()
print(headline_tokens)
print('Original Input:', headline_tokens)
encoded_words = label_enc.transform(headline_tokens)
print('Encoded by Label Encoder:', encoded_words)
encoded_words = onehot_enc.transform(encoded_words.reshape(len(encoded_words), 1))
print('Encoded by OneHot Encoder:')
# print(encoded_words)
results.append(np.sum(encoded_words.toarray(), axis=0))
print('Transform results:', results)
return results
def eucledian():
print('inside eucledian', len(only_event))
for i, news_headline in enumerate(only_event): # range(len(only_event))
print('*******', only_event[i])
print('this is transformed results: ', transformed_results)
# print('len', len(sklearn.metrics.pairwise.euclidean_distances([transformed_results[i]], [transformed_results[0]])[0]))
print(type(transformed_results), len(transformed_results))
score = sklearn.metrics.pairwise.euclidean_distances([transformed_results[i]], [transformed_results[0]])[0]
print('-----', only_event[i])
print('Score: %.2f, Comparing Sentence: %s' % (score, news_headline))
transformed_results = transform([token_event_obj])
eucledian()
您提供的示例中的错误在于 transformed_results
是一个包含一个元素的列表,包含标记化的句子 1。
only_event
虽然有 2 个句子,但您正在使用它来提供 i
。所以 i
将是 0
和 1
。当 i
为 1
时,transformed_results[i]
引发错误。
如果您将 only_event
中的两个句子标记化,例如:
headlines = [''.join([c for c in s.replace('\n', '').lower() if c not in ['.', '*', ':', '-']]).split() for s in only_event]
给出:
[['perhaps', 'this', 'code', 'is', 'incomplete', 'or', 'mistyped', 'in', 'some', 'way'], ['use', 'one', 'of', 'the', 'following', 'methods', 'use', 'a', 'querypreparation', 'api', 'to', 'safely', 'construct', 'the', 'sql', 'query', 'containing', 'usersupplied', 'values', 'only', 'concatenate', 'a', 'usersupplied', 'value', 'into', 'a', 'query', 'if', 'it', 'has', 'been', 'checked', 'against', 'a', 'whitelist', 'of', 'safe', 'string', 'values,', 'or', 'if', 'it', 'must', 'be', 'a', 'boolean', 'or', 'numeric', 'type']]
那么 transformed_results
的长度也将是 2。
你会比较两个句子的欧几里得距离,包括参考句子本身。
我正在尝试应用 https://towardsdatascience.com/3-basic-distance-measurement-in-text-mining-5852becff1d7 提供的代码。当我将它与我自己的数据一起使用时,我似乎访问了不存在的列表的一部分,只是无法确定我在哪里犯了这个错误:
File "D:/Proj/projects/eucledian_1/nltk_headline_1.py", line 190, in eucledian
score = sklearn.metrics.pairwise.euclidean_distances([transformed_results[i]], [transformed_results[0]])[0][0]
IndexError: list index out of range
我认为我没有超过 only_event
或 'transformed_results' 的长度。这是我的代码:
def eucledian(self):
print('inside eucledian', only_event)
for i, news_headline in enumerate(only_event): # range(len(only_event))
print('*******', only_event[i])
print('this is transformed results: ', transformed_results[i]) # prints
score = sklearn.metrics.pairwise.euclidean_distances([transformed_results[i]], [transformed_results[0]])[0][0]
print('-----', only_event[i]) # prints
print('Score: %.2f, Comparing Sentence: %s' % (score, news_headline)) # prints
我能够从数据库中读取并存储在列表中的数据 only_event
(length = 2) 如下所示:
['Perhaps this code is incomplete or mistyped in some way.', 'Use one of the following methods:\n* Ensure that the power is turned on.\n* Only concatenate a user-supplied value into a query, or if it must be a Boolean or numeric type.\n']
。
打印语句给出了输出,但是调用 euclidean_distances 的行抛出了 IndexError: list index out of range
错误。 transformed_results
(长度 = 1)看起来像这样:
[array([329., 2., 57., 44., 44., 44., 88., 57., 44., 44., 44.,
57., 13., 13., 88., 1., 2., 13., 136., 13., 13., 13.,
220., 44., 44., 44., 88., 88., 44., 44., 89., 2., 13.,
88., 13., 44., 132., 26., 4., 4., 132., 44., 1., 13.,
48., 27., 88., 132., 88., 44., 44., 132., 13., 4., 13.,
44., 13., 158., 15., 13., 162., 4., 44., 44., 26., 13.,
1., 44., 1., 57., 13., 1., 44., 44., 45., 44., 44.,
4., 13., 44., 1., 13., 44., 44., 44., 44., 336., 44.,
51., 2., 235., 13., 132., 132., 70., 26., 44., 13., 13.,
13., 44., 4., 1., 57., 44., 44., 2., 44.])]
提前感谢您浏览本文
已更新以包含可重现的代码@dzang
import numpy as np
import sklearn.preprocessing
import sklearn.metrics
token_event_obj = ['perhaps', 'this', 'code', 'is', 'incomplete', 'or', 'mistyped', 'in', 'some', 'way', 'use', 'one', 'of', 'the', 'following', 'methodsn', 'use', 'a', 'querypreparation', 'api', 'to', 'safely', 'construct', 'the', 'sql', 'query', 'containing', 'usersupplied', 'valuesn', 'only', 'concatenate', 'a', 'usersupplied', 'value', 'into', 'a', 'query', 'if', 'it', 'has', 'been', 'checked', 'against', 'a', 'whitelist', 'of', 'safe', 'string', 'values', 'or', 'if', 'it', 'must', 'be', 'a', 'boolean', 'or', 'numeric', 'typen']
only_event = ['Perhaps this code is incomplete or mistyped in some way.', 'Use one of the following methods:\n* Use a query-preparation API to safely construct the SQL query containing user-supplied values.\n* Only concatenate a user-supplied value into a query if it has been checked against a whitelist of safe string values, or if it must be a Boolean or numeric type.\n']
def transform(headlines):
print('inside transform', headlines)
tokens = [w for s in headlines for w in s]
print()
print('All Tokens:')
print(tokens)
results = []
label_enc = sklearn.preprocessing.LabelEncoder()
onehot_enc = sklearn.preprocessing.OneHotEncoder()
encoded_all_tokens = label_enc.fit_transform(list(set(tokens)))
encoded_all_tokens = encoded_all_tokens.reshape(len(encoded_all_tokens), 1)
onehot_enc.fit(encoded_all_tokens)
for headline_tokens in headlines:
print()
print(headline_tokens)
print('Original Input:', headline_tokens)
encoded_words = label_enc.transform(headline_tokens)
print('Encoded by Label Encoder:', encoded_words)
encoded_words = onehot_enc.transform(encoded_words.reshape(len(encoded_words), 1))
print('Encoded by OneHot Encoder:')
# print(encoded_words)
results.append(np.sum(encoded_words.toarray(), axis=0))
print('Transform results:', results)
return results
def eucledian():
print('inside eucledian', len(only_event))
for i, news_headline in enumerate(only_event): # range(len(only_event))
print('*******', only_event[i])
print('this is transformed results: ', transformed_results)
# print('len', len(sklearn.metrics.pairwise.euclidean_distances([transformed_results[i]], [transformed_results[0]])[0]))
print(type(transformed_results), len(transformed_results))
score = sklearn.metrics.pairwise.euclidean_distances([transformed_results[i]], [transformed_results[0]])[0]
print('-----', only_event[i])
print('Score: %.2f, Comparing Sentence: %s' % (score, news_headline))
transformed_results = transform([token_event_obj])
eucledian()
您提供的示例中的错误在于 transformed_results
是一个包含一个元素的列表,包含标记化的句子 1。
only_event
虽然有 2 个句子,但您正在使用它来提供 i
。所以 i
将是 0
和 1
。当 i
为 1
时,transformed_results[i]
引发错误。
如果您将 only_event
中的两个句子标记化,例如:
headlines = [''.join([c for c in s.replace('\n', '').lower() if c not in ['.', '*', ':', '-']]).split() for s in only_event]
给出:
[['perhaps', 'this', 'code', 'is', 'incomplete', 'or', 'mistyped', 'in', 'some', 'way'], ['use', 'one', 'of', 'the', 'following', 'methods', 'use', 'a', 'querypreparation', 'api', 'to', 'safely', 'construct', 'the', 'sql', 'query', 'containing', 'usersupplied', 'values', 'only', 'concatenate', 'a', 'usersupplied', 'value', 'into', 'a', 'query', 'if', 'it', 'has', 'been', 'checked', 'against', 'a', 'whitelist', 'of', 'safe', 'string', 'values,', 'or', 'if', 'it', 'must', 'be', 'a', 'boolean', 'or', 'numeric', 'type']]
那么 transformed_results
的长度也将是 2。
你会比较两个句子的欧几里得距离,包括参考句子本身。