ElasticSearch | TypeError: string indices must be integers
ElasticSearch | TypeError: string indices must be integers
我正在使用这个 Notebook,其中 Apply DocumentClassifier 部分更改如下。
Jupyter Labs,内核:conda_mxnet_latest_p37
.
我明白这个错误意味着我传递的是 str
而不是 int
。但是,这应该不是问题,因为它可以与原始笔记本中的其他 .pdf/.txt 文件一起使用。
代码单元格:
doc_dir = "GRIs/" # contains 2 .pdfs
with open('filt_gri.txt', 'r') as filehandle:
tags = [current_place.rstrip() for current_place in filehandle.readlines()]
doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
task="zero-shot-classification",
labels=tags,
batch_size=2)
# convert to Document using a fieldmap for custom content fields the classification should run on
docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
# classify using gpu, batch_size makes sure we do not run out of memory
classified_docs = doc_classifier.predict(docs_to_classify)
# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
print(classified_docs[0].to_dict())
all_docs = convert_files_to_dicts(dir_path=doc_dir)
preprocessor_sliding_window = PreProcessor(split_overlap=3,
split_length=10,
split_respect_sentence_boundary=False,
split_by='passage')
输出错误:
INFO - haystack.modeling.utils - Using devices: CUDA
INFO - haystack.modeling.utils - Number of GPUs: 1
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-11-82b54cd162ff> in <module>
14
15 # classify using gpu, batch_size makes sure we do not run out of memory
---> 16 classified_docs = doc_classifier.predict(docs_to_classify)
17
18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
144 for prediction, doc in zip(predictions, documents):
145 if self.task == 'zero-shot-classification':
--> 146 prediction["label"] = prediction["labels"][0]
147 doc.meta["classification"] = prediction
148
TypeError: string indices must be integers
请让我知道是否还有任何我应该添加到 post/澄清的内容。
我用 my_dsw
替换了变量 docs_sliding_window
。
my_dsw
只保留长度为 <= 1000
个字符的行。这有助于更好地适应我的数据形状。
my_dsw = []
for dsw in range(0, len(docs_sliding_window)-1):
if len(docs_sliding_window[dsw]['content']) <= 1000:
my_dsw.append(docs_sliding_window[dsw])
在 docs_to_classify
行中将其换出:
# convert to Document using a fieldmap for custom content fields the classification should run on
docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
诚然,我不确定这与错误具体有何关系;但它确实有助于数据更好地拟合;现在我可以增加 batch_size=4
.
我正在使用这个 Notebook,其中 Apply DocumentClassifier 部分更改如下。
Jupyter Labs,内核:conda_mxnet_latest_p37
.
我明白这个错误意味着我传递的是 str
而不是 int
。但是,这应该不是问题,因为它可以与原始笔记本中的其他 .pdf/.txt 文件一起使用。
代码单元格:
doc_dir = "GRIs/" # contains 2 .pdfs
with open('filt_gri.txt', 'r') as filehandle:
tags = [current_place.rstrip() for current_place in filehandle.readlines()]
doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
task="zero-shot-classification",
labels=tags,
batch_size=2)
# convert to Document using a fieldmap for custom content fields the classification should run on
docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
# classify using gpu, batch_size makes sure we do not run out of memory
classified_docs = doc_classifier.predict(docs_to_classify)
# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
print(classified_docs[0].to_dict())
all_docs = convert_files_to_dicts(dir_path=doc_dir)
preprocessor_sliding_window = PreProcessor(split_overlap=3,
split_length=10,
split_respect_sentence_boundary=False,
split_by='passage')
输出错误:
INFO - haystack.modeling.utils - Using devices: CUDA
INFO - haystack.modeling.utils - Number of GPUs: 1
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-11-82b54cd162ff> in <module>
14
15 # classify using gpu, batch_size makes sure we do not run out of memory
---> 16 classified_docs = doc_classifier.predict(docs_to_classify)
17
18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
144 for prediction, doc in zip(predictions, documents):
145 if self.task == 'zero-shot-classification':
--> 146 prediction["label"] = prediction["labels"][0]
147 doc.meta["classification"] = prediction
148
TypeError: string indices must be integers
请让我知道是否还有任何我应该添加到 post/澄清的内容。
我用 my_dsw
替换了变量 docs_sliding_window
。
my_dsw
只保留长度为 <= 1000
个字符的行。这有助于更好地适应我的数据形状。
my_dsw = []
for dsw in range(0, len(docs_sliding_window)-1):
if len(docs_sliding_window[dsw]['content']) <= 1000:
my_dsw.append(docs_sliding_window[dsw])
在 docs_to_classify
行中将其换出:
# convert to Document using a fieldmap for custom content fields the classification should run on
docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
诚然,我不确定这与错误具体有何关系;但它确实有助于数据更好地拟合;现在我可以增加 batch_size=4
.