HuggingFace Bert 情感分析
HuggingFace Bert Sentiment analysis
我收到以下错误:
AssertionError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).
,当我运行classifier(encoded)
。我的文本类型是 str
所以我不确定我做错了什么。非常感谢任何帮助。
import torch
from transformers import AutoTokenizer, BertTokenizer, BertModel, BertForMaskedLM, AutoModelForSequenceClassification, pipeline
# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)
# Load pre-trained model tokenizer (vocabulary)
# used the cased instead of uncased to account for cases like BAD.
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
# alternative? what is the difference between these two tokenizers?
#tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-SST-2")
model = AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-SST-2")
# feed the model and the tokenizer into the pipeline
classifier = pipeline('sentiment-analysis', model=model, tokenizer= tokenizer)
#---------------sample raw input passage--------
text = "Who was Jim Henson ? Jim Henson was a puppeteer. He is simply awful."
# tokenized_text = tokenizer.tokenize(text)
#----------Tokenization and Padding---------
# Encode the sentences to get tokenized and add padding stuff
encoded = tokenizer.encode_plus(
text=text, # the sentences to be encoded
add_special_tokens=True, # Add [CLS] and [SEP] !!!
max_length = 64, # maximum length of a sentence (TODO Figure the longest passage length)
pad_to_max_length=True, # Add [PAD]s
return_attention_mask = True, # Generate the attention mask
truncation=True, #explicitly truncate examples to max length
return_tensors = 'pt', # ask the function to return PyTorch tensors
)
#-------------------------------------------
# view the IDs
for key, value in encoded.items():
print(f"{key}: {value.numpy().tolist()}")
#-------------------------------------------
classifier(encoded)
管道已包含编码器。
而不是
classifier(encoded)
做
classifier(text)
我收到以下错误:
AssertionError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).
,当我运行classifier(encoded)
。我的文本类型是 str
所以我不确定我做错了什么。非常感谢任何帮助。
import torch
from transformers import AutoTokenizer, BertTokenizer, BertModel, BertForMaskedLM, AutoModelForSequenceClassification, pipeline
# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)
# Load pre-trained model tokenizer (vocabulary)
# used the cased instead of uncased to account for cases like BAD.
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
# alternative? what is the difference between these two tokenizers?
#tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-SST-2")
model = AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-SST-2")
# feed the model and the tokenizer into the pipeline
classifier = pipeline('sentiment-analysis', model=model, tokenizer= tokenizer)
#---------------sample raw input passage--------
text = "Who was Jim Henson ? Jim Henson was a puppeteer. He is simply awful."
# tokenized_text = tokenizer.tokenize(text)
#----------Tokenization and Padding---------
# Encode the sentences to get tokenized and add padding stuff
encoded = tokenizer.encode_plus(
text=text, # the sentences to be encoded
add_special_tokens=True, # Add [CLS] and [SEP] !!!
max_length = 64, # maximum length of a sentence (TODO Figure the longest passage length)
pad_to_max_length=True, # Add [PAD]s
return_attention_mask = True, # Generate the attention mask
truncation=True, #explicitly truncate examples to max length
return_tensors = 'pt', # ask the function to return PyTorch tensors
)
#-------------------------------------------
# view the IDs
for key, value in encoded.items():
print(f"{key}: {value.numpy().tolist()}")
#-------------------------------------------
classifier(encoded)
管道已包含编码器。 而不是
classifier(encoded)
做
classifier(text)