how to fix the error ValueError: could not convert string to float in a NLP project in python?
how to fix the error ValueError: could not convert string to float in a NLP project in python?
我正在使用 jupyter notebook 编写 python 代码,该代码训练和测试数据集以 return 正确的情绪。
当我尝试预测短语的情绪时系统崩溃并显示以下错误的问题:
ValueError: could not convert string to float: 'this book was so
interstening it made me not happy'
注意我有一个不平衡的数据集,所以我使用 SMOTE 以便 over_sampling 数据集
代码:
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE# for inbalance dataset
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer,TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
from sklearn.pipeline import Pipeline
df = pd.read_csv("data/Apple-Twitter-Sentiment-DFE.csv",encoding="ISO-8859-1")
df
# data is cleaned using preprocessing functions
# Solving inbalanced dataset using SMOTE
vectorizer = TfidfVectorizer()
vect_df =vectorizer.fit_transform(df["clean_text"])
oversample = SMOTE(random_state = 42)
x_smote,y_smote = oversample.fit_resample(vect_df, df["sentiment"])
print("shape x before SMOTE: {}".format(vect_df.shape))
print("shape x after SMOTE: {}".format(x_smote.shape))
print("balance of targets feild %")
y_smote.value_counts(normalize = True)*100
# split the dataset into train and test
x_train,x_test,y_train,y_test = train_test_split(x_smote,y_smote,test_size = 0.2,random_state =42)
logreg = Pipeline([
('tfidf', TfidfTransformer()),
('clf', LogisticRegression(n_jobs=1, C=1e5)),
])
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred))
# Make prediction
exl = "this book was so interstening it made me not happy"
logreg.predict(exl)
您应该按如下方式定义变量 exl
:
exl = vectorizer.transform(["this book was so interstening it made me not happy"])
然后做预测。
首先,将测试数据放入列表中,然后vectorizer
使用从训练数据中提取的特征进行预测。
我正在使用 jupyter notebook 编写 python 代码,该代码训练和测试数据集以 return 正确的情绪。
当我尝试预测短语的情绪时系统崩溃并显示以下错误的问题:
ValueError: could not convert string to float: 'this book was so interstening it made me not happy'
注意我有一个不平衡的数据集,所以我使用 SMOTE 以便 over_sampling 数据集
代码:
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE# for inbalance dataset
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer,TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
from sklearn.pipeline import Pipeline
df = pd.read_csv("data/Apple-Twitter-Sentiment-DFE.csv",encoding="ISO-8859-1")
df
# data is cleaned using preprocessing functions
# Solving inbalanced dataset using SMOTE
vectorizer = TfidfVectorizer()
vect_df =vectorizer.fit_transform(df["clean_text"])
oversample = SMOTE(random_state = 42)
x_smote,y_smote = oversample.fit_resample(vect_df, df["sentiment"])
print("shape x before SMOTE: {}".format(vect_df.shape))
print("shape x after SMOTE: {}".format(x_smote.shape))
print("balance of targets feild %")
y_smote.value_counts(normalize = True)*100
# split the dataset into train and test
x_train,x_test,y_train,y_test = train_test_split(x_smote,y_smote,test_size = 0.2,random_state =42)
logreg = Pipeline([
('tfidf', TfidfTransformer()),
('clf', LogisticRegression(n_jobs=1, C=1e5)),
])
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred))
# Make prediction
exl = "this book was so interstening it made me not happy"
logreg.predict(exl)
您应该按如下方式定义变量 exl
:
exl = vectorizer.transform(["this book was so interstening it made me not happy"])
然后做预测。
首先,将测试数据放入列表中,然后vectorizer
使用从训练数据中提取的特征进行预测。