CountVectorizer() : AttributeError: 'numpy.float64' object has no attribute 'lower'
CountVectorizer() : AttributeError: 'numpy.float64' object has no attribute 'lower'
我正在尝试拟合具有 event_type 和注释(自由文本)列的数据集。在调用 MultinomialNB 模型之前,我处理了文本并将其转换为数组以对其进行矢量化并在提供的代码下方计算 tfidf:
将事件类型从字符串转换为整数以便于处理
ACLED['category_id'] = ACLED['event_type'].factorize()[0]
category_id_ACLED = ACLED[['event_type', 'category_id']].drop_d
uplicates().sort_values('category_id')
category_to_id = dict(category_id_ACLED.values)
id_to_category = dict(category_id_ACLED[['category_id', 'event_type']].values)
文本表示
我还把笔记和category_id转换成特征和标签如下:
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(ACLED.notes).toarray()
labels = ACLED.category_id
print(features.shape)
然后我使用特征和标签将数据集拆分为训练集和测试集:
X_train, X_test, y_train, y_test = train_test_split(features ,labels, random_state=0)
print('Original dataset shape {}'.format(Counter(y_train)))
输出
Original dataset shape Counter({1: 1280, 2: 819, 0: 676, 3: 593, 4: 138, 5: 53, 7: 50, 6: 21, 8: 10})
由于 类 不平衡,我使用 SMOTE 解决了少数问题并创建了合成副本
应用随机过采样来克服不平衡类
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_sample(X_train, y_train)
print('Resampled dataset shape {}'.format(Counter(y_resampled)))
过采样后的输出
Resampled dataset shape Counter({3: 1280, 1: 1280, 2: 1280, 0: 1280, 7: 1280, 6: 1280, 4: 1280, 5: 1280, 8: 1280})
到目前为止一切正常,直到我尝试使用 CountVectorizer() 计算词频,如下所示:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_resampled)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
输出错误:
'numpy.ndarray' object has no attribute 'lower'
我尝试使用 ravel() 函数来展平数组,但错误仍然存在,任何想法,提前致谢
我找到了这个问题的解决方案,我没有使用特征和标签,而是直接在数据集上执行了一个子集:
X_train, X_test, y_train, y_test = train_test_split(ACLED['notes'] ,ACLED['event_type'], random_state=0)
然后我在 counVectorizer 之后移动了 SMOTE,因为 SMOTE 有自己的管道:
向量化训练集的注释列
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
应用随机过采样来克服不平衡类
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_sample(X_train_tfidf, y_train)
print('Resampled dataset shape {}'.format(Counter(y_resampled)))
输出
Original dataset shape Counter({'Riots/Protests': 1280, 'Battle-No change of territory': 819, 'Remote violence': 676, 'Violence against civilians': 593, 'Strategic development': 138, 'Battle-Government regains territory': 53, 'Battle-Non-state actor overtakes territory': 50, 'Non-violent transfer of territory': 21, 'Headquarters or base established': 10})
Resampled dataset shape Counter({'Violence against civilians': 1280, 'Riots/Protests': 1280, 'Battle-No change of territory': 1280, 'Remote violence': 1280, 'Battle-Non-state actor overtakes territory': 1280, 'Non-violent transfer of territory': 1280, 'Strategic development': 1280, 'Battle-Government regains territory': 1280, 'Headquarters or base established': 1280})
我正在尝试拟合具有 event_type 和注释(自由文本)列的数据集。在调用 MultinomialNB 模型之前,我处理了文本并将其转换为数组以对其进行矢量化并在提供的代码下方计算 tfidf:
将事件类型从字符串转换为整数以便于处理
ACLED['category_id'] = ACLED['event_type'].factorize()[0]
category_id_ACLED = ACLED[['event_type', 'category_id']].drop_d
uplicates().sort_values('category_id')
category_to_id = dict(category_id_ACLED.values)
id_to_category = dict(category_id_ACLED[['category_id', 'event_type']].values)
文本表示
我还把笔记和category_id转换成特征和标签如下:
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(ACLED.notes).toarray()
labels = ACLED.category_id
print(features.shape)
然后我使用特征和标签将数据集拆分为训练集和测试集:
X_train, X_test, y_train, y_test = train_test_split(features ,labels, random_state=0)
print('Original dataset shape {}'.format(Counter(y_train)))
输出
Original dataset shape Counter({1: 1280, 2: 819, 0: 676, 3: 593, 4: 138, 5: 53, 7: 50, 6: 21, 8: 10})
由于 类 不平衡,我使用 SMOTE 解决了少数问题并创建了合成副本
应用随机过采样来克服不平衡类
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_sample(X_train, y_train)
print('Resampled dataset shape {}'.format(Counter(y_resampled)))
过采样后的输出
Resampled dataset shape Counter({3: 1280, 1: 1280, 2: 1280, 0: 1280, 7: 1280, 6: 1280, 4: 1280, 5: 1280, 8: 1280})
到目前为止一切正常,直到我尝试使用 CountVectorizer() 计算词频,如下所示:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_resampled)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
输出错误:
'numpy.ndarray' object has no attribute 'lower'
我尝试使用 ravel() 函数来展平数组,但错误仍然存在,任何想法,提前致谢
我找到了这个问题的解决方案,我没有使用特征和标签,而是直接在数据集上执行了一个子集:
X_train, X_test, y_train, y_test = train_test_split(ACLED['notes'] ,ACLED['event_type'], random_state=0)
然后我在 counVectorizer 之后移动了 SMOTE,因为 SMOTE 有自己的管道:
向量化训练集的注释列
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
应用随机过采样来克服不平衡类
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_sample(X_train_tfidf, y_train)
print('Resampled dataset shape {}'.format(Counter(y_resampled)))
输出
Original dataset shape Counter({'Riots/Protests': 1280, 'Battle-No change of territory': 819, 'Remote violence': 676, 'Violence against civilians': 593, 'Strategic development': 138, 'Battle-Government regains territory': 53, 'Battle-Non-state actor overtakes territory': 50, 'Non-violent transfer of territory': 21, 'Headquarters or base established': 10})
Resampled dataset shape Counter({'Violence against civilians': 1280, 'Riots/Protests': 1280, 'Battle-No change of territory': 1280, 'Remote violence': 1280, 'Battle-Non-state actor overtakes territory': 1280, 'Non-violent transfer of territory': 1280, 'Strategic development': 1280, 'Battle-Government regains territory': 1280, 'Headquarters or base established': 1280})