AttributeError: 'numpy.ndarray' object has no attribute 'lower' in pipeline

AttributeError: 'numpy.ndarray' object has no attribute 'lower' in pipeline

我正在做一些 nlp class化,我想做一个堆叠集成。

我的原始数据包含每个 class 的不同级别的描述。例如,对于 1 个实例,我们最初可以有一列包含其名称,一列包含简短描述,一列包含其子类别的描述,等等。

我在上面的代码中的 X_train 是每列包含某种粒度的所有单词的地方。例如。第一列可以是简短描述,第二列是子类别描述的词和来自其他来源的词,第三列是来自更细化类别的更多词。

我将 pipepipe_2 的工作流程包含在 StackingClassifier 中,因为这是我正在尝试做的,但如果我只是尝试 运行 将 pipe_1 作为一个独立的(直接安装在 pipe_1 上)。

我尝试更改 X_trainy_train 格式(使用 ravel().tolist()),但我认为可能出现了格式问题当管道使用 ColumnSelector 时,我不确定如何处理它。

X_train(<class 'pandas.core.frame.DataFrame'>)和y_train(<class 'pandas.core.series.Series'>)的类型和我成功的不堆叠运行时一样。对于成功的运行,传递给fit方法的是一个<class 'scipy.sparse.csr.csr_matrix'>。我想在堆叠示例中也是如此,我希望 TfidfVectorizer 能够做到这一点。我看到的主要区别(我认为它可能会造成每行有问题的 numpy.ndarray,因为有多于一列?)是对于堆叠一列,X_train 有多于一列。但我本来希望 make_pipeline 中的 ColumnSelector 到 "take charge of that problem".

import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from mlxtend.feature_selection import ColumnSelector
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from mlxtend.classifier import StackingClassifier

# creating my toy trainset and testset
start = [
    ['apple this is painful Two wrongs make a right ok',
     'just a batch of suspicious words and banana',
     'another batch of fake words and another apple'],
    ['Fortune favors the italic sunny sunshine',
     'name of a company and then its description',
     'is it all sunshine or doomed to fail to no sunshine'],
    ['this was it when in rome do as the romans and make fortune',
     'well again the same thing and those descriptions',
     'lets make that work and bring the fortune'],
    ['Ok this is the last one and then its the end',
     'is it the beggining of the end or the end of the beggining',
     'allelouia']
]

X_train = pd.DataFrame(
    start, columns=['High_level', 'Mid_level', 'Low_level'])
y_train = ['A', 'B', 'C', 'D']
X_test = pd.DataFrame([['mostly apple'], ['bunch of apple'],
                       ['lot of fortune'], ['make fortune and bring the'],
                       ['beggining of the end']])
y_true = ['A', 'A', 'C', 'C', 'D']

错误出现在下一行:

pipe_1 = make_pipeline(ColumnSelector(cols=(1,)), TfidfVectorizer(min_df=1),
                     LogisticRegression(multi_class='multinomial'))
pipe_2 = make_pipeline(ColumnSelector(cols=(2,)), TfidfVectorizer(min_df=1),
                     LogisticRegression(multi_class='multinomial'))
sclf = StackingClassifier(
        classifiers=[pipe_1, pipe_2],
        meta_classifier=LogisticRegression(
            solver='lbfgs', multi_class='multinomial',
            C=1.0, class_weight='balanced', tol=1e-6, max_iter=1000,
            n_jobs=-1))
predictions = sclf.fit(X_train, y_train).predict(X_test)

这是完整的错误:

Traceback (most recent call last):
  File "C:/Users/inf10926/PycharmProjects/profiling/venv/lab.py", line 52, in <module>
    predictions = sclf.fit(X_train, y_train).predict(X_test)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\mlxtend\classifier\stacking_classification.py", line 161, in fit
    clf.fit(X, y)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 352, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 317, in _fit
    **fit_params_steps[name])
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\joblib\memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 716, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1652, in fit_transform
    X = super().fit_transform(raw_documents)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1058, in fit_transform
    self.fixed_vocabulary_)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab
    for feature in analyze(doc):
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 256, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

Process finished with exit code 1

如果我在 TfidfVectorizer 中更改为 lowercase=False,我会得到另一种错误:

Traceback (most recent call last):
  File "C:/Users/inf10926/PycharmProjects/profiling/venv/lab.py", line 52, in <module>
    predictions = sclf.fit(X_train, y_train).predict(X_test)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\mlxtend\classifier\stacking_classification.py", line 161, in fit
    clf.fit(X, y)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 352, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 317, in _fit
    **fit_params_steps[name])
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\joblib\memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 716, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1652, in fit_transform
    X = super().fit_transform(raw_documents)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1058, in fit_transform
    self.fixed_vocabulary_)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab
    for feature in analyze(doc):
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 265, in <lambda>
    return lambda doc: token_pattern.findall(doc)
TypeError: cannot use a string pattern on a bytes-like object

我遇到了同样的问题。我通过将 drop_axis = True 添加到 ColumnSelector 来解决它。当只选择一列时需要添加此参数。

请参考这里的API:http://rasbt.github.io/mlxtend/user_guide/feature_selection/ColumnSelector/#api