如何制作一个功能来检查单词的组合和/或冗余是否与销售数量相关?

How to make a function to check if the combination and or redundancy of words has a correlation with the number of sales?

在我的突出显示产品在 Internet 上的销售的数据框中,我有一列包含每个销售产品的描述。

我想创建一个算法来检查单词的组合和/或冗余是否与销售数量相关。

但我希望能够过滤掉像产品类型这样过于冗余的词。例如,我的数据框处理的是葡萄酒的销售,因此算法不能考虑描述中的“葡萄酒”一词。

在我的 df 中,我有 700 行,由 4 列组成:

编辑: 我补充说:

我的 DataFrame 示例:

import pandas as pd

data = {'Product_id': [1, 2, 3, 4, 5],
        'Price': [24, 13.5, 12.9, 34, 26],
        'Total_sales': [28, 34, 29, 42, 10],
        'CA': [672, 459, 374.1, 1428, 260],
        'Product_description': ["Fruity wine, perfect as a starter",
                                "Dry and full-bodied wine",
                                "Fresh and perfect wine as a starter",
                                "Wine combining strength and character",
                                "Wine with a ruby ​​color, full-bodied "]}

df = pd.DataFrame(data)
df

编辑 2:

wordslist = df["description"].str.split()
comp = re.compile('|'.join(stopwords))
z = [re.sub(comp, '', i).strip() for i in words_split]

print(z)

我明白了

TypeError: expected string or bytes-like object

这是检查 word/a 单词组合的使用是否对产品销售产生影响的好方法(前提是我找到了错误的解决方案)吗?

你能给我一些提示吗?

编辑 3: 感谢@maaniB 的大力帮助,感谢我向最终解决方案迈出了一大步,但我还有一点路要走,这就是我所在的位置:

我是法国人,所以对于 stop_words 的清洁方法,我将 nltk 替换为 spacy

import re
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop

# to lowercase
df['Product_description'] = df['Product_description'].str.lower()
# replace_stop_characters
stop_chars = [
    "\/",
    "\[",
    "\]",
    "\:",
    "\|",
    '\"',
    "\?",
    "\<",
    "\>",
    "\,",
    "\(",
    "\)",
    "\\",
    "\.",
    "\+",
    "\-",
    "\!",
    "\$",
    "\`",
    "\،",
    "\_",
]
stop_chars_pattern = "|".join(stop_chars)
df['Product_description'] = df.apply(
    lambda row: re.sub(stop_chars_pattern, "", row["Product_description"]),
    axis=1
)
# replace stop words with 2 list
stop_words = list(fr_stop) + list(en_stop)
stop_words.extend(['wine']) # extend the list as you wish
df['Product_description'] = df['Product_description'].map(
    lambda x: ' '.join([w for w in x.split() if w not in stop_words])
)
print(df)
from sklearn.feature_extraction.text import TfidfVectorizer
# change the ngram_range to make combinations of words
tfidf_vector = TfidfVectorizer(stop_words=stop_words,
                               ngram_range=(1, 4),
                               encoding="utf-8")
tpl_cntvec = tfidf_vector.fit_transform(df_produits_en_ligne['post_excerpt'])
df_cntvec = pd.DataFrame(tpl_cntvec.toarray(),
                         columns=tfidf_vector.get_feature_names(),
                         index=df_produits_en_ligne.index)
df_total_bow = pd.concat([df_produits_en_ligne['total_sales'], df_cntvec],
                         axis=1)
df_total_bow

我坚持最后一步,我尝试使用最小二乘法的 @maaniB 的好版本

import statsmodels.api as sm
# Here, I used ordinary least square regression method
x = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
y = df_total_bow['total_sales'].to_numpy()
ols = sm.OLS(y, x)
fit = ols.fit()
print(fit.summary())

为了 运行 它并在 Jupyter notebook 中产生结果,我不得不更改 --NotebookApp.iopub_data_rate_limit 通过命令行

jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

它在 3 分钟的过程后工作,但我完全迷失了结果,它返回给我 46987 行,但我不知道如何解释它。 这是我的结果的屏幕截图。

有人可以向我解释一下如何解释吗?

我尝试了另一种方法,但是经过一个小时的过程没有结果 我取消它:

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
# define dataset
x = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
y = df_total_bow['total_sales'].to_numpy()
# create pipeline
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
model = DecisionTreeClassifier()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, x, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

编辑 4:

import seaborn as sns

tx = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
ty = df_total_bow['total_sales'].to_numpy()
n = len(df_produits_en_ligne)
indep = tx.dot(ty) / n

c = df_total_bow.fillna(0)
measure = (c - indep)**2 / indep
xi_n = measure.sum().sum()
table = measure / xi_n
sns.heatmap(table.iloc[:-1, :-1], annot=c.iloc[:-1, :-1])
plt.show()

但是我明白了

ValueError: shapes (714,46987) and (714,) not aligned: 46987 (dim 1) != 714 (dim 0)

你的问题是文本挖掘任务的组合,我试着在这里简单地解决一下。与 NLP 和文本挖掘项目一样,第一步是清理,包括删除停用词、停用字符等:

import re

import pandas as pd
from nltk.corpus import stopwords

# to lowercase
df['Product_description'] = df['Product_description'].str.lower()
# replace_stop_characters
stop_chars = [
    "\/",
    "\[",
    "\]",
    "\:",
    "\|",
    '\"',
    "\?",
    "\<",
    "\>",
    "\,",
    "\(",
    "\)",
    "\\",
    "\.",
    "\+",
    "\-",
    "\!",
    "\$",
    "\`",
    "\،",
    "\_",
]
stop_chars_pattern = "|".join(stop_chars)
df['Product_description'] = df.apply(
    lambda row: re.sub(stop_chars_pattern, "", row["Product_description"]),
    axis=1
)
# replace stop words
stop_words = stopwords.words('english')
stop_words.extend(['wine']) # extend the list as you wish
df['Product_description'] = df['Product_description'].map(
    lambda x: ' '.join([w for w in x.split() if w not in stop_words])
)
print(df)

#   Product_id  Price  Total_sales      CA           Product_description
# 0           1   24.0           28   672.0        fruity perfect starter
# 1           2   13.5           34   459.0                dry fullbodied
# 2           3   12.9           29   374.1         fresh perfect starter
# 3           4   34.0           42  1428.0  combining strength character
# 4           5   26.0           10   260.0       ruby ​​color fullbodied

接下来,您需要提取特征(您提到了单词、短语的计数)。

from sklearn.feature_extraction.text import CountVectorizer
# change the ngram_range to make combinations of words
count_vector = CountVectorizer(ngram_range=(1, 4), encoding="utf-8")
tpl_cntvec = count_vector.fit_transform(df['Product_description'])
df_cntvec = pd.DataFrame(
    tpl_cntvec.toarray(), columns=count_vector.get_feature_names(), index=df.index
)
df_total_bow = pd.concat([df['Total_sales'], df_cntvec], axis = 1)
df_total_bow
#   Total_sales  character  color  color fullbodied  combining  ...  ruby # color  ruby color fullbodied  starter  strength  strength character
# 0           28          0      0                 0          0  ...           # 0                      0        1         0                   0
# 1           34          0      0                 0          0  ...           # 0                      0        0         0                   0
# 2           29          0      0                 0          0  ...           # 0                      0        1         0                   0
# 3           42          1      0                 0          1  ...           # 0                      0        0         1                   1
# 4           10          0      1                 1          0  ...           # 1                      1        0         0                   0

最后,您可以根据数据制作模型:

import statsmodels.api as sm
# Here, I used ordinary least square regression method
x = df_total_bow[df_total_bow.drop('Total_sales', 1).columns].to_numpy()
y = df_total_bow['Total_sales'].to_numpy()
ols = sm.OLS(y, x)
fit = ols.fit()
print(fit.summary())

关于您的其他问题:

  • 有多种统计方法可以找出文字在文本中的重要性,以及它们与其他一些变量的相关性。 CountVectorizer只是feature_extraction的简单方法。有更好的方法,例如 TfidfTransformer.
  • 统计检验或模型的类型取决于问题。由于您只需要找出单词组合与销售统计数据的相关性,因此带有特征提取的简单 regression-based 方法会很有帮助。要对特征进行排序(找到具有最高相关性和重要性的单词组合),递归特征消除 (sklearn.feature_selection.RFE) 可能是实用的。