如何制作一个功能来检查单词的组合和/或冗余是否与销售数量相关？

Question

在我的突出显示产品在 Internet 上的销售的数据框中，我有一列包含每个销售产品的描述。

我想创建一个算法来检查单词的组合和/或冗余是否与销售数量相关。

但我希望能够过滤掉像产品类型这样过于冗余的词。例如，我的数据框处理的是葡萄酒的销售，因此算法不能考虑描述中的“葡萄酒”一词。

在我的 df 中，我有 700 行，由 4 列组成：

product_id: 每个产品的id
product_price: 产品价格
total_sales:产品销售总数
product_description：产品描述（例如：“果味葡萄酒，完美的开胃菜”；“干爽且酒体饱满的葡萄酒”；“清新而完美的开胃酒”；“结合了力量和性格”；“酒色ruby，酒体醇厚”；等...)

编辑： 我补充说：

栏目'CA'：商品总销售额*商品价格
我的 df 的一个例子

我的 DataFrame 示例：

import pandas as pd

data = {'Product_id': [1, 2, 3, 4, 5],
        'Price': [24, 13.5, 12.9, 34, 26],
        'Total_sales': [28, 34, 29, 42, 10],
        'CA': [672, 459, 374.1, 1428, 260],
        'Product_description': ["Fruity wine, perfect as a starter",
                                "Dry and full-bodied wine",
                                "Fresh and perfect wine as a starter",
                                "Wine combining strength and character",
                                "Wine with a ruby color, full-bodied "]}

df = pd.DataFrame(data)
df

编辑 2：

找出某些词（and/or 词组合）之间的相关性是否会对销售量产生影响。我认为为此我可以创建一个热图，将我的列 ["total_sales"] 的不同值的数量排序，并在横坐标中列出列 ["[=137] 中最常用的单词=]"]。我认为方差分析可以帮助我验证这两个变量或卡方之间的相关性...... 我的行动过程：
求出我的列[total_sales”]的唯一值个数，我有43个不同的
创建停用词列表=[冗余词列表（例如：'the'、'the'、'by' 等...）]
将我所有行的单词拆分为列 ["description"]

wordslist = df["description"].str.split()

我无法使用停用词过滤 wordlist 变量的结果

comp = re.compile('|'.join(stopwords))
z = [re.sub(comp, '', i).strip() for i in words_split]

print(z)

我明白了

TypeError: expected string or bytes-like object

之后我打算获取列df["description"]
具有显着频率的单词应该出现在我的热图的横坐标上，并带有有序的销售数量

这是检查 word/a 单词组合的使用是否对产品销售产生影响的好方法（前提是我找到了错误的解决方案）吗？

你能给我一些提示吗？

编辑 3： 感谢@maaniB 的大力帮助，感谢我向最终解决方案迈出了一大步，但我还有一点路要走，这就是我所在的位置：

我是法国人，所以对于 stop_words 的清洁方法，我将 nltk 替换为 spacy

import re
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop

# to lowercase
df['Product_description'] = df['Product_description'].str.lower()
# replace_stop_characters
stop_chars = [
    "\/",
    "\[",
    "\]",
    "\:",
    "\|",
    '\"',
    "\?",
    "\<",
    "\>",
    "\,",
    "\(",
    "\)",
    "\\",
    "\.",
    "\+",
    "\-",
    "\!",
    "\$",
    "\`",
    "\،",
    "\_",
]
stop_chars_pattern = "|".join(stop_chars)
df['Product_description'] = df.apply(
    lambda row: re.sub(stop_chars_pattern, "", row["Product_description"]),
    axis=1
)
# replace stop words with 2 list
stop_words = list(fr_stop) + list(en_stop)
stop_words.extend(['wine']) # extend the list as you wish
df['Product_description'] = df['Product_description'].map(
    lambda x: ' '.join([w for w in x.split() if w not in stop_words])
)
print(df)

为了提取我尝试使用 CountVectorizer 和 TfidfVectorizer 的特征（我将其与 TfidfTransformer 混淆），我发现 TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
# change the ngram_range to make combinations of words
tfidf_vector = TfidfVectorizer(stop_words=stop_words,
                               ngram_range=(1, 4),
                               encoding="utf-8")
tpl_cntvec = tfidf_vector.fit_transform(df_produits_en_ligne['post_excerpt'])
df_cntvec = pd.DataFrame(tpl_cntvec.toarray(),
                         columns=tfidf_vector.get_feature_names(),
                         index=df_produits_en_ligne.index)
df_total_bow = pd.concat([df_produits_en_ligne['total_sales'], df_cntvec],
                         axis=1)
df_total_bow

我坚持最后一步，我尝试使用最小二乘法的 @maaniB 的好版本

import statsmodels.api as sm
# Here, I used ordinary least square regression method
x = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
y = df_total_bow['total_sales'].to_numpy()
ols = sm.OLS(y, x)
fit = ols.fit()
print(fit.summary())

为了运行它并在 Jupyter notebook 中产生结果，我不得不更改 --NotebookApp.iopub_data_rate_limit 通过命令行

jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

它在 3 分钟的过程后工作，但我完全迷失了结果，它返回给我 46987 行，但我不知道如何解释它。这是我的结果的屏幕截图。

有人可以向我解释一下如何解释吗？

我尝试了另一种方法，但是经过一个小时的过程没有结果我取消它：

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
# define dataset
x = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
y = df_total_bow['total_sales'].to_numpy()
# create pipeline
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
model = DecisionTreeClassifier()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, x, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

编辑 4：

我试图用 df_total_bow

import seaborn as sns

tx = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
ty = df_total_bow['total_sales'].to_numpy()
n = len(df_produits_en_ligne)
indep = tx.dot(ty) / n

c = df_total_bow.fillna(0)
measure = (c - indep)**2 / indep
xi_n = measure.sum().sum()
table = measure / xi_n
sns.heatmap(table.iloc[:-1, :-1], annot=c.iloc[:-1, :-1])
plt.show()

但是我明白了

ValueError: shapes (714,46987) and (714,) not aligned: 46987 (dim 1) != 714 (dim 0)

Answer 1

你的问题是文本挖掘任务的组合，我试着在这里简单地解决一下。与 NLP 和文本挖掘项目一样，第一步是清理，包括删除停用词、停用字符等：

import re

import pandas as pd
from nltk.corpus import stopwords

# to lowercase
df['Product_description'] = df['Product_description'].str.lower()
# replace_stop_characters
stop_chars = [
    "\/",
    "\[",
    "\]",
    "\:",
    "\|",
    '\"',
    "\?",
    "\<",
    "\>",
    "\,",
    "\(",
    "\)",
    "\\",
    "\.",
    "\+",
    "\-",
    "\!",
    "\$",
    "\`",
    "\،",
    "\_",
]
stop_chars_pattern = "|".join(stop_chars)
df['Product_description'] = df.apply(
    lambda row: re.sub(stop_chars_pattern, "", row["Product_description"]),
    axis=1
)
# replace stop words
stop_words = stopwords.words('english')
stop_words.extend(['wine']) # extend the list as you wish
df['Product_description'] = df['Product_description'].map(
    lambda x: ' '.join([w for w in x.split() if w not in stop_words])
)
print(df)

#   Product_id  Price  Total_sales      CA           Product_description
# 0           1   24.0           28   672.0        fruity perfect starter
# 1           2   13.5           34   459.0                dry fullbodied
# 2           3   12.9           29   374.1         fresh perfect starter
# 3           4   34.0           42  1428.0  combining strength character
# 4           5   26.0           10   260.0       ruby color fullbodied

接下来，您需要提取特征（您提到了单词、短语的计数）。

from sklearn.feature_extraction.text import CountVectorizer
# change the ngram_range to make combinations of words
count_vector = CountVectorizer(ngram_range=(1, 4), encoding="utf-8")
tpl_cntvec = count_vector.fit_transform(df['Product_description'])
df_cntvec = pd.DataFrame(
    tpl_cntvec.toarray(), columns=count_vector.get_feature_names(), index=df.index
)
df_total_bow = pd.concat([df['Total_sales'], df_cntvec], axis = 1)
df_total_bow
#   Total_sales  character  color  color fullbodied  combining  ...  ruby # color  ruby color fullbodied  starter  strength  strength character
# 0           28          0      0                 0          0  ...           # 0                      0        1         0                   0
# 1           34          0      0                 0          0  ...           # 0                      0        0         0                   0
# 2           29          0      0                 0          0  ...           # 0                      0        1         0                   0
# 3           42          1      0                 0          1  ...           # 0                      0        0         1                   1
# 4           10          0      1                 1          0  ...           # 1                      1        0         0                   0

最后，您可以根据数据制作模型：

import statsmodels.api as sm
# Here, I used ordinary least square regression method
x = df_total_bow[df_total_bow.drop('Total_sales', 1).columns].to_numpy()
y = df_total_bow['Total_sales'].to_numpy()
ols = sm.OLS(y, x)
fit = ols.fit()
print(fit.summary())

关于您的其他问题：

有多种统计方法可以找出文字在文本中的重要性，以及它们与其他一些变量的相关性。 CountVectorizer只是feature_extraction的简单方法。有更好的方法，例如 TfidfTransformer.
统计检验或模型的类型取决于问题。由于您只需要找出单词组合与销售统计数据的相关性，因此带有特征提取的简单 regression-based 方法会很有帮助。要对特征进行排序（找到具有最高相关性和重要性的单词组合），递归特征消除 (sklearn.feature_selection.RFE) 可能是实用的。

如何制作一个功能来检查单词的组合和/或冗余是否与销售数量相关？

How to make a function to check if the combination and or redundancy of words has a correlation with the number of sales?

python

algorithm

function

heatmap

correlation