如何制作一个功能来检查单词的组合和/或冗余是否与销售数量相关?
How to make a function to check if the combination and or redundancy of words has a correlation with the number of sales?
在我的突出显示产品在 Internet 上的销售的数据框中,我有一列包含每个销售产品的描述。
我想创建一个算法来检查单词的组合和/或冗余是否与销售数量相关。
但我希望能够过滤掉像产品类型这样过于冗余的词。例如,我的数据框处理的是葡萄酒的销售,因此算法不能考虑描述中的“葡萄酒”一词。
在我的 df 中,我有 700 行,由 4 列组成:
- product_id: 每个产品的id
- product_price: 产品价格
- total_sales:产品销售总数
- product_description:产品描述(例如:“果味葡萄酒,完美的开胃菜”;“干爽且酒体饱满的葡萄酒”;“清新而完美的开胃酒”;“结合了力量和性格”;“酒色ruby,酒体醇厚”;等...)
编辑:
我补充说:
- 栏目'CA':商品总销售额*商品价格
- 我的 df 的一个例子
我的 DataFrame 示例:
import pandas as pd
data = {'Product_id': [1, 2, 3, 4, 5],
'Price': [24, 13.5, 12.9, 34, 26],
'Total_sales': [28, 34, 29, 42, 10],
'CA': [672, 459, 374.1, 1428, 260],
'Product_description': ["Fruity wine, perfect as a starter",
"Dry and full-bodied wine",
"Fresh and perfect wine as a starter",
"Wine combining strength and character",
"Wine with a ruby color, full-bodied "]}
df = pd.DataFrame(data)
df
编辑 2:
- 找出某些词(and/or 词组合)之间的相关性是否会对销售量产生影响。我认为为此我可以创建一个热图,将我的列 ["total_sales"] 的不同值的数量排序,并在横坐标中列出列 ["[=137] 中最常用的单词=]"]。我认为方差分析可以帮助我验证这两个变量或卡方之间的相关性......
我的行动过程:
- 求出我的列[total_sales”]的唯一值个数,我有43个不同的
- 创建停用词列表=[冗余词列表(例如:'the'、'the'、'by' 等...)]
- 将我所有行的单词拆分为列 ["description"]
wordslist = df["description"].str.split()
- 我无法使用停用词过滤 wordlist 变量的结果
comp = re.compile('|'.join(stopwords))
z = [re.sub(comp, '', i).strip() for i in words_split]
print(z)
我明白了
TypeError: expected string or bytes-like object
- 之后我打算获取列df["description"]
中每个单词的出现频率
- 具有显着频率的单词应该出现在我的热图的横坐标上,并带有有序的销售数量
这是检查 word/a 单词组合的使用是否对产品销售产生影响的好方法(前提是我找到了错误的解决方案)吗?
你能给我一些提示吗?
编辑 3:
感谢@maaniB 的大力帮助,感谢我向最终解决方案迈出了一大步,但我还有一点路要走,这就是我所在的位置:
我是法国人,所以对于 stop_words 的清洁方法,我将 nltk
替换为 spacy
import re
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop
# to lowercase
df['Product_description'] = df['Product_description'].str.lower()
# replace_stop_characters
stop_chars = [
"\/",
"\[",
"\]",
"\:",
"\|",
'\"',
"\?",
"\<",
"\>",
"\,",
"\(",
"\)",
"\\",
"\.",
"\+",
"\-",
"\!",
"\$",
"\`",
"\،",
"\_",
]
stop_chars_pattern = "|".join(stop_chars)
df['Product_description'] = df.apply(
lambda row: re.sub(stop_chars_pattern, "", row["Product_description"]),
axis=1
)
# replace stop words with 2 list
stop_words = list(fr_stop) + list(en_stop)
stop_words.extend(['wine']) # extend the list as you wish
df['Product_description'] = df['Product_description'].map(
lambda x: ' '.join([w for w in x.split() if w not in stop_words])
)
print(df)
- 为了提取我尝试使用
CountVectorizer
和 TfidfVectorizer
的特征(我将其与 TfidfTransformer
混淆),我发现 TfidfVectorizer
的结果更好
from sklearn.feature_extraction.text import TfidfVectorizer
# change the ngram_range to make combinations of words
tfidf_vector = TfidfVectorizer(stop_words=stop_words,
ngram_range=(1, 4),
encoding="utf-8")
tpl_cntvec = tfidf_vector.fit_transform(df_produits_en_ligne['post_excerpt'])
df_cntvec = pd.DataFrame(tpl_cntvec.toarray(),
columns=tfidf_vector.get_feature_names(),
index=df_produits_en_ligne.index)
df_total_bow = pd.concat([df_produits_en_ligne['total_sales'], df_cntvec],
axis=1)
df_total_bow
我坚持最后一步,我尝试使用最小二乘法的 @maaniB 的好版本
import statsmodels.api as sm
# Here, I used ordinary least square regression method
x = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
y = df_total_bow['total_sales'].to_numpy()
ols = sm.OLS(y, x)
fit = ols.fit()
print(fit.summary())
为了 运行 它并在 Jupyter notebook 中产生结果,我不得不更改 --NotebookApp.iopub_data_rate_limit
通过命令行
jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
它在 3 分钟的过程后工作,但我完全迷失了结果,它返回给我 46987 行,但我不知道如何解释它。
这是我的结果的屏幕截图。
有人可以向我解释一下如何解释吗?
我尝试了另一种方法,但是经过一个小时的过程没有结果
我取消它:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
# define dataset
x = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
y = df_total_bow['total_sales'].to_numpy()
# create pipeline
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
model = DecisionTreeClassifier()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, x, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
编辑 4:
- 我试图用
df_total_bow
徒劳地制作热图
import seaborn as sns
tx = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
ty = df_total_bow['total_sales'].to_numpy()
n = len(df_produits_en_ligne)
indep = tx.dot(ty) / n
c = df_total_bow.fillna(0)
measure = (c - indep)**2 / indep
xi_n = measure.sum().sum()
table = measure / xi_n
sns.heatmap(table.iloc[:-1, :-1], annot=c.iloc[:-1, :-1])
plt.show()
但是我明白了
ValueError: shapes (714,46987) and (714,) not aligned: 46987 (dim 1) != 714 (dim 0)
你的问题是文本挖掘任务的组合,我试着在这里简单地解决一下。与 NLP 和文本挖掘项目一样,第一步是清理,包括删除停用词、停用字符等:
import re
import pandas as pd
from nltk.corpus import stopwords
# to lowercase
df['Product_description'] = df['Product_description'].str.lower()
# replace_stop_characters
stop_chars = [
"\/",
"\[",
"\]",
"\:",
"\|",
'\"',
"\?",
"\<",
"\>",
"\,",
"\(",
"\)",
"\\",
"\.",
"\+",
"\-",
"\!",
"\$",
"\`",
"\،",
"\_",
]
stop_chars_pattern = "|".join(stop_chars)
df['Product_description'] = df.apply(
lambda row: re.sub(stop_chars_pattern, "", row["Product_description"]),
axis=1
)
# replace stop words
stop_words = stopwords.words('english')
stop_words.extend(['wine']) # extend the list as you wish
df['Product_description'] = df['Product_description'].map(
lambda x: ' '.join([w for w in x.split() if w not in stop_words])
)
print(df)
# Product_id Price Total_sales CA Product_description
# 0 1 24.0 28 672.0 fruity perfect starter
# 1 2 13.5 34 459.0 dry fullbodied
# 2 3 12.9 29 374.1 fresh perfect starter
# 3 4 34.0 42 1428.0 combining strength character
# 4 5 26.0 10 260.0 ruby color fullbodied
接下来,您需要提取特征(您提到了单词、短语的计数)。
from sklearn.feature_extraction.text import CountVectorizer
# change the ngram_range to make combinations of words
count_vector = CountVectorizer(ngram_range=(1, 4), encoding="utf-8")
tpl_cntvec = count_vector.fit_transform(df['Product_description'])
df_cntvec = pd.DataFrame(
tpl_cntvec.toarray(), columns=count_vector.get_feature_names(), index=df.index
)
df_total_bow = pd.concat([df['Total_sales'], df_cntvec], axis = 1)
df_total_bow
# Total_sales character color color fullbodied combining ... ruby # color ruby color fullbodied starter strength strength character
# 0 28 0 0 0 0 ... # 0 0 1 0 0
# 1 34 0 0 0 0 ... # 0 0 0 0 0
# 2 29 0 0 0 0 ... # 0 0 1 0 0
# 3 42 1 0 0 1 ... # 0 0 0 1 1
# 4 10 0 1 1 0 ... # 1 1 0 0 0
最后,您可以根据数据制作模型:
import statsmodels.api as sm
# Here, I used ordinary least square regression method
x = df_total_bow[df_total_bow.drop('Total_sales', 1).columns].to_numpy()
y = df_total_bow['Total_sales'].to_numpy()
ols = sm.OLS(y, x)
fit = ols.fit()
print(fit.summary())
关于您的其他问题:
- 有多种统计方法可以找出文字在文本中的重要性,以及它们与其他一些变量的相关性。
CountVectorizer
只是feature_extraction的简单方法。有更好的方法,例如 TfidfTransformer
.
- 统计检验或模型的类型取决于问题。由于您只需要找出单词组合与销售统计数据的相关性,因此带有特征提取的简单 regression-based 方法会很有帮助。要对特征进行排序(找到具有最高相关性和重要性的单词组合),递归特征消除 (
sklearn.feature_selection.RFE
) 可能是实用的。
在我的突出显示产品在 Internet 上的销售的数据框中,我有一列包含每个销售产品的描述。
我想创建一个算法来检查单词的组合和/或冗余是否与销售数量相关。
但我希望能够过滤掉像产品类型这样过于冗余的词。例如,我的数据框处理的是葡萄酒的销售,因此算法不能考虑描述中的“葡萄酒”一词。
在我的 df 中,我有 700 行,由 4 列组成:
- product_id: 每个产品的id
- product_price: 产品价格
- total_sales:产品销售总数
- product_description:产品描述(例如:“果味葡萄酒,完美的开胃菜”;“干爽且酒体饱满的葡萄酒”;“清新而完美的开胃酒”;“结合了力量和性格”;“酒色ruby,酒体醇厚”;等...)
编辑: 我补充说:
- 栏目'CA':商品总销售额*商品价格
- 我的 df 的一个例子
我的 DataFrame 示例:
import pandas as pd
data = {'Product_id': [1, 2, 3, 4, 5],
'Price': [24, 13.5, 12.9, 34, 26],
'Total_sales': [28, 34, 29, 42, 10],
'CA': [672, 459, 374.1, 1428, 260],
'Product_description': ["Fruity wine, perfect as a starter",
"Dry and full-bodied wine",
"Fresh and perfect wine as a starter",
"Wine combining strength and character",
"Wine with a ruby color, full-bodied "]}
df = pd.DataFrame(data)
df
编辑 2:
- 找出某些词(and/or 词组合)之间的相关性是否会对销售量产生影响。我认为为此我可以创建一个热图,将我的列 ["total_sales"] 的不同值的数量排序,并在横坐标中列出列 ["[=137] 中最常用的单词=]"]。我认为方差分析可以帮助我验证这两个变量或卡方之间的相关性...... 我的行动过程:
- 求出我的列[total_sales”]的唯一值个数,我有43个不同的
- 创建停用词列表=[冗余词列表(例如:'the'、'the'、'by' 等...)]
- 将我所有行的单词拆分为列 ["description"]
wordslist = df["description"].str.split()
- 我无法使用停用词过滤 wordlist 变量的结果
comp = re.compile('|'.join(stopwords))
z = [re.sub(comp, '', i).strip() for i in words_split]
print(z)
我明白了
TypeError: expected string or bytes-like object
- 之后我打算获取列df["description"] 中每个单词的出现频率
- 具有显着频率的单词应该出现在我的热图的横坐标上,并带有有序的销售数量
这是检查 word/a 单词组合的使用是否对产品销售产生影响的好方法(前提是我找到了错误的解决方案)吗?
你能给我一些提示吗?
编辑 3: 感谢@maaniB 的大力帮助,感谢我向最终解决方案迈出了一大步,但我还有一点路要走,这就是我所在的位置:
我是法国人,所以对于 stop_words 的清洁方法,我将 nltk
替换为 spacy
import re
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop
# to lowercase
df['Product_description'] = df['Product_description'].str.lower()
# replace_stop_characters
stop_chars = [
"\/",
"\[",
"\]",
"\:",
"\|",
'\"',
"\?",
"\<",
"\>",
"\,",
"\(",
"\)",
"\\",
"\.",
"\+",
"\-",
"\!",
"\$",
"\`",
"\،",
"\_",
]
stop_chars_pattern = "|".join(stop_chars)
df['Product_description'] = df.apply(
lambda row: re.sub(stop_chars_pattern, "", row["Product_description"]),
axis=1
)
# replace stop words with 2 list
stop_words = list(fr_stop) + list(en_stop)
stop_words.extend(['wine']) # extend the list as you wish
df['Product_description'] = df['Product_description'].map(
lambda x: ' '.join([w for w in x.split() if w not in stop_words])
)
print(df)
- 为了提取我尝试使用
CountVectorizer
和TfidfVectorizer
的特征(我将其与TfidfTransformer
混淆),我发现TfidfVectorizer
的结果更好
from sklearn.feature_extraction.text import TfidfVectorizer
# change the ngram_range to make combinations of words
tfidf_vector = TfidfVectorizer(stop_words=stop_words,
ngram_range=(1, 4),
encoding="utf-8")
tpl_cntvec = tfidf_vector.fit_transform(df_produits_en_ligne['post_excerpt'])
df_cntvec = pd.DataFrame(tpl_cntvec.toarray(),
columns=tfidf_vector.get_feature_names(),
index=df_produits_en_ligne.index)
df_total_bow = pd.concat([df_produits_en_ligne['total_sales'], df_cntvec],
axis=1)
df_total_bow
我坚持最后一步,我尝试使用最小二乘法的 @maaniB 的好版本
import statsmodels.api as sm
# Here, I used ordinary least square regression method
x = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
y = df_total_bow['total_sales'].to_numpy()
ols = sm.OLS(y, x)
fit = ols.fit()
print(fit.summary())
为了 运行 它并在 Jupyter notebook 中产生结果,我不得不更改 --NotebookApp.iopub_data_rate_limit
通过命令行
jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
它在 3 分钟的过程后工作,但我完全迷失了结果,它返回给我 46987 行,但我不知道如何解释它。 这是我的结果的屏幕截图。
有人可以向我解释一下如何解释吗?
我尝试了另一种方法,但是经过一个小时的过程没有结果 我取消它:
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
# define dataset
x = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
y = df_total_bow['total_sales'].to_numpy()
# create pipeline
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
model = DecisionTreeClassifier()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, x, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
编辑 4:
- 我试图用
df_total_bow
徒劳地制作热图
import seaborn as sns
tx = df_total_bow[df_total_bow.drop('total_sales', 1).columns].to_numpy()
ty = df_total_bow['total_sales'].to_numpy()
n = len(df_produits_en_ligne)
indep = tx.dot(ty) / n
c = df_total_bow.fillna(0)
measure = (c - indep)**2 / indep
xi_n = measure.sum().sum()
table = measure / xi_n
sns.heatmap(table.iloc[:-1, :-1], annot=c.iloc[:-1, :-1])
plt.show()
但是我明白了
ValueError: shapes (714,46987) and (714,) not aligned: 46987 (dim 1) != 714 (dim 0)
你的问题是文本挖掘任务的组合,我试着在这里简单地解决一下。与 NLP 和文本挖掘项目一样,第一步是清理,包括删除停用词、停用字符等:
import re
import pandas as pd
from nltk.corpus import stopwords
# to lowercase
df['Product_description'] = df['Product_description'].str.lower()
# replace_stop_characters
stop_chars = [
"\/",
"\[",
"\]",
"\:",
"\|",
'\"',
"\?",
"\<",
"\>",
"\,",
"\(",
"\)",
"\\",
"\.",
"\+",
"\-",
"\!",
"\$",
"\`",
"\،",
"\_",
]
stop_chars_pattern = "|".join(stop_chars)
df['Product_description'] = df.apply(
lambda row: re.sub(stop_chars_pattern, "", row["Product_description"]),
axis=1
)
# replace stop words
stop_words = stopwords.words('english')
stop_words.extend(['wine']) # extend the list as you wish
df['Product_description'] = df['Product_description'].map(
lambda x: ' '.join([w for w in x.split() if w not in stop_words])
)
print(df)
# Product_id Price Total_sales CA Product_description
# 0 1 24.0 28 672.0 fruity perfect starter
# 1 2 13.5 34 459.0 dry fullbodied
# 2 3 12.9 29 374.1 fresh perfect starter
# 3 4 34.0 42 1428.0 combining strength character
# 4 5 26.0 10 260.0 ruby color fullbodied
接下来,您需要提取特征(您提到了单词、短语的计数)。
from sklearn.feature_extraction.text import CountVectorizer
# change the ngram_range to make combinations of words
count_vector = CountVectorizer(ngram_range=(1, 4), encoding="utf-8")
tpl_cntvec = count_vector.fit_transform(df['Product_description'])
df_cntvec = pd.DataFrame(
tpl_cntvec.toarray(), columns=count_vector.get_feature_names(), index=df.index
)
df_total_bow = pd.concat([df['Total_sales'], df_cntvec], axis = 1)
df_total_bow
# Total_sales character color color fullbodied combining ... ruby # color ruby color fullbodied starter strength strength character
# 0 28 0 0 0 0 ... # 0 0 1 0 0
# 1 34 0 0 0 0 ... # 0 0 0 0 0
# 2 29 0 0 0 0 ... # 0 0 1 0 0
# 3 42 1 0 0 1 ... # 0 0 0 1 1
# 4 10 0 1 1 0 ... # 1 1 0 0 0
最后,您可以根据数据制作模型:
import statsmodels.api as sm
# Here, I used ordinary least square regression method
x = df_total_bow[df_total_bow.drop('Total_sales', 1).columns].to_numpy()
y = df_total_bow['Total_sales'].to_numpy()
ols = sm.OLS(y, x)
fit = ols.fit()
print(fit.summary())
关于您的其他问题:
- 有多种统计方法可以找出文字在文本中的重要性,以及它们与其他一些变量的相关性。
CountVectorizer
只是feature_extraction的简单方法。有更好的方法,例如TfidfTransformer
. - 统计检验或模型的类型取决于问题。由于您只需要找出单词组合与销售统计数据的相关性,因此带有特征提取的简单 regression-based 方法会很有帮助。要对特征进行排序(找到具有最高相关性和重要性的单词组合),递归特征消除 (
sklearn.feature_selection.RFE
) 可能是实用的。