按日期划分的二元语法
Bi-grams by date
我有以下数据集:
Date D
_
0 01/18/2020 shares recipes ... - news updates · breaking news emails · lives to remem...
1 01/18/2020 both sides of the pineapple slices with olive oil. ... some of my other support go-to's i...
2 01/18/2020 honey, tea tree oil ...learn more from webmd about honey ...
3 01/18/2020 years of downtown arts | times leaderas the local community dealt with concerns, pet...
4 01/18/2020 brooklyn, ny | opentableblood orange, arugula, hazelnuts, on toast. charcuterie. .00. smoked ...
5 01/19/2020 santa maria di leuca - we the italiansthe sounds of the taranta, the smell of tomatoes, olive oil...
6 01/19/2020 abuse in amish communities : nprit's been a minute with sam sanders · code switch · throughline ...
7 01/19/2020 fast, healthy recipe ideas – cbs new ...toss the pork cubes with chili powder, oregano, cumin, c...
9 01/19/2020 100; 51-100 | csnyi have used oregano oil, coconut oil, famciclovir, an..
我对按日期显示二元组频率的数据框感兴趣。
目前我是这样做的:
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
stop_words = stopwords.words('english')
word_vectorizer = CountVectorizer(ngram_range=(2,2), analyzer='word', stop_words=stop_words)
sparse_matrix = word_vectorizer.fit_transform(df['D'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)
但它不按日期显示二元组,只显示它们的频率。
我想期待这样的事情(预期输出):
Date Bi-gram Frequency
01/18/2020 bi-gram_1 43
bi-gram_2 12
...
01/19/2020 bi-gram_5 42
bi-gram_6 23
等等。 bi-grams_1, bi-grams_2
, ...仅作为示例。
关于如何获得这样的数据框有什么建议吗?
我解决这个问题的方法是重新组织你的原始数据框,这样首要的关键就是日期,每个日期内都有一个列表句子:
new_df = {}
for index, row in df.iterrows():
if row[0] not in new_df.keys():
new_df[row[0]] = []
new_df[row[0]].append(row[1])
行[0]是日期,行[1]是数据
输出将如下所示:
{'1/18/20': ['shares recipes news updates breaking news google', 'shares
recipes news updates breaking news seo'], '1/19/20': ['shares recipes news
updates breaking news emails', 'shares recipes news updates breaking news
web']}
现在您可以遍历每个日期并获取该日期内每个二元组的频率。所有数据都存储在与上面类似的数据框中,并附加到列表中。最后,该列表将包含 n 数据框,其中 n 是数据集中的日期数:
word_vectorizer = CountVectorizer(ngram_range=(2,2), analyzer='word',
stop_words=stop_words)
frames = []
for date,values in new_df.items():
sparse_matrix = word_vectorizer.fit_transform(values)
frequencies = sum(sparse_matrix).toarray()[0]
results = pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)
frames.append(results)
或
如果您希望日期显示在数据框的所有行上,您可以将第 2 步修改为:
word_vectorizer = CountVectorizer(ngram_range=(2,2), analyzer='word',
stop_words=stop_words)
frames = []
for date,values in new_df.items():
sparse_matrix = word_vectorizer.fit_transform(values)
frequencies = sum(sparse_matrix).toarray()[0]
results = pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)
results["Date"] = [date for i in range(len(results))]
frames.append(results)
最后,您可以将数据帧连接在一起:
pd.concat(frames, keys=[k for k in new_df.keys()])
***你可以做的一些改进是找到在 pandas 本身内重新索引数据帧而不是制作新字典。
我有以下数据集:
Date D
_
0 01/18/2020 shares recipes ... - news updates · breaking news emails · lives to remem...
1 01/18/2020 both sides of the pineapple slices with olive oil. ... some of my other support go-to's i...
2 01/18/2020 honey, tea tree oil ...learn more from webmd about honey ...
3 01/18/2020 years of downtown arts | times leaderas the local community dealt with concerns, pet...
4 01/18/2020 brooklyn, ny | opentableblood orange, arugula, hazelnuts, on toast. charcuterie. .00. smoked ...
5 01/19/2020 santa maria di leuca - we the italiansthe sounds of the taranta, the smell of tomatoes, olive oil...
6 01/19/2020 abuse in amish communities : nprit's been a minute with sam sanders · code switch · throughline ...
7 01/19/2020 fast, healthy recipe ideas – cbs new ...toss the pork cubes with chili powder, oregano, cumin, c...
9 01/19/2020 100; 51-100 | csnyi have used oregano oil, coconut oil, famciclovir, an..
我对按日期显示二元组频率的数据框感兴趣。 目前我是这样做的:
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
stop_words = stopwords.words('english')
word_vectorizer = CountVectorizer(ngram_range=(2,2), analyzer='word', stop_words=stop_words)
sparse_matrix = word_vectorizer.fit_transform(df['D'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)
但它不按日期显示二元组,只显示它们的频率。 我想期待这样的事情(预期输出):
Date Bi-gram Frequency
01/18/2020 bi-gram_1 43
bi-gram_2 12
...
01/19/2020 bi-gram_5 42
bi-gram_6 23
等等。 bi-grams_1, bi-grams_2
, ...仅作为示例。
关于如何获得这样的数据框有什么建议吗?
我解决这个问题的方法是重新组织你的原始数据框,这样首要的关键就是日期,每个日期内都有一个列表句子:
new_df = {} for index, row in df.iterrows(): if row[0] not in new_df.keys(): new_df[row[0]] = [] new_df[row[0]].append(row[1])
行[0]是日期,行[1]是数据
输出将如下所示:
{'1/18/20': ['shares recipes news updates breaking news google', 'shares
recipes news updates breaking news seo'], '1/19/20': ['shares recipes news
updates breaking news emails', 'shares recipes news updates breaking news
web']}
现在您可以遍历每个日期并获取该日期内每个二元组的频率。所有数据都存储在与上面类似的数据框中,并附加到列表中。最后,该列表将包含 n 数据框,其中 n 是数据集中的日期数:
word_vectorizer = CountVectorizer(ngram_range=(2,2), analyzer='word', stop_words=stop_words) frames = [] for date,values in new_df.items(): sparse_matrix = word_vectorizer.fit_transform(values) frequencies = sum(sparse_matrix).toarray()[0] results = pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False) frames.append(results)
或 如果您希望日期显示在数据框的所有行上,您可以将第 2 步修改为:
word_vectorizer = CountVectorizer(ngram_range=(2,2), analyzer='word',
stop_words=stop_words)
frames = []
for date,values in new_df.items():
sparse_matrix = word_vectorizer.fit_transform(values)
frequencies = sum(sparse_matrix).toarray()[0]
results = pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)
results["Date"] = [date for i in range(len(results))]
frames.append(results)
最后,您可以将数据帧连接在一起:
pd.concat(frames, keys=[k for k in new_df.keys()])
***你可以做的一些改进是找到在 pandas 本身内重新索引数据帧而不是制作新字典。