每日提及一个词

Question

我有以下df，包含不同来源的每日文章：

print(df)

Date         content

2018-11-01    Apple Inc. AAPL 1.54% reported its fourth cons...
2018-11-01    U.S. stocks climbed Thursday, Apple is a real ...
2018-11-02    GONE are the days when smartphone manufacturer...
2018-11-03    To historians of technology, the story of the ...
2018-11-03    Apple Inc. AAPL 1.54% reported its fourth cons...
2018-11-03    Apple is turning to traditional broadcasting t...

(...)

我想计算 每日提及 的总数 - 因此按日期聚合 - 单词 "Apple"。如何创建 "final_df"?

print(final_df) 

    2018-11-01    2
    2018-11-02    0
    2018-11-03    2
    (...)

Answer 1

您可以 GroupBy the different dates, use str.count 计算 Apple 的出现次数并与 sum 合计以获得每个组中的计数数量：

df.groupby('Date').apply(lambda x: x.content.str.count('Apple').sum())
                  .reset_index(name='counts')

      Date     counts
0 2018-11-01       2
1 2018-11-02       0
2 2018-11-03       2

Answer 2

对新 Series 使用 count，按列 df['Date'] 与 sum 聚合：

df1 = df['content'].str.count('Apple').groupby(df['Date']).sum().reset_index(name='count')
print (df1)
         Date  count
0  2018-11-01      2
1  2018-11-02      0
2  2018-11-03      2

Answer 3

您可以尝试使用 str.contains 和 groupby 函数的替代解决方案，而无需一直使用 sum。

>>> df
         Date                                         content
0  2018-11-01  Apple Inc. AAPL 1.54% reported its fourth cons
1  2018-11-01   U.S. stocks climbed Thursday, Apple is a real
2  2018-11-02  GONE are the days when smartphone manufacturer
3  2018-11-03   To historians of technology, the story of the
4  2018-11-03  Apple Inc. AAPL 1.54% reported its fourth cons
5  2018-11-03  Apple is turning to traditional broadcasting t

解决方案：

df.content.str.contains("Apple").groupby(df['Date']).count().reset_index(name="count")

         Date  count
0  2018-11-01      2
1  2018-11-02      1
2  2018-11-03      3


# df["content"].str.contains('Apple',case=True,na=False).groupby(df['Date']).count()

每日提及一个词

Daily Mentions of a Word

python

nlp

word-count

pandas