如何根据 Python 中的多列聚合数据

Question

我正在尝试根据日期和类别列聚合文本字段。
下面是初始数据集的样子

created_at,tweet,category
7/29/2021,Great Sunny day for Cricket at London,sports
7/29/2021,Great Score put on by England batting,sports
7/29/2021,President Made a clear statement,politics
7/29/2021,Olympic is to held in Japan,sports
7/29/2021,A terrorist attack have killed 10 people,crime
7/29/2021,An election is to be kept next year,politics
8/29/2021,Srilanka have lost the T20 series,sports
8/29/2021,Australia have won the series,sports
8/29/2021,Minister have given up his role last monday,politics
8/29/2021,President is challenging the opposite leader,politics

所以我想要得到的预期输出如下

created_at,tweet,category
7/29/2021,Great Sunny day for Cricket at London Great Score put on by England batting Olympic is to held in Japan,sports
7/29/2021,President Made a clear statement An election is to be kept next year,politics
7/29/2021,A terrorist attack have killed 10 people,crime
8/29/2021,Srilanka have lost the T20 series Australia have won the series,sports
8/29/2021,Minister have given up his role last monday President is challenging the opposite leader,politics

根据示例，我实际上想根据日期和类别聚合推文文本。下面是我以前如何在不考虑类别的情况下进行聚合，我需要根据输出 above.It 进行聚合，如果有人能回答这个问题

将会非常有帮助

import pandas as pd

def aggregated():
    tweets = pd.read_csv(r'data_set.csv')
    df = pd.DataFrame(tweets, columns=['created_at', 'tweet'])
    df['created_at'] = pd.to_datetime(df['created_at'])
    df['tweet'] = df['tweet'].apply(lambda x: str(x))
    pd.set_option('display.max_colwidth', 0)
    df = df.groupby(pd.Grouper(key='created_at', freq='1D')).agg(lambda x: ' '.join(set(x)))
    return df


# Driver code
if __name__ == '__main__':
    print(aggregated())
    aggregated().to_csv(r'agg-1.csv',index = True, header=True)

Answer 1

df 就是你的榜样在第一个推特专栏中使用 groupby 制作列表并通过 apply

加入列表

df = df.groupby(["created_at", "category"], as_index=False)["tweet"].agg(lambda x: list(x))
df["tweet"] = df1["tweet"].apply(lambda x:" ".join(x))
df = df.reindex(columns=["created_at", "tweet", "category"])
df

输出：

    created_at  tweet   category
0   7/29/2021   A terrorist attack have killed 10 people    crime
1   7/29/2021   President Made a clear statement An election i...   politics
2   7/29/2021   Great Sunny day for Cricket at London Great Sc...   sports
3   8/29/2021   Minister have given up his role last monday Pr...   politics
4   8/29/2021   Srilanka have lost the T20 series Australia ha...   sports

Answer 2

您可以使用：

out = df.groupby(['created_at', 'category'], sort=False, as_index=False)['tweet'] \
        .apply(lambda x: ' '.join(x))[df.columns]
print(out)

输出：

>>> out
  created_at                                                                                                    tweet  category
0  7/29/2021  Great Sunny day for Cricket at London Great Score put on by England batting Olympic is to held in Japan    sports
1  7/29/2021                                     President Made a clear statement An election is to be kept next year  politics
2  7/29/2021                                                                 A terrorist attack have killed 10 people     crime
3  8/29/2021                                          Srilanka have lost the T20 series Australia have won the series    sports
4  8/29/2021                 Minister have given up his role last monday President is challenging the opposite leader  politics

如何根据 Python 中的多列聚合数据

How to Aggregate data based on multiple columns in Python

python

algorithm

nlp

nltk

pandas