如何根据 Python 中的多列聚合数据

How to Aggregate data based on multiple columns in Python

我正在尝试根据日期和类别列聚合文本字段。
下面是初始数据集的样子

created_at,tweet,category
7/29/2021,Great Sunny day for Cricket at London,sports
7/29/2021,Great Score put on by England batting,sports
7/29/2021,President Made a clear statement,politics
7/29/2021,Olympic is to held in Japan,sports
7/29/2021,A terrorist attack have killed 10 people,crime
7/29/2021,An election is to be kept next year,politics
8/29/2021,Srilanka have lost the T20 series,sports
8/29/2021,Australia have won the series,sports
8/29/2021,Minister have given up his role last monday,politics
8/29/2021,President is challenging the opposite leader,politics

所以我想要得到的预期输出如下

created_at,tweet,category
7/29/2021,Great Sunny day for Cricket at London Great Score put on by England batting Olympic is to held in Japan,sports
7/29/2021,President Made a clear statement An election is to be kept next year,politics
7/29/2021,A terrorist attack have killed 10 people,crime
8/29/2021,Srilanka have lost the T20 series Australia have won the series,sports
8/29/2021,Minister have given up his role last monday President is challenging the opposite leader,politics

根据示例,我实际上想根据日期和类别聚合推文文本。下面是我以前如何在不考虑类别的情况下进行聚合,我需要根据输出 above.It 进行聚合,如果有人能回答这个问题

将会非常有帮助
import pandas as pd

def aggregated():
    tweets = pd.read_csv(r'data_set.csv')
    df = pd.DataFrame(tweets, columns=['created_at', 'tweet'])
    df['created_at'] = pd.to_datetime(df['created_at'])
    df['tweet'] = df['tweet'].apply(lambda x: str(x))
    pd.set_option('display.max_colwidth', 0)
    df = df.groupby(pd.Grouper(key='created_at', freq='1D')).agg(lambda x: ' '.join(set(x)))
    return df


# Driver code
if __name__ == '__main__':
    print(aggregated())
    aggregated().to_csv(r'agg-1.csv',index = True, header=True)

df 就是你的榜样 在第一个推特专栏中使用 groupby 制作列表并通过 apply

加入列表
df = df.groupby(["created_at", "category"], as_index=False)["tweet"].agg(lambda x: list(x))
df["tweet"] = df1["tweet"].apply(lambda x:" ".join(x))
df = df.reindex(columns=["created_at", "tweet", "category"])
df

输出:

    created_at  tweet   category
0   7/29/2021   A terrorist attack have killed 10 people    crime
1   7/29/2021   President Made a clear statement An election i...   politics
2   7/29/2021   Great Sunny day for Cricket at London Great Sc...   sports
3   8/29/2021   Minister have given up his role last monday Pr...   politics
4   8/29/2021   Srilanka have lost the T20 series Australia ha...   sports

您可以使用:

out = df.groupby(['created_at', 'category'], sort=False, as_index=False)['tweet'] \
        .apply(lambda x: ' '.join(x))[df.columns]
print(out)

输出:

>>> out
  created_at                                                                                                    tweet  category
0  7/29/2021  Great Sunny day for Cricket at London Great Score put on by England batting Olympic is to held in Japan    sports
1  7/29/2021                                     President Made a clear statement An election is to be kept next year  politics
2  7/29/2021                                                                 A terrorist attack have killed 10 people     crime
3  8/29/2021                                          Srilanka have lost the T20 series Australia have won the series    sports
4  8/29/2021                 Minister have given up his role last monday President is challenging the opposite leader  politics