如何根据 Python 中的多列聚合数据
How to Aggregate data based on multiple columns in Python
我正在尝试根据日期和类别列聚合文本字段。
下面是初始数据集的样子
created_at,tweet,category
7/29/2021,Great Sunny day for Cricket at London,sports
7/29/2021,Great Score put on by England batting,sports
7/29/2021,President Made a clear statement,politics
7/29/2021,Olympic is to held in Japan,sports
7/29/2021,A terrorist attack have killed 10 people,crime
7/29/2021,An election is to be kept next year,politics
8/29/2021,Srilanka have lost the T20 series,sports
8/29/2021,Australia have won the series,sports
8/29/2021,Minister have given up his role last monday,politics
8/29/2021,President is challenging the opposite leader,politics
所以我想要得到的预期输出如下
created_at,tweet,category
7/29/2021,Great Sunny day for Cricket at London Great Score put on by England batting Olympic is to held in Japan,sports
7/29/2021,President Made a clear statement An election is to be kept next year,politics
7/29/2021,A terrorist attack have killed 10 people,crime
8/29/2021,Srilanka have lost the T20 series Australia have won the series,sports
8/29/2021,Minister have given up his role last monday President is challenging the opposite leader,politics
根据示例,我实际上想根据日期和类别聚合推文文本。下面是我以前如何在不考虑类别的情况下进行聚合,我需要根据输出 above.It 进行聚合,如果有人能回答这个问题
将会非常有帮助
import pandas as pd
def aggregated():
tweets = pd.read_csv(r'data_set.csv')
df = pd.DataFrame(tweets, columns=['created_at', 'tweet'])
df['created_at'] = pd.to_datetime(df['created_at'])
df['tweet'] = df['tweet'].apply(lambda x: str(x))
pd.set_option('display.max_colwidth', 0)
df = df.groupby(pd.Grouper(key='created_at', freq='1D')).agg(lambda x: ' '.join(set(x)))
return df
# Driver code
if __name__ == '__main__':
print(aggregated())
aggregated().to_csv(r'agg-1.csv',index = True, header=True)
df 就是你的榜样
在第一个推特专栏中使用 groupby 制作列表并通过 apply
加入列表
df = df.groupby(["created_at", "category"], as_index=False)["tweet"].agg(lambda x: list(x))
df["tweet"] = df1["tweet"].apply(lambda x:" ".join(x))
df = df.reindex(columns=["created_at", "tweet", "category"])
df
输出:
created_at tweet category
0 7/29/2021 A terrorist attack have killed 10 people crime
1 7/29/2021 President Made a clear statement An election i... politics
2 7/29/2021 Great Sunny day for Cricket at London Great Sc... sports
3 8/29/2021 Minister have given up his role last monday Pr... politics
4 8/29/2021 Srilanka have lost the T20 series Australia ha... sports
您可以使用:
out = df.groupby(['created_at', 'category'], sort=False, as_index=False)['tweet'] \
.apply(lambda x: ' '.join(x))[df.columns]
print(out)
输出:
>>> out
created_at tweet category
0 7/29/2021 Great Sunny day for Cricket at London Great Score put on by England batting Olympic is to held in Japan sports
1 7/29/2021 President Made a clear statement An election is to be kept next year politics
2 7/29/2021 A terrorist attack have killed 10 people crime
3 8/29/2021 Srilanka have lost the T20 series Australia have won the series sports
4 8/29/2021 Minister have given up his role last monday President is challenging the opposite leader politics
我正在尝试根据日期和类别列聚合文本字段。
下面是初始数据集的样子
created_at,tweet,category
7/29/2021,Great Sunny day for Cricket at London,sports
7/29/2021,Great Score put on by England batting,sports
7/29/2021,President Made a clear statement,politics
7/29/2021,Olympic is to held in Japan,sports
7/29/2021,A terrorist attack have killed 10 people,crime
7/29/2021,An election is to be kept next year,politics
8/29/2021,Srilanka have lost the T20 series,sports
8/29/2021,Australia have won the series,sports
8/29/2021,Minister have given up his role last monday,politics
8/29/2021,President is challenging the opposite leader,politics
所以我想要得到的预期输出如下
created_at,tweet,category
7/29/2021,Great Sunny day for Cricket at London Great Score put on by England batting Olympic is to held in Japan,sports
7/29/2021,President Made a clear statement An election is to be kept next year,politics
7/29/2021,A terrorist attack have killed 10 people,crime
8/29/2021,Srilanka have lost the T20 series Australia have won the series,sports
8/29/2021,Minister have given up his role last monday President is challenging the opposite leader,politics
根据示例,我实际上想根据日期和类别聚合推文文本。下面是我以前如何在不考虑类别的情况下进行聚合,我需要根据输出 above.It 进行聚合,如果有人能回答这个问题
将会非常有帮助import pandas as pd
def aggregated():
tweets = pd.read_csv(r'data_set.csv')
df = pd.DataFrame(tweets, columns=['created_at', 'tweet'])
df['created_at'] = pd.to_datetime(df['created_at'])
df['tweet'] = df['tweet'].apply(lambda x: str(x))
pd.set_option('display.max_colwidth', 0)
df = df.groupby(pd.Grouper(key='created_at', freq='1D')).agg(lambda x: ' '.join(set(x)))
return df
# Driver code
if __name__ == '__main__':
print(aggregated())
aggregated().to_csv(r'agg-1.csv',index = True, header=True)
df 就是你的榜样 在第一个推特专栏中使用 groupby 制作列表并通过 apply
加入列表df = df.groupby(["created_at", "category"], as_index=False)["tweet"].agg(lambda x: list(x))
df["tweet"] = df1["tweet"].apply(lambda x:" ".join(x))
df = df.reindex(columns=["created_at", "tweet", "category"])
df
输出:
created_at tweet category
0 7/29/2021 A terrorist attack have killed 10 people crime
1 7/29/2021 President Made a clear statement An election i... politics
2 7/29/2021 Great Sunny day for Cricket at London Great Sc... sports
3 8/29/2021 Minister have given up his role last monday Pr... politics
4 8/29/2021 Srilanka have lost the T20 series Australia ha... sports
您可以使用:
out = df.groupby(['created_at', 'category'], sort=False, as_index=False)['tweet'] \
.apply(lambda x: ' '.join(x))[df.columns]
print(out)
输出:
>>> out
created_at tweet category
0 7/29/2021 Great Sunny day for Cricket at London Great Score put on by England batting Olympic is to held in Japan sports
1 7/29/2021 President Made a clear statement An election is to be kept next year politics
2 7/29/2021 A terrorist attack have killed 10 people crime
3 8/29/2021 Srilanka have lost the T20 series Australia have won the series sports
4 8/29/2021 Minister have given up his role last monday President is challenging the opposite leader politics