如何计算两个 pandas 时间轴向量之间的皮尔逊相关性
How calculate pearson correlation between two pandas timeline vectors
我有一个社交网络中的用户帖子数据库,使用 Pandas DataFrame 我计算了每个用户每月的帖子数量,这导致每个用户包含 2 列 table帖子的月份和数量。我想计算不同用户之间的每月计数相关性,知道每个两个用户之间的每月时间线不同(有一些相交的月份)
这是创建每月时间表的代码table (agg)
# Create an empty dataframe
df = pd.DataFrame()
# Create a column from the datetime variable
df['datetime'] = date_list
# Convert that column into a datetime datatype
df['datetime'] = pd.to_datetime(df['datetime'])
# Set the datetime column as the index
df['score'] = count
df.index = df['datetime']
# this is the table containing posts count for each month
agg = df['score'].resample('M').sum().to_frame()
所以基本上我必须在两个 "agg" 变量上应用相关函数,但找不到一种直观的方法来做到这一点。
这是属于两个不同用户的 agg 变量的两个示例:
第一列:Month
,第二列:Number of posts
。
User A
2018-04-30 39
2018-05-31 41
2018-06-30 19
2018-07-31 46
2018-08-31 61
2018-09-30 57
2018-10-31 33
2018-11-30 18
User B:
2017-11-30 0
2017-12-31 3
2018-01-31 0
2018-02-28 0
2018-03-31 22
2018-04-30 3
2018-05-31 11
这里是微积分皮尔逊相关的解决方案:
import pandas as pd
data = """
datetime score
2018-04-30 39
2018-05-31 41
2018-06-30 19
2018-07-31 46
2018-08-31 61
2018-09-30 57
2018-10-31 33
2018-11-30 18
"""
datb = """
datetime score
2017-11-30 0
2017-12-31 3
2018-01-31 0
2018-02-28 0
2018-03-31 22
2018-04-30 3
2018-05-31 11
"""
dfa = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
dfb = pd.read_csv(pd.compat.StringIO(datb), sep='\s+')
dfa['datetime'] = pd.to_datetime(dfa['datetime'])
dfb['datetime'] = pd.to_datetime(dfb['datetime'])
dfa.index = dfa['datetime']
dfb.index = dfb['datetime']
agga = dfa['score'].resample('M').sum().to_frame()
aggb = dfb['score'].resample('M').sum().to_frame()
print(agga,aggb)
#intersection of 2 dataframes on datetime
inter = agga.merge(aggb, on='datetime')
print(inter)
result = inter['score_x'].corr(inter['score_y'])
print(result)
dfa
score
datetime
2018-04-30 39
2018-05-31 41
2018-06-30 19
2018-07-31 46
2018-08-31 61
2018-09-30 57
2018-10-31 33
2018-11-30 18
dfb
score
datetime
2017-11-30 0
2017-12-31 3
2018-01-31 0
2018-02-28 0
2018-03-31 22
2018-04-30 3
2018-05-31 11
inter
score_x score_y
datetime
2018-04-30 39 3
2018-05-31 41 11
result
0.9999999999999999
如果要使用union:
union = pd.merge(agga, aggb, on='datetime', how='outer').fillna(0)
联合的输出:
score_x score_y
datetime
2018-04-30 39.0 3.0
2018-05-31 41.0 11.0
2018-06-30 19.0 0.0
2018-07-31 46.0 0.0
2018-08-31 61.0 0.0
2018-09-30 57.0 0.0
2018-10-31 33.0 0.0
2018-11-30 18.0 0.0
2017-11-30 0.0 0.0
2017-12-31 0.0 3.0
2018-01-31 0.0 0.0
2018-02-28 0.0 0.0
2018-03-31 0.0 22.0
很好link理解merge
我有一个社交网络中的用户帖子数据库,使用 Pandas DataFrame 我计算了每个用户每月的帖子数量,这导致每个用户包含 2 列 table帖子的月份和数量。我想计算不同用户之间的每月计数相关性,知道每个两个用户之间的每月时间线不同(有一些相交的月份)
这是创建每月时间表的代码table (agg)
# Create an empty dataframe
df = pd.DataFrame()
# Create a column from the datetime variable
df['datetime'] = date_list
# Convert that column into a datetime datatype
df['datetime'] = pd.to_datetime(df['datetime'])
# Set the datetime column as the index
df['score'] = count
df.index = df['datetime']
# this is the table containing posts count for each month
agg = df['score'].resample('M').sum().to_frame()
所以基本上我必须在两个 "agg" 变量上应用相关函数,但找不到一种直观的方法来做到这一点。 这是属于两个不同用户的 agg 变量的两个示例:
第一列:Month
,第二列:Number of posts
。
User A
2018-04-30 39
2018-05-31 41
2018-06-30 19
2018-07-31 46
2018-08-31 61
2018-09-30 57
2018-10-31 33
2018-11-30 18
User B:
2017-11-30 0
2017-12-31 3
2018-01-31 0
2018-02-28 0
2018-03-31 22
2018-04-30 3
2018-05-31 11
这里是微积分皮尔逊相关的解决方案:
import pandas as pd
data = """
datetime score
2018-04-30 39
2018-05-31 41
2018-06-30 19
2018-07-31 46
2018-08-31 61
2018-09-30 57
2018-10-31 33
2018-11-30 18
"""
datb = """
datetime score
2017-11-30 0
2017-12-31 3
2018-01-31 0
2018-02-28 0
2018-03-31 22
2018-04-30 3
2018-05-31 11
"""
dfa = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
dfb = pd.read_csv(pd.compat.StringIO(datb), sep='\s+')
dfa['datetime'] = pd.to_datetime(dfa['datetime'])
dfb['datetime'] = pd.to_datetime(dfb['datetime'])
dfa.index = dfa['datetime']
dfb.index = dfb['datetime']
agga = dfa['score'].resample('M').sum().to_frame()
aggb = dfb['score'].resample('M').sum().to_frame()
print(agga,aggb)
#intersection of 2 dataframes on datetime
inter = agga.merge(aggb, on='datetime')
print(inter)
result = inter['score_x'].corr(inter['score_y'])
print(result)
dfa
score
datetime
2018-04-30 39
2018-05-31 41
2018-06-30 19
2018-07-31 46
2018-08-31 61
2018-09-30 57
2018-10-31 33
2018-11-30 18
dfb
score
datetime
2017-11-30 0
2017-12-31 3
2018-01-31 0
2018-02-28 0
2018-03-31 22
2018-04-30 3
2018-05-31 11
inter
score_x score_y
datetime
2018-04-30 39 3
2018-05-31 41 11
result
0.9999999999999999
如果要使用union:
union = pd.merge(agga, aggb, on='datetime', how='outer').fillna(0)
联合的输出:
score_x score_y
datetime
2018-04-30 39.0 3.0
2018-05-31 41.0 11.0
2018-06-30 19.0 0.0
2018-07-31 46.0 0.0
2018-08-31 61.0 0.0
2018-09-30 57.0 0.0
2018-10-31 33.0 0.0
2018-11-30 18.0 0.0
2017-11-30 0.0 0.0
2017-12-31 0.0 3.0
2018-01-31 0.0 0.0
2018-02-28 0.0 0.0
2018-03-31 0.0 22.0
很好link理解merge