如何统计推特的每日词频?
How to count daily word frequency from twitter?
我有一个这样的 Twitter 数据框,
>>>twitdata=pd.read_csv('D:\twit-data.csv')
>>>twitdata
tweet_id user_id user_name t_date t_time tweets
4.05323E+17 82142636 1nvestor 11/26/2013 8:12:00 Fidelity reports that $TSN stock gets called away. Position now closed.
2.53585E+17 22042454 Kiplinger 10/3/2012 15:57:00 Did you know that every 0 bump in avg. home prices lifts consumer spending by ? http://t.co/zXRbWJzR
...
我想计算一个特定单词的每日频率,比如 iphone
,并得到其每日频率的结果,如
date frequency
2011-01-01 530
2011-01-02 550
...
我如何设计一个程序来实现它?
我基于随机数据创建了一个数据框,但它应该能让您了解如何从这里开始。我将日历日的计数设置为 D
,您可以根据需要更改 offset
import pandas as pd
import io # only needed to import sample data
data = """
date tweet_id tweet
2015-10-31 50230 tweet_1
2015-10-31 48646 tweet_2
2015-10-31 48748 tweet_3
2015-10-31 46992 tweet_4
2015-11-01 46491 tweet_5
2015-11-01 45347 tweet_6
2015-11-01 45681 tweet_7
2015-11-01 46430 tweet_8
"""
df = pd.read_csv(io.StringIO(data), delimiter='\s+', \
index_col=False, parse_dates = ['date'])
# Tweet count starts here
df_count = df.set_index('date').resample('D', how='count') # 'D' for offset calendar day
df_count = df_count.drop(df_count.columns[1:], axis=1)
df_count.columns = ['count']
print(df)
只是为了检查一下你原来的 df
是什么样子
date tweet_id tweet
0 2015-10-31 50230 tweet_1
1 2015-10-31 48646 tweet_2
2 2015-10-31 48748 tweet_3
3 2015-10-31 46992 tweet_4
4 2015-11-01 46491 tweet_5
5 2015-11-01 45347 tweet_6
6 2015-11-01 45681 tweet_7
7 2015-11-01 46430 tweet_8
我们使用后resample
print(df_count)
count
date
2015-10-31 4
2015-11-01 4
我已经自己解决了这个问题,这是我的解决方案。
import operator
result = tweetdata.groupby('t_date').first();
allFreq={}
for date in range(0,result.shape(0)):
df=tweetdata[tweetdata.t_date==result.index[date]].ix[:,['t_date','tweets']]
#type(tweetdata.loc[1,'t_date'])
A=''
for i in df.index:
A=A+' '+df.ix[i,1]
text_file = open("A.txt", "w+")
text_file.write("%s" % A)
text_file.close()
with open('A.txt') as f:
words = f.read()
wordfreq = {}
for word in words.replace(',', ' ').split():
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
x = wordfreq
sorted_x = sorted(x.items(), key=operator.itemgetter(1),reverse=True)
sorted_x
allFreq[result.index[date]]=sorted_x
>>>allFreq['2012-06-01']
>>> [('the', 248),
('to', 201),
('of', 143),
('a', 137),
('in', 127),
('and', 107),
('for', 100),
('you', 95),
('is', 93),
('I', 81),
...]
我有一个这样的 Twitter 数据框,
>>>twitdata=pd.read_csv('D:\twit-data.csv')
>>>twitdata
tweet_id user_id user_name t_date t_time tweets
4.05323E+17 82142636 1nvestor 11/26/2013 8:12:00 Fidelity reports that $TSN stock gets called away. Position now closed.
2.53585E+17 22042454 Kiplinger 10/3/2012 15:57:00 Did you know that every 0 bump in avg. home prices lifts consumer spending by ? http://t.co/zXRbWJzR
...
我想计算一个特定单词的每日频率,比如 iphone
,并得到其每日频率的结果,如
date frequency
2011-01-01 530
2011-01-02 550
...
我如何设计一个程序来实现它?
我基于随机数据创建了一个数据框,但它应该能让您了解如何从这里开始。我将日历日的计数设置为 D
,您可以根据需要更改 offset
import pandas as pd
import io # only needed to import sample data
data = """
date tweet_id tweet
2015-10-31 50230 tweet_1
2015-10-31 48646 tweet_2
2015-10-31 48748 tweet_3
2015-10-31 46992 tweet_4
2015-11-01 46491 tweet_5
2015-11-01 45347 tweet_6
2015-11-01 45681 tweet_7
2015-11-01 46430 tweet_8
"""
df = pd.read_csv(io.StringIO(data), delimiter='\s+', \
index_col=False, parse_dates = ['date'])
# Tweet count starts here
df_count = df.set_index('date').resample('D', how='count') # 'D' for offset calendar day
df_count = df_count.drop(df_count.columns[1:], axis=1)
df_count.columns = ['count']
print(df)
只是为了检查一下你原来的 df
是什么样子
date tweet_id tweet
0 2015-10-31 50230 tweet_1
1 2015-10-31 48646 tweet_2
2 2015-10-31 48748 tweet_3
3 2015-10-31 46992 tweet_4
4 2015-11-01 46491 tweet_5
5 2015-11-01 45347 tweet_6
6 2015-11-01 45681 tweet_7
7 2015-11-01 46430 tweet_8
我们使用后resample
print(df_count)
count
date
2015-10-31 4
2015-11-01 4
我已经自己解决了这个问题,这是我的解决方案。
import operator
result = tweetdata.groupby('t_date').first();
allFreq={}
for date in range(0,result.shape(0)):
df=tweetdata[tweetdata.t_date==result.index[date]].ix[:,['t_date','tweets']]
#type(tweetdata.loc[1,'t_date'])
A=''
for i in df.index:
A=A+' '+df.ix[i,1]
text_file = open("A.txt", "w+")
text_file.write("%s" % A)
text_file.close()
with open('A.txt') as f:
words = f.read()
wordfreq = {}
for word in words.replace(',', ' ').split():
wordfreq[word] = wordfreq.setdefault(word, 0) + 1
x = wordfreq
sorted_x = sorted(x.items(), key=operator.itemgetter(1),reverse=True)
sorted_x
allFreq[result.index[date]]=sorted_x
>>>allFreq['2012-06-01']
>>> [('the', 248),
('to', 201),
('of', 143),
('a', 137),
('in', 127),
('and', 107),
('for', 100),
('you', 95),
('is', 93),
('I', 81),
...]