如何使用 tweepy 获取标签中的所有推文?
how to take all tweets in a hashtag with tweepy?
我正在尝试将每个打开的推文都放入一个标签中,但我的代码不会超过 299 条推文。
我也尝试从特定时间线获取推文,例如仅在 2015 年 5 月和 2016 年 7 月的推文。有什么方法可以在主流程中完成,还是我应该为此编写一些代码?
这是我的代码:
# if this is the first time, creates a new array which
# will store max id of the tweets for each keyword
if not os.path.isfile("max_ids.npy"):
max_ids = np.empty(len(keywords))
# every value is initialized as -1 in order to start from the beginning the first time program run
max_ids.fill(-1)
else:
max_ids = np.load("max_ids.npy") # loads the previous max ids
# if there is any new keywords added, extends the max_ids array in order to correspond every keyword
if len(keywords) > len(max_ids):
new_indexes = np.empty(len(keywords) - len(max_ids))
new_indexes.fill(-1)
max_ids = np.append(arr=max_ids, values=new_indexes)
count = 0
for i in range(len(keywords)):
since_date="2015-01-01"
sinceId = None
tweetCount = 0
maxTweets = 5000000000000000000000 # maximum tweets to find per keyword
tweetsPerQry = 100
searchQuery = "#{0}".format(keywords[i])
while tweetCount < maxTweets:
if max_ids[i] < 0:
if (not sinceId):
new_tweets = api.search(q=searchQuery, count=tweetsPerQry)
else:
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
since_id=sinceId)
else:
if (not sinceId):
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
max_id=str(max_ids - 1))
else:
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
max_id=str(max_ids - 1),
since_id=sinceId)
if not new_tweets:
print("Keyword: {0} No more tweets found".format(searchQuery))
break
for tweet in new_tweets:
count += 1
print(count)
file_write.write(
.
.
.
)
item = {
.
.
.
.
.
}
# instead of using mongo's id for _id, using tweet's id
raw_data = tweet._json
raw_data["_id"] = tweet.id
raw_data.pop("id", None)
try:
db["Tweets"].insert_one(item)
except pymongo.errors.DuplicateKeyError as e:
print("Already exists in 'Tweets' collection.")
try:
db["RawTweets"].insert_one(raw_data)
except pymongo.errors.DuplicateKeyError as e:
print("Already exists in 'RawTweets' collection.")
tweetCount += len(new_tweets)
print("Downloaded {0} tweets".format(tweetCount))
max_ids[i] = new_tweets[-1].id
np.save(arr=max_ids, file="max_ids.npy") # saving in order to continue mining from where left next time program run
查看 twitter api 文档,可能它只允许解析 300 条推文。
我建议忘记 api,通过流媒体请求来实现。 api 是有限制的请求的实现。
看看这个:https://tweepy.readthedocs.io/en/v3.5.0/cursor_tutorial.html
然后试试这个:
import tweepy
auth = tweepy.OAuthHandler(CONSUMER_TOKEN, CONSUMER_SECRET)
api = tweepy.API(auth)
for tweet in tweepy.Cursor(api.search, q='#python', rpp=100).items():
# Do something
pass
在您的情况下,您可以获取最大数量的推文,因此根据链接的教程,您可以执行以下操作:
import tweepy
MAX_TWEETS = 5000000000000000000000
auth = tweepy.OAuthHandler(CONSUMER_TOKEN, CONSUMER_SECRET)
api = tweepy.API(auth)
for tweet in tweepy.Cursor(api.search, q='#python', rpp=100).items(MAX_TWEETS):
# Do something
pass
如果您想要在给定 ID 之后发送推文,您也可以传递该参数。
抱歉,我无法在评论中回答,太长了。 :)
当然 :) 检查这个例子:
高级搜索 #data 关键字 2015 年 5 月 - 2016 年 7 月
得到这个 url:https://twitter.com/search?l=&q=%23data%20since%3A2015-05-01%20until%3A2016-07-31&src=typd
session = requests.session()
keyword = 'data'
date1 = '2015-05-01'
date2 = '2016-07-31'
session.get('https://twitter.com/search?l=&q=%23+keyword+%20since%3A+date1+%20until%3A+date2&src=typd', streaming = True)
现在我们有所有请求的推文,
可能你可能会遇到 'pagination' 的问题
分页 url ->
也许你可以放一个随机的推文 ID,或者你可以先解析,或者从推特请求一些数据。可以的。
使用 Chrome 的网络选项卡查找所有请求的信息:)
这段代码对我有用。
import tweepy
import pandas as pd
import os
#Twitter Access
auth = tweepy.OAuthHandler( 'xxx','xxx')
auth.set_access_token('xxx-xxx','xxx')
api = tweepy.API(auth,wait_on_rate_limit = True)
df = pd.DataFrame(columns=['text', 'source', 'url'])
msgs = []
msg =[]
for tweet in tweepy.Cursor(api.search, q='#bmw', rpp=100).items(10):
msg = [tweet.text, tweet.source, tweet.source_url]
msg = tuple(msg)
msgs.append(msg)
df = pd.DataFrame(msgs)
我正在尝试将每个打开的推文都放入一个标签中,但我的代码不会超过 299 条推文。
我也尝试从特定时间线获取推文,例如仅在 2015 年 5 月和 2016 年 7 月的推文。有什么方法可以在主流程中完成,还是我应该为此编写一些代码?
这是我的代码:
# if this is the first time, creates a new array which
# will store max id of the tweets for each keyword
if not os.path.isfile("max_ids.npy"):
max_ids = np.empty(len(keywords))
# every value is initialized as -1 in order to start from the beginning the first time program run
max_ids.fill(-1)
else:
max_ids = np.load("max_ids.npy") # loads the previous max ids
# if there is any new keywords added, extends the max_ids array in order to correspond every keyword
if len(keywords) > len(max_ids):
new_indexes = np.empty(len(keywords) - len(max_ids))
new_indexes.fill(-1)
max_ids = np.append(arr=max_ids, values=new_indexes)
count = 0
for i in range(len(keywords)):
since_date="2015-01-01"
sinceId = None
tweetCount = 0
maxTweets = 5000000000000000000000 # maximum tweets to find per keyword
tweetsPerQry = 100
searchQuery = "#{0}".format(keywords[i])
while tweetCount < maxTweets:
if max_ids[i] < 0:
if (not sinceId):
new_tweets = api.search(q=searchQuery, count=tweetsPerQry)
else:
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
since_id=sinceId)
else:
if (not sinceId):
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
max_id=str(max_ids - 1))
else:
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
max_id=str(max_ids - 1),
since_id=sinceId)
if not new_tweets:
print("Keyword: {0} No more tweets found".format(searchQuery))
break
for tweet in new_tweets:
count += 1
print(count)
file_write.write(
.
.
.
)
item = {
.
.
.
.
.
}
# instead of using mongo's id for _id, using tweet's id
raw_data = tweet._json
raw_data["_id"] = tweet.id
raw_data.pop("id", None)
try:
db["Tweets"].insert_one(item)
except pymongo.errors.DuplicateKeyError as e:
print("Already exists in 'Tweets' collection.")
try:
db["RawTweets"].insert_one(raw_data)
except pymongo.errors.DuplicateKeyError as e:
print("Already exists in 'RawTweets' collection.")
tweetCount += len(new_tweets)
print("Downloaded {0} tweets".format(tweetCount))
max_ids[i] = new_tweets[-1].id
np.save(arr=max_ids, file="max_ids.npy") # saving in order to continue mining from where left next time program run
查看 twitter api 文档,可能它只允许解析 300 条推文。 我建议忘记 api,通过流媒体请求来实现。 api 是有限制的请求的实现。
看看这个:https://tweepy.readthedocs.io/en/v3.5.0/cursor_tutorial.html
然后试试这个:
import tweepy
auth = tweepy.OAuthHandler(CONSUMER_TOKEN, CONSUMER_SECRET)
api = tweepy.API(auth)
for tweet in tweepy.Cursor(api.search, q='#python', rpp=100).items():
# Do something
pass
在您的情况下,您可以获取最大数量的推文,因此根据链接的教程,您可以执行以下操作:
import tweepy
MAX_TWEETS = 5000000000000000000000
auth = tweepy.OAuthHandler(CONSUMER_TOKEN, CONSUMER_SECRET)
api = tweepy.API(auth)
for tweet in tweepy.Cursor(api.search, q='#python', rpp=100).items(MAX_TWEETS):
# Do something
pass
如果您想要在给定 ID 之后发送推文,您也可以传递该参数。
抱歉,我无法在评论中回答,太长了。 :)
当然 :) 检查这个例子: 高级搜索 #data 关键字 2015 年 5 月 - 2016 年 7 月 得到这个 url:https://twitter.com/search?l=&q=%23data%20since%3A2015-05-01%20until%3A2016-07-31&src=typd
session = requests.session()
keyword = 'data'
date1 = '2015-05-01'
date2 = '2016-07-31'
session.get('https://twitter.com/search?l=&q=%23+keyword+%20since%3A+date1+%20until%3A+date2&src=typd', streaming = True)
现在我们有所有请求的推文, 可能你可能会遇到 'pagination' 的问题 分页 url ->
也许你可以放一个随机的推文 ID,或者你可以先解析,或者从推特请求一些数据。可以的。
使用 Chrome 的网络选项卡查找所有请求的信息:)
这段代码对我有用。
import tweepy
import pandas as pd
import os
#Twitter Access
auth = tweepy.OAuthHandler( 'xxx','xxx')
auth.set_access_token('xxx-xxx','xxx')
api = tweepy.API(auth,wait_on_rate_limit = True)
df = pd.DataFrame(columns=['text', 'source', 'url'])
msgs = []
msg =[]
for tweet in tweepy.Cursor(api.search, q='#bmw', rpp=100).items(10):
msg = [tweet.text, tweet.source, tweet.source_url]
msg = tuple(msg)
msgs.append(msg)
df = pd.DataFrame(msgs)