按推文位置和用户位置抓取推文

Scrape tweets by tweet location and user location

我正在尝试使用 tweepy 使用推文位置而不是用户位置来下载推文。目前,我可以下载带有用户位置的推文,但我无法获取推文位置,即使 geo_enabled returns True.

例如,假设 user_a 来自纽约,但他在加利福尼亚发推文。我想要用户位置纽约和推文位置加利福尼亚。

代码:

import tweepy
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import pandas as pd
import json
import csv
import sys
reload(sys)
sys.setdefaultencoding('utf8')

ckey = 'key'
csecret = 'secret'
atoken = 'token'
asecret = 'secret'
#csvfile = open('StreamSearch.csv','a')
#csvwriter = csv.writer(csvfile, delimiter = ',')

class StdOutListener(StreamListener):
    def __init__(self, api=None):
        super(StdOutListener, self).__init__()
        self.num_tweets = 0

    def on_data(self, data):
        self.num_tweets += 1
        if self.num_tweets < 5: #Remove the limit of no. of tweets to 5
            print data
            return True
        else:
            return False

    def on_error(self, status):
        print status


l = StdOutListener()
auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
stream = Stream(auth, l)
stream.filter(locations = [80.10,12.90,80.33,13.24] ) #user location 

输出

userLocation, userTimezone, Coordinates,GeoEnabled, Language, TweetPlace
London,UK      Amsterdam                  FALSE      en         null
Aachen,Germany  Berlin                    TRUE       de         null
Kewaunee Wi                               TRUE       en         null
Connecticut, Eastern Time (US & Canada)   TRUE       en         null
                                          TRUE       en         null
Lahore, City of Gardens London            TRUE       en         null
NAU class of 2018.  Arizona               FALSE      en         null
                                          FALSE      en         null
    Pacific Time (US & Canada)            FALSE      en         null

以上给出的输出是海量数据的清理版本。即使 Geolocation 已启用,我也无法获取推文位置,也无法获取 co-ordinates.

  1. 为什么带有 geo_enabled == True 的推文不提供推文位置?

根据 this, if place or coordinates is None, it means the user didn't allow permission for that tweet. Users with geo_enabled turned on still have to give explicit permission for their exact location to be displayed. Also, the documentation 状态:

geo_enabled: When true, indicates that the user has enabled the possibility of geotagging their Tweets. This field must be true for the current user to attach geographic data when using POST statuses/update.

  1. 如何按推文位置过滤? Check here

如果您按位置过滤,则只会包括落在请求的边界框内的推文,用户的位置字段不用于过滤推文。如果坐标和地点为空,则推文不会通过过滤器。

#filter all tweets from san francisco
myStream.filter(location= [-122.75,36.8,-121.75,37.8])
  1. 如何按用户位置和推文位置过滤?

您可以从过滤器中捕获推文,然后检查作者的位置以匹配您感兴趣的领域。

class StdOutListener(StreamListener):
    def __init__(self, api=None):
        super(StdOutListener, self).__init__()
        self.num_tweets = 0

    def on_data(self, data):
    #first check the location is not None
        if status.author.location and 'New York' in status.author.location:
            self.num_tweets += 1
            print data
        if self.num_tweets < 5: #Remove the limit of no. of tweets to 5            
            return True
        else:
            return False
    def on_error(self, status):
        print status
  1. 如何不局限于 Twitter API 过滤器?

记住过滤器允许所有推文,只要它传递参数之一,所以如果您需要更严格的限制,只需在 def on_data(self, data) 中包含条件子句,就像我在 (3) 中为作者位置所做的那样.