Python 中的 TwitterSearch 编码错误

Question

我想使用 TwitterSearch 将推文导入 csv。但是，该脚本不会捕获特殊字符（例如法语中的重音符号）。我尝试了几件事，比如添加 .encode('utf-8')，但没有成功。

如果我尝试写：

tweet_text = tweet['text'].strip().encode('utf-8', 'ignore')

然后我得到

 Traceback (most recent call last): File "/Users/usr/Documents/Python/twitter_search2.py", line 56, in <module> get_tweets(query, max_tweets) File "/Users/usr/Documents/Python/twitter_search2.py", line 44, in get_tweets print('@%s: %s' % (user, tweet_text)) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 32: ordinal not in range(128)

有人知道吗？

我在 Python 2.7。代码是：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from TwitterSearch import *
import csv


def get_tweets(query, max = 10):

    i = 0
    search = query

    with open(search+'.csv', 'wb') as outf:
        writer = csv.writer(outf)
        writer.writerow(['user','time','tweet','latitude','longitude'])
        try:
            tso = TwitterSearchOrder()
            tso.set_keywords([search])
            tso.set_include_entities(True)

           # tso.set_language('fr')

            ts = TwitterSearch(
                consumer_key = 'YOUR CONSUMER KEY',
                consumer_secret = 'YOUR CONSUMER SECRET',
                access_token = 'YOUR ACCESS TOKEN',
                access_token_secret = 'YOUR ACCESS TOKEN SECRET'
            )

            for tweet in ts.search_tweets_iterable(tso):
                lat = None
                long = None
                time = tweet['created_at']
                user = tweet['user']['screen_name']
                tweet_text = tweet['text'].strip().encode('ascii', 'ignore')
                tweet_text = ''.join(tweet_text.splitlines())
                print i,time,
                if tweet['geo'] != None and tweet['geo']['coordinates'][0] != 0.0: # avoiding bad values
                    lat = tweet['geo']['coordinates'][0]
                    long = tweet['geo']['coordinates'][1]
                    print('@%s: %s' % (user, tweet_text)), lat, long
                else:
                    print('@%s: %s' % (user, tweet_text))

                writer.writerow([user, time, tweet_text, lat, long])
                i += 1
                if i > max:
                    return()

        except TwitterSearchException as e:
            print(e)


query = raw_input ("Recherche : ")
max_tweets = 10
get_tweets(query, max_tweets)

非常感谢您的帮助！

Answer 1

您正在将编码的推文与 用户名 一起插入：

print('@%s: %s' % (user, tweet_text))

如果 user 对象是 Unicode 字符串，这将失败：

>>> user = u'Héllo'
>>> tweet_text = u'Héllo'.encode('utf8')
>>> '@%s: %s' % (user, tweet_text)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

因为你在混合类型。 Python 尝试解码 tweet_text 值，使其再次成为 unicode 对象。

坚持一种；要么对所有内容进行编码，要么保留所有内容 Unicode，并在最后一个实例进行编码。

无论如何，您都必须对 CSV 文件的 user 值进行编码，在此之前先对推文进行编码：

tweet_text = tweet['text'].strip()
tweet_text = u''.join(tweet_text.splitlines())
print i, time,
if tweet['geo'] and tweet['geo']['coordinates'][0]: 
    lat, long = tweet['geo']['coordinates'][:2]
    print u'@%s: %s' % (user, tweet_text), lat, long
else:
    print u'@%s: %s' % (user, tweet_text)

writer.writerow([user.encode('utf8'), time.encode('utf8'), 
                 tweet_text.encode('utf8'), lat, long])

Python 中的 TwitterSearch 编码错误

Encoding error with TwitterSearch in Python

python

encoding

utf-8