Tweepy：始终以推文文本中的“\ud83d\ude4c”格式显示表情符号

Question

我的问题

使用 tweepy 传输数据时，我收到了

的预期结果

Tweet Contents: RT @ChickSoPretty: Zendaya tho \ud83d\ude4c https:....

使用代码时

def on_data(self, data):
    username = data.split(',"screen_name":"')[1].split('","location"')[0]
    tweet = data.split(',"text":"')[1].split('","source')[0]
    print("Tweet Contents: " + tweet)

--- 我目前正在跟踪 u'\U0001f64c'，表情符号的代码。 ---

但是，当我尝试输出用户最近的推文的其余部分时...

for status in tweepy.Cursor(api.user_timeline, id=username).items(20):
    tweet = status.text
    print("Tweet Contents: " + tweet)

其中 'username' 是最近使用表情符号的用户，我的程序崩溃了。

这是可以理解的，因为我现在正尝试在控制台上打印表情符号，而不是我最初所做的，而是显示 Javascript 转义码，\ud83d\ude4c.

我的问题是，如何读取用户的状态并以第一种格式输出他们的推文？

我的代码的目的

我的长期目标是遍历用户的状态，并检查他们在最近的 20 条推文（包括转发和回复）中使用了多少表情符号。

我有"successfully created"一些乱七八糟的代码，当表情符号以Javascript/Java转义格式显示时，用于检测推文中的表情符号，如下...

for character in tweet:
  iteration = iteration + 1
  if(iteration < tweetLength):
    if tweet[iteration] == '\' and tweet[iteration + 1] == 'u' and tweet[iteration + 6] == '\' and tweet[iteration + 7] == 'u':           
    for x in range(0,12):
      emojiCode += tweet[iteration + x]                                        
      numberOfEmojis = numberOfEmojis + 1
      print("Emoji Code Found: "+emojiCode)  
      emojiCode = ""          
      iteration = iteration + 7

哇，真是一团糟。但是，它适用于我需要它做的事情（仅限英文推文）。

有没有更好的方法？我应该废弃它并使用

tweet.encode('utf-8')

并尝试以下列输出格式查找表情符号？

b'@Jathey3 @zachnahra31 this hard\xf0\x9f\x98\x82 we gotta do this https:...'

我正在使用 Python 3.4.2

Answer 1

Is there a better way?

是：不要尝试使用低级逐个字符的字符串摆弄来处理 JSON 格式的数据。标准库中提供了一些工具，可以更快、更可靠地执行此操作。

搜索 JSON-string-literal-encoded 形式的字符很棘手，因为您不知道它是作为 \ud83d\ude4c 包含还是只是原始字符 </code>（U+1F64C 举双手庆祝的人）。任何其他非表情符号字符也可能被编码为 <code>\u 转义，例如 \u0061\u0061 是 aa。还有关于当你有双反斜杠或转义引号时会发生什么的规则，这些规则很难在查找字符的同时处理，并且属性顺序和空格格式有很多问题当您试图找到您想要的属性时。

通过使用 json 模块的 loads 方法将 JSON 字符串解码为包含您可以直接检查的原始字符串的 Python 字典，从而避免所有这些陷阱。

然后找一定范围内的字符，有正则表达式，re模块提供

最后，如果您想以 JSON 格式显示输出，如 \ud83d\ude4c，您可以使用 json.dumps 方法将该输出编码回 JSON。

# Assuming input like:
json_input= '{"screen_name":"fred","location":"home","text":"Here is an emoji: ... and here is another one "}'

import json, re
emoji_pattern = re.compile('[\U0001F300-\U0001F64F]')

dict_input = json.loads(json_input)
text = dict_input['text']
screen_name = dict_input['screen_name']
emojis = emoji_pattern.findall(text)

print(len(emojis), 'chars found in post by', screen_name)
for emoji in emojis:
    print('emoji: ' + json.dumps(emoji))

2 chars found in post by fred
Character: "\ud83d\ude4c"
Character: "\ud83d\udca9"

（假设只有 U+1F300 到 U+1F64F 范围内的字符才算是真正的表情符号。还有其他字符可以被归类为表情符号，但那是另一种蠕虫病毒。加上未来的 Unicode 版本可能会添加更多新角色。）

（旁注：re 中的 \U 不适用于 Python 3.3 之前的“窄”Python 版本的用户。）

Tweepy：始终以推文文本中的“\ud83d\ude4c”格式显示表情符号

Tweepy: Always display emoji in "\ud83d\ude4c" format from tweet text

python

unicode

twitter

tweepy

emoji

我的问题

我的代码的目的