将字节对象转换为字符串对象时,我似乎丢失了一些字符

I seem to lose some characters when converting a byte object into a string object

我目前正在开发一个 python 程序,该程序将过滤掉 JSON 文件的 "text" 标记中的一些关键字。我的系统的转换如下:.gz --> 在 rb 模式下使用 gzip 打开 --> 将 b'' 转换为 str --> json.load(str)

def gzworker(fullpath, condition):
    """Worker opens one .gz file"""
    print('Opening {}'.format(fullpath))
    buffer = []
    with gzip.open(fullpath, 'rb') as infile:
        for _line in infile:
            result = filter(json.loads(str(_line).split('|',1)[1][:-5]), condition)
            if result:
                buffer.append(result)
    print('Closing {}'.format(fullpath))
    return buffer

过滤函数将参数设为一个JSON文件

在 运行 多次通过这段代码后,我意识到实际上它不起作用的原因是一些逗号似乎消失了。有谁知道在这个过程中是否有可能丢失一些信息?

我使用之前的方法得到的结果(无效JSON)[如果我使用解码得到相同的结果]

{"created_at":"Thu Apr 17 04:45:03 +0000 2014","id":456654551114735616,"id_str":"456654551114735616","text":"@cam_clay1 come visit us soon plz \ud83d\ude18","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":456654343781892098,"in_reply_to_status_id_str":"456654343781892098","in_reply_to_user_id":427007607,"in_reply_to_user_id_str":"427007607","in_reply_to_screen_name":"cam_clay1","user":{"id":335107310,"id_str":"335107310","name":"Roger Krick","screen_name":"roger_krick","location":"Atlanta GA","url":null,"description":"I pushed Regina George in front of the bus.","protected":false,"followers_count":772,"friends_count":235,"listed_count":3,"created_at":"Thu Jul 14 04:49:29 +0000 2011","favourites_count":7192,"utc_offset":-18000,"time_zone":"Quito","geo_enabled":true,"verified":false,"statuses_count":9518,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/pbs.twimg.com\/profile_background_images\/378800000021719152\/28971ed1e15e606fb52ef9e7af736e60.jpeg","profile_background_image_url_https":"https:\/\/pbs.twimg.com\/profile_background_images\/378800000021719152\/28971ed1e15e606fb52ef9e7af736e60.jpeg","profile_background_tile":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/453031044393222144\/7vIvMWvk_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/453031044393222144\/7vIvMWvk_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/335107310\/1352964715","profile_link_color":"0084B4","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":{"type":"Point","coordinates":[33.75781394,-84.38479358]},"coordinates":{"type":"Point","coordinates":[-84.38479358,33.75781394]},"place":{"id":"8173485c72e78ca5","url":"https:\/\/api.twitter.com\/1.1\/geo\/id\/8173485c72e78ca5.json","place_type":"city","name":"Atlanta","full_name":"Atlanta, GA","country_code":"US","country":"United States","contained_within":[],"bounding_box":{"type":"Polygon","coordinates":[[[-84.5464728,33.647845],[-84.5464728,33.8868859],[-84.289385,33.8868859],[-84.289385,33.647845]]]},"attributes":{}},"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"cam_clay1","name":"Cameron Clay","id":427007607,"id_str":"427007607","indices":[0,10]}]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"en"}

我应该得到什么(有效JSON):

{"created_at":"Thu Apr 17 04:45:03 +0000 2014","id":456654551114735616,"id_str":"456654551114735616","text":"@cam_clay1 come visit us soon plz \ud83d\ude18","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":456654343781892098,"in_reply_to_status_id_str":"456654343781892098","in_reply_to_user_id":427007607,"in_reply_to_user_id_str":"427007607","in_reply_to_screen_name":"cam_clay1","user":{"id":335107310,"id_str":"335107310","name":"Roger Krick","screen_name":"roger_krick","location":"Atlanta GA","url":null,"description":"I pushed Regina George in front of the bus.","protected":false,"followers_count":772,"friends_count":235,"listed_count":3,"created_at":"Thu Jul 14 04:49:29 +0000 2011","favourites_count":7192,"utc_offset":-18000,"time_zone":"Quito","geo_enabled":true,"verified":false,"statuses_count":9518,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/pbs.twimg.com\/profile_background_images\/378800000021719152\/28971ed1e15e606fb52ef9e7af736e60.jpeg","profile_background_image_url_https":"https:\/\/pbs.twimg.com\/profile_background_images\/378800000021719152\/28971ed1e15e606fb52ef9e7af736e60.jpeg","profile_background_tile":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/453031044393222144\/7vIvMWvk_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/453031044393222144\/7vIvMWvk_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/335107310\/1352964715","profile_link_color":"0084B4","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":{"type":"Point","coordinates":[33.75781394,-84.38479358]},"coordinates":{"type":"Point","coordinates":[-84.38479358,33.75781394]},"place":{"id":"8173485c72e78ca5","url":"https:\/\/api.twitter.com\/1.1\/geo\/id\/8173485c72e78ca5.json","place_type":"city","name":"Atlanta","full_name":"Atlanta, GA","country_code":"US","country":"United States","contained_within":[],"bounding_box":{"type":"Polygon","coordinates":[[[-84.5464728,33.647845],[-84.5464728,33.8868859],[-84.289385,33.8868859],[-84.289385,33.647845]]]},"attributes":{}},"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"cam_clay1","name":"Cameron Clay","id":427007607,"id_str":"427007607","indices":[0,10]}]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"en"}

您对字节的解码有误:

str(_line)

将对象转换为 表示,这对调试很有用,但对处理数据没有用:

>>> 'Føo'.encode('utf8')
b'F\xc3\xb8o'
>>> str('Føo'.encode('utf8'))
"b'F\xc3\xb8o'"

注意 b' 前缀、' 后缀和转义序列!

解码 字节对象:

_line.decode('utf8')

我假设因为这是 JSON 数据,它使用的是 UTF-8 编码(JSON 标准声明这是默认设置,唯一允许的其他选项是 UTF -16 和 UTF-32)。

更好的是,使用 io.TextIOWrapper() object 为您处理解码。

接下来,您似乎颠倒了您的条件和数据filter() 先取条件,再取数据序列。

更正后的代码:

def gzworker(fullpath, condition):
    """Worker opens one .gz file"""
    print('Opening {}'.format(fullpath))
    buffer = []
    with gzip.open(fullpath, 'rb') as infile:
        decoded = io.TextIOWrapper(infile, encoding='utf8')
        for line in decoded:
            json_data = line.split('|', 1)[1][:-4]
            result = filter(condition, json.loads(json_data))
            if result:
                buffer.append(result)
    print('Closing {}'.format(fullpath))
    return buffer

我调整了你的切片操作,假设你之前切掉了 str() 调用引入的 ' 字符。