识别文本内容中的部分字符 encoding/compression

Identifying partial character encoding/compression in text content

我有一个 CSV(从 BZ2 中提取),其中只有一些值被编码:

hoxvh|c1x6nos c1x6e26|0 1
hqa1x|c1xiujs c1xj4e2|1 0
hpopn|c1xeuca c1xdepf|0 1
hpibh c1xcjy1|c1xe4yn c1xd1gh|1 0
hqdex|c1xls27 c1xjvjx|1 0

|01 字符肯定按预期出现,但其他值已明确编码。事实上,它们看起来像是文本压缩替换,这可能意味着 CSV 的值被压缩并且 然后 也被整体压缩为 BZ2。

无论是使用 7zip 提取 BZ2 然后在文本编辑器中打开 CSV,还是使用 Python bz2 模块打开,或者使用 Pandas 和 read_csv:

import bz2

with bz2.open("test-balanced.csv.bz2") as f:
    contents = f.read().decode()
import pandas as pd

contents = pd.read_csv("test-balanced.csv.bz2", compression="bz2", encoding="utf-8")

如何确定要解码的编码类型?


源目录:https://nlp.cs.princeton.edu/SARC/2.0/main

源文件:test-balanced.csv.bz2

提取的 CSV 文件的前 100 行:https://pastebin.com/mgW8hKdh

我问了 CSV/dataset 的原作者,但他们没有回应,这是可以理解的。

来自 readme.txt:

File Guide:

  • raw/key.csv: column key for raw/sarc.csv
  • raw/sarc.csv: contains sarcastic and non-sarcastic comments of authors in authors.json
  • */comments.json: dictionary in JSON format containing text and metadata for each comment in {comment_id: data} format
  • /.csv: CSV where each row contains a sequence of comments following a post, a set of responses to the last comment in that sequence, and sarcastic/non-sarcastic labels for those responses. The format is
    post_id comment_id … comment_id|response_id … response_id|label … label
    where *_id is a key to */comments.json and label 1 indicates the respective response_id maps to a sarcastic response.
    Thus each row has three entries (comment chain, responses, labels) delimited by '|', and each of these entries has elements delimited by spaces.
    The first entry always contains a post_id and 0 or more comment_ids. The second and third entries have the same number of elements, with the first response_id corresponding to the first label and so on.

将上面的内容转换为 Python 代码片段:

import pandas as pd
import json
from pprint import pprint

file_csv = r"D:\bat\SO596864\test-balanced.csv"
data_csv = pd.read_csv(file_csv,
                       sep='|',
                       names=['posts','responses','labels'],
                       encoding='utf-8')

file_json = r"D:\bat\SO596864\comments.json"
with open(file_json, mode='r', encoding='utf-8') as f:
    data_json = json.load(f)

print(f'{chr(0x20)*30} First csv line decoded:')
for post_id in data_csv['posts'][0].split(chr(0x20)):
    print(f'{chr(0x20)*30} post_id: {post_id}')
    pprint(data_json[post_id])

for response_id in data_csv['responses'][0].split(chr(0x20)):
    print(f'{chr(0x20)*30} response_id: {response_id}')
    pprint(data_json[response_id])

请注意,文件是(手动)从 pol directory 下载的可接受的大小(pol:包含与 /r/politics 中的评论相对应的主数据集的子集).

结果D:\bat\SO596864.py

                               First csv line decoded:
                               post_id: hqa1x
{'author': 'joshlamb619',
 'created_utc': 1307053256,
 'date': '2011-06',
 'downs': 359,
 'score': 274,
 'subreddit': 'politics',
 'text': 'Wisconsin GOP caught red handed, looking to run fake Democratic '
         'candidates during recall elections.',
 'ups': 633}
                               response_id: c1xiujs
{'author': 'Artisane',
 'created_utc': 1307077221,
 'date': '2011-06',
 'downs': 0,
 'score': -2,
 'subreddit': 'politics',
 'text': "And we're upset since the Democrats would *never* try something as "
         'sneaky as this, right?',
 'ups': -2}
                               response_id: c1xj4e2
{'author': 'stellarfury',
 'created_utc': 1307080843,
 'date': '2011-06',
 'downs': 0,
 'score': -2,
 'subreddit': 'politics',
 'text': "Oooh baby you caught me red handed Creepin' on the senate floor "
         "Picture this we were makin' up candidates Being huge election whores",
 'ups': -2}