识别文本内容中的部分字符 encoding/compression

Question

我有一个 CSV（从 BZ2 中提取），其中只有一些值被编码：

hoxvh|c1x6nos c1x6e26|0 1
hqa1x|c1xiujs c1xj4e2|1 0
hpopn|c1xeuca c1xdepf|0 1
hpibh c1xcjy1|c1xe4yn c1xd1gh|1 0
hqdex|c1xls27 c1xjvjx|1 0

|、0 和 1 字符肯定按预期出现，但其他值已明确编码。事实上，它们看起来像是文本压缩替换，这可能意味着 CSV 的值被压缩并且然后也被整体压缩为 BZ2。

无论是使用 7zip 提取 BZ2 然后在文本编辑器中打开 CSV，还是使用 Python bz2 模块打开，或者使用 Pandas 和 read_csv:

import bz2

with bz2.open("test-balanced.csv.bz2") as f:
    contents = f.read().decode()

import pandas as pd

contents = pd.read_csv("test-balanced.csv.bz2", compression="bz2", encoding="utf-8")

如何确定要解码的编码类型？

源目录：https://nlp.cs.princeton.edu/SARC/2.0/main

源文件：test-balanced.csv.bz2

提取的 CSV 文件的前 100 行：https://pastebin.com/mgW8hKdh

我问了 CSV/dataset 的原作者，但他们没有回应，这是可以理解的。

Answer 1

来自 readme.txt:

File Guide:

raw/key.csv: column key for raw/sarc.csv

raw/sarc.csv: contains sarcastic and non-sarcastic comments of authors in authors.json

*/comments.json: dictionary in JSON format containing text and metadata for each comment in {comment_id: data} format

/.csv: CSV where each row contains a sequence of comments following a post, a set of responses to the last comment in that sequence, and sarcastic/non-sarcastic labels for those responses. The format is
post_id comment_id … comment_id|response_id … response_id|label … label
where *_id is a key to */comments.json and label 1 indicates the respective response_id maps to a sarcastic response.
Thus each row has three entries (comment chain, responses, labels) delimited by '|', and each of these entries has elements delimited by spaces.
The first entry always contains a post_id and 0 or more comment_ids. The second and third entries have the same number of elements, with the first response_id corresponding to the first label and so on.

将上面的内容转换为 Python 代码片段：

import pandas as pd
import json
from pprint import pprint

file_csv = r"D:\bat\SO596864\test-balanced.csv"
data_csv = pd.read_csv(file_csv,
                       sep='|',
                       names=['posts','responses','labels'],
                       encoding='utf-8')

file_json = r"D:\bat\SO596864\comments.json"
with open(file_json, mode='r', encoding='utf-8') as f:
    data_json = json.load(f)

print(f'{chr(0x20)*30} First csv line decoded:')
for post_id in data_csv['posts'][0].split(chr(0x20)):
    print(f'{chr(0x20)*30} post_id: {post_id}')
    pprint(data_json[post_id])

for response_id in data_csv['responses'][0].split(chr(0x20)):
    print(f'{chr(0x20)*30} response_id: {response_id}')
    pprint(data_json[response_id])

请注意，文件是（手动）从 pol directory 下载的可接受的大小（pol：包含与 /r/politics 中的评论相对应的主数据集的子集).

结果：D:\bat\SO596864.py

                               First csv line decoded:
                               post_id: hqa1x
{'author': 'joshlamb619',
 'created_utc': 1307053256,
 'date': '2011-06',
 'downs': 359,
 'score': 274,
 'subreddit': 'politics',
 'text': 'Wisconsin GOP caught red handed, looking to run fake Democratic '
         'candidates during recall elections.',
 'ups': 633}
                               response_id: c1xiujs
{'author': 'Artisane',
 'created_utc': 1307077221,
 'date': '2011-06',
 'downs': 0,
 'score': -2,
 'subreddit': 'politics',
 'text': "And we're upset since the Democrats would *never* try something as "
         'sneaky as this, right?',
 'ups': -2}
                               response_id: c1xj4e2
{'author': 'stellarfury',
 'created_utc': 1307080843,
 'date': '2011-06',
 'downs': 0,
 'score': -2,
 'subreddit': 'politics',
 'text': "Oooh baby you caught me red handed Creepin' on the senate floor "
         "Picture this we were makin' up candidates Being huge election whores",
 'ups': -2}

识别文本内容中的部分字符 encoding/compression

Identifying partial character encoding/compression in text content

python

csv

encoding

dataset

pandas