识别文本内容中的部分字符 encoding/compression
Identifying partial character encoding/compression in text content
我有一个 CSV(从 BZ2 中提取),其中只有一些值被编码:
hoxvh|c1x6nos c1x6e26|0 1
hqa1x|c1xiujs c1xj4e2|1 0
hpopn|c1xeuca c1xdepf|0 1
hpibh c1xcjy1|c1xe4yn c1xd1gh|1 0
hqdex|c1xls27 c1xjvjx|1 0
|
、0
和 1
字符肯定按预期出现,但其他值已明确编码。事实上,它们看起来像是文本压缩替换,这可能意味着 CSV 的值被压缩并且 然后 也被整体压缩为 BZ2。
无论是使用 7zip 提取 BZ2 然后在文本编辑器中打开 CSV,还是使用 Python bz2
模块打开,或者使用 Pandas 和 read_csv
:
import bz2
with bz2.open("test-balanced.csv.bz2") as f:
contents = f.read().decode()
import pandas as pd
contents = pd.read_csv("test-balanced.csv.bz2", compression="bz2", encoding="utf-8")
如何确定要解码的编码类型?
源目录:https://nlp.cs.princeton.edu/SARC/2.0/main
提取的 CSV 文件的前 100 行:https://pastebin.com/mgW8hKdh
我问了 CSV/dataset 的原作者,但他们没有回应,这是可以理解的。
来自 readme.txt:
File Guide:
- raw/key.csv: column key for raw/sarc.csv
- raw/sarc.csv: contains sarcastic and non-sarcastic comments of authors in authors.json
- */comments.json: dictionary in JSON format containing text and metadata for each comment in {comment_id: data} format
- /.csv: CSV where each row contains a sequence of comments following a post, a set of responses to the last comment in that
sequence, and sarcastic/non-sarcastic labels for those responses. The
format is
post_id comment_id … comment_id|response_id … response_id|label … label
where *_id
is a key to */comments.json
and label
1 indicates the respective response_id
maps to a
sarcastic response.
Thus each row has three entries (comment
chain, responses, labels) delimited by '|', and each of these entries
has elements delimited by spaces.
The first entry always contains a
post_id
and 0 or more comment_ids
. The second and third entries
have the same number of elements, with the first response_id
corresponding to the first label and so on.
将上面的内容转换为 Python 代码片段:
import pandas as pd
import json
from pprint import pprint
file_csv = r"D:\bat\SO596864\test-balanced.csv"
data_csv = pd.read_csv(file_csv,
sep='|',
names=['posts','responses','labels'],
encoding='utf-8')
file_json = r"D:\bat\SO596864\comments.json"
with open(file_json, mode='r', encoding='utf-8') as f:
data_json = json.load(f)
print(f'{chr(0x20)*30} First csv line decoded:')
for post_id in data_csv['posts'][0].split(chr(0x20)):
print(f'{chr(0x20)*30} post_id: {post_id}')
pprint(data_json[post_id])
for response_id in data_csv['responses'][0].split(chr(0x20)):
print(f'{chr(0x20)*30} response_id: {response_id}')
pprint(data_json[response_id])
请注意,文件是(手动)从 pol
directory 下载的可接受的大小(pol
:包含与 /r/politics 中的评论相对应的主数据集的子集).
结果:D:\bat\SO596864.py
First csv line decoded:
post_id: hqa1x
{'author': 'joshlamb619',
'created_utc': 1307053256,
'date': '2011-06',
'downs': 359,
'score': 274,
'subreddit': 'politics',
'text': 'Wisconsin GOP caught red handed, looking to run fake Democratic '
'candidates during recall elections.',
'ups': 633}
response_id: c1xiujs
{'author': 'Artisane',
'created_utc': 1307077221,
'date': '2011-06',
'downs': 0,
'score': -2,
'subreddit': 'politics',
'text': "And we're upset since the Democrats would *never* try something as "
'sneaky as this, right?',
'ups': -2}
response_id: c1xj4e2
{'author': 'stellarfury',
'created_utc': 1307080843,
'date': '2011-06',
'downs': 0,
'score': -2,
'subreddit': 'politics',
'text': "Oooh baby you caught me red handed Creepin' on the senate floor "
"Picture this we were makin' up candidates Being huge election whores",
'ups': -2}
我有一个 CSV(从 BZ2 中提取),其中只有一些值被编码:
hoxvh|c1x6nos c1x6e26|0 1
hqa1x|c1xiujs c1xj4e2|1 0
hpopn|c1xeuca c1xdepf|0 1
hpibh c1xcjy1|c1xe4yn c1xd1gh|1 0
hqdex|c1xls27 c1xjvjx|1 0
|
、0
和 1
字符肯定按预期出现,但其他值已明确编码。事实上,它们看起来像是文本压缩替换,这可能意味着 CSV 的值被压缩并且 然后 也被整体压缩为 BZ2。
无论是使用 7zip 提取 BZ2 然后在文本编辑器中打开 CSV,还是使用 Python bz2
模块打开,或者使用 Pandas 和 read_csv
:
import bz2
with bz2.open("test-balanced.csv.bz2") as f:
contents = f.read().decode()
import pandas as pd
contents = pd.read_csv("test-balanced.csv.bz2", compression="bz2", encoding="utf-8")
如何确定要解码的编码类型?
源目录:https://nlp.cs.princeton.edu/SARC/2.0/main
提取的 CSV 文件的前 100 行:https://pastebin.com/mgW8hKdh
我问了 CSV/dataset 的原作者,但他们没有回应,这是可以理解的。
来自 readme.txt:
File Guide:
- raw/key.csv: column key for raw/sarc.csv
- raw/sarc.csv: contains sarcastic and non-sarcastic comments of authors in authors.json
- */comments.json: dictionary in JSON format containing text and metadata for each comment in {comment_id: data} format
- /.csv: CSV where each row contains a sequence of comments following a post, a set of responses to the last comment in that sequence, and sarcastic/non-sarcastic labels for those responses. The format is
post_id comment_id … comment_id|response_id … response_id|label … label
where*_id
is a key to */comments.json andlabel
1 indicates the respectiveresponse_id
maps to a sarcastic response.
Thus each row has three entries (comment chain, responses, labels) delimited by '|', and each of these entries has elements delimited by spaces.
The first entry always contains apost_id
and 0 or morecomment_ids
. The second and third entries have the same number of elements, with the firstresponse_id
corresponding to the first label and so on.
将上面的内容转换为 Python 代码片段:
import pandas as pd
import json
from pprint import pprint
file_csv = r"D:\bat\SO596864\test-balanced.csv"
data_csv = pd.read_csv(file_csv,
sep='|',
names=['posts','responses','labels'],
encoding='utf-8')
file_json = r"D:\bat\SO596864\comments.json"
with open(file_json, mode='r', encoding='utf-8') as f:
data_json = json.load(f)
print(f'{chr(0x20)*30} First csv line decoded:')
for post_id in data_csv['posts'][0].split(chr(0x20)):
print(f'{chr(0x20)*30} post_id: {post_id}')
pprint(data_json[post_id])
for response_id in data_csv['responses'][0].split(chr(0x20)):
print(f'{chr(0x20)*30} response_id: {response_id}')
pprint(data_json[response_id])
请注意,文件是(手动)从 pol
directory 下载的可接受的大小(pol
:包含与 /r/politics 中的评论相对应的主数据集的子集).
结果:D:\bat\SO596864.py
First csv line decoded:
post_id: hqa1x
{'author': 'joshlamb619',
'created_utc': 1307053256,
'date': '2011-06',
'downs': 359,
'score': 274,
'subreddit': 'politics',
'text': 'Wisconsin GOP caught red handed, looking to run fake Democratic '
'candidates during recall elections.',
'ups': 633}
response_id: c1xiujs
{'author': 'Artisane',
'created_utc': 1307077221,
'date': '2011-06',
'downs': 0,
'score': -2,
'subreddit': 'politics',
'text': "And we're upset since the Democrats would *never* try something as "
'sneaky as this, right?',
'ups': -2}
response_id: c1xj4e2
{'author': 'stellarfury',
'created_utc': 1307080843,
'date': '2011-06',
'downs': 0,
'score': -2,
'subreddit': 'politics',
'text': "Oooh baby you caught me red handed Creepin' on the senate floor "
"Picture this we were makin' up candidates Being huge election whores",
'ups': -2}