使用生成器将 JSON 和 TSV 数据转换为字典
Use a Generator To Convert JSON and TSV Data into a Dictionary
我们需要从文件 file.data 中获取数据到 DataFrame 中。问题是文件每一行的数据都是 JSON 或制表符分隔值 (TSV) 格式。
JSON 行格式正确,只需将它们转换为本机 Python 字典。
TSV 行需要转换成符合 JSON 格式的字典。
这是文件示例:
{"company": "Watkins Inc", "catch_phrase": "Integrated radical installation", "phone": "7712422719", "timezone": "America/New_York", "client_count": 442}
Bennett and Sons Persistent contextually-based standardization 018.666.0600 America/Los_Angeles 492
Ferguson-Garner Multi-layered tertiary neural-net (086)401-8955x53502 America/Los_Angeles 528
{"company": "Pennington PLC", "catch_phrase": "Future-proofed tertiary frame", "phone": "+1-312-296-2956x137", "timezone": "America/Indiana/Indianapolis", "client_count": 638}
编写一个将迭代器作为参数的生成器。它应该解析迭代器中的值并以正确的格式生成每个值:一个带有键的字典:
- 公司
- catch_phrase
- phone
- 时区
- client_count
到目前为止我的代码:
df = pd.read_csv("file.data", sep="\t")
for col in df[["company"]]:
obj = df[col]
for item in obj.values:
json_obj = json.loads(item)
不要使用pandas读取整个文件。相反,逐行读取文件,并创建一个字典列表。然后使用 pandas 获取数据框。
dict_data = []
tsv_data = []
with open('file.data', 'r') as f:
for line in f:
line = line.strip()
try:
d = json.loads(line)
dict_data.append(d)
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data.append(line.split("\t")) # Split the line by tabs, append to the tsv list
在此之后,我们有
dict_data = [{'company': 'Watkins Inc',
'catch_phrase': 'Integrated radical installation',
'phone': '7712422719',
'timezone': 'America/New_York',
'client_count': 442},
{'company': 'Pennington PLC',
'catch_phrase': 'Future-proofed tertiary frame',
'phone': '+1-312-296-2956x137',
'timezone': 'America/Indiana/Indianapolis',
'client_count': 638}]
tsv_data = [['Bennett and Sons',
'Persistent contextually-based standardization',
'018.666.0600',
'America/Los_Angeles',
'492'],
['Ferguson-Garner',
'Multi-layered tertiary neural-net',
'(086)401-8955x53502',
'America/Los_Angeles',
'528']]
请注意 tsv_data
中的所有内容都是字符串,因此我们将不得不在某个时候修复它。
现在,使用 dict_data
和 tsv_data
这两个列表创建一个数据框,更改 tsv
数据框的数据类型,然后将两者连接起来。
data_cols = list(dict_data[0].keys())
df_dict = pd.DataFrame(dict_data)
df_tsv = pd.DataFrame(tsv_data, columns=data_cols)
for column in df_tsv:
df_tsv[column] = df_tsv[column].astype(df_dict[column].dtype)
df_all = df_dict.append(df_tsv).reset_index(drop=True)
df_all
看起来像这样:
company
catch_phrase
phone
timezone
client_count
0
Watkins Inc
Integrated radical installation
7712422719
America/New_York
442
1
Pennington PLC
Future-proofed tertiary frame
+1-312-296-2956x137
America/Indiana/Indianapolis
638
2
Bennett and Sons
Persistent contextually-based standardization
018.666.0600
America/Los_Angeles
492
3
Ferguson-Garner
Multi-layered tertiary neural-net
(086)401-8955x53502
America/Los_Angeles
528
将其应用于您最初想要的生成器函数:
def parse_file(file_iterator):
dict_keys_types = None
for line in file_iterator:
line = line.strip()
try:
d = json.loads(line)
# When you read a valid dict, set the keys and types
# So you can parse the tsv lines
dict_keys_types = [
(key, type(value))
for key, value in d.items()
]
yield d
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data = line.split("\t")
# Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
yield tsv_dict
现在,您可以将文件迭代器传递给此函数,它会生成您想要的字典:
list(parse_file(f))
[{'company': 'Watkins Inc',
'catch_phrase': 'Integrated radical installation',
'phone': '7712422719',
'timezone': 'America/New_York',
'client_count': 442},
{'company': 'Bennett and Sons',
'catch_phrase': 'Persistent contextually-based standardization',
'phone': '018.666.0600',
'timezone': 'America/Los_Angeles',
'client_count': 492},
{'company': 'Ferguson-Garner',
'catch_phrase': 'Multi-layered tertiary neural-net',
'phone': '(086)401-8955x53502',
'timezone': 'America/Los_Angeles',
'client_count': 528},
{'company': 'Pennington PLC',
'catch_phrase': 'Future-proofed tertiary frame',
'phone': '+1-312-296-2956x137',
'timezone': 'America/Indiana/Indianapolis',
'client_count': 638}]
当文件的第一行 不是 一个 json 字典时,这将导致错误,因为它没有键和数据类型。不是从您看到的第一个 json 字典中推断键和值,您可以对键和数据类型进行硬编码,或者将字典之前的 tsv 行放入单独的列表中以供稍后解析。
硬编码方法:
def parse_file(file_iterator):
dict_keys_types = [('company', str),
('catch_phrase', str),
('phone', str),
('timezone', str),
('client_count', int)]
for line in file_iterator:
line = line.strip()
try:
d = json.loads(line)
yield d
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data = line.split("\t")
# Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
yield tsv_dict
保存以备后用的方法:
def parse_file(file_iterator):
dict_keys_types = None
unused_tsv_lines = []
for line in file_iterator:
line = line.strip()
try:
d = json.loads(line)
# When you read a valid dict, set the keys and types
# So you can parse the tsv lines
dict_keys_types = [
(key, type(value))
for key, value in d.items()
]
yield d
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data = line.split("\t")
if dict_keys_types: # Check if this is set already
# If it is,
# Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
yield tsv_dict
else: # Else add to unused_tsv_lines
unused_tsv_lines.append(tsv_data)
# After you've finished reading the file, try to reparse the lines
# you skipped before
if dict_keys_types: # Before parsing, make sure dict_keys_types was set
for tsv_data in unused_tsv_lines:
# With each line, do the same thing as before
tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
yield tsv_dict
我们需要从文件 file.data 中获取数据到 DataFrame 中。问题是文件每一行的数据都是 JSON 或制表符分隔值 (TSV) 格式。
JSON 行格式正确,只需将它们转换为本机 Python 字典。
TSV 行需要转换成符合 JSON 格式的字典。
这是文件示例:
{"company": "Watkins Inc", "catch_phrase": "Integrated radical installation", "phone": "7712422719", "timezone": "America/New_York", "client_count": 442}
Bennett and Sons Persistent contextually-based standardization 018.666.0600 America/Los_Angeles 492
Ferguson-Garner Multi-layered tertiary neural-net (086)401-8955x53502 America/Los_Angeles 528
{"company": "Pennington PLC", "catch_phrase": "Future-proofed tertiary frame", "phone": "+1-312-296-2956x137", "timezone": "America/Indiana/Indianapolis", "client_count": 638}
编写一个将迭代器作为参数的生成器。它应该解析迭代器中的值并以正确的格式生成每个值:一个带有键的字典:
- 公司
- catch_phrase
- phone
- 时区
- client_count
到目前为止我的代码:
df = pd.read_csv("file.data", sep="\t")
for col in df[["company"]]:
obj = df[col]
for item in obj.values:
json_obj = json.loads(item)
不要使用pandas读取整个文件。相反,逐行读取文件,并创建一个字典列表。然后使用 pandas 获取数据框。
dict_data = []
tsv_data = []
with open('file.data', 'r') as f:
for line in f:
line = line.strip()
try:
d = json.loads(line)
dict_data.append(d)
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data.append(line.split("\t")) # Split the line by tabs, append to the tsv list
在此之后,我们有
dict_data = [{'company': 'Watkins Inc',
'catch_phrase': 'Integrated radical installation',
'phone': '7712422719',
'timezone': 'America/New_York',
'client_count': 442},
{'company': 'Pennington PLC',
'catch_phrase': 'Future-proofed tertiary frame',
'phone': '+1-312-296-2956x137',
'timezone': 'America/Indiana/Indianapolis',
'client_count': 638}]
tsv_data = [['Bennett and Sons',
'Persistent contextually-based standardization',
'018.666.0600',
'America/Los_Angeles',
'492'],
['Ferguson-Garner',
'Multi-layered tertiary neural-net',
'(086)401-8955x53502',
'America/Los_Angeles',
'528']]
请注意 tsv_data
中的所有内容都是字符串,因此我们将不得不在某个时候修复它。
现在,使用 dict_data
和 tsv_data
这两个列表创建一个数据框,更改 tsv
数据框的数据类型,然后将两者连接起来。
data_cols = list(dict_data[0].keys())
df_dict = pd.DataFrame(dict_data)
df_tsv = pd.DataFrame(tsv_data, columns=data_cols)
for column in df_tsv:
df_tsv[column] = df_tsv[column].astype(df_dict[column].dtype)
df_all = df_dict.append(df_tsv).reset_index(drop=True)
df_all
看起来像这样:
company | catch_phrase | phone | timezone | client_count | |
---|---|---|---|---|---|
0 | Watkins Inc | Integrated radical installation | 7712422719 | America/New_York | 442 |
1 | Pennington PLC | Future-proofed tertiary frame | +1-312-296-2956x137 | America/Indiana/Indianapolis | 638 |
2 | Bennett and Sons | Persistent contextually-based standardization | 018.666.0600 | America/Los_Angeles | 492 |
3 | Ferguson-Garner | Multi-layered tertiary neural-net | (086)401-8955x53502 | America/Los_Angeles | 528 |
将其应用于您最初想要的生成器函数:
def parse_file(file_iterator):
dict_keys_types = None
for line in file_iterator:
line = line.strip()
try:
d = json.loads(line)
# When you read a valid dict, set the keys and types
# So you can parse the tsv lines
dict_keys_types = [
(key, type(value))
for key, value in d.items()
]
yield d
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data = line.split("\t")
# Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
yield tsv_dict
现在,您可以将文件迭代器传递给此函数,它会生成您想要的字典:
list(parse_file(f))
[{'company': 'Watkins Inc',
'catch_phrase': 'Integrated radical installation',
'phone': '7712422719',
'timezone': 'America/New_York',
'client_count': 442},
{'company': 'Bennett and Sons',
'catch_phrase': 'Persistent contextually-based standardization',
'phone': '018.666.0600',
'timezone': 'America/Los_Angeles',
'client_count': 492},
{'company': 'Ferguson-Garner',
'catch_phrase': 'Multi-layered tertiary neural-net',
'phone': '(086)401-8955x53502',
'timezone': 'America/Los_Angeles',
'client_count': 528},
{'company': 'Pennington PLC',
'catch_phrase': 'Future-proofed tertiary frame',
'phone': '+1-312-296-2956x137',
'timezone': 'America/Indiana/Indianapolis',
'client_count': 638}]
当文件的第一行 不是 一个 json 字典时,这将导致错误,因为它没有键和数据类型。不是从您看到的第一个 json 字典中推断键和值,您可以对键和数据类型进行硬编码,或者将字典之前的 tsv 行放入单独的列表中以供稍后解析。
硬编码方法:
def parse_file(file_iterator):
dict_keys_types = [('company', str),
('catch_phrase', str),
('phone', str),
('timezone', str),
('client_count', int)]
for line in file_iterator:
line = line.strip()
try:
d = json.loads(line)
yield d
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data = line.split("\t")
# Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
yield tsv_dict
保存以备后用的方法:
def parse_file(file_iterator):
dict_keys_types = None
unused_tsv_lines = []
for line in file_iterator:
line = line.strip()
try:
d = json.loads(line)
# When you read a valid dict, set the keys and types
# So you can parse the tsv lines
dict_keys_types = [
(key, type(value))
for key, value in d.items()
]
yield d
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data = line.split("\t")
if dict_keys_types: # Check if this is set already
# If it is,
# Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
yield tsv_dict
else: # Else add to unused_tsv_lines
unused_tsv_lines.append(tsv_data)
# After you've finished reading the file, try to reparse the lines
# you skipped before
if dict_keys_types: # Before parsing, make sure dict_keys_types was set
for tsv_data in unused_tsv_lines:
# With each line, do the same thing as before
tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
yield tsv_dict