使用生成器将 JSON 和 TSV 数据转换为字典

Use a Generator To Convert JSON and TSV Data into a Dictionary

我们需要从文件 file.data 中获取数据到 DataFrame 中。问题是文件每一行的数据都是 JSON 或制表符分隔值 (TSV) 格式。

JSON 行格式正确,只需将它们转换为本机 Python 字典。

TSV 行需要转换成符合 JSON 格式的字典。

这是文件示例:

{"company": "Watkins Inc", "catch_phrase": "Integrated radical installation", "phone": "7712422719", "timezone": "America/New_York", "client_count": 442}
Bennett and Sons    Persistent contextually-based standardization   018.666.0600    America/Los_Angeles 492
Ferguson-Garner Multi-layered tertiary neural-net   (086)401-8955x53502 America/Los_Angeles 528
{"company": "Pennington PLC", "catch_phrase": "Future-proofed tertiary frame", "phone": "+1-312-296-2956x137", "timezone": "America/Indiana/Indianapolis", "client_count": 638}

编写一个将迭代器作为参数的生成器。它应该解析迭代器中的值并以正确的格式生成每个值:一个带有键的字典:

到目前为止我的代码:

df = pd.read_csv("file.data", sep="\t")
    for col in df[["company"]]:
        obj = df[col]
        for item in obj.values:
            json_obj = json.loads(item)

不要使用pandas读取整个文件。相反,逐行读取文件,并创建一个字典列表。然后使用 pandas 获取数据框。

dict_data = []
tsv_data = []
with open('file.data', 'r') as f:
    for line in f:
        line = line.strip()
        try:
            d = json.loads(line)
            dict_data.append(d)
        except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
            tsv_data.append(line.split("\t")) # Split the line by tabs, append to the tsv list

在此之后,我们有

dict_data = [{'company': 'Watkins Inc',
  'catch_phrase': 'Integrated radical installation',
  'phone': '7712422719',
  'timezone': 'America/New_York',
  'client_count': 442},
 {'company': 'Pennington PLC',
  'catch_phrase': 'Future-proofed tertiary frame',
  'phone': '+1-312-296-2956x137',
  'timezone': 'America/Indiana/Indianapolis',
  'client_count': 638}]

tsv_data = [['Bennett and Sons',
  'Persistent contextually-based standardization',
  '018.666.0600',
  'America/Los_Angeles',
  '492'],
 ['Ferguson-Garner',
  'Multi-layered tertiary neural-net',
  '(086)401-8955x53502',
  'America/Los_Angeles',
  '528']]

请注意 tsv_data 中的所有内容都是字符串,因此我们将不得不在某个时候修复它。

现在,使用 dict_datatsv_data 这两个列表创建一个数据框,更改 tsv 数据框的数据类型,然后将两者连接起来。

data_cols = list(dict_data[0].keys())
df_dict = pd.DataFrame(dict_data)
df_tsv = pd.DataFrame(tsv_data, columns=data_cols)


for column in df_tsv:
    df_tsv[column] = df_tsv[column].astype(df_dict[column].dtype)

df_all = df_dict.append(df_tsv).reset_index(drop=True)

df_all 看起来像这样:

company catch_phrase phone timezone client_count
0 Watkins Inc Integrated radical installation 7712422719 America/New_York 442
1 Pennington PLC Future-proofed tertiary frame +1-312-296-2956x137 America/Indiana/Indianapolis 638
2 Bennett and Sons Persistent contextually-based standardization 018.666.0600 America/Los_Angeles 492
3 Ferguson-Garner Multi-layered tertiary neural-net (086)401-8955x53502 America/Los_Angeles 528

将其应用于您最初想要的生成器函数:

def parse_file(file_iterator):
    dict_keys_types = None

    for line in file_iterator:
        line = line.strip()
        try:
            d = json.loads(line)
            # When you read a valid dict, set the keys and types 
            # So you can parse the tsv lines
            dict_keys_types = [
                  (key, type(value)) 
                  for key, value in d.items()
             ]
             yield d
        except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
            tsv_data = line.split("\t")
            # Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
            tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
            yield tsv_dict
        

现在,您可以将文件迭代器传递给此函数,它会生成您想要的字典:

list(parse_file(f))

[{'company': 'Watkins Inc',
  'catch_phrase': 'Integrated radical installation',
  'phone': '7712422719',
  'timezone': 'America/New_York',
  'client_count': 442},
 {'company': 'Bennett and Sons',
  'catch_phrase': 'Persistent contextually-based standardization',
  'phone': '018.666.0600',
  'timezone': 'America/Los_Angeles',
  'client_count': 492},
 {'company': 'Ferguson-Garner',
  'catch_phrase': 'Multi-layered tertiary neural-net',
  'phone': '(086)401-8955x53502',
  'timezone': 'America/Los_Angeles',
  'client_count': 528},
 {'company': 'Pennington PLC',
  'catch_phrase': 'Future-proofed tertiary frame',
  'phone': '+1-312-296-2956x137',
  'timezone': 'America/Indiana/Indianapolis',
  'client_count': 638}]


当文件的第一行 不是 一个 json 字典时,这将导致错误,因为它没有键和数据类型。不是从您看到的第一个 json 字典中推断键和值,您可以对键和数据类型进行硬编码,或者将字典之前的 tsv 行放入单独的列表中以供稍后解析。

硬编码方法:

def parse_file(file_iterator):
    dict_keys_types = [('company', str),
         ('catch_phrase', str),
         ('phone', str),
         ('timezone', str),
         ('client_count', int)]

    for line in file_iterator:
        line = line.strip()
        try:
            d = json.loads(line)
            yield d
        except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
            tsv_data = line.split("\t")
            # Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
            tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
            yield tsv_dict

保存以备后用的方法:

def parse_file(file_iterator):
    dict_keys_types = None
    unused_tsv_lines = []
    for line in file_iterator:
        line = line.strip()
        try:
            d = json.loads(line)
            # When you read a valid dict, set the keys and types 
            # So you can parse the tsv lines
            dict_keys_types = [
                  (key, type(value)) 
                  for key, value in d.items()
             ]
             yield d
        except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
            tsv_data = line.split("\t")
            if dict_keys_types: # Check if this is set already
                # If it is, 
                # Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
                tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
                yield tsv_dict
            else: # Else add to unused_tsv_lines
                unused_tsv_lines.append(tsv_data)

    # After you've finished reading the file, try to reparse the lines
    # you skipped before
    if dict_keys_types: # Before parsing, make sure dict_keys_types was set
        for tsv_data in unused_tsv_lines:
            # With each line, do the same thing as before
            tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
            yield tsv_dict