从 json 格式的文本文件中正确读取数据

correctly reading data from a text file with json format in it

假设我有一个包含以下两个观察结果的文本文件:

liame@ziggo.nl:horse22| homeAddress = {
  "city": "AMSTERDAM",
  "houseNumber": "5",
  "houseNumberAddition": null,
  "postalCode": "1111 AN",
  "street": "Walker",
  "__typename": "ShopperAddress"
}
johndoe@live.nl:pizzalover1 | homeAddress = {
  "city": "NEW YOK",
  "houseNumber": "23",
  "houseNumberAddition": null,
  "postalCode": "9999 HV",
  "street": "Marie Curie",
  "__typename": "ShopperAddress"
}

有没有办法以数据框如下所示的方式读取此文本文件:

username1       username2    city        housenumber  housenumber_addition  postalcode   street      typename
liam@ziggo.nl   horse22      AMSTERDAM   5            null                  1111 AN      Walker      ShopperAddress
johndoe@live.nl pizzalover1  NEW YORK    23           null                  9999 HV      Marie Curie ShopperAddress

感谢

您的文本文件表明数据的编码方式存在一种模式:

<username1>:<username2> | homeAddress = {
    <json_data>
}

我们将分两遍解析文件:第一遍分开一条记录 从另一遍和第二遍中挑选出记录中的字段:

  • 记录在包含单个“}”字符的行结束
  • 使用正则表达式分隔记录中的字段
import json, re
import pandas as pd

data = []
pattern = re.compile(r"(.+?):(.+?)\s*\|\s*homeAddress = (.+)", re.DOTALL)

with open('data.txt') as fp:
    record = ""
    for line in fp:
        record += line

        if line == "}\n":
            m = pattern.match(record)
            if m:
                username1 = m.group(1)
                username2 = m.group(2)
                home_address = json.loads(m.group(3))
                data.append({
                    "username1": username1,
                    "username2": username2,
                    **home_address
                })
            record = ""

df = pd.DataFrame(data).rename(columns={"__typename": "typename"})

您可以对原始文本进行一些修改,使其成为有效的 dictionary/JSON 并将其提供给 pandas.read_json:

(pd.read_json('[%s]'%re.sub(r'([^:\n]+):([^\|:]+)\s*\|\s*homeAddress = {',
                            r',{\n  "username1":"",\n  "username2":"",',
                            text)[1:])
   .rename(columns={'houseNumber': 'housenumber',
                    'houseNumberAddition': 'housenumber_addition',
                    'postalCode': 'postalcode',
                    '__typename': 'typename'})
)

输出:

         username1     username2       city  housenumber  housenumber_addition postalcode       street        typename
0   liame@ziggo.nl       horse22  AMSTERDAM            5                   NaN    1111 AN       Walker  ShopperAddress
1  johndoe@live.nl  pizzalover1     NEW YOK           23                   NaN    9999 HV  Marie Curie  ShopperAddress

中间返工数据:

[{
  "username1":"liame@ziggo.nl",
  "username2":"horse22",
  "city": "AMSTERDAM",
  "houseNumber": "5",
  "houseNumberAddition": null,
  "postalCode": "1111 AN",
  "street": "Walker",
  "__typename": "ShopperAddress"
}
,{
  "username1":"johndoe@live.nl",
  "username2":"pizzalover1 ",
  "city": "NEW YOK",
  "houseNumber": "23",
  "houseNumberAddition": null,
  "postalCode": "9999 HV",
  "street": "Marie Curie",
  "__typename": "ShopperAddress"
}]