从 json 格式的文本文件中正确读取数据
correctly reading data from a text file with json format in it
假设我有一个包含以下两个观察结果的文本文件:
liame@ziggo.nl:horse22| homeAddress = {
"city": "AMSTERDAM",
"houseNumber": "5",
"houseNumberAddition": null,
"postalCode": "1111 AN",
"street": "Walker",
"__typename": "ShopperAddress"
}
johndoe@live.nl:pizzalover1 | homeAddress = {
"city": "NEW YOK",
"houseNumber": "23",
"houseNumberAddition": null,
"postalCode": "9999 HV",
"street": "Marie Curie",
"__typename": "ShopperAddress"
}
有没有办法以数据框如下所示的方式读取此文本文件:
username1 username2 city housenumber housenumber_addition postalcode street typename
liam@ziggo.nl horse22 AMSTERDAM 5 null 1111 AN Walker ShopperAddress
johndoe@live.nl pizzalover1 NEW YORK 23 null 9999 HV Marie Curie ShopperAddress
感谢
您的文本文件表明数据的编码方式存在一种模式:
<username1>:<username2> | homeAddress = {
<json_data>
}
我们将分两遍解析文件:第一遍分开一条记录
从另一遍和第二遍中挑选出记录中的字段:
- 记录在包含单个“}”字符的行结束
- 使用正则表达式分隔记录中的字段
import json, re
import pandas as pd
data = []
pattern = re.compile(r"(.+?):(.+?)\s*\|\s*homeAddress = (.+)", re.DOTALL)
with open('data.txt') as fp:
record = ""
for line in fp:
record += line
if line == "}\n":
m = pattern.match(record)
if m:
username1 = m.group(1)
username2 = m.group(2)
home_address = json.loads(m.group(3))
data.append({
"username1": username1,
"username2": username2,
**home_address
})
record = ""
df = pd.DataFrame(data).rename(columns={"__typename": "typename"})
您可以对原始文本进行一些修改,使其成为有效的 dictionary/JSON 并将其提供给 pandas.read_json
:
(pd.read_json('[%s]'%re.sub(r'([^:\n]+):([^\|:]+)\s*\|\s*homeAddress = {',
r',{\n "username1":"",\n "username2":"",',
text)[1:])
.rename(columns={'houseNumber': 'housenumber',
'houseNumberAddition': 'housenumber_addition',
'postalCode': 'postalcode',
'__typename': 'typename'})
)
输出:
username1 username2 city housenumber housenumber_addition postalcode street typename
0 liame@ziggo.nl horse22 AMSTERDAM 5 NaN 1111 AN Walker ShopperAddress
1 johndoe@live.nl pizzalover1 NEW YOK 23 NaN 9999 HV Marie Curie ShopperAddress
中间返工数据:
[{
"username1":"liame@ziggo.nl",
"username2":"horse22",
"city": "AMSTERDAM",
"houseNumber": "5",
"houseNumberAddition": null,
"postalCode": "1111 AN",
"street": "Walker",
"__typename": "ShopperAddress"
}
,{
"username1":"johndoe@live.nl",
"username2":"pizzalover1 ",
"city": "NEW YOK",
"houseNumber": "23",
"houseNumberAddition": null,
"postalCode": "9999 HV",
"street": "Marie Curie",
"__typename": "ShopperAddress"
}]
假设我有一个包含以下两个观察结果的文本文件:
liame@ziggo.nl:horse22| homeAddress = {
"city": "AMSTERDAM",
"houseNumber": "5",
"houseNumberAddition": null,
"postalCode": "1111 AN",
"street": "Walker",
"__typename": "ShopperAddress"
}
johndoe@live.nl:pizzalover1 | homeAddress = {
"city": "NEW YOK",
"houseNumber": "23",
"houseNumberAddition": null,
"postalCode": "9999 HV",
"street": "Marie Curie",
"__typename": "ShopperAddress"
}
有没有办法以数据框如下所示的方式读取此文本文件:
username1 username2 city housenumber housenumber_addition postalcode street typename
liam@ziggo.nl horse22 AMSTERDAM 5 null 1111 AN Walker ShopperAddress
johndoe@live.nl pizzalover1 NEW YORK 23 null 9999 HV Marie Curie ShopperAddress
感谢
您的文本文件表明数据的编码方式存在一种模式:
<username1>:<username2> | homeAddress = {
<json_data>
}
我们将分两遍解析文件:第一遍分开一条记录 从另一遍和第二遍中挑选出记录中的字段:
- 记录在包含单个“}”字符的行结束
- 使用正则表达式分隔记录中的字段
import json, re
import pandas as pd
data = []
pattern = re.compile(r"(.+?):(.+?)\s*\|\s*homeAddress = (.+)", re.DOTALL)
with open('data.txt') as fp:
record = ""
for line in fp:
record += line
if line == "}\n":
m = pattern.match(record)
if m:
username1 = m.group(1)
username2 = m.group(2)
home_address = json.loads(m.group(3))
data.append({
"username1": username1,
"username2": username2,
**home_address
})
record = ""
df = pd.DataFrame(data).rename(columns={"__typename": "typename"})
您可以对原始文本进行一些修改,使其成为有效的 dictionary/JSON 并将其提供给 pandas.read_json
:
(pd.read_json('[%s]'%re.sub(r'([^:\n]+):([^\|:]+)\s*\|\s*homeAddress = {',
r',{\n "username1":"",\n "username2":"",',
text)[1:])
.rename(columns={'houseNumber': 'housenumber',
'houseNumberAddition': 'housenumber_addition',
'postalCode': 'postalcode',
'__typename': 'typename'})
)
输出:
username1 username2 city housenumber housenumber_addition postalcode street typename
0 liame@ziggo.nl horse22 AMSTERDAM 5 NaN 1111 AN Walker ShopperAddress
1 johndoe@live.nl pizzalover1 NEW YOK 23 NaN 9999 HV Marie Curie ShopperAddress
中间返工数据:
[{
"username1":"liame@ziggo.nl",
"username2":"horse22",
"city": "AMSTERDAM",
"houseNumber": "5",
"houseNumberAddition": null,
"postalCode": "1111 AN",
"street": "Walker",
"__typename": "ShopperAddress"
}
,{
"username1":"johndoe@live.nl",
"username2":"pizzalover1 ",
"city": "NEW YOK",
"houseNumber": "23",
"houseNumberAddition": null,
"postalCode": "9999 HV",
"street": "Marie Curie",
"__typename": "ShopperAddress"
}]