pandas df 连接行的嵌套字典
Nested dictionary to pandas df concatenating rows
给定以下命令:
j = {
"source": "https://example.com",
"timestamp": "2021-04-12T19:34:24Z",
"durationInTicks": 1082400000,
"duration": "PT1M48.24S",
"combinedRecognizedPhrases": [
{
"channel": 0,
"lexical": "aaa",
"itn": "aaa",
"maskedITN": "aaa",
"display": "aaa"
}
],
"recognizedPhrases": [
{
"recognitionStatus": "Success",
"channel": 0,
"speaker": 1,
"offset": "PT2.18S",
"duration": "PT3.88S",
"offsetInTicks": 21800000,
"durationInTicks": 38800000,
"nBest": [
{
"confidence": 0.9306252,
"lexical": "gracias por llamar",
"itn": "gracias por llamar",
"maskedITN": "gracias por llamar",
"display": "¿Gracias por llamar",
"words": [
{
"word": "gracias",
"offset": "PT2.18S",
"duration": "PT0.37S",
"offsetInTicks": 21800000,
"durationInTicks": 3700000,
"confidence": 0.930625
},
{
"word": "por",
"offset": "PT2.55S",
"duration": "PT0.18S",
"offsetInTicks": 25500000,
"durationInTicks": 1800000,
"confidence": 0.930625
},
{
"word": "llamar",
"offset": "PT2.73S",
"duration": "PT0.22S",
"offsetInTicks": 27300000,
"durationInTicks": 2200000,
"confidence": 0.930625
}
]
}
]
},
{
"recognitionStatus": "Success",
"channel": 0,
"speaker": 2,
"offset": "PT6.85S",
"duration": "PT5.63S",
"offsetInTicks": 68500000,
"durationInTicks": 56300000,
"nBest": [
{
"confidence": 0.9306253,
"lexical": "quiero hacer un pago",
"itn": "quiero hacer un pago",
"maskedITN": "quiero hacer un pago",
"display": "quiero hacer un pago"
}
]
},
{
"recognitionStatus": "Success",
"channel": 0,
"speaker": 2,
"offset": "PT13.29S",
"duration": "PT3.81S",
"offsetInTicks": 132900000,
"durationInTicks": 38100000,
"nBest": [
{
"confidence": 0.93062526,
"lexical": "no sé bien la cantidad",
"itn": "no sé bien la cantidad",
"maskedITN": "no sé bien la cantidad",
"display": "no sé bien la cantidad"
}
]
}
]
}
目标:在df的单行中获取感兴趣的信息。
到目前为止我做了什么?:
df = pd.json_normalize(j, record_path=['recognizedPhrases', 'nBest'], meta=['source', 'durationInTicks', 'duration', ['recognizedPhrases', 'speaker']])
df['speech'] = df.groupby(['source', 'recognizedPhrases.speaker'])['display'].transform(lambda x : ' '.join(x))
df = df.drop_duplicates(subset=['recognizedPhrases.speaker'])
获得df:
为什么我对获得的输出不满意?:我的输出显示了一个有两行的 df(每 recognizedPhrases.speaker
一行),我需要一行中的所有信息,一列是说话者 1 所说的(在 speaker
列中),另一列是 speaker
2 说的。
其他信息:性能是一个重要因素,因为我将对数千个文件执行此过程。
编辑 1:
我期望的结果看起来像这样:
expected_dict = {'source': {0: 'https://example.com'},
'durationInTicks': {0: 1082400000},
'duration': {0: 'PT1M48.24S'},
'recognizedPhrases.speaker1': {0: '¿Gracias por llamar'},
'recognizedPhrases.speaker2': {0: 'quiero hacer un pago no sé bien la cantidad'}}
expected_df = pd.DataFrame(expected_dict)
你可以pivot()
进入预期的输出:
index = ['source', 'durationInTicks', 'duration']
columns = ['recognizedPhrases.speaker']
values= ['speech']
df = df[index+columns+values].pivot(index=index, columns=columns, values=values[0])
df.columns = [f'{df.columns.name}{column}' for column in df.columns]
source
durationInTicks
duration
recognizedPhrases.speaker1
recognizedPhrases.speaker2
https://example.com
1082400000
PT1M48.24S
¿Gracias por llamar
quiero hacer un pago no sé bien la cantidad
给定以下命令:
j = {
"source": "https://example.com",
"timestamp": "2021-04-12T19:34:24Z",
"durationInTicks": 1082400000,
"duration": "PT1M48.24S",
"combinedRecognizedPhrases": [
{
"channel": 0,
"lexical": "aaa",
"itn": "aaa",
"maskedITN": "aaa",
"display": "aaa"
}
],
"recognizedPhrases": [
{
"recognitionStatus": "Success",
"channel": 0,
"speaker": 1,
"offset": "PT2.18S",
"duration": "PT3.88S",
"offsetInTicks": 21800000,
"durationInTicks": 38800000,
"nBest": [
{
"confidence": 0.9306252,
"lexical": "gracias por llamar",
"itn": "gracias por llamar",
"maskedITN": "gracias por llamar",
"display": "¿Gracias por llamar",
"words": [
{
"word": "gracias",
"offset": "PT2.18S",
"duration": "PT0.37S",
"offsetInTicks": 21800000,
"durationInTicks": 3700000,
"confidence": 0.930625
},
{
"word": "por",
"offset": "PT2.55S",
"duration": "PT0.18S",
"offsetInTicks": 25500000,
"durationInTicks": 1800000,
"confidence": 0.930625
},
{
"word": "llamar",
"offset": "PT2.73S",
"duration": "PT0.22S",
"offsetInTicks": 27300000,
"durationInTicks": 2200000,
"confidence": 0.930625
}
]
}
]
},
{
"recognitionStatus": "Success",
"channel": 0,
"speaker": 2,
"offset": "PT6.85S",
"duration": "PT5.63S",
"offsetInTicks": 68500000,
"durationInTicks": 56300000,
"nBest": [
{
"confidence": 0.9306253,
"lexical": "quiero hacer un pago",
"itn": "quiero hacer un pago",
"maskedITN": "quiero hacer un pago",
"display": "quiero hacer un pago"
}
]
},
{
"recognitionStatus": "Success",
"channel": 0,
"speaker": 2,
"offset": "PT13.29S",
"duration": "PT3.81S",
"offsetInTicks": 132900000,
"durationInTicks": 38100000,
"nBest": [
{
"confidence": 0.93062526,
"lexical": "no sé bien la cantidad",
"itn": "no sé bien la cantidad",
"maskedITN": "no sé bien la cantidad",
"display": "no sé bien la cantidad"
}
]
}
]
}
目标:在df的单行中获取感兴趣的信息。
到目前为止我做了什么?:
df = pd.json_normalize(j, record_path=['recognizedPhrases', 'nBest'], meta=['source', 'durationInTicks', 'duration', ['recognizedPhrases', 'speaker']])
df['speech'] = df.groupby(['source', 'recognizedPhrases.speaker'])['display'].transform(lambda x : ' '.join(x))
df = df.drop_duplicates(subset=['recognizedPhrases.speaker'])
获得df:
为什么我对获得的输出不满意?:我的输出显示了一个有两行的 df(每 recognizedPhrases.speaker
一行),我需要一行中的所有信息,一列是说话者 1 所说的(在 speaker
列中),另一列是 speaker
2 说的。
其他信息:性能是一个重要因素,因为我将对数千个文件执行此过程。
编辑 1: 我期望的结果看起来像这样:
expected_dict = {'source': {0: 'https://example.com'},
'durationInTicks': {0: 1082400000},
'duration': {0: 'PT1M48.24S'},
'recognizedPhrases.speaker1': {0: '¿Gracias por llamar'},
'recognizedPhrases.speaker2': {0: 'quiero hacer un pago no sé bien la cantidad'}}
expected_df = pd.DataFrame(expected_dict)
你可以pivot()
进入预期的输出:
index = ['source', 'durationInTicks', 'duration']
columns = ['recognizedPhrases.speaker']
values= ['speech']
df = df[index+columns+values].pivot(index=index, columns=columns, values=values[0])
df.columns = [f'{df.columns.name}{column}' for column in df.columns]
source | durationInTicks | duration | recognizedPhrases.speaker1 | recognizedPhrases.speaker2 |
---|---|---|---|---|
https://example.com | 1082400000 | PT1M48.24S | ¿Gracias por llamar | quiero hacer un pago no sé bien la cantidad |