Azure 转录 json 到 pandas df
Azure transcription json to pandas df
我正在尝试将 Azure 语音转文本转录服务 (json) 的输出转换为 pd 数据帧。
下面是得到的例子json:
{
"source": "https://batchtranscriptionstore1.blob.core.windows.net/recordings/20210221-1022043b576ef4.wav?fakecredentials123456789",
"timestamp": "2020-06-16T09:30:21Z",
"durationInTicks": 41200000,
"duration": "PT4.12S",
"combinedRecognizedPhrases": [
{
"channel": 0,
"lexical": "hello world",
"itn": "hello world",
"maskedITN": "hello world",
"display": "Hello world."
}
],
"recognizedPhrases": [
{
"recognitionStatus": "Success",
"speaker": 1,
"channel": 0,
"offset": "PT0.07S",
"duration": "PT1.59S",
"offsetInTicks": 700000,
"durationInTicks": 15900000,
"nBest": [
{
"confidence": 0.898652852,
"lexical": "hello world",
"itn": "hello world",
"maskedITN": "hello world",
"display": "Hello world.",
"words": [
{
"word": "hello",
"offset": "PT0.09S",
"duration": "PT0.48S",
"offsetInTicks": 900000,
"durationInTicks": 4800000,
"confidence": 0.987572
},
{
"word": "world",
"offset": "PT0.59S",
"duration": "PT0.16S",
"offsetInTicks": 5900000,
"durationInTicks": 1600000,
"confidence": 0.906032
}
]
}
]
}
]
}
使用以下代码,我设法用以下列制作 df:source
、timestamp
、durationInTicks
、duration
、combinedRecognizedPhrases
with open('file.json') as json_data:
data = json.load(json_data)
ll = pd.DataFrame(dict(list(data.items())[0:5]))
但我还需要单独列中的“combinedRecognizedPhrases”的各个值。我该怎么做?
尝试 pd.json_normalize()
与 record_path
然后加入
with open('file.json','r') as f:
j = json.load(f)
df = pd.json_normalize(j,max_level=1)
df1 = pd.json_normalize(j,max_level=1,record_path=['combinedRecognizedPhrases'])
df2 = df[['source', 'timestamp', 'durationInTicks', 'duration']].join(df1)
根据@Manakin 建议的答案和以下 [link][1],我想出了这个解决方案:
with open('file.json','r') as f:
j = json.load(f)
zz = pd.json_normalize(j, record_path=['combinedRecognizedPhrases'], meta=['source', 'durationInTicks', 'duration'])
[1]: http://(https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd
我正在尝试将 Azure 语音转文本转录服务 (json) 的输出转换为 pd 数据帧。
下面是得到的例子json:
{
"source": "https://batchtranscriptionstore1.blob.core.windows.net/recordings/20210221-1022043b576ef4.wav?fakecredentials123456789",
"timestamp": "2020-06-16T09:30:21Z",
"durationInTicks": 41200000,
"duration": "PT4.12S",
"combinedRecognizedPhrases": [
{
"channel": 0,
"lexical": "hello world",
"itn": "hello world",
"maskedITN": "hello world",
"display": "Hello world."
}
],
"recognizedPhrases": [
{
"recognitionStatus": "Success",
"speaker": 1,
"channel": 0,
"offset": "PT0.07S",
"duration": "PT1.59S",
"offsetInTicks": 700000,
"durationInTicks": 15900000,
"nBest": [
{
"confidence": 0.898652852,
"lexical": "hello world",
"itn": "hello world",
"maskedITN": "hello world",
"display": "Hello world.",
"words": [
{
"word": "hello",
"offset": "PT0.09S",
"duration": "PT0.48S",
"offsetInTicks": 900000,
"durationInTicks": 4800000,
"confidence": 0.987572
},
{
"word": "world",
"offset": "PT0.59S",
"duration": "PT0.16S",
"offsetInTicks": 5900000,
"durationInTicks": 1600000,
"confidence": 0.906032
}
]
}
]
}
]
}
使用以下代码,我设法用以下列制作 df:source
、timestamp
、durationInTicks
、duration
、combinedRecognizedPhrases
with open('file.json') as json_data:
data = json.load(json_data)
ll = pd.DataFrame(dict(list(data.items())[0:5]))
但我还需要单独列中的“combinedRecognizedPhrases”的各个值。我该怎么做?
尝试 pd.json_normalize()
与 record_path
然后加入
with open('file.json','r') as f:
j = json.load(f)
df = pd.json_normalize(j,max_level=1)
df1 = pd.json_normalize(j,max_level=1,record_path=['combinedRecognizedPhrases'])
df2 = df[['source', 'timestamp', 'durationInTicks', 'duration']].join(df1)
根据@Manakin 建议的答案和以下 [link][1],我想出了这个解决方案:
with open('file.json','r') as f:
j = json.load(f)
zz = pd.json_normalize(j, record_path=['combinedRecognizedPhrases'], meta=['source', 'durationInTicks', 'duration'])
[1]: http://(https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd