使用 python 对 json 数据进行数据预处理(Jupyter 笔记本)
Data preprocessing with json data using python (Jupyter notebook)
我正在尝试为 json 数据集执行一些预处理命令。使用 .csv 文件很容易,但我无法了解如何实现一些预处理命令,如 isnull()、fillna()、dropna() 和 imputer class.
以下是我已执行但未能执行上述操作的一些命令,因为我无法弄清楚如何使用 Json 文件数据集。
数据集link:https://drive.google.com/file/d/1puNNrRaV-Jt_kt709fuYGCvDW9-EuwoB/view?usp=sharing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
dataset = pd.read_json('moviereviews.json', orient='columns')
print(dataset)
movies = pd.read_json( ( dataset).to_json(), orient='index')
print(movies)
print(type(movies))
movie = pd.read_json( ( dataset['12 Strong']).to_json(), orient='index')
print(movie)
movie_name = [
"12 Strong",
"A Ciambra",
"All The Money In The World",
"Along With The Gods: The Two Worlds",
"Bilal: A New Breed Of Hero",
"Call Me By Your Name",
"Condorito: La Película",
"Darkest Hour",
"Den Of Thieves",
"Downsizing",
"Father Figures",
"Film Stars Don'T Die In Liverpool",
"Forever My Girl",
"Happy End",
"Hostiles",
"I, Tonya",
"In The Fade (Aus Dem Nichts)",
"Insidious: The Last Key",
"Jumanji: Welcome To The Jungle",
"Mary And The Witch'S Flower",
"Maze Runner: The Death Cure",
"Molly'S Game",
"Paddington 2",
"Padmaavat",
"Phantom Thread",
"Pitch Perfect 3",
"Proud Mary",
"Star Wars: Episode Viii - The Last Jedi",
"Star Wars: The Last Jedi",
"The Cage Fighter",
"The Commuter",
"The Final Year",
"The Greatest Showman",
"The Insult (L'Insulte)",
"The Post",
"The Shape Of Water",
"Una Mujer Fantástica",
"Winchester"
]
print(movie_name)
data = []
for moviename in movie_name:
movie = pd.read_json( ( dataset[moviename]).to_json(), orient='index')
data.append(movie)
print(data)
您对该数据集的挑战之一是它对相同数据具有不同的键名称,例如 'Tomato Score'
和 'tomatoscore'
。下面的解决方案不是最好的,它可以优化很多,但是,我这样说是为了让您更容易看到为使数据一致而实施的步骤:
import pandas as pd
with open('moviereviews.json', "r") as read_file:
dataset = json.load(read_file)
data = []
for index in range(len(dataset)):
for key in dataset[index]:
movie_name = key
if 'Genre' in dataset[index][key]:
genre = dataset[index][key]['Genre']
else:
genre = None
if 'Gross' in dataset[index][key]:
gross = dataset[index][key]['Gross']
else:
gross = None
if 'IMDB Metascore' in dataset[index][key]:
imdb = dataset[index][key]['IMDB Metascore']
else:
imdb = None
if 'Popcorn Score' in dataset[index][key]:
popcorn = dataset[index][key]['Popcorn Score']
elif 'popcornscore' in dataset[index][key]:
popcorn = dataset[index][key]['popcornscore']
else:
popcorn = None
if 'Rating' in dataset[index][key]:
rating = dataset[index][key]['Rating']
elif 'rating' in dataset[index][key]:
rating = dataset[index][key]['rating']
else:
rating = None
if 'Tomato Score' in dataset[index][key]:
tomato = dataset[index][key]['Tomato Score']
elif 'tomatoscore' in dataset[index][key]:
tomato = dataset[index][key]['tomatoscore']
else:
tomato = None
data.append({'Movie Name': movie_name,
'Genre': genre,
'Gross': gross,
'IMDB Metascore': imdb,
'Popcorn Score': popcorn,
'Rating': rating,
'Tomato Score': tomato})
df = pd.DataFrame(data)
df
您可以拆分字典中的项目并单独阅读它们,一次性用 None 填充 NaN。
如果你的json被称为数据,那么
df = pd.DataFrame(data[0].values()).fillna('None')
df['Movie Name'] = pd.DataFrame(data[0].keys())
df.set_index('Movie Name', inplace=True)
df.head()
Genre Gross IMDB Metascore Popcorn Score Rating Tomato Score popcornscore rating tomatoscore
Movie Name
12 Strong Action ,465,000 54 72 R 54 None None None
A Ciambra Drama unknown 70 unknown unrated unkown None None None
All The Money In The World None None None None None None 72.0 R 76.0
Along With The Gods: The Two Worlds None None None None None None 90.0 NR 50.0
Bilal: A New Breed Of Hero Animation unknown 52 unknown unrated unkown None None None
我正在尝试为 json 数据集执行一些预处理命令。使用 .csv 文件很容易,但我无法了解如何实现一些预处理命令,如 isnull()、fillna()、dropna() 和 imputer class.
以下是我已执行但未能执行上述操作的一些命令,因为我无法弄清楚如何使用 Json 文件数据集。
数据集link:https://drive.google.com/file/d/1puNNrRaV-Jt_kt709fuYGCvDW9-EuwoB/view?usp=sharing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
dataset = pd.read_json('moviereviews.json', orient='columns')
print(dataset)
movies = pd.read_json( ( dataset).to_json(), orient='index')
print(movies)
print(type(movies))
movie = pd.read_json( ( dataset['12 Strong']).to_json(), orient='index')
print(movie)
movie_name = [
"12 Strong",
"A Ciambra",
"All The Money In The World",
"Along With The Gods: The Two Worlds",
"Bilal: A New Breed Of Hero",
"Call Me By Your Name",
"Condorito: La Película",
"Darkest Hour",
"Den Of Thieves",
"Downsizing",
"Father Figures",
"Film Stars Don'T Die In Liverpool",
"Forever My Girl",
"Happy End",
"Hostiles",
"I, Tonya",
"In The Fade (Aus Dem Nichts)",
"Insidious: The Last Key",
"Jumanji: Welcome To The Jungle",
"Mary And The Witch'S Flower",
"Maze Runner: The Death Cure",
"Molly'S Game",
"Paddington 2",
"Padmaavat",
"Phantom Thread",
"Pitch Perfect 3",
"Proud Mary",
"Star Wars: Episode Viii - The Last Jedi",
"Star Wars: The Last Jedi",
"The Cage Fighter",
"The Commuter",
"The Final Year",
"The Greatest Showman",
"The Insult (L'Insulte)",
"The Post",
"The Shape Of Water",
"Una Mujer Fantástica",
"Winchester"
]
print(movie_name)
data = []
for moviename in movie_name:
movie = pd.read_json( ( dataset[moviename]).to_json(), orient='index')
data.append(movie)
print(data)
您对该数据集的挑战之一是它对相同数据具有不同的键名称,例如 'Tomato Score'
和 'tomatoscore'
。下面的解决方案不是最好的,它可以优化很多,但是,我这样说是为了让您更容易看到为使数据一致而实施的步骤:
import pandas as pd
with open('moviereviews.json', "r") as read_file:
dataset = json.load(read_file)
data = []
for index in range(len(dataset)):
for key in dataset[index]:
movie_name = key
if 'Genre' in dataset[index][key]:
genre = dataset[index][key]['Genre']
else:
genre = None
if 'Gross' in dataset[index][key]:
gross = dataset[index][key]['Gross']
else:
gross = None
if 'IMDB Metascore' in dataset[index][key]:
imdb = dataset[index][key]['IMDB Metascore']
else:
imdb = None
if 'Popcorn Score' in dataset[index][key]:
popcorn = dataset[index][key]['Popcorn Score']
elif 'popcornscore' in dataset[index][key]:
popcorn = dataset[index][key]['popcornscore']
else:
popcorn = None
if 'Rating' in dataset[index][key]:
rating = dataset[index][key]['Rating']
elif 'rating' in dataset[index][key]:
rating = dataset[index][key]['rating']
else:
rating = None
if 'Tomato Score' in dataset[index][key]:
tomato = dataset[index][key]['Tomato Score']
elif 'tomatoscore' in dataset[index][key]:
tomato = dataset[index][key]['tomatoscore']
else:
tomato = None
data.append({'Movie Name': movie_name,
'Genre': genre,
'Gross': gross,
'IMDB Metascore': imdb,
'Popcorn Score': popcorn,
'Rating': rating,
'Tomato Score': tomato})
df = pd.DataFrame(data)
df
您可以拆分字典中的项目并单独阅读它们,一次性用 None 填充 NaN。
如果你的json被称为数据,那么
df = pd.DataFrame(data[0].values()).fillna('None')
df['Movie Name'] = pd.DataFrame(data[0].keys())
df.set_index('Movie Name', inplace=True)
df.head()
Genre Gross IMDB Metascore Popcorn Score Rating Tomato Score popcornscore rating tomatoscore
Movie Name
12 Strong Action ,465,000 54 72 R 54 None None None
A Ciambra Drama unknown 70 unknown unrated unkown None None None
All The Money In The World None None None None None None 72.0 R 76.0
Along With The Gods: The Two Worlds None None None None None None 90.0 NR 50.0
Bilal: A New Breed Of Hero Animation unknown 52 unknown unrated unkown None None None