在 Panda 中清理数据
Cleaning data in Panda
背景
我从文本到数据应用程序创建的 csv/xlsx 文件中将数据加载到 Panda 中。在节省时间的同时,自动读取非常准确。
下面我简化了一个负载来说明我难以排序的特定问题:
import pandas as pd
from tabulate import tabulate
df_is = {"Var":["Sales","Gogs","Op prof","Depreciation","Net fin","PBT","Tax","PAT"],
"2021":[100,-50,50,-10,-5,35,"",""],
"2022":[125,-55,70,-15,-10,45,-10,25],
"":["","","","","","",-15,30]}
df_want = {"Var":["Sales","Gogs","Op prof","Depreciation","Net fin","PBT","Tax","PAT"],
"2021":[100,-50,50,-10,-5,35,-10,25],
"2022":[125,-55,70,-15,-10,45,-15,30]}
print(tabulate(df_is))
print()
print(tabulate(df_want))
问题
通过运行代码可以看出,第一个table中的数据没有被应用程序正确读取,导致第二列和第三列的最后两个数据点出现在第三列和最后一列, 分别.
第二个 table 显示了我希望它出现的方式。实际问题更加复杂和普遍,因此局部覆盖值的解决方案是不可行的。一个解决方案,比如 Excel,我会删除第二列中的空单元格,同时将行中的所有其他数据移动到 left/right(取决于任务),这会很好。
尝试过
作为新手,我曾尝试搜索解决方案,但 none 我的搜索条件似乎导致了相关的解决方案。
我还使用 df.iloc() 创建了四个不一致的数据单元格的变量,然后尝试将它们附加到第 1 列和第 2 列。比只添加最后一个的副本两行。
非常感谢您的建议!
版本
康达 4.11.0
Python3.9.7
Pandas 1.3.4
请试试这个:
import pandas as pd
import numpy as np
f_is = {"Var":["Sales","Gogs","Op prof","Depreciation","Net fin","PBT","Tax","PAT"],
"2021":[100,-50,50,-10,-5,35,"",""],
"2022":[125,-55,70,-15,-10,45,-10,25],
"":["","","","","","",-15,30]}
input_df = pd.DataFrame(f_is)
output_df = input_df.T.replace('', np.nan).apply(lambda x: pd.Series(x.dropna().to_numpy())).T
output_df.columns = ['Var','2021','2022']
output_df
背景 我从文本到数据应用程序创建的 csv/xlsx 文件中将数据加载到 Panda 中。在节省时间的同时,自动读取非常准确。 下面我简化了一个负载来说明我难以排序的特定问题:
import pandas as pd
from tabulate import tabulate
df_is = {"Var":["Sales","Gogs","Op prof","Depreciation","Net fin","PBT","Tax","PAT"],
"2021":[100,-50,50,-10,-5,35,"",""],
"2022":[125,-55,70,-15,-10,45,-10,25],
"":["","","","","","",-15,30]}
df_want = {"Var":["Sales","Gogs","Op prof","Depreciation","Net fin","PBT","Tax","PAT"],
"2021":[100,-50,50,-10,-5,35,-10,25],
"2022":[125,-55,70,-15,-10,45,-15,30]}
print(tabulate(df_is))
print()
print(tabulate(df_want))
问题 通过运行代码可以看出,第一个table中的数据没有被应用程序正确读取,导致第二列和第三列的最后两个数据点出现在第三列和最后一列, 分别.
第二个 table 显示了我希望它出现的方式。实际问题更加复杂和普遍,因此局部覆盖值的解决方案是不可行的。一个解决方案,比如 Excel,我会删除第二列中的空单元格,同时将行中的所有其他数据移动到 left/right(取决于任务),这会很好。
尝试过 作为新手,我曾尝试搜索解决方案,但 none 我的搜索条件似乎导致了相关的解决方案。
我还使用 df.iloc() 创建了四个不一致的数据单元格的变量,然后尝试将它们附加到第 1 列和第 2 列。比只添加最后一个的副本两行。
非常感谢您的建议!
版本 康达 4.11.0 Python3.9.7
Pandas 1.3.4
请试试这个:
import pandas as pd
import numpy as np
f_is = {"Var":["Sales","Gogs","Op prof","Depreciation","Net fin","PBT","Tax","PAT"],
"2021":[100,-50,50,-10,-5,35,"",""],
"2022":[125,-55,70,-15,-10,45,-10,25],
"":["","","","","","",-15,30]}
input_df = pd.DataFrame(f_is)
output_df = input_df.T.replace('', np.nan).apply(lambda x: pd.Series(x.dropna().to_numpy())).T
output_df.columns = ['Var','2021','2022']
output_df