在 Panda 中清理数据

Cleaning data in Panda

背景 我从文本到数据应用程序创建的 csv/xlsx 文件中将数据加载到 Panda 中。在节省时间的同时,自动读取非常准确。 下面我简化了一个负载来说明我难以排序的特定问题:

import pandas as pd
from tabulate import tabulate

df_is = {"Var":["Sales","Gogs","Op prof","Depreciation","Net fin","PBT","Tax","PAT"],
"2021":[100,-50,50,-10,-5,35,"",""],
"2022":[125,-55,70,-15,-10,45,-10,25],
"":["","","","","","",-15,30]}

df_want = {"Var":["Sales","Gogs","Op prof","Depreciation","Net fin","PBT","Tax","PAT"],
"2021":[100,-50,50,-10,-5,35,-10,25],
"2022":[125,-55,70,-15,-10,45,-15,30]}

print(tabulate(df_is))
print()
print(tabulate(df_want))

问题 通过运行代码可以看出,第一个table中的数据没有被应用程序正确读取,导致第二列和第三列的最后两个数据点出现在第三列和最后一列, 分别.

第二个 table 显示了我希望它出现的方式。实际问题更加复杂和普遍,因此局部覆盖值的解决方案是不可行的。一个解决方案,比如 Excel,我会删除第二列中的空单元格,同时将行中的所有其他数据移动到 left/right(取决于任务),这会很好。

尝试过 作为新手,我曾尝试搜索解决方案,但 none 我的搜索条件似乎导致了相关的解决方案。

我还使用 df.iloc() 创建了四个不一致的数据单元格的变量,然后尝试将它们附加到第 1 列和第 2 列。比只添加最后一个的副本两行。

非常感谢您的建议!

版本 康达 4.11.0 Python3.9.7

Pandas 1.3.4

请试试这个:

import pandas as pd
import numpy as np
f_is = {"Var":["Sales","Gogs","Op prof","Depreciation","Net fin","PBT","Tax","PAT"],
"2021":[100,-50,50,-10,-5,35,"",""],
"2022":[125,-55,70,-15,-10,45,-10,25],
"":["","","","","","",-15,30]}
input_df = pd.DataFrame(f_is)

output_df = input_df.T.replace('', np.nan).apply(lambda x: pd.Series(x.dropna().to_numpy())).T
output_df.columns = ['Var','2021','2022']
output_df