Pandas:如何在减法后保留行顺序?

Pandas: How can I preserve row ordering after subtraction?

我有两个三列的数据框,列名相同。我想减去第一列和第二列的值匹配的第三列的值。我尝试了以下方法:

# Common column names
columns = ["month", "category", "sum"]

# First data frame
data1 = [("jan", "j", 10), ("feb", "f", 20)]
df1 = pd.DataFrame.from_records(data1, columns=columns)

# Second data frame
data2 = [("jan", "j", 9.5), ("mar", "m", 30)]
df2 = pd.DataFrame.from_records(data2, columns=columns)

print(df1)  # Observe order of `month`s: jan, feb
print(df2)  # Observe order of `month`s: jan, mar

# Subtract `sum` where `month`, and `category` match:
df1.set_index(["month", "category"]).subtract(df2.set_index(["month", "category"])).reset_index()

这会产生以下输出。 观察行在 month.

上按字母顺序排序
  month category  sum
0   feb        f  NaN
1   jan        j  0.5
2   mar        m  NaN

如何保持左侧操作数的行顺序? IE。如何获得以下输出(或类似输出):

  month category  sum
1   jan        j  0.5
0   feb        f  NaN
2   mar        m  NaN

您可以将列分类并指定您认为合适的任何顺序:

df1['month'] = pd.Categorical(df1['month'], categories=['jan', 'feb', 'mar'], ordered=True)
df2['month'] = pd.Categorical(df2['month'], categories=['jan', 'feb', 'mar'], ordered=True)

# Subtract `sum` where `month`, and `category` match:
res = df1.set_index(["month", "category"]).subtract(df2.set_index(["month", "category"])).reset_index()
print(res)

输出

  month category  sum
0   jan        j  0.5
1   feb        f  NaN
2   mar        m  NaN

pd.merge 将保留左操作数的顺序,然后您可以计算两列之间的差异。例如,您可以这样做:

df3 = pd.merge(df1, df2, on=["month", "category"], how="outer")
df3.loc[:, "difference"] = df3["sum_x"] - df3["sum_y"]

您的数据产生的结果:

  month category  sum_x  sum_y  difference
0   jan        j   10.0    9.5         0.5
1   feb        f   20.0    NaN         NaN
2   mar        m    NaN   30.0         NaN

试试这个:

df1.sort_index(inplace=True)

这只会强制数据框按索引排序。 在此处找到更多文档:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html

因为pandas版本1.1.0 sort_values可以带一个参数key。您可以使用该参数传递所需的订单:

order = {"jan": 0, "feb": 1, "mar": 2}
df1.set_index(["month", "category"]).subtract(df2.set_index(["month", "category"])).reset_index().sort_values(by=['month'], key=lambda x: x.map(order))

输出:

    month   category    sum
1     jan          j    0.5
0     feb          f    NaN
2     mar          m    NaN