到特定日期的数据框价格平均行

Dataframe Price Average in rows to the specific date

基本上我有一个包含大量房地产数据的数据框。每天,每个房地产都会添加新数据,最重要的是它的价格、房地产所在的区域以及该房地产被添加到数据框的日期。对于每个地区,我想每天计算价格的发展。我从这样的数据库中获取我的数据框:

data1 = pd.read_sql_query(
 "SELECT REAL_ESTATE.UNIQUE_RE_NUMBER, REAL_ESTATE.TYP_ID, ADDRESS.ADDRSS, ADDRESS.LOCATION, PRICE.RE_PRICE, MAX(PRICE.UPDATE_DATE) AS UPDATE_DATE, HOUSEINFO.RE_POLOHA, HOUSEINFO.RE_DRUH, HOUSEINFO.RE_TYP, HOUSEINFO.RE_UPLOCHA "
 "FROM REAL_ESTATE INNER JOIN ADDRESS, PRICE, HOUSEINFO ON REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=HOUSEINFO.INF_ID GROUP BY REAL_ESTATE.ID ",
 conn)

data2 = pd.read_sql_query(
     "SELECT REAL_ESTATE.UNIQUE_RE_NUMBER, REAL_ESTATE.TYP_ID, ADDRESS.ADDRSS, ADDRESS.LOCATION, PRICE.RE_PRICE, MAX(PRICE.UPDATE_DATE) AS UPDATE_DATE, FLATINFO.RE_DISPOZICE, FLATINFO.RE_DRUH, FLATINFO.RE_PPLOCHA "
     "FROM REAL_ESTATE INNER JOIN ADDRESS, PRICE, FLATINFO ON REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=FLATINFO.INF_ID GROUP BY REAL_ESTATE.ID ",
     conn)

data3 = pd.read_sql_query(
     "SELECT REAL_ESTATE.UNIQUE_RE_NUMBER, REAL_ESTATE.TYP_ID, ADDRESS.ADDRSS, ADDRESS.LOCATION, PRICE.RE_PRICE, MAX(PRICE.UPDATE_DATE) AS UPDATE_DATE, LANDINFO.RE_PLOCHA, LANDINFO.RE_DRUH, LANDINFO.RE_SITE, LANDINFO.RE_KOMUNIKACE "
     "FROM REAL_ESTATE INNER JOIN ADDRESS, PRICE, LANDINFO ON REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=LANDINFO.INF_ID GROUP BY REAL_ESTATE.ID ",
     conn)

df = [data1, data2, data3]

dff = pd.concat(df)
dff = dff.reset_index(drop=True)

为了计算平均值,我有这个命令:

dff['LOC_DATE_AVG'] = dff.groupby(['LOCATION', 'UPDATE_DATE'])['RE_PRICE'].transform('mean')

这仅显示每天添加的平均价格,但我想计算添加到特定日期的每个房地产的整体平均值。因此,当我有来自 1.1.2021、2.1.2021、3.1.2021 的数据时,当我想知道截至 2.1.2021 的平均值时,它会计算 1.1.2021 和 2.1.2021 的平均值。可能吗?

没有 mre (see also here) 就有点难说了。请加一个。

你可以试试:

dff["UPDATE_DATE"] = pd.to_datetime(dff["UPDATE_DATE"])  # Just to make sure
result = (dff[dff["UPDATE_DATE"] <= pd.Timestamp(year=2021, month=1, day=2)]
          .groupby("LOCATION")["RE_PRICE"]
          .mean())

关于您的评论:使用示例数据框 (mre :))

df = pd.DataFrame(
    {
        "LOCATION": ["A", "A", "A", "B", "B"],
        "UPDATE_DATE": ["2021-01-01", "2021-01-02", "2021-01-03",
                        "2021-01-01", "2021-01-02"],
        "RE_PRICE": [1, 2, 3, 1, 2]
    }
)
df["UPDATE_DATE"] = pd.to_datetime(df["UPDATE_DATE"])
  LOCATION UPDATE_DATE  RE_PRICE
0        A  2021-01-01         1
1        A  2021-01-02         2
2        A  2021-01-03         3
3        B  2021-01-01         1
4        B  2021-01-02         2

这个

def cum_mean(sdf):
    return pd.DataFrame(
               sdf.query("UPDATE_DATE <= @day")["RE_PRICE"].mean()
               for day in sdf['UPDATE_DATE'].values
           )

df["CUM_MEAN"] = df.groupby("LOCATION").apply(cum_mean).reset_index(drop=True)

产生

  LOCATION UPDATE_DATE  RE_PRICE  CUM_MEAN
0        A  2021-01-01         1       1.0
1        A  2021-01-02         2       1.5
2        A  2021-01-03         3       2.0
3        B  2021-01-01         1       1.0
4        B  2021-01-02         2       1.5

如果 UPDATE_DATE 列按升序排序(分组就足够了)你也可以这样做

grouped = df.groupby("LOCATION")
df["CUM_MEAN"] = grouped["RE_PRICE"].cumsum() / (grouped.cumcount() + 1)

这可能会比其他版本更快。