如何使用 PySpark 或 pandas 旋转列以便它们变成行?
How to pivot columns so they turn into rows using PySpark or pandas?
我有一个看起来像下面的数据框,但有数百行。我需要旋转它,以便 Region
之后的每一列都是一行,就像下面的另一列 table 一样。
+--------------+----------+---------------------+----------+------------------+------------------+-----------------+
|city |city_tier | city_classification | Region | Jan-2022-orders | Feb-2022-orders | Mar-2022-orders|
+--------------+----------+---------------------+----------+------------------+------------------+-----------------+
|new york | large | alpha | NE | 100000 |195000 | 237000 |
|los angeles | large | alpha | W | 330000 |400000 | 580000 |
我需要使用 PySpark 对它进行旋转,所以我得到了这样的结果:
+--------------+----------+---------------------+----------+-----------+---------+
|city |city_tier | city_classification | Region | month | orders |
+--------------+----------+---------------------+----------+-----------+---------+
|new york | large | alpha | NE | Jan-2022 | 100000 |
|new york | large | alpha | NE | Fev-2022 | 195000 |
|new york | large | alpha | NE | Mar-2022 | 237000 |
|los angeles | large | alpha | W | Jan-2022 | 330000 |
|los angeles | large | alpha | W | Fev-2022 | 400000 |
|los angeles | large | alpha | W | Mar-2022 | 580000 |
P.S.: 使用 pandas 的解决方案也可以。
在 PySpark 中,您当前的示例:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('new york', 'large', 'alpha', 'NE', 100000, 195000, 237000),
('los angeles', 'large', 'alpha', 'W', 330000, 400000, 580000)],
['city', 'city_tier', 'city_classification', 'Region', 'Jan-2022-orders', 'Feb-2022-orders', 'Mar-2022-orders']
)
df2 = df.select(
'city', 'city_tier', 'city_classification', 'Region',
F.expr("stack(3, 'Jan-2022', `Jan-2022-orders`, 'Fev-2022', `Feb-2022-orders`, 'Mar-2022', `Mar-2022-orders`) as (month, orders)")
)
df2.show()
# +-----------+---------+-------------------+------+--------+------+
# | city|city_tier|city_classification|Region| month|orders|
# +-----------+---------+-------------------+------+--------+------+
# | new york| large| alpha| NE|Jan-2022|100000|
# | new york| large| alpha| NE|Fev-2022|195000|
# | new york| large| alpha| NE|Mar-2022|237000|
# |los angeles| large| alpha| W|Jan-2022|330000|
# |los angeles| large| alpha| W|Fev-2022|400000|
# |los angeles| large| alpha| W|Mar-2022|580000|
# +-----------+---------+-------------------+------+--------+------+
启用它的功能是stack
. It does not have a dataframe API, so you need to use expr
访问它。
顺便说一句,这不是旋转,而是相反 - 不旋转。
在pandas中:
df.melt(df.columns[:4], var_name = 'month', value_name = 'orders')
city city_tier city_classification Region month orders
0 york large alpha NE Jan-2022-orders 100000
1 angeles large alpha W Jan-2022-orders 330000
2 york large alpha NE Feb-2022-orders 195000
3 angeles large alpha W Feb-2022-orders 400000
4 york large alpha NE Mar-2022-orders 237000
5 angeles large alpha W Mar-2022-orders 580000
甚至
df.melt(['city', 'city_tier', 'city_classification', 'Region'],
var_name = 'month', value_name = 'orders')
city city_tier city_classification Region month orders
0 york large alpha NE Jan-2022-orders 100000
1 angeles large alpha W Jan-2022-orders 330000
2 york large alpha NE Feb-2022-orders 195000
3 angeles large alpha W Feb-2022-orders 400000
4 york large alpha NE Mar-2022-orders 237000
5 angeles large alpha W Mar-2022-orders 580000
我有一个看起来像下面的数据框,但有数百行。我需要旋转它,以便 Region
之后的每一列都是一行,就像下面的另一列 table 一样。
+--------------+----------+---------------------+----------+------------------+------------------+-----------------+
|city |city_tier | city_classification | Region | Jan-2022-orders | Feb-2022-orders | Mar-2022-orders|
+--------------+----------+---------------------+----------+------------------+------------------+-----------------+
|new york | large | alpha | NE | 100000 |195000 | 237000 |
|los angeles | large | alpha | W | 330000 |400000 | 580000 |
我需要使用 PySpark 对它进行旋转,所以我得到了这样的结果:
+--------------+----------+---------------------+----------+-----------+---------+
|city |city_tier | city_classification | Region | month | orders |
+--------------+----------+---------------------+----------+-----------+---------+
|new york | large | alpha | NE | Jan-2022 | 100000 |
|new york | large | alpha | NE | Fev-2022 | 195000 |
|new york | large | alpha | NE | Mar-2022 | 237000 |
|los angeles | large | alpha | W | Jan-2022 | 330000 |
|los angeles | large | alpha | W | Fev-2022 | 400000 |
|los angeles | large | alpha | W | Mar-2022 | 580000 |
P.S.: 使用 pandas 的解决方案也可以。
在 PySpark 中,您当前的示例:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('new york', 'large', 'alpha', 'NE', 100000, 195000, 237000),
('los angeles', 'large', 'alpha', 'W', 330000, 400000, 580000)],
['city', 'city_tier', 'city_classification', 'Region', 'Jan-2022-orders', 'Feb-2022-orders', 'Mar-2022-orders']
)
df2 = df.select(
'city', 'city_tier', 'city_classification', 'Region',
F.expr("stack(3, 'Jan-2022', `Jan-2022-orders`, 'Fev-2022', `Feb-2022-orders`, 'Mar-2022', `Mar-2022-orders`) as (month, orders)")
)
df2.show()
# +-----------+---------+-------------------+------+--------+------+
# | city|city_tier|city_classification|Region| month|orders|
# +-----------+---------+-------------------+------+--------+------+
# | new york| large| alpha| NE|Jan-2022|100000|
# | new york| large| alpha| NE|Fev-2022|195000|
# | new york| large| alpha| NE|Mar-2022|237000|
# |los angeles| large| alpha| W|Jan-2022|330000|
# |los angeles| large| alpha| W|Fev-2022|400000|
# |los angeles| large| alpha| W|Mar-2022|580000|
# +-----------+---------+-------------------+------+--------+------+
启用它的功能是stack
. It does not have a dataframe API, so you need to use expr
访问它。
顺便说一句,这不是旋转,而是相反 - 不旋转。
在pandas中:
df.melt(df.columns[:4], var_name = 'month', value_name = 'orders')
city city_tier city_classification Region month orders
0 york large alpha NE Jan-2022-orders 100000
1 angeles large alpha W Jan-2022-orders 330000
2 york large alpha NE Feb-2022-orders 195000
3 angeles large alpha W Feb-2022-orders 400000
4 york large alpha NE Mar-2022-orders 237000
5 angeles large alpha W Mar-2022-orders 580000
甚至
df.melt(['city', 'city_tier', 'city_classification', 'Region'],
var_name = 'month', value_name = 'orders')
city city_tier city_classification Region month orders
0 york large alpha NE Jan-2022-orders 100000
1 angeles large alpha W Jan-2022-orders 330000
2 york large alpha NE Feb-2022-orders 195000
3 angeles large alpha W Feb-2022-orders 400000
4 york large alpha NE Mar-2022-orders 237000
5 angeles large alpha W Mar-2022-orders 580000