Python 按 2 列分组,但获取的记录因不同的列而异
Python group by 2 columns but get records vary by a different column
我有一个包含 3 列的数据框,ZIP_CODE、TERR_NAME、STATE。对于给定的 ZIP_CODE & TERR_NAME,只能有一个 STATE 代码。可以存在重复记录,但不存在具有相同 ZIP_CODE/TERR_NAME 和 2 个不同状态的记录?请问如何获取错误记录呢
我尝试按 ZIP_CODE/TERR_NAME/STATE 分组,但不知道如何获取这些错误记录。
df1= sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ"),("81A01","TERR NAME 01","CA"),("81A02","TERR NAME 02","NY"),("81A03","TERR NAME 03","NY"), ("81A03","TERR NAME 03","CA"), ("81A04","TERR NAME 04","FL"), ("81A05","TERR NAME 05","NJ"), ("81A06","TERR NAME 06","CA"), ("81A06","TERR NAME 06","CA")], ["zip_code","territory_name","state"])
df1.createOrReplaceTempView("df1_temp")
+--------+--------------+-----+
|zip_code|territory_name|state|
+--------+--------------+-----+
| 81A01| TERR NAME 01| NJ|
| 81A01| TERR NAME 01| CA|
| 81A02| TERR NAME 02| NY|
| 81A03| TERR NAME 03| NY|
| 81A03| TERR NAME 03| CA|
| 81A04| TERR NAME 04| FL|
| 81A05| TERR NAME 05| NJ|
| 81A06| TERR NAME 06| CA|
| 81A06| TERR NAME 06| CA|
+--------+--------------+-----+
我需要一个没有这些代码的数据框,即 81A01、81A03,它们具有相同的 zip_code、领土名称但使用 spark.sql() 具有不同的州代码。
预计新的 DF:
+--------+--------------+-----+
|zip_code|territory_name|state|
+--------+--------------+-----+
| 81A02| TERR NAME 02| NY|
| 81A04| TERR NAME 04| FL|
| 81A05| TERR NAME 05| NJ|
| 81A06| TERR NAME 06| CA|
| 81A06| TERR NAME 06| CA|
+--------+--------------+-----+
排除的邮政编码:
+--------+--------------+-----+
|zip_code|territory_name|state|
+--------+--------------+-----+
| 81A01| TERR NAME 01| NJ|
| 81A01| TERR NAME 01| CA|
| 81A03| TERR NAME 03| NY|
| 81A03| TERR NAME 03| CA|
+--------+--------------+-----+
提前致谢。
for key,group_df in df.groupby(['zip_code','territory_name']):
if len(group_df)>1:
print(key)
希望以上代码能解决您的问题
我自己找到了解决方案,发布在这里可能对其他人有用:
spark.sql("SELECT zip_code, territory_name, COUNT(distinct state) as COUNT FROM df1_temp GROUP BY zip_code, territory_name having COUNT>1").show()
+--------+--------------+-----+
|zip_code|territory_name|COUNT|
+--------+--------------+-----+
| 81A03| TERR NAME 03| 2|
| 81A01| TERR NAME 01| 2|
+--------+--------------+-----+
谢谢
With Pyspark :
Here the code snippet as per your requirement.
from pyspark.sql.functions import *
from pyspark.sql.window import Window
df1= sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ"),("81A01","TERR NAME 01","CA"),("81A02","TERR NAME 02","NY"),("81A03","TERR NAME 03","NY"), ("81A03","TERR NAME 03","CA"), ("81A04","TERR NAME 04","FL"), ("81A05","TERR NAME 05","NJ"), ("81A06","TERR NAME 06","CA"), ("81A06","TERR NAME 06","CA")], ["zip_code","territory_name","state"])
df1_v1 = df1.withColumn("avg", collect_set("state").over(Window.partitionBy("zip_code","territory_name").orderBy("zip_code"))).filter(size(col("avg"))==1).orderBy(col("zip_code")).drop(col("avg"))
df1_v1.show()
如果您遇到任何与此相关的问题,请告诉我,如果它解决了您的问题,请接受答案。
import pandas as pd
data = {
"zip_code":["81A01", "81A01", "81A02", "81A03", "81A03", "81A04", "81A05",
"81A06", "81A06"],
"territory_name": ["TERR NAME 01", "TERR NAME 01", "TERR NAME 02",
"TERR NAME 03", "TERR NAME 03", "TERR NAME 04", "TERR NAME 05",
"TERR NAME 06", "TERR NAME 06"],
"state": ["NJ", "CA", "NY", "NY", "CA", "FL", "NJ", "CA", "CA"]
}
df = pd.DataFrame(data)
duplicate = list(set([tuple(df[(df["zip_code"] == df["zip_code"][i]) &
(df["territory_name"] == df["territory_name"][i])].index) for i in
range(len(df))]))
for i in duplicate:
if len(i) > 1:
if not df["state"][i[0]] == df["state"][i[1]]:
df = df.drop(i[0])
df = df.drop(i[1])
print(df)
我有一个包含 3 列的数据框,ZIP_CODE、TERR_NAME、STATE。对于给定的 ZIP_CODE & TERR_NAME,只能有一个 STATE 代码。可以存在重复记录,但不存在具有相同 ZIP_CODE/TERR_NAME 和 2 个不同状态的记录?请问如何获取错误记录呢
我尝试按 ZIP_CODE/TERR_NAME/STATE 分组,但不知道如何获取这些错误记录。
df1= sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ"),("81A01","TERR NAME 01","CA"),("81A02","TERR NAME 02","NY"),("81A03","TERR NAME 03","NY"), ("81A03","TERR NAME 03","CA"), ("81A04","TERR NAME 04","FL"), ("81A05","TERR NAME 05","NJ"), ("81A06","TERR NAME 06","CA"), ("81A06","TERR NAME 06","CA")], ["zip_code","territory_name","state"])
df1.createOrReplaceTempView("df1_temp")
+--------+--------------+-----+
|zip_code|territory_name|state|
+--------+--------------+-----+
| 81A01| TERR NAME 01| NJ|
| 81A01| TERR NAME 01| CA|
| 81A02| TERR NAME 02| NY|
| 81A03| TERR NAME 03| NY|
| 81A03| TERR NAME 03| CA|
| 81A04| TERR NAME 04| FL|
| 81A05| TERR NAME 05| NJ|
| 81A06| TERR NAME 06| CA|
| 81A06| TERR NAME 06| CA|
+--------+--------------+-----+
我需要一个没有这些代码的数据框,即 81A01、81A03,它们具有相同的 zip_code、领土名称但使用 spark.sql() 具有不同的州代码。
预计新的 DF:
+--------+--------------+-----+
|zip_code|territory_name|state|
+--------+--------------+-----+
| 81A02| TERR NAME 02| NY|
| 81A04| TERR NAME 04| FL|
| 81A05| TERR NAME 05| NJ|
| 81A06| TERR NAME 06| CA|
| 81A06| TERR NAME 06| CA|
+--------+--------------+-----+
排除的邮政编码:
+--------+--------------+-----+
|zip_code|territory_name|state|
+--------+--------------+-----+
| 81A01| TERR NAME 01| NJ|
| 81A01| TERR NAME 01| CA|
| 81A03| TERR NAME 03| NY|
| 81A03| TERR NAME 03| CA|
+--------+--------------+-----+
提前致谢。
for key,group_df in df.groupby(['zip_code','territory_name']):
if len(group_df)>1:
print(key)
希望以上代码能解决您的问题
我自己找到了解决方案,发布在这里可能对其他人有用:
spark.sql("SELECT zip_code, territory_name, COUNT(distinct state) as COUNT FROM df1_temp GROUP BY zip_code, territory_name having COUNT>1").show()
+--------+--------------+-----+
|zip_code|territory_name|COUNT|
+--------+--------------+-----+
| 81A03| TERR NAME 03| 2|
| 81A01| TERR NAME 01| 2|
+--------+--------------+-----+
谢谢
With Pyspark : Here the code snippet as per your requirement.
from pyspark.sql.functions import *
from pyspark.sql.window import Window
df1= sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ"),("81A01","TERR NAME 01","CA"),("81A02","TERR NAME 02","NY"),("81A03","TERR NAME 03","NY"), ("81A03","TERR NAME 03","CA"), ("81A04","TERR NAME 04","FL"), ("81A05","TERR NAME 05","NJ"), ("81A06","TERR NAME 06","CA"), ("81A06","TERR NAME 06","CA")], ["zip_code","territory_name","state"])
df1_v1 = df1.withColumn("avg", collect_set("state").over(Window.partitionBy("zip_code","territory_name").orderBy("zip_code"))).filter(size(col("avg"))==1).orderBy(col("zip_code")).drop(col("avg"))
df1_v1.show()
如果您遇到任何与此相关的问题,请告诉我,如果它解决了您的问题,请接受答案。
import pandas as pd
data = {
"zip_code":["81A01", "81A01", "81A02", "81A03", "81A03", "81A04", "81A05",
"81A06", "81A06"],
"territory_name": ["TERR NAME 01", "TERR NAME 01", "TERR NAME 02",
"TERR NAME 03", "TERR NAME 03", "TERR NAME 04", "TERR NAME 05",
"TERR NAME 06", "TERR NAME 06"],
"state": ["NJ", "CA", "NY", "NY", "CA", "FL", "NJ", "CA", "CA"]
}
df = pd.DataFrame(data)
duplicate = list(set([tuple(df[(df["zip_code"] == df["zip_code"][i]) &
(df["territory_name"] == df["territory_name"][i])].index) for i in
range(len(df))]))
for i in duplicate:
if len(i) > 1:
if not df["state"][i[0]] == df["state"][i[1]]:
df = df.drop(i[0])
df = df.drop(i[1])
print(df)