仅保留 DataFrame 中有关某些字段的重复项
Keep only duplicates from a DataFrame regarding some field
我有这个 spark DataFrame:
+---+-----+------+----+------------+------------+
| ID| ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT| QWA| 6|null| 08:59:00| 23:30:00|
|ALT|AUTRE| 2|null| 08:58:00| 23:29:00|
|TDR| QWA| 3|null| 08:57:00| 23:28:00|
|ALT| TEST| 4|null| 08:56:00| 23:27:00|
|ALT| QWA| 6|null| 08:55:00| 23:26:00|
|ALT| QWA| 2|null| 08:54:00| 23:25:00|
|ALT| QWA| 2|null| 08:53:00| 23:24:00|
+---+-----+------+----+------------+------------+
我想获得一个新的数据框,其中只有关于 "ID"
、"ID2"
和 "Number"
.
3 个字段不唯一的行
意思是我要这个DataFrame:
+---+-----+------+----+------------+------------+
| ID| ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT| QWA| 6|null| 08:59:00| 23:30:00|
|ALT| QWA| 2|null| 08:53:00| 23:24:00|
+---+-----+------+----+------------+------------+
或者可能是包含所有重复项的数据框:
+---+-----+------+----+------------+------------+
| ID| ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT| QWA| 6|null| 08:59:00| 23:30:00|
|ALT| QWA| 6|null| 08:55:00| 23:26:00|
|ALT| QWA| 2|null| 08:54:00| 23:25:00|
|ALT| QWA| 2|null| 08:53:00| 23:24:00|
+---+-----+------+----+------------+------------+
一种方法是使用 pyspark.sql.Window
添加一个列来计算每行的 ("ID", "ID2", "Name")
组合的重复项数。然后select只有重复数大于1的行。
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy('ID', 'ID2', 'Number')
df.select('*', f.count('ID').over(w).alias('dupeCount'))\
.where('dupeCount > 1')\
.drop('dupeCount')\
.show()
#+---+---+------+----+------------+------------+
#| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
#+---+---+------+----+------------+------------+
#|ALT|QWA| 2|null| 08:54:00| 23:25:00|
#|ALT|QWA| 2|null| 08:53:00| 23:24:00|
#|ALT|QWA| 6|null| 08:59:00| 23:30:00|
#|ALT|QWA| 6|null| 08:55:00| 23:26:00|
#+---+---+------+----+------------+------------+
我用了pyspark.sql.functions.count()
来统计每组的项目数。 returns 一个包含所有重复项的 DataFrame(您显示的第二个输出)。
如果您希望每个 ("ID", "ID2", "Name")
组合只得到一行,您可以使用另一个 Window 来对行进行排序。
例如,下面我为row_number
和select添加了另一列,仅重复计数大于1且行号等于1的行。这保证了一行每个分组。
w2 = Window.partitionBy('ID', 'ID2', 'Number').orderBy('ID', 'ID2', 'Number')
df.select(
'*',
f.count('ID').over(w).alias('dupeCount'),
f.row_number().over(w2).alias('rowNum')
)\
.where('(dupeCount > 1) AND (rowNum = 1)')\
.drop('dupeCount', 'rowNum')\
.show()
#+---+---+------+----+------------+------------+
#| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
#+---+---+------+----+------------+------------+
#|ALT|QWA| 2|null| 08:54:00| 23:25:00|
#|ALT|QWA| 6|null| 08:59:00| 23:30:00|
#+---+---+------+----+------------+------------+
扩展 pault 的 :我经常需要将数据帧子集化为仅重复 x 次的条目,并且由于我需要经常这样做,所以我将它变成了一个函数,我只是在我的脚本开头导入许多其他辅助函数:
import pyspark.sql.functions as f
from pyspark.sql import Window
def get_entries_with_frequency(df, cols, num):
"""
This function will filter the dataframe df down to all the rows that
have the same values in cols num times. Example: If num=3, col="cartype",
then the function will only return rows where a certain cartype occurs exactly 3 times
in the dataset. If col "cartype" contains the following:
["Mazda", "Seat", "Seat", "VW", "Mercedes", "VW", "VW", "Mercedes", "Seat"],
then the function will only return rows containing "VW" or "Seat"
since these occur exactly 3 times.
df: Pyspark dataframe
cols: Either string column name or list of strings for multiple columns.
num: int - The exact number of times a value (or combination of values,
if cols is a list) has to appear in df.
"""
if type(cols)==str:
cols = [cols]
w = Window.partitionBy(cols)
return df.select('*', f.count(cols[0]).over(w).alias('dupeCount'))\
.where("dupeCount = {}".format(num))\
.drop('dupeCount')
这里有一个不用 Window 的方法。
具有重复项的 DataFrame
df.exceptAll(df.drop_duplicates(['ID', 'ID2', 'Number'])).show()
# +---+---+------+------------+------------+
# | ID|ID2|Number|Opening_Hour|Closing_Hour|
# +---+---+------+------------+------------+
# |ALT|QWA| 2| 08:53:00| 23:24:00|
# |ALT|QWA| 6| 08:55:00| 23:26:00|
# +---+---+------+------------+------------+
具有所有重复项的 DataFrame(使用 left_anti 连接)
df.join(df.groupBy('ID', 'ID2', 'Number')\
.count().where('count = 1').drop('count'),
on=['ID', 'ID2', 'Number'],
how='left_anti').show()
# +---+---+------+------------+------------+
# | ID|ID2|Number|Opening_Hour|Closing_Hour|
# +---+---+------+------------+------------+
# |ALT|QWA| 2| 08:54:00| 23:25:00|
# |ALT|QWA| 2| 08:53:00| 23:24:00|
# |ALT|QWA| 6| 08:59:00| 23:30:00|
# |ALT|QWA| 6| 08:55:00| 23:26:00|
# +---+---+------+------------+------------+
我有这个 spark DataFrame:
+---+-----+------+----+------------+------------+
| ID| ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT| QWA| 6|null| 08:59:00| 23:30:00|
|ALT|AUTRE| 2|null| 08:58:00| 23:29:00|
|TDR| QWA| 3|null| 08:57:00| 23:28:00|
|ALT| TEST| 4|null| 08:56:00| 23:27:00|
|ALT| QWA| 6|null| 08:55:00| 23:26:00|
|ALT| QWA| 2|null| 08:54:00| 23:25:00|
|ALT| QWA| 2|null| 08:53:00| 23:24:00|
+---+-----+------+----+------------+------------+
我想获得一个新的数据框,其中只有关于 "ID"
、"ID2"
和 "Number"
.
意思是我要这个DataFrame:
+---+-----+------+----+------------+------------+
| ID| ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT| QWA| 6|null| 08:59:00| 23:30:00|
|ALT| QWA| 2|null| 08:53:00| 23:24:00|
+---+-----+------+----+------------+------------+
或者可能是包含所有重复项的数据框:
+---+-----+------+----+------------+------------+
| ID| ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT| QWA| 6|null| 08:59:00| 23:30:00|
|ALT| QWA| 6|null| 08:55:00| 23:26:00|
|ALT| QWA| 2|null| 08:54:00| 23:25:00|
|ALT| QWA| 2|null| 08:53:00| 23:24:00|
+---+-----+------+----+------------+------------+
一种方法是使用 pyspark.sql.Window
添加一个列来计算每行的 ("ID", "ID2", "Name")
组合的重复项数。然后select只有重复数大于1的行。
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy('ID', 'ID2', 'Number')
df.select('*', f.count('ID').over(w).alias('dupeCount'))\
.where('dupeCount > 1')\
.drop('dupeCount')\
.show()
#+---+---+------+----+------------+------------+
#| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
#+---+---+------+----+------------+------------+
#|ALT|QWA| 2|null| 08:54:00| 23:25:00|
#|ALT|QWA| 2|null| 08:53:00| 23:24:00|
#|ALT|QWA| 6|null| 08:59:00| 23:30:00|
#|ALT|QWA| 6|null| 08:55:00| 23:26:00|
#+---+---+------+----+------------+------------+
我用了pyspark.sql.functions.count()
来统计每组的项目数。 returns 一个包含所有重复项的 DataFrame(您显示的第二个输出)。
如果您希望每个 ("ID", "ID2", "Name")
组合只得到一行,您可以使用另一个 Window 来对行进行排序。
例如,下面我为row_number
和select添加了另一列,仅重复计数大于1且行号等于1的行。这保证了一行每个分组。
w2 = Window.partitionBy('ID', 'ID2', 'Number').orderBy('ID', 'ID2', 'Number')
df.select(
'*',
f.count('ID').over(w).alias('dupeCount'),
f.row_number().over(w2).alias('rowNum')
)\
.where('(dupeCount > 1) AND (rowNum = 1)')\
.drop('dupeCount', 'rowNum')\
.show()
#+---+---+------+----+------------+------------+
#| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
#+---+---+------+----+------------+------------+
#|ALT|QWA| 2|null| 08:54:00| 23:25:00|
#|ALT|QWA| 6|null| 08:59:00| 23:30:00|
#+---+---+------+----+------------+------------+
扩展 pault 的
import pyspark.sql.functions as f
from pyspark.sql import Window
def get_entries_with_frequency(df, cols, num):
"""
This function will filter the dataframe df down to all the rows that
have the same values in cols num times. Example: If num=3, col="cartype",
then the function will only return rows where a certain cartype occurs exactly 3 times
in the dataset. If col "cartype" contains the following:
["Mazda", "Seat", "Seat", "VW", "Mercedes", "VW", "VW", "Mercedes", "Seat"],
then the function will only return rows containing "VW" or "Seat"
since these occur exactly 3 times.
df: Pyspark dataframe
cols: Either string column name or list of strings for multiple columns.
num: int - The exact number of times a value (or combination of values,
if cols is a list) has to appear in df.
"""
if type(cols)==str:
cols = [cols]
w = Window.partitionBy(cols)
return df.select('*', f.count(cols[0]).over(w).alias('dupeCount'))\
.where("dupeCount = {}".format(num))\
.drop('dupeCount')
这里有一个不用 Window 的方法。
具有重复项的 DataFrame
df.exceptAll(df.drop_duplicates(['ID', 'ID2', 'Number'])).show()
# +---+---+------+------------+------------+
# | ID|ID2|Number|Opening_Hour|Closing_Hour|
# +---+---+------+------------+------------+
# |ALT|QWA| 2| 08:53:00| 23:24:00|
# |ALT|QWA| 6| 08:55:00| 23:26:00|
# +---+---+------+------------+------------+
具有所有重复项的 DataFrame(使用 left_anti 连接)
df.join(df.groupBy('ID', 'ID2', 'Number')\
.count().where('count = 1').drop('count'),
on=['ID', 'ID2', 'Number'],
how='left_anti').show()
# +---+---+------+------------+------------+
# | ID|ID2|Number|Opening_Hour|Closing_Hour|
# +---+---+------+------------+------------+
# |ALT|QWA| 2| 08:54:00| 23:25:00|
# |ALT|QWA| 2| 08:53:00| 23:24:00|
# |ALT|QWA| 6| 08:59:00| 23:30:00|
# |ALT|QWA| 6| 08:55:00| 23:26:00|
# +---+---+------+------------+------------+