当您的数据量很大时,有没有一种有效的方法可以使用第二个 table 来填充正确的不一致数据?
Is there an effective method to fill in correct inconsistent data using a second table when your data has a large size?
我有一个 table 数据不一致,如下所示:
Table 1:
flight_id
engine_number
aircraft_tail
year
month
000000_20180121
000000
G-RHBZ
2018
01
258741_20171021
258741
H-RZBE
2017
10
_20150214
V-RDER
2015
02
_20110287
NO-NUMBER
G-EHRK
2011
12
不一致,因为某些字段不符合指定格式。例如,engine_number 不应等于 '000000' 也不应不存在(第 2 行和第 3 行)。我想创建另一个 table(指标 table),其中包含错误字段和相应的正确值,我有另一个 table(很大)可用于创建此类指标table
Table 2:
engine_number
aircraft_tail
year
month
258741
H-RZBE
2017
10
348741
V-RDER
2015
02
348741
V-RDER
2015
03
589745
G-RHBZ
2018
01
587981
G-EHRK
2011
12
我想要得到的指标table看起来像
Table 3: *指标table*
bad_engine_number
aircraft_tail
year
month
good_engine_number
000000
G-RHBZ
2018
01
589745
V-RDER
2015
02
348741
NO-NUMBER
G-EHRK
2011
02
12
如您所见,tables(table 1 和 2)具有共同的 aircraft_tail、年份和月份列。但是我无法合并它们来创建指标 table,因为我正在处理连续数据并且我的 table 的维度很大。我尝试使用模糊 wuzzy 匹配方法在 aircraft_tail、年份之间进行匹配,并用好的 engine_number 填充,但再次由于数据大小而失败。任何想法请创建这样的指标 table?.
我是这个领域的新手:)))
谢谢!
我想你可以选择 merge
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html。
import pandas as pd
df1 = pd.DataFrame(data=[
{"flight_id":"000000_20180121","engine_number":"000000",
"aircraft_tail":"G-RHBZ","year":"2018","month":"01"},
{"flight_id":"258741_20171021","engine_number":"258741",
"aircraft_tail":"H-RZBE","year":"2017","month":"10"},
{"flight_id":"_20150214","engine_number":"",
"aircraft_tail":"V-RDER","year":"2015","month":"02"},
{"flight_id":"_20110287","engine_number":"NO-NUMBER",
"aircraft_tail":"G-EHRK","year":"2011","month":"12"}]
)
df2 = pd.DataFrame(data=[
{"engine_number":"258741","aircraft_tail":"H-RZBE","year":"2017","month":"10"},
{"engine_number":"348741","aircraft_tail":"V-RDER","year":"2015","month":"02"},
{"engine_number":"348741","aircraft_tail":"V-RDER","year":"2015","month":"03"},
{"engine_number":"589745","aircraft_tail":"G-RHBZ","year":"2018","month":"01"},
{"engine_number":"587981","aircraft_tail":"G-EHRK","year":"2011","month":"12"}]
)
# Validator function
def bad_engine_number_detector(engine_number):
lst_invalid_engine_number = ["000000", "NO-NUMBER"]
is_bad_engine_number = False
if engine_number == "":
is_bad_engine_number = True
elif engine_number in lst_invalid_engine_number:
is_bad_engine_number = True
return is_bad_engine_number
# Identify invalid entries on df1
mask = df1["engine_number"].apply(bad_engine_number_detector)
# Merge both tables (df1 filtered only with bad entries)
df1.loc[mask].merge(df2,
on=["aircraft_tail","year","month"],
suffixes=["_bad","_good"])
flight_id
engine_number_bad
aircraft_tail
year
month
engine_number_good
000000_20180121
000000
G-RHBZ
2018
01
589745
_20150214
V-RDER
2015
02
348741
_20110287
NO-NUMBER
G-EHRK
2011
12
587981
我有一个 table 数据不一致,如下所示:
Table 1:
flight_id | engine_number | aircraft_tail | year | month |
---|---|---|---|---|
000000_20180121 | 000000 | G-RHBZ | 2018 | 01 |
258741_20171021 | 258741 | H-RZBE | 2017 | 10 |
_20150214 | V-RDER | 2015 | 02 | |
_20110287 | NO-NUMBER | G-EHRK | 2011 | 12 |
不一致,因为某些字段不符合指定格式。例如,engine_number 不应等于 '000000' 也不应不存在(第 2 行和第 3 行)。我想创建另一个 table(指标 table),其中包含错误字段和相应的正确值,我有另一个 table(很大)可用于创建此类指标table
Table 2:
engine_number | aircraft_tail | year | month |
---|---|---|---|
258741 | H-RZBE | 2017 | 10 |
348741 | V-RDER | 2015 | 02 |
348741 | V-RDER | 2015 | 03 |
589745 | G-RHBZ | 2018 | 01 |
587981 | G-EHRK | 2011 | 12 |
我想要得到的指标table看起来像
Table 3: *指标table*
bad_engine_number | aircraft_tail | year | month | good_engine_number |
---|---|---|---|---|
000000 | G-RHBZ | 2018 | 01 | 589745 |
V-RDER | 2015 | 02 | 348741 | |
NO-NUMBER | G-EHRK | 2011 | 02 | 12 |
如您所见,tables(table 1 和 2)具有共同的 aircraft_tail、年份和月份列。但是我无法合并它们来创建指标 table,因为我正在处理连续数据并且我的 table 的维度很大。我尝试使用模糊 wuzzy 匹配方法在 aircraft_tail、年份之间进行匹配,并用好的 engine_number 填充,但再次由于数据大小而失败。任何想法请创建这样的指标 table?.
我是这个领域的新手:))) 谢谢!
我想你可以选择 merge
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html。
import pandas as pd
df1 = pd.DataFrame(data=[
{"flight_id":"000000_20180121","engine_number":"000000",
"aircraft_tail":"G-RHBZ","year":"2018","month":"01"},
{"flight_id":"258741_20171021","engine_number":"258741",
"aircraft_tail":"H-RZBE","year":"2017","month":"10"},
{"flight_id":"_20150214","engine_number":"",
"aircraft_tail":"V-RDER","year":"2015","month":"02"},
{"flight_id":"_20110287","engine_number":"NO-NUMBER",
"aircraft_tail":"G-EHRK","year":"2011","month":"12"}]
)
df2 = pd.DataFrame(data=[
{"engine_number":"258741","aircraft_tail":"H-RZBE","year":"2017","month":"10"},
{"engine_number":"348741","aircraft_tail":"V-RDER","year":"2015","month":"02"},
{"engine_number":"348741","aircraft_tail":"V-RDER","year":"2015","month":"03"},
{"engine_number":"589745","aircraft_tail":"G-RHBZ","year":"2018","month":"01"},
{"engine_number":"587981","aircraft_tail":"G-EHRK","year":"2011","month":"12"}]
)
# Validator function
def bad_engine_number_detector(engine_number):
lst_invalid_engine_number = ["000000", "NO-NUMBER"]
is_bad_engine_number = False
if engine_number == "":
is_bad_engine_number = True
elif engine_number in lst_invalid_engine_number:
is_bad_engine_number = True
return is_bad_engine_number
# Identify invalid entries on df1
mask = df1["engine_number"].apply(bad_engine_number_detector)
# Merge both tables (df1 filtered only with bad entries)
df1.loc[mask].merge(df2,
on=["aircraft_tail","year","month"],
suffixes=["_bad","_good"])
flight_id | engine_number_bad | aircraft_tail | year | month | engine_number_good |
---|---|---|---|---|---|
000000_20180121 | 000000 | G-RHBZ | 2018 | 01 | 589745 |
_20150214 | V-RDER | 2015 | 02 | 348741 | |
_20110287 | NO-NUMBER | G-EHRK | 2011 | 12 | 587981 |