当您的数据量很大时,有没有一种有效的方法可以使用第二个 table 来填充正确的不一致数据?

Is there an effective method to fill in correct inconsistent data using a second table when your data has a large size?

我有一个 table 数据不一致,如下所示:

Table 1:

flight_id engine_number aircraft_tail year month
000000_20180121 000000 G-RHBZ 2018 01
258741_20171021 258741 H-RZBE 2017 10
_20150214 V-RDER 2015 02
_20110287 NO-NUMBER G-EHRK 2011 12

不一致,因为某些字段不符合指定格式。例如,engine_number 不应等于 '000000' 也不应不存在(第 2 行和第 3 行)。我想创建另一个 table(指标 table),其中包含错误字段和相应的正确值,我有另一个 table(很大)可用于创建此类指标table

Table 2:

engine_number aircraft_tail year month
258741 H-RZBE 2017 10
348741 V-RDER 2015 02
348741 V-RDER 2015 03
589745 G-RHBZ 2018 01
587981 G-EHRK 2011 12

我想要得到的指标table看起来像

Table 3: *指标table*

bad_engine_number aircraft_tail year month good_engine_number
000000 G-RHBZ 2018 01 589745
V-RDER 2015 02 348741
NO-NUMBER G-EHRK 2011 02 12

如您所见,tables(table 1 和 2)具有共同的 aircraft_tail、年份和月份列。但是我无法合并它们来创建指标 table,因为我正在处理连续数据并且我的 table 的维度很大。我尝试使用模糊 wuzzy 匹配方法在 aircraft_tail、年份之间进行匹配,并用好的 engine_number 填充,但再次由于数据大小而失败。任何想法请创建这样的指标 table?.

我是这个领域的新手:))) 谢谢!

我想你可以选择 mergehttps://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

import pandas as pd

df1 = pd.DataFrame(data=[
    {"flight_id":"000000_20180121","engine_number":"000000",
     "aircraft_tail":"G-RHBZ","year":"2018","month":"01"},
    {"flight_id":"258741_20171021","engine_number":"258741",
     "aircraft_tail":"H-RZBE","year":"2017","month":"10"},
    {"flight_id":"_20150214","engine_number":"",
     "aircraft_tail":"V-RDER","year":"2015","month":"02"},
    {"flight_id":"_20110287","engine_number":"NO-NUMBER", 
     "aircraft_tail":"G-EHRK","year":"2011","month":"12"}]
)
df2 = pd.DataFrame(data=[
    {"engine_number":"258741","aircraft_tail":"H-RZBE","year":"2017","month":"10"},
    {"engine_number":"348741","aircraft_tail":"V-RDER","year":"2015","month":"02"},
    {"engine_number":"348741","aircraft_tail":"V-RDER","year":"2015","month":"03"},
    {"engine_number":"589745","aircraft_tail":"G-RHBZ","year":"2018","month":"01"},
    {"engine_number":"587981","aircraft_tail":"G-EHRK","year":"2011","month":"12"}]
    )

# Validator function
def bad_engine_number_detector(engine_number):

    lst_invalid_engine_number = ["000000", "NO-NUMBER"]

    is_bad_engine_number = False
    if engine_number == "":
        is_bad_engine_number = True
    elif engine_number in lst_invalid_engine_number:
        is_bad_engine_number = True

    return is_bad_engine_number
    
# Identify invalid entries on df1
mask = df1["engine_number"].apply(bad_engine_number_detector)

# Merge both tables (df1 filtered only with bad entries)
df1.loc[mask].merge(df2, 
                    on=["aircraft_tail","year","month"],
                    suffixes=["_bad","_good"])
flight_id engine_number_bad aircraft_tail year month engine_number_good
000000_20180121 000000 G-RHBZ 2018 01 589745
_20150214 V-RDER 2015 02 348741
_20110287 NO-NUMBER G-EHRK 2011 12 587981