如何在处理 ML 模型中的缺失数据时将估算数据集的 fit_tranform 与原始数据集相匹配?

How to match fit_tranform of the imputed dataset with the original dataset while handling missing data in an ML model?

当尝试使用以下代码行使用 KNNImputer 算法填充缺失值时:

pd.DataFrame(knn_imputer.fit_transform(data),
                        index=data.index,
                        columns=data.columns)

我收到错误消息:

Traceback (most recent call last):
  File "c:\Users\myname\Desktop\Project\PythonTool\calculator\database-analyzer\database_analyzer.py", line 384, in <module>
    main()
  File "c:\Users\myname\Desktop\Project\PythonTool\calculator\database-analyzer\database_analyzer.py", line 232, in main
    train_data_engineered = missingvalue_handler(train_data_engineered)
  File "c:\Users\myname\Desktop\Project\PythonTool\calculator\database-analyzer\utilities_module.py", line 1268, in missingvalue_handler
    return pd.DataFrame(knn_imputer.fit_transform(new_data),
  File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\pandas\core\frame.py", line 695, in __init__
    mgr = ndarray_to_mgr(
  File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\pandas\core\internals\construction.py", line 351, in ndarray_to_mgr    
    _check_values_indices_shape_match(values, index, columns)
  File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\pandas\core\internals\construction.py", line 422, in _check_values_indices_shape_match
    raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (196, 1032), indices imply (196, 1033)

我知道这是因为 imputer 实际上将一列完全从 1033 降低到 1032。我该如何在不知道哪一列已被删除的情况下解决这个问题?

我真的想通了。我不需要知道确切的列名。我进行了以下更改以确保 data.shape[1] 和 len(data.columns) 在从估算数据集制作 pandas 数据帧时匹配。

pd.DataFrame(knn_imputer.fit_transform(data),
                        index=data.index,
                        columns=data.columns)

pd.DataFrame(knn_imputer.fit_transform(data),
                        index=data.index,
                        columns=data.dropna(axis=1, how='all').columns)