如何在处理 ML 模型中的缺失数据时将估算数据集的 fit_tranform 与原始数据集相匹配?
How to match fit_tranform of the imputed dataset with the original dataset while handling missing data in an ML model?
当尝试使用以下代码行使用 KNNImputer 算法填充缺失值时:
pd.DataFrame(knn_imputer.fit_transform(data),
index=data.index,
columns=data.columns)
我收到错误消息:
Traceback (most recent call last):
File "c:\Users\myname\Desktop\Project\PythonTool\calculator\database-analyzer\database_analyzer.py", line 384, in <module>
main()
File "c:\Users\myname\Desktop\Project\PythonTool\calculator\database-analyzer\database_analyzer.py", line 232, in main
train_data_engineered = missingvalue_handler(train_data_engineered)
File "c:\Users\myname\Desktop\Project\PythonTool\calculator\database-analyzer\utilities_module.py", line 1268, in missingvalue_handler
return pd.DataFrame(knn_imputer.fit_transform(new_data),
File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\pandas\core\frame.py", line 695, in __init__
mgr = ndarray_to_mgr(
File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\pandas\core\internals\construction.py", line 351, in ndarray_to_mgr
_check_values_indices_shape_match(values, index, columns)
File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\pandas\core\internals\construction.py", line 422, in _check_values_indices_shape_match
raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (196, 1032), indices imply (196, 1033)
我知道这是因为 imputer 实际上将一列完全从 1033 降低到 1032。我该如何在不知道哪一列已被删除的情况下解决这个问题?
我真的想通了。我不需要知道确切的列名。我进行了以下更改以确保 data.shape[1] 和 len(data.columns) 在从估算数据集制作 pandas 数据帧时匹配。
pd.DataFrame(knn_imputer.fit_transform(data),
index=data.index,
columns=data.columns)
至
pd.DataFrame(knn_imputer.fit_transform(data),
index=data.index,
columns=data.dropna(axis=1, how='all').columns)
当尝试使用以下代码行使用 KNNImputer 算法填充缺失值时:
pd.DataFrame(knn_imputer.fit_transform(data),
index=data.index,
columns=data.columns)
我收到错误消息:
Traceback (most recent call last):
File "c:\Users\myname\Desktop\Project\PythonTool\calculator\database-analyzer\database_analyzer.py", line 384, in <module>
main()
File "c:\Users\myname\Desktop\Project\PythonTool\calculator\database-analyzer\database_analyzer.py", line 232, in main
train_data_engineered = missingvalue_handler(train_data_engineered)
File "c:\Users\myname\Desktop\Project\PythonTool\calculator\database-analyzer\utilities_module.py", line 1268, in missingvalue_handler
return pd.DataFrame(knn_imputer.fit_transform(new_data),
File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\pandas\core\frame.py", line 695, in __init__
mgr = ndarray_to_mgr(
File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\pandas\core\internals\construction.py", line 351, in ndarray_to_mgr
_check_values_indices_shape_match(values, index, columns)
File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\pandas\core\internals\construction.py", line 422, in _check_values_indices_shape_match
raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (196, 1032), indices imply (196, 1033)
我知道这是因为 imputer 实际上将一列完全从 1033 降低到 1032。我该如何在不知道哪一列已被删除的情况下解决这个问题?
我真的想通了。我不需要知道确切的列名。我进行了以下更改以确保 data.shape[1] 和 len(data.columns) 在从估算数据集制作 pandas 数据帧时匹配。
pd.DataFrame(knn_imputer.fit_transform(data),
index=data.index,
columns=data.columns)
至
pd.DataFrame(knn_imputer.fit_transform(data),
index=data.index,
columns=data.dropna(axis=1, how='all').columns)