Error when doing simple data normalization, TypeError: unsupported operand type(s) for -: 'str' and 'str'

Question

我正在尝试使用以下函数规范化 pandas 数据帧：

def normalize(df):
result = df.copy()
for feature_name in df.columns:
    max_value = df[feature_name].max()
    min_value = df[feature_name].min()
    result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
return result


df_normalized = normalize(df)

其中：

filename = 'data.csv'

data = pd.read_csv(filename)

df = pd.DataFrame(data)

df=df.dropna(axis=1,how='all')

但我一直运行陷入这个困扰我几个小时的错误：

TypeError: unsupported operand type(s) for -: 'str' and 'str'

有人知道为什么吗？

这是我的数据：https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Answer 1

该错误表明您正在尝试减去字符串，这是一个没有任何意义的操作。

本质上，您正在尝试执行类似 "foo" - "bar" 的操作。

尝试在所有减法操作数上使用 float() 来修复它。

对于您的代码：

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = float(df[feature_name].max())
        min_value = float(df[feature_name].min())
        result[feature_name] = (float(df[feature_name]) - min_value) / (max_value - min_value)
    return result

Answer 2

从文件中读取并不总是保证 pandas 会猜测您的对象的类型，您必须像

那样明确地这样做

def normalize(df):
result = df.copy()
for feature_name in df.columns:
    df[feature_name]=df[feature_name].apply(pd.to_numeric,errors='ignore')
    max_value = df[feature_name].max()
    min_value = df[feature_name].min()
    result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
return result


df_normalized = normalize(df)

 df.apply(pd.to_numeric)

Answer 3

您可以先检查 dtypes 输出 df:

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'

df = pd.read_csv(url, header=None)
print (df.dtypes)
0       int64
1      object
2     float64
3     float64
...
...
29    float64
30    float64
31    float64
dtype: object

所有列都是数字，只有第二列是 object - 显然是 string，所以一种可能的解决方案是 set_index 将所有字符串列转换为索引：

df = df.set_index(1)
print (df.head())
         0      2      3       4       5        6        7       8        9   \
1                                                                              
M    842302  17.99  10.38  122.80  1001.0  0.11840  0.27760  0.3001  0.14710   
M    842517  20.57  17.77  132.90  1326.0  0.08474  0.07864  0.0869  0.07017   
M  84300903  19.69  21.25  130.00  1203.0  0.10960  0.15990  0.1974  0.12790   
M  84348301  11.42  20.38   77.58   386.1  0.14250  0.28390  0.2414  0.10520   
M  84358402  20.29  14.34  135.10  1297.0  0.10030  0.13280  0.1980  0.10430   

       10   ...        22     23      24      25      26      27      28  \
1           ...                                                            
M  0.2419   ...     25.38  17.33  184.60  2019.0  0.1622  0.6656  0.7119   
M  0.1812   ...     24.99  23.41  158.80  1956.0  0.1238  0.1866  0.2416   
M  0.2069   ...     23.57  25.53  152.50  1709.0  0.1444  0.4245  0.4504   
M  0.2597   ...     14.91  26.50   98.87   567.7  0.2098  0.8663  0.6869   
M  0.1809   ...     22.54  16.67  152.20  1575.0  0.1374  0.2050  0.4000   

       29      30       31  
1                           
M  0.2654  0.4601  0.11890  
M  0.1860  0.2750  0.08902  
M  0.2430  0.3613  0.08758  
M  0.2575  0.6638  0.17300  
M  0.1625  0.2364  0.07678  

[5 rows x 31 columns]

然后一切正常，最后添加 reset_index:

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

df_normalized = normalize(df).reset_index().sort_index(axis=1)
print (df_normalized.head())
         0  1         2         3         4         5         6         7   \
0  0.000915  M  0.521037  0.022658  0.545989  0.363733  0.593753  0.792037   
1  0.000915  M  0.643144  0.272574  0.615783  0.501591  0.289880  0.181768   
2  0.092495  M  0.601496  0.390260  0.595743  0.449417  0.514309  0.431017   
3  0.092547  M  0.210090  0.360839  0.233501  0.102906  0.811321  0.811361   
4  0.092559  M  0.629893  0.156578  0.630986  0.489290  0.430351  0.347893   

         8         9     ...           22        23        24        25  \
0  0.703140  0.731113    ...     0.620776  0.141525  0.668310  0.450698   
1  0.203608  0.348757    ...     0.606901  0.303571  0.539818  0.435214   
2  0.462512  0.635686    ...     0.556386  0.360075  0.508442  0.374508   
3  0.565604  0.522863    ...     0.248310  0.385928  0.241347  0.094008   
4  0.463918  0.518390    ...     0.519744  0.123934  0.506948  0.341575   

         26        27        28        29        30        31  
0  0.601136  0.619292  0.568610  0.912027  0.598462  0.418864  
1  0.347553  0.154563  0.192971  0.639175  0.233590  0.222878  
2  0.483590  0.385375  0.359744  0.835052  0.403706  0.213433  
3  0.915472  0.814012  0.548642  0.884880  1.000000  0.773711  
4  0.437364  0.172415  0.319489  0.558419  0.157500  0.142595  

[5 rows x 32 columns]

Error when doing simple data normalization, TypeError: unsupported operand type(s) for -: 'str' and 'str'

Error when doing simple data normalization, TypeError: unsupported operand type(s) for -: 'str' and 'str'

python

normalization

typeerror