Error when doing simple data normalization, TypeError: unsupported operand type(s) for -: 'str' and 'str'
Error when doing simple data normalization, TypeError: unsupported operand type(s) for -: 'str' and 'str'
我正在尝试使用以下函数规范化 pandas 数据帧:
def normalize(df):
result = df.copy()
for feature_name in df.columns:
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
return result
df_normalized = normalize(df)
其中:
filename = 'data.csv'
data = pd.read_csv(filename)
df = pd.DataFrame(data)
df=df.dropna(axis=1,how='all')
但我一直运行陷入这个困扰我几个小时的错误:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
有人知道为什么吗?
这是我的数据:https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
该错误表明您正在尝试减去字符串,这是一个没有任何意义的操作。
本质上,您正在尝试执行类似 "foo" - "bar"
的操作。
尝试在所有减法操作数上使用 float()
来修复它。
对于您的代码:
def normalize(df):
result = df.copy()
for feature_name in df.columns:
max_value = float(df[feature_name].max())
min_value = float(df[feature_name].min())
result[feature_name] = (float(df[feature_name]) - min_value) / (max_value - min_value)
return result
从文件中读取并不总是保证 pandas 会猜测您的对象的类型,您必须像
那样明确地这样做
def normalize(df):
result = df.copy()
for feature_name in df.columns:
df[feature_name]=df[feature_name].apply(pd.to_numeric,errors='ignore')
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
return result
df_normalized = normalize(df)
df.apply(pd.to_numeric)
您可以先检查 dtypes
输出 df
:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
df = pd.read_csv(url, header=None)
print (df.dtypes)
0 int64
1 object
2 float64
3 float64
...
...
29 float64
30 float64
31 float64
dtype: object
所有列都是数字,只有第二列是 object
- 显然是 string
,所以一种可能的解决方案是 set_index
将所有字符串列转换为索引:
df = df.set_index(1)
print (df.head())
0 2 3 4 5 6 7 8 9 \
1
M 842302 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710
M 842517 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017
M 84300903 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790
M 84348301 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520
M 84358402 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430
10 ... 22 23 24 25 26 27 28 \
1 ...
M 0.2419 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119
M 0.1812 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416
M 0.2069 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504
M 0.2597 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869
M 0.1809 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000
29 30 31
1
M 0.2654 0.4601 0.11890
M 0.1860 0.2750 0.08902
M 0.2430 0.3613 0.08758
M 0.2575 0.6638 0.17300
M 0.1625 0.2364 0.07678
[5 rows x 31 columns]
然后一切正常,最后添加 reset_index
:
def normalize(df):
result = df.copy()
for feature_name in df.columns:
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
return result
df_normalized = normalize(df).reset_index().sort_index(axis=1)
print (df_normalized.head())
0 1 2 3 4 5 6 7 \
0 0.000915 M 0.521037 0.022658 0.545989 0.363733 0.593753 0.792037
1 0.000915 M 0.643144 0.272574 0.615783 0.501591 0.289880 0.181768
2 0.092495 M 0.601496 0.390260 0.595743 0.449417 0.514309 0.431017
3 0.092547 M 0.210090 0.360839 0.233501 0.102906 0.811321 0.811361
4 0.092559 M 0.629893 0.156578 0.630986 0.489290 0.430351 0.347893
8 9 ... 22 23 24 25 \
0 0.703140 0.731113 ... 0.620776 0.141525 0.668310 0.450698
1 0.203608 0.348757 ... 0.606901 0.303571 0.539818 0.435214
2 0.462512 0.635686 ... 0.556386 0.360075 0.508442 0.374508
3 0.565604 0.522863 ... 0.248310 0.385928 0.241347 0.094008
4 0.463918 0.518390 ... 0.519744 0.123934 0.506948 0.341575
26 27 28 29 30 31
0 0.601136 0.619292 0.568610 0.912027 0.598462 0.418864
1 0.347553 0.154563 0.192971 0.639175 0.233590 0.222878
2 0.483590 0.385375 0.359744 0.835052 0.403706 0.213433
3 0.915472 0.814012 0.548642 0.884880 1.000000 0.773711
4 0.437364 0.172415 0.319489 0.558419 0.157500 0.142595
[5 rows x 32 columns]
我正在尝试使用以下函数规范化 pandas 数据帧:
def normalize(df):
result = df.copy()
for feature_name in df.columns:
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
return result
df_normalized = normalize(df)
其中:
filename = 'data.csv'
data = pd.read_csv(filename)
df = pd.DataFrame(data)
df=df.dropna(axis=1,how='all')
但我一直运行陷入这个困扰我几个小时的错误:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
有人知道为什么吗?
这是我的数据:https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
该错误表明您正在尝试减去字符串,这是一个没有任何意义的操作。
本质上,您正在尝试执行类似 "foo" - "bar"
的操作。
尝试在所有减法操作数上使用 float()
来修复它。
对于您的代码:
def normalize(df):
result = df.copy()
for feature_name in df.columns:
max_value = float(df[feature_name].max())
min_value = float(df[feature_name].min())
result[feature_name] = (float(df[feature_name]) - min_value) / (max_value - min_value)
return result
从文件中读取并不总是保证 pandas 会猜测您的对象的类型,您必须像
那样明确地这样做def normalize(df):
result = df.copy()
for feature_name in df.columns:
df[feature_name]=df[feature_name].apply(pd.to_numeric,errors='ignore')
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
return result
df_normalized = normalize(df)
df.apply(pd.to_numeric)
您可以先检查 dtypes
输出 df
:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
df = pd.read_csv(url, header=None)
print (df.dtypes)
0 int64
1 object
2 float64
3 float64
...
...
29 float64
30 float64
31 float64
dtype: object
所有列都是数字,只有第二列是 object
- 显然是 string
,所以一种可能的解决方案是 set_index
将所有字符串列转换为索引:
df = df.set_index(1)
print (df.head())
0 2 3 4 5 6 7 8 9 \
1
M 842302 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710
M 842517 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017
M 84300903 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790
M 84348301 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520
M 84358402 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430
10 ... 22 23 24 25 26 27 28 \
1 ...
M 0.2419 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119
M 0.1812 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416
M 0.2069 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504
M 0.2597 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869
M 0.1809 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000
29 30 31
1
M 0.2654 0.4601 0.11890
M 0.1860 0.2750 0.08902
M 0.2430 0.3613 0.08758
M 0.2575 0.6638 0.17300
M 0.1625 0.2364 0.07678
[5 rows x 31 columns]
然后一切正常,最后添加 reset_index
:
def normalize(df):
result = df.copy()
for feature_name in df.columns:
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
return result
df_normalized = normalize(df).reset_index().sort_index(axis=1)
print (df_normalized.head())
0 1 2 3 4 5 6 7 \
0 0.000915 M 0.521037 0.022658 0.545989 0.363733 0.593753 0.792037
1 0.000915 M 0.643144 0.272574 0.615783 0.501591 0.289880 0.181768
2 0.092495 M 0.601496 0.390260 0.595743 0.449417 0.514309 0.431017
3 0.092547 M 0.210090 0.360839 0.233501 0.102906 0.811321 0.811361
4 0.092559 M 0.629893 0.156578 0.630986 0.489290 0.430351 0.347893
8 9 ... 22 23 24 25 \
0 0.703140 0.731113 ... 0.620776 0.141525 0.668310 0.450698
1 0.203608 0.348757 ... 0.606901 0.303571 0.539818 0.435214
2 0.462512 0.635686 ... 0.556386 0.360075 0.508442 0.374508
3 0.565604 0.522863 ... 0.248310 0.385928 0.241347 0.094008
4 0.463918 0.518390 ... 0.519744 0.123934 0.506948 0.341575
26 27 28 29 30 31
0 0.601136 0.619292 0.568610 0.912027 0.598462 0.418864
1 0.347553 0.154563 0.192971 0.639175 0.233590 0.222878
2 0.483590 0.385375 0.359744 0.835052 0.403706 0.213433
3 0.915472 0.814012 0.548642 0.884880 1.000000 0.773711
4 0.437364 0.172415 0.319489 0.558419 0.157500 0.142595
[5 rows x 32 columns]