python pandas - 使用 astype() 处理字符串中的逗号到浮点转换的通用方法
python pandas - generic ways to deal with commas in string to float conversion with astype()
是否有通用方法告诉 pandas 使用逗号 (",") 作为从字符串到浮点等类型转换的小数点分隔符?
import pandas as pd
from datetime import datetime
data = {
"col_str": ["a", "b", "c"],
"col_int": ["1", "2", "3"],
"col_float": ["1,2", "3,2342", "97837,8277"],
"col_float2": ["13,2", "3234,2342", "263,8277"],
"col_date": [datetime(2020, 8, 1, 0, 3, 4).isoformat(),
datetime(2020, 8, 2, 2, 4, 5).isoformat(),
datetime(2020, 8, 3, 6, 8, 4).isoformat()
]
}
conversion_dict = {
"col_str": str,
"col_int": int,
"col_float": float,
"col_float2": float,
"col_date": "datetime64"
}
df = pd.DataFrame(data=data)
print(df.dtypes)
df = df.astype(conversion_dict, errors="ignore")
print(df.dtypes)
print(df)
上面的例子returns object-columns for "col_float" and "col_float2" or throws an error is errors set to "raise".
我想直接使用 astype() 方法,而无需手动将逗号替换为点。
数据源通常 returns 以逗号作为小数分隔符浮动,因为语言环境设置为德语。
是否有一种通用的方法来告诉 pandas 浮点数中的逗号类型 - 或任何其他带小数的数字数据类型 - 是否可以并应自动转换?
PS:我不能在可以直接指定分隔符的地方使用 read_csv,因为它是一个数据库。
提前致谢。
您可以使用 locale
库通过 apply()
和 locale.atof
以通用方式解决此问题。只需替换为适当的语言环境即可。我在这种情况下使用 de_DE,因为他们使用“,”小数点。
import locale
from datetime import datetime
import pandas as pd
locale.setlocale(locale.LC_ALL, locale="de_DE")
data = {
"col_str": ["a", "b", "c"],
"col_int": ["1", "2", "3"],
"col_float": ["1,2", "3,2342", "97837,8277"],
"col_float2": ["13,2", "3234,2342", "263,8277"],
"col_date": [datetime(2020, 8, 1, 0, 3, 4).isoformat(),
datetime(2020, 8, 2, 2, 4, 5).isoformat(),
datetime(2020, 8, 3, 6, 8, 4).isoformat()
]
}
conversion_dict = {
"col_str": str,
"col_int": int,
"col_float": str,
"col_float2": str,
"col_date": "datetime64"
}
df = pd.DataFrame(data=data)
print(df.dtypes)
df = df.astype(conversion_dict, errors="ignore")
df["col_float"] = df["col_float"].apply(locale.atof)
df["col_float2"] = df["col_float2"].apply(locale.atof)
print(df.dtypes)
print(df)
输出:
col_str object
col_int object
col_float object
col_float2 object
col_date object
dtype: object
col_str object
col_int int64
col_float float64
col_float2 float64
col_date datetime64[ns]
dtype: object
col_str col_int col_float col_float2 col_date
0 a 1 1.2000 13.2000 2020-08-01 00:03:04
1 b 2 3.2342 3234.2342 2020-08-02 02:04:05
2 c 3 97837.8277 263.8277 2020-08-03 06:08:04
我通过以下解决方法解决了这个问题。在某些情况下这仍然可能会中断,但我没有找到一种方法来告诉 pands astype() 逗号是可以的。如果有人有其他仅 pandas 的解决方案,请告诉我:
import locale
from datetime import datetime
import pandas as pd
data = {
"col_str": ["a", "b", "c"],
"col_int": ["1", "2", "3"],
"col_float": ["1,2", "3,2342", "97837,8277"],
"col_float2": ["13,2", "3234,2342", "263,8277"],
"col_date": [datetime(2020, 8, 1, 0, 3, 4).isoformat(),
datetime(2020, 8, 2, 2, 4, 5).isoformat(),
datetime(2020, 8, 3, 6, 8, 4).isoformat()
]
}
conversion_dict = {
"col_str": str,
"col_int": int,
"col_float": float,
"col_float2": float,
"col_date": "datetime64"
}
df = pd.DataFrame(data=data)
throw_error = True
try:
df = df.astype(conversion_dict, errors="raise")
except ValueError as e:
error_message = str(e).strip().upper()
error_search = "COULD NOT CONVERT STRING TO FLOAT:"
# compare error messages to only get the string to float error because pandas only throws ValueError´s which
# are not datatype specific. This might be quite hacky because error messages could change.
if error_message[:len(error_search)] == error_search:
# convert everything else and ignore errors for the float columns
df = df.astype(conversion_dict, errors="ignore")
# go over the conversion dict
for key, value in conversion_dict.items():
# print(str(key) + ":" + str(value) + ":" + str(df[key].dtype))
# only apply to convert-to-float-columns which are not already in the correct pandas type float64
# if you don´t check for correctly classified types, .str.replace() throws an error
if (value == float or value == "float") and df[key].dtype != "float64":
# df[key].apply(locale.atof) or anythin locale related is plattform dependant and therefore bad
# in my opinion
# locale settings for atof
# WINDOWS: locale.setlocale(locale.LC_ALL, 'deu_deu')
# UNIX: locale.setlocale(locale.LC_ALL, 'de_DE')
df[key] = pd.to_numeric(df[key].str.replace(',', '.'))
else:
if throw_error:
# or do whatever is best suited for your use case
raise ValueError(str(e))
else:
df = df.astype(conversion_dict, errors="ignore")
print(df.dtypes)
print(df)
是否有通用方法告诉 pandas 使用逗号 (",") 作为从字符串到浮点等类型转换的小数点分隔符?
import pandas as pd
from datetime import datetime
data = {
"col_str": ["a", "b", "c"],
"col_int": ["1", "2", "3"],
"col_float": ["1,2", "3,2342", "97837,8277"],
"col_float2": ["13,2", "3234,2342", "263,8277"],
"col_date": [datetime(2020, 8, 1, 0, 3, 4).isoformat(),
datetime(2020, 8, 2, 2, 4, 5).isoformat(),
datetime(2020, 8, 3, 6, 8, 4).isoformat()
]
}
conversion_dict = {
"col_str": str,
"col_int": int,
"col_float": float,
"col_float2": float,
"col_date": "datetime64"
}
df = pd.DataFrame(data=data)
print(df.dtypes)
df = df.astype(conversion_dict, errors="ignore")
print(df.dtypes)
print(df)
上面的例子returns object-columns for "col_float" and "col_float2" or throws an error is errors set to "raise".
我想直接使用 astype() 方法,而无需手动将逗号替换为点。 数据源通常 returns 以逗号作为小数分隔符浮动,因为语言环境设置为德语。
是否有一种通用的方法来告诉 pandas 浮点数中的逗号类型 - 或任何其他带小数的数字数据类型 - 是否可以并应自动转换?
PS:我不能在可以直接指定分隔符的地方使用 read_csv,因为它是一个数据库。
提前致谢。
您可以使用 locale
库通过 apply()
和 locale.atof
以通用方式解决此问题。只需替换为适当的语言环境即可。我在这种情况下使用 de_DE,因为他们使用“,”小数点。
import locale
from datetime import datetime
import pandas as pd
locale.setlocale(locale.LC_ALL, locale="de_DE")
data = {
"col_str": ["a", "b", "c"],
"col_int": ["1", "2", "3"],
"col_float": ["1,2", "3,2342", "97837,8277"],
"col_float2": ["13,2", "3234,2342", "263,8277"],
"col_date": [datetime(2020, 8, 1, 0, 3, 4).isoformat(),
datetime(2020, 8, 2, 2, 4, 5).isoformat(),
datetime(2020, 8, 3, 6, 8, 4).isoformat()
]
}
conversion_dict = {
"col_str": str,
"col_int": int,
"col_float": str,
"col_float2": str,
"col_date": "datetime64"
}
df = pd.DataFrame(data=data)
print(df.dtypes)
df = df.astype(conversion_dict, errors="ignore")
df["col_float"] = df["col_float"].apply(locale.atof)
df["col_float2"] = df["col_float2"].apply(locale.atof)
print(df.dtypes)
print(df)
输出:
col_str object
col_int object
col_float object
col_float2 object
col_date object
dtype: object
col_str object
col_int int64
col_float float64
col_float2 float64
col_date datetime64[ns]
dtype: object
col_str col_int col_float col_float2 col_date
0 a 1 1.2000 13.2000 2020-08-01 00:03:04
1 b 2 3.2342 3234.2342 2020-08-02 02:04:05
2 c 3 97837.8277 263.8277 2020-08-03 06:08:04
我通过以下解决方法解决了这个问题。在某些情况下这仍然可能会中断,但我没有找到一种方法来告诉 pands astype() 逗号是可以的。如果有人有其他仅 pandas 的解决方案,请告诉我:
import locale
from datetime import datetime
import pandas as pd
data = {
"col_str": ["a", "b", "c"],
"col_int": ["1", "2", "3"],
"col_float": ["1,2", "3,2342", "97837,8277"],
"col_float2": ["13,2", "3234,2342", "263,8277"],
"col_date": [datetime(2020, 8, 1, 0, 3, 4).isoformat(),
datetime(2020, 8, 2, 2, 4, 5).isoformat(),
datetime(2020, 8, 3, 6, 8, 4).isoformat()
]
}
conversion_dict = {
"col_str": str,
"col_int": int,
"col_float": float,
"col_float2": float,
"col_date": "datetime64"
}
df = pd.DataFrame(data=data)
throw_error = True
try:
df = df.astype(conversion_dict, errors="raise")
except ValueError as e:
error_message = str(e).strip().upper()
error_search = "COULD NOT CONVERT STRING TO FLOAT:"
# compare error messages to only get the string to float error because pandas only throws ValueError´s which
# are not datatype specific. This might be quite hacky because error messages could change.
if error_message[:len(error_search)] == error_search:
# convert everything else and ignore errors for the float columns
df = df.astype(conversion_dict, errors="ignore")
# go over the conversion dict
for key, value in conversion_dict.items():
# print(str(key) + ":" + str(value) + ":" + str(df[key].dtype))
# only apply to convert-to-float-columns which are not already in the correct pandas type float64
# if you don´t check for correctly classified types, .str.replace() throws an error
if (value == float or value == "float") and df[key].dtype != "float64":
# df[key].apply(locale.atof) or anythin locale related is plattform dependant and therefore bad
# in my opinion
# locale settings for atof
# WINDOWS: locale.setlocale(locale.LC_ALL, 'deu_deu')
# UNIX: locale.setlocale(locale.LC_ALL, 'de_DE')
df[key] = pd.to_numeric(df[key].str.replace(',', '.'))
else:
if throw_error:
# or do whatever is best suited for your use case
raise ValueError(str(e))
else:
df = df.astype(conversion_dict, errors="ignore")
print(df.dtypes)
print(df)