转换数据框列时如何删除千位逗号分隔符?
How can I remove the thousand comma separator when converting data frame columns?
给定以下数据框:
State,City,Population,Poverty_Rate,Median_Age,
VA,XYZ,.,10.5%,42,
MD,ABC,"12,345",8.9%,.,
NY,.,987,654,.,41,
...
import pandas as pd
df = pd.read_csv("/path... /sample_data")
df.dtypes
returns
State Object
City Object
Population Object
Proverty_Rate Object
Median_Age Object
我尝试将适当列的数据类型转换为 int 或 float:
df = df.astype({"Population": int, "Proverty_rate": float, "Median_Age": int })
我收到了
Value Error: invalid literal for int() with base 10: '12,345'
我怀疑逗号分隔符导致了这个问题。如何从我的数据集中删除它们?
您可以尝试以下方法吗?在将其转换为整数之前首先在该列上执行 str.replace
?
import pandas as pd
df = pd.DataFrame([
{'value': '123,445'},
{'value': '143,445,788'}
])
df['value'] = df['value'].str.replace(',', '').astype(int)
Pandas DataFrame 中有一个参数为 pd.read_csv(thousands=',')
,默认设置为 None。
data = """
State City Population Poverty_Rate Median_Age
VA XYZ 500,00 10.5% 42
MD ABC 12,345 8.9% .
NY . 987,654 . 41"""
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(data),sep='\s+',thousands=',')
print(df)
State City Population Poverty_Rate Median_Age
0 VA XYZ 50000 10.5% 42
1 MD ABC 12345 8.9% .
2 NY . 987654 . 41
理想情况下,您需要做的是替换字符串标记,然后将您的字符串列强制转换为 integers/floats。
#using your dict.
int_cols = ({"Population": int, "Poverty_Rate": float, "Median_Age": int })
for col in int_cols.keys():
df[col] = pd.to_numeric(df[col].astype(str).str.replace('%',''),errors='coerce')
print(df.dtypes)
State object
City object
Population int64
Poverty_Rate float64
Median_Age float64
dtype: object
print(df)
State City Population Poverty_Rate Median_Age
0 VA XYZ 50000 10.5 42.0
1 MD ABC 12345 8.9 NaN
2 NY . 987654 NaN 41.0
给定以下数据框:
State,City,Population,Poverty_Rate,Median_Age,
VA,XYZ,.,10.5%,42,
MD,ABC,"12,345",8.9%,.,
NY,.,987,654,.,41,
...
import pandas as pd
df = pd.read_csv("/path... /sample_data")
df.dtypes
returns
State Object
City Object
Population Object
Proverty_Rate Object
Median_Age Object
我尝试将适当列的数据类型转换为 int 或 float:
df = df.astype({"Population": int, "Proverty_rate": float, "Median_Age": int })
我收到了
Value Error: invalid literal for int() with base 10: '12,345'
我怀疑逗号分隔符导致了这个问题。如何从我的数据集中删除它们?
您可以尝试以下方法吗?在将其转换为整数之前首先在该列上执行 str.replace
?
import pandas as pd
df = pd.DataFrame([
{'value': '123,445'},
{'value': '143,445,788'}
])
df['value'] = df['value'].str.replace(',', '').astype(int)
Pandas DataFrame 中有一个参数为 pd.read_csv(thousands=',')
,默认设置为 None。
data = """
State City Population Poverty_Rate Median_Age
VA XYZ 500,00 10.5% 42
MD ABC 12,345 8.9% .
NY . 987,654 . 41"""
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(data),sep='\s+',thousands=',')
print(df)
State City Population Poverty_Rate Median_Age
0 VA XYZ 50000 10.5% 42
1 MD ABC 12345 8.9% .
2 NY . 987654 . 41
理想情况下,您需要做的是替换字符串标记,然后将您的字符串列强制转换为 integers/floats。
#using your dict.
int_cols = ({"Population": int, "Poverty_Rate": float, "Median_Age": int })
for col in int_cols.keys():
df[col] = pd.to_numeric(df[col].astype(str).str.replace('%',''),errors='coerce')
print(df.dtypes)
State object
City object
Population int64
Poverty_Rate float64
Median_Age float64
dtype: object
print(df)
State City Population Poverty_Rate Median_Age
0 VA XYZ 50000 10.5 42.0
1 MD ABC 12345 8.9 NaN
2 NY . 987654 NaN 41.0