pandas 中的数据修改
Data munging in pandas
我有一个 CSV 文件,其中的行如下所示:
ID,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
我可以读入
#!/usr/bin/env python
import pandas as pd
import sys
filename = sys.argv[1]
df = pd.read_csv(filename)
给定一个特定的列,我想按 ID 拆分行,然后输出每个 ID 的均值和标准差。
我的第一个问题是,如何从应该分别为 100 和 0 的“100M”和“0N#”等数字中删除所有 non-numeric 部分。
我还尝试循环遍历相关的 headers 并使用
df[header].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')
如 Pandas DataFrame: remove unwanted parts from strings in a column 中所建议。
但是这会将 98.4 更改为 984。
使用str.extract
:
In [356]:
import io
import pandas as pd
t="""ID,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#"""
df = pd.read_csv(io.StringIO(t), header=None)
df
Out[356]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 \
0 ID 98.4 100M 55M 65M 75M 100M 75M 65M 100M 98M 100M 100M 92M
14 15
0 0# 0N#
In [357]:
for col in df.columns[2:]:
df[col] = df[col].str.extract(r'(\d+)').astype(int)
df
Out[357]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 ID 98.4 100 55 65 75 100 75 65 100 98 100 100 92 0 0
如果您有浮点数,则可以使用以下正则表达式:
In [379]:
t="""ID,98.4,100.50M,55.234M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#"""
df = pd.read_csv(io.StringIO(t), header=None)
df
Out[379]:
0 1 2 3 4 5 6 7 8 9 10 11 \
0 ID 98.4 100.50M 55.234M 65M 75M 100M 75M 65M 100M 98M 100M
12 13 14 15
0 100M 92M 0# 0N#
In [380]:
for col in df.columns[2:]:
df[col] = df[col].str.extract(r'(\d+\.?\d+)').astype(np.float)
df
Out[380]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 ID 98.4 100.5 55.234 65 75 100 75 65 100 98 100 100 92 NaN NaN
so (\d+\.?\d+)
查找包含 \d+
1 位或更多位带 \.?
可选小数点和 \d+
1 位或更多位小数点后的组
编辑
OK 编辑了我的正则表达式模式:
In [408]:
t="""Name,97.7,0A,0A,65M,0A,100M,5M,75M,100M,90M,90M,99M,90M,0#,0N#"""
df = pd.read_csv(io.StringIO(t), header=None)
df
Out[408]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 \
0 Name 97.7 0A 0A 65M 0A 100M 5M 75M 100M 90M 90M 99M 90M 0#
15
0 0N#
In [409]:
for col in df.columns[2:]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
df
Out[409]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 Name 97.7 0 0 65 0 100 5 75 100 90 90 99 90 0 0
My first problem is, how can I remove all the non-numeric parts from the numbers such as "100M" and "0N#" which should be 100 and 0 respectively.
import re
df = pd.read_csv(yourfile, header=None)
df.columns = ['ID'] + list(df.columns)[1:]
df = df.stack().apply(lambda v: re.sub('[^0-9]','', v)
if isinstance(v, str) else v).astype(float).unstack()
df.groupby('ID').agg(['std', 'mean'])
此处 .stack()
将数据帧转换为系列,.apply()
为每个值调用 lambda,re.sub()
删除任何非数字字符,.astype()
转换为数字unstack()
将系列转换回数据框。这对整数和浮点数同样有效。
Given a particular column, I would like to split the rows by ID and then output the mean and standard deviation for each ID.
# for all columns
df.groupby('ID').agg(['std', 'mean'])
# for specific column
df.groupby('ID')['<colname>'].agg(['std', 'mean'])
示例中使用的数据如下:
from StringIO import StringIO
s="""
1,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
1,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
2,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
2,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
"""
yourfile = StringIO(s)
我有一个 CSV 文件,其中的行如下所示:
ID,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
我可以读入
#!/usr/bin/env python
import pandas as pd
import sys
filename = sys.argv[1]
df = pd.read_csv(filename)
给定一个特定的列,我想按 ID 拆分行,然后输出每个 ID 的均值和标准差。
我的第一个问题是,如何从应该分别为 100 和 0 的“100M”和“0N#”等数字中删除所有 non-numeric 部分。
我还尝试循环遍历相关的 headers 并使用
df[header].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')
如 Pandas DataFrame: remove unwanted parts from strings in a column 中所建议。
但是这会将 98.4 更改为 984。
使用str.extract
:
In [356]:
import io
import pandas as pd
t="""ID,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#"""
df = pd.read_csv(io.StringIO(t), header=None)
df
Out[356]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 \
0 ID 98.4 100M 55M 65M 75M 100M 75M 65M 100M 98M 100M 100M 92M
14 15
0 0# 0N#
In [357]:
for col in df.columns[2:]:
df[col] = df[col].str.extract(r'(\d+)').astype(int)
df
Out[357]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 ID 98.4 100 55 65 75 100 75 65 100 98 100 100 92 0 0
如果您有浮点数,则可以使用以下正则表达式:
In [379]:
t="""ID,98.4,100.50M,55.234M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#"""
df = pd.read_csv(io.StringIO(t), header=None)
df
Out[379]:
0 1 2 3 4 5 6 7 8 9 10 11 \
0 ID 98.4 100.50M 55.234M 65M 75M 100M 75M 65M 100M 98M 100M
12 13 14 15
0 100M 92M 0# 0N#
In [380]:
for col in df.columns[2:]:
df[col] = df[col].str.extract(r'(\d+\.?\d+)').astype(np.float)
df
Out[380]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 ID 98.4 100.5 55.234 65 75 100 75 65 100 98 100 100 92 NaN NaN
so (\d+\.?\d+)
查找包含 \d+
1 位或更多位带 \.?
可选小数点和 \d+
1 位或更多位小数点后的组
编辑
OK 编辑了我的正则表达式模式:
In [408]:
t="""Name,97.7,0A,0A,65M,0A,100M,5M,75M,100M,90M,90M,99M,90M,0#,0N#"""
df = pd.read_csv(io.StringIO(t), header=None)
df
Out[408]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 \
0 Name 97.7 0A 0A 65M 0A 100M 5M 75M 100M 90M 90M 99M 90M 0#
15
0 0N#
In [409]:
for col in df.columns[2:]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
df
Out[409]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 Name 97.7 0 0 65 0 100 5 75 100 90 90 99 90 0 0
My first problem is, how can I remove all the non-numeric parts from the numbers such as "100M" and "0N#" which should be 100 and 0 respectively.
import re
df = pd.read_csv(yourfile, header=None)
df.columns = ['ID'] + list(df.columns)[1:]
df = df.stack().apply(lambda v: re.sub('[^0-9]','', v)
if isinstance(v, str) else v).astype(float).unstack()
df.groupby('ID').agg(['std', 'mean'])
此处 .stack()
将数据帧转换为系列,.apply()
为每个值调用 lambda,re.sub()
删除任何非数字字符,.astype()
转换为数字unstack()
将系列转换回数据框。这对整数和浮点数同样有效。
Given a particular column, I would like to split the rows by ID and then output the mean and standard deviation for each ID.
# for all columns
df.groupby('ID').agg(['std', 'mean'])
# for specific column
df.groupby('ID')['<colname>'].agg(['std', 'mean'])
示例中使用的数据如下:
from StringIO import StringIO
s="""
1,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
1,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
2,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
2,98.4,100M,55M,65M,75M,100M,75M,65M,100M,98M,100M,100M,92M,0#,0N#,
"""
yourfile = StringIO(s)