有条件地将 A、B、C 列中的值替换为 D 列中的值
Conditional replacement of values in column A, B, C with value in column D
我正在清理一个混乱的数据源,该数据源描述了如下标识的层次结构。我正在使用 Python 和 pandas.
¦ A ¦ B ¦ C ¦ D ¦
-----------------
¦ x ¦ ¦ ¦ a ¦
¦ ¦ x ¦ ¦ b ¦
¦ ¦ ¦ x ¦ c ¦
¦ ¦ ¦ x ¦ d ¦
¦ x ¦ ¦ ¦ e ¦
¦ ¦ x ¦ ¦ f ¦
¦ ¦ ¦ x ¦ g ¦
¦ ¦ ¦ x ¦ h ¦
我想生成唯一的 ID,同时保持数据的层次结构。 (每个 parent 的名字都是独一无二的,请不要关注那部分。)
¦ A ¦ B ¦ C ¦ D ¦ ID ¦
-------------------------
¦ x ¦ ¦ ¦ a ¦ a ¦
¦ ¦ x ¦ ¦ b ¦ a.b ¦
¦ ¦ ¦ x ¦ c ¦ a.b.c ¦
¦ ¦ ¦ x ¦ d ¦ a.b.d ¦
¦ x ¦ ¦ ¦ e ¦ e ¦ <-- note, this is NOT e.b.d,
¦ ¦ x ¦ ¦ f ¦ e.f ¦ so when parent changes
¦ ¦ ¦ x ¦ g ¦ e.f.g ¦ fillna must not be applied
¦ ¦ ¦ x ¦ h ¦ e.f.h ¦
我的策略是:
- 将 A、B、C 中的 'x' 值替换为 D
中的值
- 使用pandas'转发na填充
- 将 A、B 和 C 连接到列 ID
2 和 3 很容易,但我不能通过 1。我可以用单个值替换 x-es:
df[df.loc[:,'A':'C'] == 'x'] = 1
但是如果我尝试传递 df.D
而不是 1
,那将不起作用。
请推荐一个优雅的pythonic解决方案。
要使用的来源:
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
TESTDATA=StringIO("""
A;B;C;D;solution
x;;;x;x
;x;;a;xa
;x;;b;xb
;x;;c;xc
;;x;1;xc1
;;x;2;xc2
;x;;d;xd
;;x;3;xd3
;;x;4;xd4
x;;;y;y
;x;;e;ye
;;x;5;ye5
;;x;6;ye6
;x;;f;yf
;;x;7;yf7
;;x;8;yf8
;;x;9;yf9""")
df = pd.read_csv(TESTDATA, sep=";", header=False)
您可以使用 ix 代替 loc:
df.ix[df.ix[:,'A'] == 'x','A'] = df.ix[df.ix[:,'A'] == 'x','D']
df.ix[df.ix[:,'B'] == 'x','B'] = df.ix[df.ix[:,'B'] == 'x','D']
df.ix[df.ix[:,'C'] == 'x','C'] = df.ix[df.ix[:,'C'] == 'x','D']
这里有一个方法:
dt = pd.DataFrame([np.where(df[n]=='x', df['D'], df[n]) for n in ['A','B','C']]).T
dt.ffill().fillna('').apply(lambda x: '.'.join(x), axis=1).str.replace('\.+$','')
Out[213]:
0 x
1 x.a
2 x.b
3 x.c
4 x.c.1
5 x.c.2
6 x.d.2
7 x.d.3
8 x.d.4
9 y.d.4
10 y.e.4
11 y.e.5
12 y.e.6
13 y.f.6
14 y.f.7
15 y.f.8
16 y.f.9
dtype: object
不是有史以来最漂亮的,但有点像
w0 = df.iloc[:,:3]
wx = w0 == 'x'
wempty = (wx.cumsum(axis=1) >= 1).shift(axis=1).fillna(False)
wfilled = w0.where(~wx, df.D, axis=0).ffill()
w = w0.where(wempty, wfilled, axis=1).fillna('')
df["new_solution"] = w.apply('.'.join,axis=1).str.rstrip(".")
给我
>>> df
A B C D solution new_solution
0 x NaN NaN x x x
1 NaN x NaN a xa x.a
2 NaN x NaN b xb x.b
3 NaN x NaN c xc x.c
4 NaN NaN x 1 xc1 x.c.1
5 NaN NaN x 2 xc2 x.c.2
6 NaN x NaN d xd x.d
7 NaN NaN x 3 xd3 x.d.3
8 NaN NaN x 4 xd4 x.d.4
9 x NaN NaN y y y
10 NaN x NaN e ye y.e
11 NaN NaN x 5 ye5 y.e.5
12 NaN NaN x 6 ye6 y.e.6
13 NaN x NaN f yf y.f
14 NaN NaN x 7 yf7 y.f.7
15 NaN NaN x 8 yf8 y.f.8
16 NaN NaN x 9 yf9 y.f.9
这里的技巧是使用 cumsum
,它让我们区分应该为空的单元格和应该填充的单元格。
好吧,我终于使用@DSM 的一些技巧找到了这个解决方案。
它只有一个临时变量,主要用boolean masking解决问题
# bool mask for empty cells that have non-empty cell before them
nofills = (df.iloc[:,:3] == 'x').cumsum(axis=1) & ((df.iloc[:,:3] == 'x') == False) > 0
# fill these with empty strings
df[nofills] = ''
# replace 'x'es with values from column D, ffill up NaNs then concat together into a new column
df['solution2'] = df.iloc[:,:3].where(df.iloc[:,:3] != 'x', df.D, axis=0).ffill().apply(''.join, axis=1)
print df
结果:
A B C D solution solution2
0 x x x x
1 NaN x a xa xa
2 NaN x b xb xb
3 NaN x c xc xc
4 NaN NaN x 1 xc1 xc1
5 NaN NaN x 2 xc2 xc2
6 NaN x d xd xd
7 NaN NaN x 3 xd3 xd3
8 NaN NaN x 4 xd4 xd4
9 x y y y
10 NaN x e ye ye
11 NaN NaN x 5 ye5 ye5
12 NaN NaN x 6 ye6 ye6
13 NaN x f yf yf
14 NaN NaN x 7 yf7 yf7
15 NaN NaN x 8 yf8 yf8
16 NaN NaN x 9 yf9 yf9
非常感谢任何评论/建议。
我正在清理一个混乱的数据源,该数据源描述了如下标识的层次结构。我正在使用 Python 和 pandas.
¦ A ¦ B ¦ C ¦ D ¦
-----------------
¦ x ¦ ¦ ¦ a ¦
¦ ¦ x ¦ ¦ b ¦
¦ ¦ ¦ x ¦ c ¦
¦ ¦ ¦ x ¦ d ¦
¦ x ¦ ¦ ¦ e ¦
¦ ¦ x ¦ ¦ f ¦
¦ ¦ ¦ x ¦ g ¦
¦ ¦ ¦ x ¦ h ¦
我想生成唯一的 ID,同时保持数据的层次结构。 (每个 parent 的名字都是独一无二的,请不要关注那部分。)
¦ A ¦ B ¦ C ¦ D ¦ ID ¦
-------------------------
¦ x ¦ ¦ ¦ a ¦ a ¦
¦ ¦ x ¦ ¦ b ¦ a.b ¦
¦ ¦ ¦ x ¦ c ¦ a.b.c ¦
¦ ¦ ¦ x ¦ d ¦ a.b.d ¦
¦ x ¦ ¦ ¦ e ¦ e ¦ <-- note, this is NOT e.b.d,
¦ ¦ x ¦ ¦ f ¦ e.f ¦ so when parent changes
¦ ¦ ¦ x ¦ g ¦ e.f.g ¦ fillna must not be applied
¦ ¦ ¦ x ¦ h ¦ e.f.h ¦
我的策略是:
- 将 A、B、C 中的 'x' 值替换为 D 中的值
- 使用pandas'转发na填充
- 将 A、B 和 C 连接到列 ID
2 和 3 很容易,但我不能通过 1。我可以用单个值替换 x-es:
df[df.loc[:,'A':'C'] == 'x'] = 1
但是如果我尝试传递 df.D
而不是 1
,那将不起作用。
请推荐一个优雅的pythonic解决方案。
要使用的来源:
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
TESTDATA=StringIO("""
A;B;C;D;solution
x;;;x;x
;x;;a;xa
;x;;b;xb
;x;;c;xc
;;x;1;xc1
;;x;2;xc2
;x;;d;xd
;;x;3;xd3
;;x;4;xd4
x;;;y;y
;x;;e;ye
;;x;5;ye5
;;x;6;ye6
;x;;f;yf
;;x;7;yf7
;;x;8;yf8
;;x;9;yf9""")
df = pd.read_csv(TESTDATA, sep=";", header=False)
您可以使用 ix 代替 loc:
df.ix[df.ix[:,'A'] == 'x','A'] = df.ix[df.ix[:,'A'] == 'x','D']
df.ix[df.ix[:,'B'] == 'x','B'] = df.ix[df.ix[:,'B'] == 'x','D']
df.ix[df.ix[:,'C'] == 'x','C'] = df.ix[df.ix[:,'C'] == 'x','D']
这里有一个方法:
dt = pd.DataFrame([np.where(df[n]=='x', df['D'], df[n]) for n in ['A','B','C']]).T
dt.ffill().fillna('').apply(lambda x: '.'.join(x), axis=1).str.replace('\.+$','')
Out[213]:
0 x
1 x.a
2 x.b
3 x.c
4 x.c.1
5 x.c.2
6 x.d.2
7 x.d.3
8 x.d.4
9 y.d.4
10 y.e.4
11 y.e.5
12 y.e.6
13 y.f.6
14 y.f.7
15 y.f.8
16 y.f.9
dtype: object
不是有史以来最漂亮的,但有点像
w0 = df.iloc[:,:3]
wx = w0 == 'x'
wempty = (wx.cumsum(axis=1) >= 1).shift(axis=1).fillna(False)
wfilled = w0.where(~wx, df.D, axis=0).ffill()
w = w0.where(wempty, wfilled, axis=1).fillna('')
df["new_solution"] = w.apply('.'.join,axis=1).str.rstrip(".")
给我
>>> df
A B C D solution new_solution
0 x NaN NaN x x x
1 NaN x NaN a xa x.a
2 NaN x NaN b xb x.b
3 NaN x NaN c xc x.c
4 NaN NaN x 1 xc1 x.c.1
5 NaN NaN x 2 xc2 x.c.2
6 NaN x NaN d xd x.d
7 NaN NaN x 3 xd3 x.d.3
8 NaN NaN x 4 xd4 x.d.4
9 x NaN NaN y y y
10 NaN x NaN e ye y.e
11 NaN NaN x 5 ye5 y.e.5
12 NaN NaN x 6 ye6 y.e.6
13 NaN x NaN f yf y.f
14 NaN NaN x 7 yf7 y.f.7
15 NaN NaN x 8 yf8 y.f.8
16 NaN NaN x 9 yf9 y.f.9
这里的技巧是使用 cumsum
,它让我们区分应该为空的单元格和应该填充的单元格。
好吧,我终于使用@DSM 的一些技巧找到了这个解决方案。
它只有一个临时变量,主要用boolean masking解决问题
# bool mask for empty cells that have non-empty cell before them
nofills = (df.iloc[:,:3] == 'x').cumsum(axis=1) & ((df.iloc[:,:3] == 'x') == False) > 0
# fill these with empty strings
df[nofills] = ''
# replace 'x'es with values from column D, ffill up NaNs then concat together into a new column
df['solution2'] = df.iloc[:,:3].where(df.iloc[:,:3] != 'x', df.D, axis=0).ffill().apply(''.join, axis=1)
print df
结果:
A B C D solution solution2
0 x x x x
1 NaN x a xa xa
2 NaN x b xb xb
3 NaN x c xc xc
4 NaN NaN x 1 xc1 xc1
5 NaN NaN x 2 xc2 xc2
6 NaN x d xd xd
7 NaN NaN x 3 xd3 xd3
8 NaN NaN x 4 xd4 xd4
9 x y y y
10 NaN x e ye ye
11 NaN NaN x 5 ye5 ye5
12 NaN NaN x 6 ye6 ye6
13 NaN x f yf yf
14 NaN NaN x 7 yf7 yf7
15 NaN NaN x 8 yf8 yf8
16 NaN NaN x 9 yf9 yf9
非常感谢任何评论/建议。