Python pandas 方法链接:从 strsplit 分配列
Python pandas method chaining: assign column from strsplit
当我想从另一个列的拆分创建一个新列时,我的分配方法有问题。如果我 select split 方法的值,我将得到错误 ValueError:值的长度与索引的长度不匹配。如果我只是应用拆分,没有 selecting (index) 任何值,我会得到一个包含列表的新列。
如果我不索引拆分方法的输出,这里是输出
(
pd.DataFrame({
"Gene": ["G1", "G1", "G2", "G2"],
"Sample": ["H1_T1", "H2_T1", "H1_T1", "H2_T1"]
})
.assign(Timepoint = lambda x: x.Sample.str.split("_")[1])
)
Gene Sample Timepoint
0 G1 H1_T1 [H1, T1]
1 G1 H2_T1 [H2, T1]
2 G2 H1_T1 [H1, T1]
3 G2 H2_T1 [H2, T1]
这是一个示例,我想 select 样本列中的 T1 或 T2 值并给出错误:
(
pd.DataFrame({
"Gene": ["G1", "G1", "G2", "G2"],
"Sample": ["H1_T1", "H2_T1", "H1_T1", "H2_T1"]
})
.assign(Timepoint = lambda x: x.Sample.str.split("_")[1])
)
我得到的错误是:
/home/user/anaconda3/lib/python3.4/site-packages/pandas/core/series.py in _sanitize_index(data, index, copy)
2739
2740 if len(data) != len(index):
-> 2741 raise ValueError('Length of values does not match length of '
2742 'index')
2743
ValueError: Length of values does not match length of index
IIUC 然后你想额外调用 str
到 select 元素:
In [234]:
pd.DataFrame({
"Gene": ["G1", "G1", "G2", "G2"],
"Sample": ["H1_T1", "H2_T1", "H1_T1", "H2_T1"]
}).assign(Timepoint = lambda x: x.Sample.str.split("_").str[1])
Out[234]:
Gene Sample Timepoint
0 G1 H1_T1 T1
1 G1 H2_T1 T1
2 G2 H1_T1 T1
3 G2 H2_T1 T1
如果我们稍微修改你的 df 并查看输出
In [237]:
df = pd.DataFrame({
"Gene": ["G1", "G1", "G2", "G2"],
"Sample": ["H1_T1", "H2_T2", "H1_T3", "H2_T4"]
})
df['Sample'].str.split("_")
Out[237]:
0 [H1, T1]
1 [H2, T2]
2 [H1, T3]
3 [H2, T4]
dtype: object
那么您的尝试如下:
In [238]:
df['Sample'].str.split("_")[1]
Out[238]:
['H2', 'T2']
你可以看到这是对第二行select所做的,你想要的是select每一行的第二个元素:
In [239]:
df['Sample'].str.split("_").str[1]
Out[239]:
0 T1
1 T2
2 T3
3 T4
dtype: object
当我想从另一个列的拆分创建一个新列时,我的分配方法有问题。如果我 select split 方法的值,我将得到错误 ValueError:值的长度与索引的长度不匹配。如果我只是应用拆分,没有 selecting (index) 任何值,我会得到一个包含列表的新列。
如果我不索引拆分方法的输出,这里是输出
(
pd.DataFrame({
"Gene": ["G1", "G1", "G2", "G2"],
"Sample": ["H1_T1", "H2_T1", "H1_T1", "H2_T1"]
})
.assign(Timepoint = lambda x: x.Sample.str.split("_")[1])
)
Gene Sample Timepoint
0 G1 H1_T1 [H1, T1]
1 G1 H2_T1 [H2, T1]
2 G2 H1_T1 [H1, T1]
3 G2 H2_T1 [H2, T1]
这是一个示例,我想 select 样本列中的 T1 或 T2 值并给出错误:
(
pd.DataFrame({
"Gene": ["G1", "G1", "G2", "G2"],
"Sample": ["H1_T1", "H2_T1", "H1_T1", "H2_T1"]
})
.assign(Timepoint = lambda x: x.Sample.str.split("_")[1])
)
我得到的错误是:
/home/user/anaconda3/lib/python3.4/site-packages/pandas/core/series.py in _sanitize_index(data, index, copy)
2739
2740 if len(data) != len(index):
-> 2741 raise ValueError('Length of values does not match length of '
2742 'index')
2743
ValueError: Length of values does not match length of index
IIUC 然后你想额外调用 str
到 select 元素:
In [234]:
pd.DataFrame({
"Gene": ["G1", "G1", "G2", "G2"],
"Sample": ["H1_T1", "H2_T1", "H1_T1", "H2_T1"]
}).assign(Timepoint = lambda x: x.Sample.str.split("_").str[1])
Out[234]:
Gene Sample Timepoint
0 G1 H1_T1 T1
1 G1 H2_T1 T1
2 G2 H1_T1 T1
3 G2 H2_T1 T1
如果我们稍微修改你的 df 并查看输出
In [237]:
df = pd.DataFrame({
"Gene": ["G1", "G1", "G2", "G2"],
"Sample": ["H1_T1", "H2_T2", "H1_T3", "H2_T4"]
})
df['Sample'].str.split("_")
Out[237]:
0 [H1, T1]
1 [H2, T2]
2 [H1, T3]
3 [H2, T4]
dtype: object
那么您的尝试如下:
In [238]:
df['Sample'].str.split("_")[1]
Out[238]:
['H2', 'T2']
你可以看到这是对第二行select所做的,你想要的是select每一行的第二个元素:
In [239]:
df['Sample'].str.split("_").str[1]
Out[239]:
0 T1
1 T2
2 T3
3 T4
dtype: object