截断 pandas DataFrame 的行
Truncate rows of a pandas DataFrame
创建示例数据框的代码:
Sample = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': [[.332, .326], [.058, .138]]},
{'account': 'Alpha Co', 'Jan': 200, 'Feb': 210, 'Mar': [[.234, .246], [.234, .395], [.013, .592]]},
{'account': 'Blue Inc', 'Jan': 50, 'Feb': 90, 'Mar': [[.084, .23], [.745, .923], [.925, .843]]}]
df = pd.DataFrame(Sample)
可视化的示例数据框:
df:
account Jan Feb Mar
Jones LLC | 150 | 200 | [.332, .326], [.058, .138]
Alpha Co | 200 | 210 | [[.234, .246], [.234, .395], [.013, .592]]
Blue Inc | 50 | 90 | [[.084, .23], [.745, .923], [.925, .843]]
我正在寻找一个公式来截断 'Mar' 列,以便截断任何形状大于 (2,x) 的行,从而导致以下 df
df:
account Jan Feb Mar
Jones LLC | 150 | 200 | [.332, .326], [.058, .138]
Alpha Co | 200 | 210 | [[.234, .246], [.234, .395]
Blue Inc | 50 | 90 | [[.084, .23], [.745, .923]
可以使用 .apply
函数结合 lambda
运算符轻松完成单元格级别的操作:
df["Mar"] = df["Mar"].apply(lambda x: x[:2])
str
访问器专为字符串操作而设计,但对于像列表这样的可迭代对象,您也可以将其用于切片:
df['Mar'] = df['Mar'].str[:2]
df
Out:
Feb Jan Mar account
0 200 150 [[0.332, 0.326], [0.058, 0.138]] Jones LLC
1 210 200 [[0.234, 0.246], [0.234, 0.395]] Alpha Co
2 90 50 [[0.084, 0.23], [0.745, 0.923]] Blue Inc
Pandas 不能很好地与系列列表一起播放,因此在处理它之前将其拉出可能会更好:
df['Mar'] = [row[:2] for row in df['Mar'].tolist()]
%timeit
结果以及 ayhan 和 Marjan 的出色回答:
3 行:
%timeit df['Mar'].str[:2]
10000 loops, best of 3: 154 µs per loop
%timeit df['Mar'].apply(lambda x: x[:2])
10000 loops, best of 3: 133 µs per loop
%timeit [row[:2] for row in df['Mar'].tolist()]
The slowest run took 5.51 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 12 µs per loop
>>>5.51*12
66.12
3,000,000 行:
%timeit df['Mar'].str[:2]
1 loop, best of 3: 1.23 s per loop
%timeit df['Mar'].apply(lambda x: x[:2])
1 loop, best of 3: 1.25 s per loop
%timeit [row[:2] for row in df['Mar'].tolist()]
1 loop, best of 3: 940 ms per loop
如果您愿意将列表对分成两列,您可以获得一种比上面的方法更好地扩展到更大数据帧的方法
>>>pd.DataFrame(df['Mar'].tolist()).iloc[:, :2]]
0 1
0 [0.332, 0.326] [0.058, 0.138]
1 [0.234, 0.246] [0.234, 0.395]
2 [0.084, 0.23] [0.745, 0.923]
在 3,000,000 行上:
%timeit pd.DataFrame(df['Mar'].tolist()).iloc[:, :2]
1 loop, best of 3: 276 ms per loop
创建示例数据框的代码:
Sample = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': [[.332, .326], [.058, .138]]},
{'account': 'Alpha Co', 'Jan': 200, 'Feb': 210, 'Mar': [[.234, .246], [.234, .395], [.013, .592]]},
{'account': 'Blue Inc', 'Jan': 50, 'Feb': 90, 'Mar': [[.084, .23], [.745, .923], [.925, .843]]}]
df = pd.DataFrame(Sample)
可视化的示例数据框:
df:
account Jan Feb Mar
Jones LLC | 150 | 200 | [.332, .326], [.058, .138]
Alpha Co | 200 | 210 | [[.234, .246], [.234, .395], [.013, .592]]
Blue Inc | 50 | 90 | [[.084, .23], [.745, .923], [.925, .843]]
我正在寻找一个公式来截断 'Mar' 列,以便截断任何形状大于 (2,x) 的行,从而导致以下 df
df:
account Jan Feb Mar
Jones LLC | 150 | 200 | [.332, .326], [.058, .138]
Alpha Co | 200 | 210 | [[.234, .246], [.234, .395]
Blue Inc | 50 | 90 | [[.084, .23], [.745, .923]
可以使用 .apply
函数结合 lambda
运算符轻松完成单元格级别的操作:
df["Mar"] = df["Mar"].apply(lambda x: x[:2])
str
访问器专为字符串操作而设计,但对于像列表这样的可迭代对象,您也可以将其用于切片:
df['Mar'] = df['Mar'].str[:2]
df
Out:
Feb Jan Mar account
0 200 150 [[0.332, 0.326], [0.058, 0.138]] Jones LLC
1 210 200 [[0.234, 0.246], [0.234, 0.395]] Alpha Co
2 90 50 [[0.084, 0.23], [0.745, 0.923]] Blue Inc
Pandas 不能很好地与系列列表一起播放,因此在处理它之前将其拉出可能会更好:
df['Mar'] = [row[:2] for row in df['Mar'].tolist()]
%timeit
结果以及 ayhan 和 Marjan 的出色回答:
3 行:
%timeit df['Mar'].str[:2]
10000 loops, best of 3: 154 µs per loop
%timeit df['Mar'].apply(lambda x: x[:2])
10000 loops, best of 3: 133 µs per loop
%timeit [row[:2] for row in df['Mar'].tolist()]
The slowest run took 5.51 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 12 µs per loop
>>>5.51*12
66.12
3,000,000 行:
%timeit df['Mar'].str[:2]
1 loop, best of 3: 1.23 s per loop
%timeit df['Mar'].apply(lambda x: x[:2])
1 loop, best of 3: 1.25 s per loop
%timeit [row[:2] for row in df['Mar'].tolist()]
1 loop, best of 3: 940 ms per loop
如果您愿意将列表对分成两列,您可以获得一种比上面的方法更好地扩展到更大数据帧的方法
>>>pd.DataFrame(df['Mar'].tolist()).iloc[:, :2]]
0 1
0 [0.332, 0.326] [0.058, 0.138]
1 [0.234, 0.246] [0.234, 0.395]
2 [0.084, 0.23] [0.745, 0.923]
在 3,000,000 行上:
%timeit pd.DataFrame(df['Mar'].tolist()).iloc[:, :2]
1 loop, best of 3: 276 ms per loop