解压 DataFrame 的列表元素
Unpack the list element of DataFrame
我有这个 df:
l1 = ['a', 'b', 'c']
l2 = ['x', ['y1', 'y2', 'y3'], 'z']
df = pd.DataFrame(list(zip(l1, l2)), columns = ['l1', 'l2'])
结果:
l1 l2
0 a x
1 b [y1, y2, y3]
2 c z
我需要解压 l2 中的内部列表并将相应的值散布在 l1 中,如下所示:
l1 l2
0 a x
1 b y1
2 b y2
3 b y3
4 c z
正确的做法是什么?
谢谢
您可以将嵌套列表理解与 itertools.zip_longest 结合使用。
import pandas as pd
from itertools import zip_longest
l1 = ['a', 'b', 'c']
l2 = ['x', ['y1', 'y2', 'y3'], 'z']
expanded = [(left, right) for outer in zip(l1, l2)
for left, right in zip_longest(*outer, fillvalue=outer[0])]
pd.DataFrame(expanded)
结果是……
0 1
0 a x
1 b y1
2 b y2
3 b y3
4 c z
对我来说,这是列表组合太长的边缘。还假设 l1
中没有列表并且将进行填充。
蛮力,遍历数据帧:
for idx in df.index:
# This transforms the item in "l2" into an iterable list
item = df.loc[idx, "l2"] if isinstance(df.loc[idx, "l2"], (list, tuple)) else [df.loc[idx, "l2"]]
for element in item:
print(df.loc[idx, "l1"], element)
returns
a x
b y1
b y2
b y3
c z
对于列数不固定的数据帧,我现在做这样的事情:
l1 = ['a', 'b', 'c']
l2 = ['x', ['y1', 'y2', 'y3'], 'z']
df = pd.DataFrame(list(zip(l1, l2)), columns = ['l1', 'l2'])
自 pandas 0.25.0 以来,有一个内置的 explode 方法,它正是这样做的,保留索引:
df.explode('l2')
结果:
l1 l2
0 a x
1 b y1
1 b y2
1 b y3
2 c z
如果需要刷新索引:
df.explode('l2').reset_index(drop=True)
结果:
l1 l2
0 a x
1 b y1
2 b y2
3 b y3
4 c z
旧答案:
df2 = pd.DataFrame(columns=df.columns,index=df.index)[0:0]
for idx in df.index:
new_row = df.loc[idx, :].copy()
for res in df.ix[idx, 'l2']:
new_row.set_value('l2', res)
df2.loc[len(df2)] = new_row
它有效,但这看起来很像暴力破解。
我认为您可以使用 numpy.repeat
for repeat values by legths by str.len
和 chain
:
的嵌套 lists
的平面值
from itertools import chain
df1 = pd.DataFrame({
"l1": np.repeat(df.l1.values, df.l2.str.len()),
"l2": list(chain.from_iterable(df.l2))})
print (df1)
l1 l2
0 a x
1 b y1
2 b y2
3 b y3
4 c z
时间:
#[100000 rows x 2 columns]
np.random.seed(10)
N = 100000
l1 = ['a', 'b', 'c']
l1 = np.random.choice(l1, N)
l2 = [list(tuple(string.ascii_letters[:np.random.randint(1, 10)])) for _ in np.arange(N)]
df = pd.DataFrame({"l1":l1, "l2":l2})
df.l2 = df.l2.apply(lambda x: x if len(x) !=1 else x[0])
#print (df)
In [91]: %timeit (pd.DataFrame([(left, right) for outer in zip(l1, l2) for left, right in zip_longest(*outer, fillvalue=outer[0])]))
1 loop, best of 3: 242 ms per loop
In [92]: %timeit (pd.DataFrame({ "l1": np.repeat(df.l1.values, df.l2.str.len()), "l2": list(chain.from_iterable(df.l2))}))
10 loops, best of 3: 84.6 ms per loop
结论:
numpy.repeat
比 zip_longest
更大 df 中的解决方案更快 3 times
。
编辑:
与循环版本比较需要较小的df,因为很慢:
#[1000 rows x 2 columns]
np.random.seed(10)
N = 1000
l1 = ['a', 'b', 'c']
l1 = np.random.choice(l1, N)
l2 = [list(tuple(string.ascii_letters[:np.random.randint(1, 10)])) for _ in np.arange(N)]
df = pd.DataFrame({"l1":l1, "l2":l2})
df.l2 = df.l2.apply(lambda x: x if len(x) !=1 else x[0])
#print (df)
def alexey(df):
df2 = pd.DataFrame(columns=df.columns,index=df.index)[0:0]
for idx in df.index:
new_row = df.loc[idx, :].copy()
for res in df.ix[idx, 'l2']:
new_row.set_value('l2', res)
df2.loc[len(df2)] = new_row
return df2
print (alexey(df))
In [20]: %timeit (alexey(df))
1 loop, best of 3: 11.4 s per loop
In [21]: %timeit pd.DataFrame([(left, right) for outer in zip(l1, l2) for left, right in zip_longest(*outer, fillvalue=outer[0])])
100 loops, best of 3: 2.57 ms per loop
In [22]: %timeit pd.DataFrame({ "l1": np.repeat(df.l1.values, df.l2.str.len()), "l2": list(chain.from_iterable(df.l2))})
The slowest run took 4.42 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.41 ms per loop
我有这个 df:
l1 = ['a', 'b', 'c']
l2 = ['x', ['y1', 'y2', 'y3'], 'z']
df = pd.DataFrame(list(zip(l1, l2)), columns = ['l1', 'l2'])
结果:
l1 l2
0 a x
1 b [y1, y2, y3]
2 c z
我需要解压 l2 中的内部列表并将相应的值散布在 l1 中,如下所示:
l1 l2
0 a x
1 b y1
2 b y2
3 b y3
4 c z
正确的做法是什么? 谢谢
您可以将嵌套列表理解与 itertools.zip_longest 结合使用。
import pandas as pd
from itertools import zip_longest
l1 = ['a', 'b', 'c']
l2 = ['x', ['y1', 'y2', 'y3'], 'z']
expanded = [(left, right) for outer in zip(l1, l2)
for left, right in zip_longest(*outer, fillvalue=outer[0])]
pd.DataFrame(expanded)
结果是……
0 1
0 a x
1 b y1
2 b y2
3 b y3
4 c z
对我来说,这是列表组合太长的边缘。还假设 l1
中没有列表并且将进行填充。
蛮力,遍历数据帧:
for idx in df.index:
# This transforms the item in "l2" into an iterable list
item = df.loc[idx, "l2"] if isinstance(df.loc[idx, "l2"], (list, tuple)) else [df.loc[idx, "l2"]]
for element in item:
print(df.loc[idx, "l1"], element)
returns
a x
b y1
b y2
b y3
c z
对于列数不固定的数据帧,我现在做这样的事情:
l1 = ['a', 'b', 'c']
l2 = ['x', ['y1', 'y2', 'y3'], 'z']
df = pd.DataFrame(list(zip(l1, l2)), columns = ['l1', 'l2'])
自 pandas 0.25.0 以来,有一个内置的 explode 方法,它正是这样做的,保留索引:
df.explode('l2')
结果:
l1 l2
0 a x
1 b y1
1 b y2
1 b y3
2 c z
如果需要刷新索引:
df.explode('l2').reset_index(drop=True)
结果:
l1 l2
0 a x
1 b y1
2 b y2
3 b y3
4 c z
旧答案:
df2 = pd.DataFrame(columns=df.columns,index=df.index)[0:0]
for idx in df.index:
new_row = df.loc[idx, :].copy()
for res in df.ix[idx, 'l2']:
new_row.set_value('l2', res)
df2.loc[len(df2)] = new_row
它有效,但这看起来很像暴力破解。
我认为您可以使用 numpy.repeat
for repeat values by legths by str.len
和 chain
:
lists
的平面值
from itertools import chain
df1 = pd.DataFrame({
"l1": np.repeat(df.l1.values, df.l2.str.len()),
"l2": list(chain.from_iterable(df.l2))})
print (df1)
l1 l2
0 a x
1 b y1
2 b y2
3 b y3
4 c z
时间:
#[100000 rows x 2 columns]
np.random.seed(10)
N = 100000
l1 = ['a', 'b', 'c']
l1 = np.random.choice(l1, N)
l2 = [list(tuple(string.ascii_letters[:np.random.randint(1, 10)])) for _ in np.arange(N)]
df = pd.DataFrame({"l1":l1, "l2":l2})
df.l2 = df.l2.apply(lambda x: x if len(x) !=1 else x[0])
#print (df)
In [91]: %timeit (pd.DataFrame([(left, right) for outer in zip(l1, l2) for left, right in zip_longest(*outer, fillvalue=outer[0])]))
1 loop, best of 3: 242 ms per loop
In [92]: %timeit (pd.DataFrame({ "l1": np.repeat(df.l1.values, df.l2.str.len()), "l2": list(chain.from_iterable(df.l2))}))
10 loops, best of 3: 84.6 ms per loop
结论:
numpy.repeat
比 zip_longest
更大 df 中的解决方案更快 3 times
。
编辑:
与循环版本比较需要较小的df,因为很慢:
#[1000 rows x 2 columns]
np.random.seed(10)
N = 1000
l1 = ['a', 'b', 'c']
l1 = np.random.choice(l1, N)
l2 = [list(tuple(string.ascii_letters[:np.random.randint(1, 10)])) for _ in np.arange(N)]
df = pd.DataFrame({"l1":l1, "l2":l2})
df.l2 = df.l2.apply(lambda x: x if len(x) !=1 else x[0])
#print (df)
def alexey(df):
df2 = pd.DataFrame(columns=df.columns,index=df.index)[0:0]
for idx in df.index:
new_row = df.loc[idx, :].copy()
for res in df.ix[idx, 'l2']:
new_row.set_value('l2', res)
df2.loc[len(df2)] = new_row
return df2
print (alexey(df))
In [20]: %timeit (alexey(df))
1 loop, best of 3: 11.4 s per loop
In [21]: %timeit pd.DataFrame([(left, right) for outer in zip(l1, l2) for left, right in zip_longest(*outer, fillvalue=outer[0])])
100 loops, best of 3: 2.57 ms per loop
In [22]: %timeit pd.DataFrame({ "l1": np.repeat(df.l1.values, df.l2.str.len()), "l2": list(chain.from_iterable(df.l2))})
The slowest run took 4.42 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.41 ms per loop