在 Pandas Dataframe 中展开列表的更快方法

Question

我在下面有一个数据框：

import pandas
df = pandas.DataFrame({"terms" : [[['the', 'boy', 'and', 'the goat'],['a', 'girl', 'and', 'the cat']], [['fish', 'boy', 'with', 'the dog'],['when', 'girl', 'find', 'the mouse'], ['if', 'dog', 'see', 'the cat']]]})

我想要的结果如下：

df2 = pandas.DataFrame({"terms" : ['the boy  and the goat','a girl and the cat',  'fish boy with the dog','when girl find the mouse', 'if dog see the cat']})

有没有一种简单的方法可以完成此操作，而不必使用 for 循环遍历每个元素和子字符串的每一行：

result = pandas.DataFrame()
for i in range(len(df.terms.tolist())):
    x = df.terms.tolist()[i]
    for y in x:
        z = str(y).replace(",",'').replace("'",'').replace('[','').replace(']','')
        flattened = pandas.DataFrame({'flattened_term':[z]})
        result = result.append(flattened)

print(result)

谢谢。

Answer 1

这当然不是避免循环的方法，至少不是隐含的。 Pandas 不是为了将 list 对象作为元素处理而创建的，它可以很好地处理数字数据，并且可以很好地处理字符串。无论如何，您的根本问题是您在循环中使用 pd.Dataframe.append，这是一种二次时间算法（整个 data-frame 是每次迭代的 re-created）。但是您可能只需执行以下操作即可，而且速度应该快得多：

>>> df
                                               terms
0  [[the, boy, and, the goat], [a, girl, and, the...
1  [[fish, boy, with, the dog], [when, girl, find...
>>> pandas.DataFrame([' '.join(term) for row in df.itertuples() for term in row.terms])
                          0
0      the boy and the goat
1        a girl and the cat
2     fish boy with the dog
3  when girl find the mouse
4        if dog see the cat
>>>

在 Pandas Dataframe 中展开列表的更快方法

Faster way to flatten list in Pandas Dataframe

python

list

flatten

dataframe

pandas