协助将数据框拆分为新列

Question

我在按 _ 拆分数据框并从中创建新列时遇到问题。

原链

AMAT_0000006951_10Q_20200726_Item1A_excerpt.txt    as section

我当前的代码

df = pd.DataFrame(myList,columns=['section','text'])
#df['text'] = df['text'].str.replace('•','')
df['section'] = df['section'].str.replace('Item1A', 'Filing Section: Risk Factors')
df['section'] = df['section'].str.replace('Item2_', 'Filing Section: Management Discussion and Analysis')
df['section'] = df['section'].str.replace('excerpt.txt', '').str.replace(r'\d{10}_|\d{8}_', '')
df.to_csv("./SECParse.csv", encoding='utf-8-sig', sep=',',index=False)

输出：

section                                 text
AMAT_10Q_Filing Section: Risk Factors_  The COVID-19 pandemic and global measures taken in response 
                                        thereto have adversely impacted, and may continue to adversely 
                                        impact, Applied’s operations and financial results.
AMAT_10Q_Filing Section: Risk Factors_  The COVID-19 pandemic and measures taken in response by 
                                        governments and businesses worldwide to contain its spread, 
                                        
AMAT_10Q_Filing Section: Risk Factors_  The degree to which the pandemic ultimately impacts Applied’s 
                                        financial condition and results of operations and the global 
                                        economy will depend on future developments beyond our control

我真的很想将 'section' 拆分为基于“_”的新列我尝试了很多不同的正则表达式变体来拆分 'section'，但所有这些变体要么给我没有填充的标题，要么在节和文本之后添加了列，这没有用。我还应该添加大约 100,000 个观察值。

想要的结果：

Ticker  Filing type  Section                       Text
AMAT    10Q          Filing Section: Risk Factors  The COVID-19 pandemic and global measures taken in response

如有任何指导，我们将不胜感激。

Answer 1

如果你总是知道拆分的次数，你可以这样做：

import pandas as pd

df = pd.DataFrame({ "a": [ "test_a_b", "test2_c_d" ] })

# Split column by "_"
items = df["a"].str.split("_")

# Get last item from splitted column and place it on "b"
df["b"] = items.apply(list.pop)

# Get next last item from splitted column and place it on "c"
df["c"] = items.apply(list.pop)

# Get final item from splitted column and place it on "d"
df["d"] = items.apply(list.pop)

这样，dataframe就会变成

           a  b  c      d
0   test_a_b  b  a   test
1  test2_c_d  d  c  test2

由于您希望列按特定顺序排列，您可以按如下方式重新排序数据框的列：

>>> df = df[[ "d", "c", "b", "a" ]]
>>> df
       d  c  b          a
0   test  a  b   test_a_b
1  test2  c  d  test2_c_d

协助将数据框拆分为新列

Assistance with splitting data frame to new columns

python

regex

string

pandas

python-re