协助将数据框拆分为新列
Assistance with splitting data frame to new columns
我在按 _ 拆分数据框并从中创建新列时遇到问题。
原链
AMAT_0000006951_10Q_20200726_Item1A_excerpt.txt as section
我当前的代码
df = pd.DataFrame(myList,columns=['section','text'])
#df['text'] = df['text'].str.replace('•','')
df['section'] = df['section'].str.replace('Item1A', 'Filing Section: Risk Factors')
df['section'] = df['section'].str.replace('Item2_', 'Filing Section: Management Discussion and Analysis')
df['section'] = df['section'].str.replace('excerpt.txt', '').str.replace(r'\d{10}_|\d{8}_', '')
df.to_csv("./SECParse.csv", encoding='utf-8-sig', sep=',',index=False)
输出:
section text
AMAT_10Q_Filing Section: Risk Factors_ The COVID-19 pandemic and global measures taken in response
thereto have adversely impacted, and may continue to adversely
impact, Applied’s operations and financial results.
AMAT_10Q_Filing Section: Risk Factors_ The COVID-19 pandemic and measures taken in response by
governments and businesses worldwide to contain its spread,
AMAT_10Q_Filing Section: Risk Factors_ The degree to which the pandemic ultimately impacts Applied’s
financial condition and results of operations and the global
economy will depend on future developments beyond our control
我真的很想将 'section' 拆分为基于“_”的新列
我尝试了很多不同的正则表达式变体来拆分 'section',但所有这些变体要么给我没有填充的标题,要么在节和文本之后添加了列,这没有用。我还应该添加大约 100,000 个观察值。
想要的结果:
Ticker Filing type Section Text
AMAT 10Q Filing Section: Risk Factors The COVID-19 pandemic and global measures taken in response
如有任何指导,我们将不胜感激。
如果你总是知道拆分的次数,你可以这样做:
import pandas as pd
df = pd.DataFrame({ "a": [ "test_a_b", "test2_c_d" ] })
# Split column by "_"
items = df["a"].str.split("_")
# Get last item from splitted column and place it on "b"
df["b"] = items.apply(list.pop)
# Get next last item from splitted column and place it on "c"
df["c"] = items.apply(list.pop)
# Get final item from splitted column and place it on "d"
df["d"] = items.apply(list.pop)
这样,dataframe就会变成
a b c d
0 test_a_b b a test
1 test2_c_d d c test2
由于您希望列按特定顺序排列,您可以按如下方式重新排序数据框的列:
>>> df = df[[ "d", "c", "b", "a" ]]
>>> df
d c b a
0 test a b test_a_b
1 test2 c d test2_c_d
我在按 _ 拆分数据框并从中创建新列时遇到问题。
原链
AMAT_0000006951_10Q_20200726_Item1A_excerpt.txt as section
我当前的代码
df = pd.DataFrame(myList,columns=['section','text'])
#df['text'] = df['text'].str.replace('•','')
df['section'] = df['section'].str.replace('Item1A', 'Filing Section: Risk Factors')
df['section'] = df['section'].str.replace('Item2_', 'Filing Section: Management Discussion and Analysis')
df['section'] = df['section'].str.replace('excerpt.txt', '').str.replace(r'\d{10}_|\d{8}_', '')
df.to_csv("./SECParse.csv", encoding='utf-8-sig', sep=',',index=False)
输出:
section text
AMAT_10Q_Filing Section: Risk Factors_ The COVID-19 pandemic and global measures taken in response
thereto have adversely impacted, and may continue to adversely
impact, Applied’s operations and financial results.
AMAT_10Q_Filing Section: Risk Factors_ The COVID-19 pandemic and measures taken in response by
governments and businesses worldwide to contain its spread,
AMAT_10Q_Filing Section: Risk Factors_ The degree to which the pandemic ultimately impacts Applied’s
financial condition and results of operations and the global
economy will depend on future developments beyond our control
我真的很想将 'section' 拆分为基于“_”的新列 我尝试了很多不同的正则表达式变体来拆分 'section',但所有这些变体要么给我没有填充的标题,要么在节和文本之后添加了列,这没有用。我还应该添加大约 100,000 个观察值。
想要的结果:
Ticker Filing type Section Text
AMAT 10Q Filing Section: Risk Factors The COVID-19 pandemic and global measures taken in response
如有任何指导,我们将不胜感激。
如果你总是知道拆分的次数,你可以这样做:
import pandas as pd
df = pd.DataFrame({ "a": [ "test_a_b", "test2_c_d" ] })
# Split column by "_"
items = df["a"].str.split("_")
# Get last item from splitted column and place it on "b"
df["b"] = items.apply(list.pop)
# Get next last item from splitted column and place it on "c"
df["c"] = items.apply(list.pop)
# Get final item from splitted column and place it on "d"
df["d"] = items.apply(list.pop)
这样,dataframe就会变成
a b c d
0 test_a_b b a test
1 test2_c_d d c test2
由于您希望列按特定顺序排列,您可以按如下方式重新排序数据框的列:
>>> df = df[[ "d", "c", "b", "a" ]]
>>> df
d c b a
0 test a b test_a_b
1 test2 c d test2_c_d