使用 python 从(Edgar 13-F 文件)TXT(2013 年之前)中提取 table 的持股
Extracting table of holdings from (Edgar 13-F filings) TXT (pre-2013) with python
我正在努力从 EDGAR 的 13-F 表格中提取 table 的持股。 2013 年之前,持有量以 txt 文件的形式给出(参见 example)。
我的目标输出是 pd.DataFrame,其形状与 txt 文件中的“Form 13F Information Table”相同(10 列,每行在单独的行中)。
我曾尝试使用 BeautifulSoup,它将 table 变成了一个标签对象,但我无法弄清楚如何将其格式化以进入数据框,如上所述。
这是我的代码尝试:
soup2 = BeautifulSoup(requests.get(filing_url_13f).content, 'lxml')
holdings = soup2.find_all('table')
#This is my attempt to turn the content into a list:
lixt=[]
for x in soup2.find_all(['c','c','c','c','c','c','c','c','c'] ):
for line in x:
lixt.append(line)
x=lixt[1]
l=[]
for string in x.strings:
l.append(repr(string))
el=l[7]
这就是我卡住的地方,因为下面是 el returns。我不能用 \n 分割它,因为公司名称中经常有 \n (AMERICAN\n EXPRESS CO)。
\nAMERICAN\n EXPRESS CO COM 025816109 112,209 1,952,142 Shared-Defined 4 1,952,142 - -\nAMERICAN\n EXPRESS CO COM 025816109 990,116 17,225,400 Shared-Defined 4, 5 17,225,400 - -\nAMERICAN\n EXPRESS CO COM 025816109 48,274 839,832 Shared-Defined 4, 7 839,832 - -\nAMERICAN\n EXPRESS CO COM 025816109 111,689 1,943,100 Shared-Defined 4, 8, 11 1,943,100 - -\nAMERICAN\n EXPRESS CO COM 025816109 459,532 7,994,634 Shared-Defined 4, 10 7,994,634 - -\nAMERICAN\n EXPRESS CO COM 025816109 6,912,308 120,255,879 Shared-Defined 4, 11 120,255,879 - -\nAMERICAN\n EXPRESS CO COM 025816109 80,456 1,399,713 Shared-Defined 4, 13 1,399,713 - -\nARCHER DANIELS\n MIDLAND CO COM 039483102 163,151 5,956,600 Shared-Defined 4, 5 5,956,600 - -\nBANK OF NEW\n YORK MELLON\n CORP COM 064058100 206,661 8,041,300 Shared-Defined 4 8,041,300 - -\nBANK OF NEW\n YORK MELLON\n CORP COM 064058100 46,104 1,793,915 Shared-Defined 2, 4, 11 1,793,915 - -\nBANK OF NEW\n YORK MELLON\n CORP COM 064058100 251,827 9,798,700 Shared-Defined 4, 8, 11 9,798,700 - -\nCOCA COLA CO COM 191216100 29,000 800,000 Shared-Defined 4 800,000 - -\n
如有任何建议,我将不胜感激。
是的,这些旧的 EDGAR 备案很糟糕(并不是说新的要好得多)。这一个特别糟糕,因为较长的输入行被分成单独的行以使它们适合页面。
所以下面的内容应该能让你足够接近你想要的:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
req = requests.get('https://www.sec.gov/Archives/edgar/data/1067983/000119312512470800/d434976d13fhr.txt')
#next is a helper function to put back those longer entires
def lst_bunch(l,lenth=4):
i=0
while i < len(l):
if len(l[i])<lenth:
l[i] += l.pop(i+1)
i += 1
for item in l:
if len(item)<lenth:
lst_bunch(l,lenth)
else:
return l
tabs = req.text.replace('<TABLE>','xxx<TABLE>').split('xxx')
for tab in tabs[2:]:
soup = bs(tab,'lxml')
table = soup.select_one('table')
lines = table.text.splitlines()
lst_bunch(lines,30)
for line in lines:
print(line.strip())
Output:
Name of Issuer Class CUSIP (In Thousands) Amount Discretion Managers Sole Shared None
AMERICAN EXPRESS CO COM 025816109 110,999 1,952,142 Shared-Defined 4 1,952,142 - -
AMERICAN EXPRESS CO COM 025816109 979,436 17,225,400 Shared-Defined 4, 5 17,225,400 - -
等等
我正在努力从 EDGAR 的 13-F 表格中提取 table 的持股。 2013 年之前,持有量以 txt 文件的形式给出(参见 example)。 我的目标输出是 pd.DataFrame,其形状与 txt 文件中的“Form 13F Information Table”相同(10 列,每行在单独的行中)。
我曾尝试使用 BeautifulSoup,它将 table 变成了一个标签对象,但我无法弄清楚如何将其格式化以进入数据框,如上所述。
这是我的代码尝试:
soup2 = BeautifulSoup(requests.get(filing_url_13f).content, 'lxml')
holdings = soup2.find_all('table')
#This is my attempt to turn the content into a list:
lixt=[]
for x in soup2.find_all(['c','c','c','c','c','c','c','c','c'] ):
for line in x:
lixt.append(line)
x=lixt[1]
l=[]
for string in x.strings:
l.append(repr(string))
el=l[7]
这就是我卡住的地方,因为下面是 el returns。我不能用 \n 分割它,因为公司名称中经常有 \n (AMERICAN\n EXPRESS CO)。
\nAMERICAN\n EXPRESS CO COM 025816109 112,209 1,952,142 Shared-Defined 4 1,952,142 - -\nAMERICAN\n EXPRESS CO COM 025816109 990,116 17,225,400 Shared-Defined 4, 5 17,225,400 - -\nAMERICAN\n EXPRESS CO COM 025816109 48,274 839,832 Shared-Defined 4, 7 839,832 - -\nAMERICAN\n EXPRESS CO COM 025816109 111,689 1,943,100 Shared-Defined 4, 8, 11 1,943,100 - -\nAMERICAN\n EXPRESS CO COM 025816109 459,532 7,994,634 Shared-Defined 4, 10 7,994,634 - -\nAMERICAN\n EXPRESS CO COM 025816109 6,912,308 120,255,879 Shared-Defined 4, 11 120,255,879 - -\nAMERICAN\n EXPRESS CO COM 025816109 80,456 1,399,713 Shared-Defined 4, 13 1,399,713 - -\nARCHER DANIELS\n MIDLAND CO COM 039483102 163,151 5,956,600 Shared-Defined 4, 5 5,956,600 - -\nBANK OF NEW\n YORK MELLON\n CORP COM 064058100 206,661 8,041,300 Shared-Defined 4 8,041,300 - -\nBANK OF NEW\n YORK MELLON\n CORP COM 064058100 46,104 1,793,915 Shared-Defined 2, 4, 11 1,793,915 - -\nBANK OF NEW\n YORK MELLON\n CORP COM 064058100 251,827 9,798,700 Shared-Defined 4, 8, 11 9,798,700 - -\nCOCA COLA CO COM 191216100 29,000 800,000 Shared-Defined 4 800,000 - -\n
如有任何建议,我将不胜感激。
是的,这些旧的 EDGAR 备案很糟糕(并不是说新的要好得多)。这一个特别糟糕,因为较长的输入行被分成单独的行以使它们适合页面。
所以下面的内容应该能让你足够接近你想要的:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
req = requests.get('https://www.sec.gov/Archives/edgar/data/1067983/000119312512470800/d434976d13fhr.txt')
#next is a helper function to put back those longer entires
def lst_bunch(l,lenth=4):
i=0
while i < len(l):
if len(l[i])<lenth:
l[i] += l.pop(i+1)
i += 1
for item in l:
if len(item)<lenth:
lst_bunch(l,lenth)
else:
return l
tabs = req.text.replace('<TABLE>','xxx<TABLE>').split('xxx')
for tab in tabs[2:]:
soup = bs(tab,'lxml')
table = soup.select_one('table')
lines = table.text.splitlines()
lst_bunch(lines,30)
for line in lines:
print(line.strip())
Output:
Name of Issuer Class CUSIP (In Thousands) Amount Discretion Managers Sole Shared None
AMERICAN EXPRESS CO COM 025816109 110,999 1,952,142 Shared-Defined 4 1,952,142 - -
AMERICAN EXPRESS CO COM 025816109 979,436 17,225,400 Shared-Defined 4, 5 17,225,400 - -
等等