从(Edgar 10-K 文件)中提取文本部分 HTML
Extracting text section from (Edgar 10-K filings) HTML
我正在尝试从 HTML 文件中提取特定部分。具体来说,我查找 10-K 文件(某公司的美国业务报告)的 "ITEM 1" 部分。例如。:
https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002
问题:但是,我找不到 "ITEM 1" 部分,我也不知道如何告诉我的算法从那个点 "ITEM 1" 搜索到另一个点(例如 "ITEM 1A") 并提取中间的文本。
非常感谢您的帮助。
除其他外,我试过这个(和类似的),但我的 bd
总是空的:
try:
# bd = soup.body.findAll(text=re.compile('^ITEM 1$'))
# bd = soup.find_all(name="ITEM 1")
# bd = soup.find_all(["ITEM 1", "ITEM1", "Item 1", "Item1", "item 1", "item1"])
print(" Business Section (Item 1): ", bd.content)
except:
print("\n Section not found!")
使用 Python 3.7 和 Beautifulsoup4
问候赫卡
有特殊字符。先删除它们
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = requests.get('https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002').text
doc = SimplifiedDoc(html)
doc.loadHtml(doc.replaceReg(doc.html, 'ITEM[\s]+','ITEM '))
item1 = doc.getElementByText('ITEM 1')
print(item1) # {'tag': 'B', 'html': 'ITEM 1. BUSINESS'}
# Here's what you might use
table = item1.getParent('TABLE')
trs = table.TRs
for tr in trs:
print (tr.TDs)
如果你使用的是最新版本,可以使用以下方法
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = requests.get('https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002').text
doc = SimplifiedDoc(html)
item1 = doc.getElementByReg('ITEM[\s]+1') # Incoming regex
print(item1,item1.text) # {'tag': 'B', 'html': 'ITEM\n 1. BUSINESS'} ITEM 1. BUSINESS
# Here's what you might use
table = item1.getParent('TABLE')
trs = table.TRs
for tr in trs:
print (tr.TDs)
正如我在评论中提到的,由于 EDGAR 的性质,这可能适用于一个文件,但对另一个文件无效。不过,这些原则通常应该有效(经过一些调整...)
import requests
import lxml.html
url = 'https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002'
source = requests.get(url)
doc = lxml.html.fromstring(source.text)
tabs = doc.xpath('//table[./tr/td/font/a[@name="a_002"]]/following-sibling::p/font')
#in this filing, Item 1 is hiding in a series of <p> tags following a table with an <a> tag with a
#"name" attribute which has a value of "a_002"
flag = ''
for i in tabs:
if flag == 'stop':
break
if i.text is not None: #we now start extracting the text from each <p> tag and move to the next
print(i.text_content().strip().replace('\n',''))
nxt = i.getparent().getnext()
#the following detects when the <p> tags of Item 1 end and the next Item begins and then stops
if str(type(nxt)) != "<class 'NoneType'>" and nxt.tag == 'table':
for j in nxt.iterdescendants():
if j.tag == 'a' and j.values()[0]=='a_003':
# we have encountered the <a> tag with a "name" attribute which has a value of "a_003", indicated the beginning of the next Item; so we stop
flag='stop'
输出是本文件中第 1 项的文本。
我正在尝试从 HTML 文件中提取特定部分。具体来说,我查找 10-K 文件(某公司的美国业务报告)的 "ITEM 1" 部分。例如。: https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002
问题:但是,我找不到 "ITEM 1" 部分,我也不知道如何告诉我的算法从那个点 "ITEM 1" 搜索到另一个点(例如 "ITEM 1A") 并提取中间的文本。
非常感谢您的帮助。
除其他外,我试过这个(和类似的),但我的 bd
总是空的:
try:
# bd = soup.body.findAll(text=re.compile('^ITEM 1$'))
# bd = soup.find_all(name="ITEM 1")
# bd = soup.find_all(["ITEM 1", "ITEM1", "Item 1", "Item1", "item 1", "item1"])
print(" Business Section (Item 1): ", bd.content)
except:
print("\n Section not found!")
使用 Python 3.7 和 Beautifulsoup4
问候赫卡
有特殊字符。先删除它们
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = requests.get('https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002').text
doc = SimplifiedDoc(html)
doc.loadHtml(doc.replaceReg(doc.html, 'ITEM[\s]+','ITEM '))
item1 = doc.getElementByText('ITEM 1')
print(item1) # {'tag': 'B', 'html': 'ITEM 1. BUSINESS'}
# Here's what you might use
table = item1.getParent('TABLE')
trs = table.TRs
for tr in trs:
print (tr.TDs)
如果你使用的是最新版本,可以使用以下方法
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = requests.get('https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002').text
doc = SimplifiedDoc(html)
item1 = doc.getElementByReg('ITEM[\s]+1') # Incoming regex
print(item1,item1.text) # {'tag': 'B', 'html': 'ITEM\n 1. BUSINESS'} ITEM 1. BUSINESS
# Here's what you might use
table = item1.getParent('TABLE')
trs = table.TRs
for tr in trs:
print (tr.TDs)
正如我在评论中提到的,由于 EDGAR 的性质,这可能适用于一个文件,但对另一个文件无效。不过,这些原则通常应该有效(经过一些调整...)
import requests
import lxml.html
url = 'https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002'
source = requests.get(url)
doc = lxml.html.fromstring(source.text)
tabs = doc.xpath('//table[./tr/td/font/a[@name="a_002"]]/following-sibling::p/font')
#in this filing, Item 1 is hiding in a series of <p> tags following a table with an <a> tag with a
#"name" attribute which has a value of "a_002"
flag = ''
for i in tabs:
if flag == 'stop':
break
if i.text is not None: #we now start extracting the text from each <p> tag and move to the next
print(i.text_content().strip().replace('\n',''))
nxt = i.getparent().getnext()
#the following detects when the <p> tags of Item 1 end and the next Item begins and then stops
if str(type(nxt)) != "<class 'NoneType'>" and nxt.tag == 'table':
for j in nxt.iterdescendants():
if j.tag == 'a' and j.values()[0]=='a_003':
# we have encountered the <a> tag with a "name" attribute which has a value of "a_003", indicated the beginning of the next Item; so we stop
flag='stop'
输出是本文件中第 1 项的文本。