文本抓取(来自 EDGAR 10K 亚马逊)代码不起作用
Text Scraping (from EDGAR 10K Amazon) code not working
我有以下代码可以从财务报表 (US SEC EDGAR 10K) 文本文件中抓取一些特定的单词列表。如果有人能帮助我,我将不胜感激。我已经手动交叉检查并在文档中找到了单词,但我的代码根本没有找到任何单词。我正在使用 Python 3.5.3。
提前致谢
给定公司 (CIK) 的 .txt 格式的 EDGAR 10-K 文件在一年中的 URL 路径,此代码将执行字数统计
#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib.request as urllib2
import time
import csv
import sys
CIK = '0001018724'
Year = '2013'
string_match1 = 'edgar/data/1018724/0001193125-13-028520.txt'
url3 = 'https://www.sec.gov/Archives/' + string_match1
response3 = urllib2.urlopen(url3)
words = [
'anticipate',
'believe',
'depend',
'fluctuate',
'indefinite',
'likelihood',
'possible',
'predict',
'risk',
'uncertain',
]
count = {} # is a dictionary data structure in Python
for elem in words:
count[elem] = 0
for line in response3:
elements = line.split()
for word in words:
count[word] = count[word] + elements.count(word)
print CIK
print Year
print url3
print count
这是脚本输出:
0001018724
2013
https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt
{
'believe': 0,
'likelihood': 0,
'anticipate': 0,
'fluctuate': 0,
'predict': 0,
'risk': 0,
'possible': 0,
'indefinite': 0,
'depend': 0,
'uncertain': 0,
}
您的代码的简化版本似乎在 Python 3.7 中与请求库一起工作:
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt'
response = requests.get(url)
words = [your word list above ]
count = {} # is a dictionary data structure in Python
for elem in words:
count[elem] = 0
info = str(response.content)
count[elem] = count[elem] + info.count(elem)
print(count)
输出:
{'anticipate': 9, 'believe': 32, 'depend': 39, 'fluctuate': 4, 'indefinite': 15, 'likelihood': 15, 'possible': 25,
'predict': 6, 'risk': 55, 'uncertain': 38}
我有以下代码可以从财务报表 (US SEC EDGAR 10K) 文本文件中抓取一些特定的单词列表。如果有人能帮助我,我将不胜感激。我已经手动交叉检查并在文档中找到了单词,但我的代码根本没有找到任何单词。我正在使用 Python 3.5.3。 提前致谢
给定公司 (CIK) 的 .txt 格式的 EDGAR 10-K 文件在一年中的 URL 路径,此代码将执行字数统计
#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib.request as urllib2
import time
import csv
import sys
CIK = '0001018724'
Year = '2013'
string_match1 = 'edgar/data/1018724/0001193125-13-028520.txt'
url3 = 'https://www.sec.gov/Archives/' + string_match1
response3 = urllib2.urlopen(url3)
words = [
'anticipate',
'believe',
'depend',
'fluctuate',
'indefinite',
'likelihood',
'possible',
'predict',
'risk',
'uncertain',
]
count = {} # is a dictionary data structure in Python
for elem in words:
count[elem] = 0
for line in response3:
elements = line.split()
for word in words:
count[word] = count[word] + elements.count(word)
print CIK
print Year
print url3
print count
这是脚本输出:
0001018724
2013
https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt
{
'believe': 0,
'likelihood': 0,
'anticipate': 0,
'fluctuate': 0,
'predict': 0,
'risk': 0,
'possible': 0,
'indefinite': 0,
'depend': 0,
'uncertain': 0,
}
您的代码的简化版本似乎在 Python 3.7 中与请求库一起工作:
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt'
response = requests.get(url)
words = [your word list above ]
count = {} # is a dictionary data structure in Python
for elem in words:
count[elem] = 0
info = str(response.content)
count[elem] = count[elem] + info.count(elem)
print(count)
输出:
{'anticipate': 9, 'believe': 32, 'depend': 39, 'fluctuate': 4, 'indefinite': 15, 'likelihood': 15, 'possible': 25,
'predict': 6, 'risk': 55, 'uncertain': 38}