python 上的文本文件导入
Text file import on python
我的问题可能之前已经问过,但我正在为我工作的场景没有得到任何帮助。
已经尝试了不同的方法和事情,但仍然没有成功,任何帮助将不胜感激
问题
我正在尝试从 URL https://www.sec.gov/Archives/edgar/cik-lookup-data.txt 加载一个文本文件,这样我就可以修改数据并创建一个数据框。
示例:- 来自 link
的数据
1188 百老汇 LLC:0001372374:
119 博伊西,LLC:0001633290:
11900 东阿蒂西亚大道,LLC:0001639215:
11900 哈兰路 LLC:0001398414:
11:11 资本公司:0001463262:
我应该低于输出
Name | number
1188 BROADWAY LLC | 0001372374
119 BOISE, LLC | 0001633290
11900 EAST ARTESIA BOULEVARD, LLC | 0001639215
11900 HARLAN ROAD LLC | 0001398414
11:11 CAPITAL CORP. | 0001463262
我遇到了加载文本文件的第一个问题,我一直收到 403 url HTTPError: HTTP Error 403: Forbidden
参考文献:
- Given a URL to a text file, what is the simplest way to read the contents of the text file?
我的代码:-
import urllib.request # the lib that handles the url stuff
data = urllib.request.urlopen("https://www.sec.gov/Archives/edgar/cik-lookup-data.txt") # it's a file like object and works just like a file
for line in data: # files are iterable
print (line)
这是不允许的 - 所以你会得到 response_code = 403。
在抓取任何网页时检查 robots.txt 文件是一个很好的做法。 robots.txt 文件告诉搜索引擎爬虫爬虫可以访问您网站上的哪些 URL。这主要用于避免您的站点因请求而超载;然而,它并不是一种让网页远离 Google.
的机制
你的情况是https://www.sec.gov/robots.txt
返回的错误信息说:
Your request has been identified as part of a network of automated
tools outside of the acceptable policy and will be managed until
action is taken to declare your traffic. Please declare your
traffic by updating your user agent to include company specific
information.
您可以按如下方式解决:
import urllib
url = "https://www.sec.gov/Archives/edgar/cik-lookup-data.txt"
hdr = {'User-Agent': 'Your Company Name admin@domain.com'} #change as needed
req = urllib.request.Request(url, headers=hdr)
data = urllib.request.urlopen(req, timeout=60).read().splitlines()
>>> data[:10]
[b'!J INC:0001438823:',
b'#1 A LIFESAFER HOLDINGS, INC.:0001509607:',
b'#1 ARIZONA DISCOUNT PROPERTIES LLC:0001457512:',
b'#1 PAINTBALL CORP:0001433777:',
b'$ LLC:0001427189:',
b'$AVY, INC.:0001655250:',
b'& S MEDIA GROUP LLC:0001447162:',
b'&TV COMMUNICATIONS INC.:0001479357:',
b'&VEST DOMESTIC FUND II KPIV, L.P.:0001802417:',
b'&VEST DOMESTIC FUND II LP:0001800903:']
我的问题可能之前已经问过,但我正在为我工作的场景没有得到任何帮助。
已经尝试了不同的方法和事情,但仍然没有成功,任何帮助将不胜感激
问题
我正在尝试从 URL https://www.sec.gov/Archives/edgar/cik-lookup-data.txt 加载一个文本文件,这样我就可以修改数据并创建一个数据框。
示例:- 来自 link
的数据1188 百老汇 LLC:0001372374:
119 博伊西,LLC:0001633290:
11900 东阿蒂西亚大道,LLC:0001639215:
11900 哈兰路 LLC:0001398414:
11:11 资本公司:0001463262:
我应该低于输出
Name | number
1188 BROADWAY LLC | 0001372374
119 BOISE, LLC | 0001633290
11900 EAST ARTESIA BOULEVARD, LLC | 0001639215
11900 HARLAN ROAD LLC | 0001398414
11:11 CAPITAL CORP. | 0001463262
我遇到了加载文本文件的第一个问题,我一直收到 403 url HTTPError: HTTP Error 403: Forbidden
参考文献:
- Given a URL to a text file, what is the simplest way to read the contents of the text file?
我的代码:-
import urllib.request # the lib that handles the url stuff
data = urllib.request.urlopen("https://www.sec.gov/Archives/edgar/cik-lookup-data.txt") # it's a file like object and works just like a file
for line in data: # files are iterable
print (line)
这是不允许的 - 所以你会得到 response_code = 403。 在抓取任何网页时检查 robots.txt 文件是一个很好的做法。 robots.txt 文件告诉搜索引擎爬虫爬虫可以访问您网站上的哪些 URL。这主要用于避免您的站点因请求而超载;然而,它并不是一种让网页远离 Google.
的机制你的情况是https://www.sec.gov/robots.txt
返回的错误信息说:
Your request has been identified as part of a network of automated tools outside of the acceptable policy and will be managed until action is taken to declare your traffic. Please declare your traffic by updating your user agent to include company specific information.
您可以按如下方式解决:
import urllib
url = "https://www.sec.gov/Archives/edgar/cik-lookup-data.txt"
hdr = {'User-Agent': 'Your Company Name admin@domain.com'} #change as needed
req = urllib.request.Request(url, headers=hdr)
data = urllib.request.urlopen(req, timeout=60).read().splitlines()
>>> data[:10]
[b'!J INC:0001438823:',
b'#1 A LIFESAFER HOLDINGS, INC.:0001509607:',
b'#1 ARIZONA DISCOUNT PROPERTIES LLC:0001457512:',
b'#1 PAINTBALL CORP:0001433777:',
b'$ LLC:0001427189:',
b'$AVY, INC.:0001655250:',
b'& S MEDIA GROUP LLC:0001447162:',
b'&TV COMMUNICATIONS INC.:0001479357:',
b'&VEST DOMESTIC FUND II KPIV, L.P.:0001802417:',
b'&VEST DOMESTIC FUND II LP:0001800903:']