python 上的文本文件导入

Text file import on python

我的问题可能之前已经问过,但我正在为我工​​作的场景没有得到任何帮助。

已经尝试了不同的方法和事情,但仍然没有成功,任何帮助将不胜感激

问题

我正在尝试从 URL https://www.sec.gov/Archives/edgar/cik-lookup-data.txt 加载一个文本文件,这样我就可以修改数据并创建一个数据框。

示例:- 来自 link

的数据

1188 百老汇 LLC:0001372374:

119 博伊西,LLC:0001633290:

11900 东阿蒂西亚大道,LLC:0001639215:

11900 哈兰路 LLC:0001398414:

11:11 资本公司:0001463262:

我应该低于输出

   Name                              | number 
   1188 BROADWAY LLC                 | 0001372374 
   119 BOISE, LLC                    | 0001633290 
   11900 EAST ARTESIA BOULEVARD, LLC | 0001639215 
   11900 HARLAN ROAD LLC             | 0001398414 
   11:11 CAPITAL CORP.               | 0001463262

我遇到了加载文本文件的第一个问题,我一直收到 403 url HTTPError: HTTP Error 403: Forbidden

参考文献:

  1. Given a URL to a text file, what is the simplest way to read the contents of the text file?

我的代码:-

import urllib.request  # the lib that handles the url stuff

data = urllib.request.urlopen("https://www.sec.gov/Archives/edgar/cik-lookup-data.txt") # it's a file like object and works just like a file
for line in data: # files are iterable
    print (line)

这是不允许的 - 所以你会得到 response_code = 403。 在抓取任何网页时检查 robots.txt 文件是一个很好的做法。 robots.txt 文件告诉搜索引擎爬虫爬虫可以访问您网站上的哪些 URL。这主要用于避免您的站点因请求而超载;然而,它并不是一种让网页远离 Google.

的机制

你的情况是https://www.sec.gov/robots.txt

返回的错误信息说:

Your request has been identified as part of a network of automated tools outside of the acceptable policy and will be managed until action is taken to declare your traffic. Please declare your traffic by updating your user agent to include company specific information.

您可以按如下方式解决:

import urllib

url = "https://www.sec.gov/Archives/edgar/cik-lookup-data.txt"
hdr = {'User-Agent': 'Your Company Name admin@domain.com'} #change as needed

req = urllib.request.Request(url, headers=hdr) 

data = urllib.request.urlopen(req, timeout=60).read().splitlines()

>>> data[:10]
[b'!J INC:0001438823:',
 b'#1 A LIFESAFER HOLDINGS, INC.:0001509607:',
 b'#1 ARIZONA DISCOUNT PROPERTIES LLC:0001457512:',
 b'#1 PAINTBALL CORP:0001433777:',
 b'$ LLC:0001427189:',
 b'$AVY, INC.:0001655250:',
 b'& S MEDIA GROUP LLC:0001447162:',
 b'&TV COMMUNICATIONS INC.:0001479357:',
 b'&VEST DOMESTIC FUND II KPIV, L.P.:0001802417:',
 b'&VEST DOMESTIC FUND II LP:0001800903:']