网页扫描提取ip的方法

Question

在执行简单的 ip 地址提取任务时，我发现该程序运行良好。但是在完整的网络爬虫程序中，它无法生存并且结果参差不齐。

这是我的 ip 地址代码片段：

    #!/usr/bin/python3

    import os
    import re 

    def get_ip_address(url):
        command = "host " + url
        process = os.popen(command)
        results = str(process.read())
        marker = results.find("has address") + 12
        n = (results[marker:].splitlines()[0])
        m = re.search('\w+ \w+: \d\([A-Z]+\)', n)
        if m is not None:
            url_new = url[8:]
            command = "host " + url_new
            process = os.popen(command)
            results = str(process.read())
            marker = results.find("has address") + 12
            return results[marker:].splitlines()[0]

    print(get_ip_address("https://www.yahoo.com"))

完整的网页抓取程序如下所示：

    #!/usr/bin/python3

    from general import *
    from domain_name import *
    from ip_address import *
    from nmap import * 
    from robots_txt import *
    from whois import *

    ROOT_DIR = "companies"
    create_dir(ROOT_DIR)

    def gather_info(name, url):
        domain_name = get_domain_name(url)
        ip_address = get_ip_address(url)
        nmap = get_nmap('-F', ip_address)
        robots_txt = get_robots_txt(url)
        whois = get_whois(domain_name)
        create_report(name, url, domain_name, nmap, robots_txt, whois, ip_address)

   def create_report(name, full_url, domain_name, nmap, robots_txt, whois, ip_address):
       project_dir = ROOT_DIR + '/' + name
       create_dir(project_dir)
       write_file(project_dir + '/full_url.txt', full_url)
       write_file(project_dir + '/domain_name.txt', domain_name)
       write_file(project_dir + '/nmap.txt', nmap)
       write_file(project_dir + '/robots_txt.txt', robots_txt)
       write_file(project_dir + '/whois.txt', whois)
       write_file(project_dir + '/ip_address.txt', ip_address)

    x = input("Enter the Company Name: ")
    y = input("Enter the complete url of the company: ")    
    gather_info( x , y )

输入的内容如下所示：

    root@nitin-Lenovo-G580:~/Desktop/web_scanning# python3 main.py 
    106.10.138.240
    Enter the Company Name: Yahoo
    Enter the complete url of the company: https://www.yahoo.com/
    /bin/sh: 1: Syntax error: "(" unexpected

而ip_address.txt中的输出是：

    hoo.com/ not found: 3(NXDOMAIN)

所看到的程序在运行时运行良好，并给出 ip 为 106.10.138.240 仍然在 ip_address.txt 中保存一些不同的东西我也没有找到这个 /bin/sh 语法错误是怎么来的。请帮助我...

Answer 1

抱歉，我没有足够的声誉来添加评论，所以我会post在这里提出我的建议。

我认为问题出在 def get_ip_address(url) 中的 process = os.popen(command)。你可以打印command看看是否有效。

除了问题，提几点建议：

尽量不要在import中使用*，这样会让读者更难追溯代码。
学习 pdb，这是一个 python 调试器，简单但功能强大，适用于小型甚至中型项目。使用它的最简单方法是在您希望程序停止的行之前添加 import pdb; pdb.set_trace() 这样您就可以逐行运行您的代码。

Answer 2

我同意 Joe Lin 的建议，不要在导入语句中使用通配符。它会极大地污染您的命名空间，并可能产生奇怪的行为。

Python 是 "batteries included" 所以你可能应该利用 requests 和 urllib3 包来处理 HTTP 请求，谨慎使用 subprocess 来执行命令，并且查看 scrapy 网络抓取包。它们各自的对象和方法的数据 return 可能包含您要提取的内容。

越懒越靠"prior art."

在 get_ip_address 的前几行中，我注意到以下内容：

def get_ip_address(url):
    command = "host " + url
    process = os.popen(command)
    ....

如果我通过 shell 执行此命令，它会从字面上反映：

host http://www.foo.com

执行 man host 并阅读手册页：

   host is a simple utility for performing DNS lookups. It is normally
   used to convert names to IP addresses and vice versa. When no arguments
   or options are given, host prints a short summary of its command line
   arguments and options.

   name is the domain name that is to be looked up. It can also be a
   dotted-decimal IPv4 address or a colon-delimited IPv6 address, in which
   case host will by default perform a reverse lookup for that address.
   server is an optional argument which is either the name or IP address
   of the name server that host should query instead of the server or
   servers listed in /etc/resolv.conf.

您正在提供 host 一个 URL，而它只需要一个 IP 地址或主机名。 URLs 包括方案、主机名和路径。您将必须显式提取主机名以使 host 以选择的方式与其交互。鉴于 URLs may/may 不包括详细路径信息，你必须解开它：

url= "http://www.yahoo.com/some_random/path"

# Split on "//" to extract scheme
_, host_and_path = url.split("//")

# Use .split() with maxsplit 1 to break this into pieces as desired
hostname , path = host_path.split("/", 1)

# # Use 'hostname' as input to the command
command = "host " + url
...

我不认为这个问题提供了与此问题相关的所有代码。错误输出似乎是基于 shell 的，而不是传统的 Python 堆栈跟踪，可能是 get_something 函数之一利用 Popen 做一些 shell 你想要的命令。

网页扫描提取ip的方法

How to extract ip in web scanning

python

web-crawler

python-os