给定网站列表、搜索和 Python 中的 return 信息

Question

我创建了一个函数，该函数 return 是给定特定公司名称的 url 列表。我想知道通过这个 url 列表搜索并查找有关该公司是否归另一家公司所有的信息。

示例："Marketo" 公司被 Adobe 收购。

我想return某公司是否被收购以及被谁收购。

这是我目前的情况：

import requests
from googlesearch import search
from bs4 import BeautifulSoup as BS


def get_url(company_name):
    url_list = []
    for url in search(company_name, stop=10):
        url_list.append(url)
    return url_list


test1 = get_url('Marketo')
print(test1[7])


r = requests.get(test1[7])
html = r.text
soup = BS(html, 'lxml')
stuff = soup.find_all('a')


print(stuff)

我是网络抓取的新手，我不知道如何真正搜索每个 URL（假设我可以）并找到我想要的信息。

test1的值如下：

['https://www.marketo.com/', 'https://www.marketo.com/software/marketing-automation/', 'https://blog.marketo.com/', 'https://www.marketo.com/software/', 'https://www.marketo.com/company/', 'https://www.marketo.com/solutions/pricing/', 'https://www.marketo.com/solutions/', 'https://en.wikipedia.org/wiki/Marketo', 'https://www.linkedin.com/company/marketo', 'https://www.cmswire.com/digital-marketing/what-is-marketo-a-marketers-guide/']

Answer 1

您可以从 Crunchbase 等网站找到该信息。

获取步骤如下：

构建包含目标公司信息的 url。假设您发现 url 包含您需要的信息，例如：

url = 'https://www.example.com/infoaboutmycompany.html'
使用 selenium 获取 html，因为该站点不允许您直接抓取页面。像这样：

from selenium import webdriver from bs4 import BeautifulSoup driver = webdriver.Firefox() driver.get(url) html = driver.page_source
使用 BeautifulSoup 从包含信息的 div 中获取文本。它有一个特定的 class，你可以很容易地找到 html:

bsobj = BeautifulSoup(html, 'lxml') res = bsobj.find('div', {'class':'alpha beta gamma'}) res.text.strip()

不到10行代码搞定

当然，它可以改变您的列表，从 url 列表到公司列表，希望该站点考虑。对于 marketo 它有效。

Answer 2

I want to return whether some company was acquired and by whom

您可以抓取 crunchbase 网站来获取这个 information.The 缺点是您会将搜索限制在他们的网站上。要扩展这一点，您也许还可以包括其他一些网站。

import requests
from bs4 import BeautifulSoup
import re
while True:
    print()
    organization_name=input('Enter organization_name: ').strip().lower()
    crunchbase_url='https://www.crunchbase.com/organization/'+organization_name
    headers={
        'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
    }
    r=requests.get(crunchbase_url,headers=headers)
    if r.status_code == 404:
        print('This organization is not available\n')
    else:
        soup=BeautifulSoup(r.text,'html.parser')
        overview_h2=soup.find('h2',text=re.compile('Overview'))
        try:
            possible_acquired_by_span=overview_h2.find_next('span',class_='bigValueItemLabelOrData')
            if possible_acquired_by_span.text.strip() == 'Acquired by':
                acquired_by=possible_acquired_by_span.find_next('span',class_='bigValueItemLabelOrData').text.strip()
            else:
                acquired_by=False
        except Exception as e:
                acquired_by=False
                # uncomment below line if you want to see the error
                # print(e)
        if acquired_by:
            print('Acquired By: '+acquired_by+'\n')
        else:
            print('No acquisition information available\n')

    again=input('Do You Want To Continue? ').strip().lower()
    if  again not in ['y','yes']:
        break

示例输出：

Enter organization_name: Marketo
Acquired By: Adobe Systems

Do You Want To Continue? y

Enter organization_name: Facebook
No acquisition information available

Do You Want To Continue? y

Enter organization_name: FakeCompany
This organization is not available

Do You Want To Continue? n

备注

阅读 crunchbase Terms 并征求他们的同意，然后再将其部署到任何商业项目中。
另请查看 crunchbase api - 我认为这将是推进您所要求内容的合法方式。

Answer 3

如其他答案所述，crunchbase 是获取此类信息的好地方，但您需要无头浏览器来抓取 crunchbase 比如硒

如果您使用 ubuntu 安装 Selenium 相当容易。 Selenium 需要一个驱动程序来与所选浏览器交互。例如，Firefox 需要 geckodriver

pip 安装硒
sudo pip3 install selenium --upgrade

安装最新版本的 geckodriver

wget https://github.com/mozilla/geckodriver/releases/download/v0.24.0/geckodriver-v0.24.0-linux64.tar.gz
tar -xvzf 壁虎驱动*
chmod +x geckodriver

将驱动程序添加到您的 PATH 以便其他工具可以找到它或在您所有软件安装的目录中，否则它会抛出错误（'geckodriver' 可执行文件需要在 PATH 中）

mv geckodriver /usr/bin/

代码

from bs4 import BeautifulSoup as BS
from selenium import webdriver


baseurl = "https://www.crunchbase.com/organization/{0}"

query = input('type company name : ').strip().lower()
url = baseurl.format(query)

driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BS(html, 'lxml')
acquiredBy = soup.find('div', class_= 'flex-no-grow cb-overflow-ellipsis identifier-label').text


print(acquiredBy)

您还可以使用相同的逻辑获取其他信息，只需检查 class/ id 并抓取信息。

给定网站列表、搜索和 Python 中的 return 信息

Given list of websites, search and return information in Python

python

beautifulsoup

google-search

web-scraping