解析 robots.txt 文件允许和不允许的部分

Parse allowed and disallowed parts of robots.txt file

我正在尝试使用以下代码在 netflix 网站的 robots.txt 文件中获取允许和不允许的用户代理部分:-

robots="""

    User-agent: *
    Disallow: /

    User-agent: googlebot
    User-agent: Googlebot-Video
    User-agent: bingbot
    User-agent: Baiduspider
    User-agent: Baiduspider-mobile
    User-agent: Baiduspider-video
    User-agent: Baiduspider-image
    User-agent: NaverBot
    User-agent: Yeti
    User-agent: Yandex
    User-agent: YandexBot
    User-agent: YandexMobileBot
    User-agent: YandexVideo
    User-agent: YandexWebmaster
    User-agent: YandexSitelinks
    User-agent: SeznamBot
    Allow: /

    Disallow: /accountstatus
    Disallow: /AccountStatus
    Disallow: /aui/inbound
    Disallow: /authenticate
    Disallow: /autologin
    Disallow: /clearcookies
    Disallow: /companies
    Disallow: /dvdterms
    Disallow: /editpayment
    Disallow: /emailunsubscribe
    Disallow: /error
    Disallow: /eula
    Disallow: /geooverride
    Disallow: /help
    Disallow: /imagelibrary
    Disallow: /learnmorelayer
    Disallow: /learnmorelayertv
    Disallow: /login
    Disallow: /loginhelp
    Disallow: /loginhelp/lookup
    Disallow: /loginhelpsucess
    Disallow: /LoginHelp
    Disallow: /password
    Disallow: /logout
    Disallow: /Logout
    Disallow: /mcd
    Disallow: /modernizr
    Disallow: /n/
    Disallow: /notamember
    Disallow: /notfound
    Disallow: /notices
    Disallow: /nrdapp
    Disallow: /optout
    Disallow: /overviewblockseeother
    Disallow: /popup/codewhatisthis
    Disallow: /popupdetails
    Disallow: /PopupDetails
    Disallow: /popupprivacypolicy
    Disallow: /privacypolicychanges
    Disallow: /registration
    Disallow: /rememberme
    Disallow: /signout
    Disallow: /signurl
    Disallow: /subscriptioncancel
    Disallow: /tastesurvey
    Disallow: /termsofusechanges
    Disallow: /tvsignup
    Disallow: /upcomingevents
    Disallow: /verifyidentity
    Disallow: /whysecure

    Disallow: /arabic
    Disallow: /Arabic
    Disallow: /chinese
    Disallow: /Chinese
    Disallow: /korean
    Disallow: /Korean

    Disallow: /airtel
    Disallow: /anan
    Disallow: /bouyguestelecom
    Disallow: /britishairways
    Disallow: /brutus
    Disallow: /comhem
    Disallow: /courts
    Disallow: /csl
    Disallow: /elisa
    Disallow: /entertain
    Disallow: /FireTV
    Disallow: /firetv
    Disallow: /freemonth
    Disallow: /kpn
    Disallow: /lg
    Disallow: /maxis
    Disallow: /Maxis
    Disallow: /meo
    Disallow: /Meo
    Disallow: /orangefrance
    Disallow: /Panasonic
    Disallow: /panasonic
    Disallow: /playstation
    Disallow: /proximus
    Disallow: /qantas
    Disallow: /samsung
    Disallow: /Sony
    Disallow: /sony
    Disallow: /talktalk
    Disallow: /tdc
    Disallow: /telenor
    Disallow: /telfort
    Disallow: /tim
    Disallow: /virginaustralia
    Disallow: /vodafone
    Disallow: /vodafonedemobilelaunch
    Disallow: /xboxone
    Disallow: /xfinity
    Disallow: /xs4all
    Disallow: /ziggo

    Disallow: /accountaccess
    Disallow: /AccountAccess
    Disallow: /activate
    Disallow: /Activate
    Disallow: /app
    Disallow: /BillingActivity
    Disallow: /browse
    Disallow: /browse/*
    Allow: /browse/genre/*
    Disallow: /CancelPlan
    Disallow: /ChangePlan
    Disallow: /changeplan
    Disallow: /deviceManagement
    Disallow: /DoNotTest
    Disallow: /EditProfiles
    Disallow: /email
    Disallow: /EmailPreferences
    Disallow: /entrytrap
    Disallow: /HdToggle
    Disallow: /LanguagePreferences
    Disallow: /ManageDevices
    Disallow: /ManageProfiles
    Disallow: /MoviesYouveSeen
    Disallow: /MyListOrder
    Disallow: /NewWatchInstantlyRSS
    Disallow: /NewWatchInstantlyRSS/*
    Disallow: /payment
    Disallow: /Payment
    Disallow: /phonenumber
    Disallow: /pin
    Disallow: /profiles
    Disallow: /profiles/*
    Disallow: /ProfilesGate
    Disallow: /search
    Disallow: /search/*
    Disallow: /viewingactivity
    Disallow: /WiViewingActivity
    Disallow: /yourAccount
    Disallow: /youraccount
    Disallow: /YourAccount
    Disallow: /YourAccountPayment

    User-agent: AdsBot-Google
    User-agent: Twitterbot
    User-agent: Adidxbot
    Allow: /

    User-agent: Yahoo Pipes 1.0
    User-agent: Facebot
    User-agent: externalfacebookhit
    Disallow: /
    """

    strt=0
    ad=0
    robots=''.join(robots.lower().split(' '))
    for line in robots.split('\n'):
        if line!='':
            if ('user-agent:yeti' in line or strt==1) or ('user-agent' not in line and ad==0):
                strt=1
                print(line)
                if 'allow' in line or 'disallow' in line:
                    ad=1

我正在使用此代码打印出用户代理雪人允许和不允许的部分,但这有点令人困惑。任何人都可以建议正则表达式或改进此代码。我在这里使用 python。

概览

以下脚本将读取 robots.txt 文件,在换行符上从上到下拆分。您很可能不会从字符串中读取 robots.txt,而是更像是迭代器的东西。

找到用户代理标签后,开始创建用户代理列表。多个用户代理共享一组 Disallowed/Allowed 权限。

当识别出允许或不允许的标签时,为与权限块关联的每个用户代理发出该权限。

以这种方式发送数据将允许您根据需要对数据进行排序或聚合。

  • 按用户代理分组
  • 按权限分组:允许/不允许
  • 构建路径和相关权限或用户代理的字典
def robot_permissions(permission_string):
    user_agents = []
    new_block = True
    for l in permission_string.split("\n"):
        clean_l = l.strip()
        if len(clean_l) > 0:
            (tag, value) = l.split(":")
            tag = tag.strip()
            value = value.strip()
            if tag == "User-agent":
                if new_block:
                    user_agents = []
                    new_block = False
                user_agents.append(value)
            else:
                new_block = True
                for agent in user_agents:
                    yield (tag, value, agent)

def agent_filter(piter, filter_agent):
    for tag, value, agent in piter:
        if agent == filter_agent:
            yield (tag, value, agent)

if __name__ == "__main__":
    piter = robot_permissions(robots)
    for p in agent_filter(piter, "Yeti"):
        print(p)

来自 python 脚本的 robots.txt 输出的头部

('Allow', '/', 'Yeti')
('Disallow', '/accountstatus', 'Yeti')
('Disallow', '/AccountStatus', 'Yeti')
('Disallow', '/aui/inbound', 'Yeti')
('Disallow', '/authenticate', 'Yeti')
('Disallow', '/autologin', 'Yeti')
('Disallow', '/clearcookies', 'Yeti')
('Disallow', '/companies', 'Yeti')
('Disallow', '/dvdterms', 'Yeti')
('Disallow', '/editpayment', 'Yeti')

python 脚本 robots.txt 输出的尾部

('Disallow', '/profiles/*', 'Yeti')
('Disallow', '/ProfilesGate', 'Yeti')
('Disallow', '/search', 'Yeti')
('Disallow', '/search/*', 'Yeti')
('Disallow', '/viewingactivity', 'Yeti')
('Disallow', '/WiViewingActivity', 'Yeti')
('Disallow', '/yourAccount', 'Yeti')
('Disallow', '/youraccount', 'Yeti')
('Disallow', '/YourAccount', 'Yeti')
('Disallow', '/YourAccountPayment', 'Yeti')