使用 .lower() 解析网站时列出超出范围的索引

Question

我正在解析一个网站，以计算提及关键字的换行符的数量。使用以下代码一切正常：

import time
import urllib2
from urllib2 import urlopen
import datetime

website = 'http://www.dailyfinance.com/2014/11/13/market-wrap-seventh-dow-record-in-eight-days/#!slide=3077515'
topSplit = 'NEW YORK -- '
bottomSplit = "<div class=\"knot-gallery\""

# Count mentions on newlines
def main():
    try:
        x = 0
        sourceCode = urllib2.urlopen(website).read()
        sourceSplit = sourceCode.split(topSplit)[1].split(bottomSplit)[0]
        content = sourceSplit.split('\n') # provides an array
        
        for line in content:
            if 'gain' in line:
                x += 1
        
        print x
    
    except Exception,e:
        print 'Failed in the main loop'
        print str(e)

main()

但是，我想考虑对特定关键字的所有提及（在本例中为 'gain' 或 'Gain'）。反过来，我把.lower()包含在阅读源代码中。

sourceCode = urllib2.urlopen(website).read().lower()

然而这给了我错误：

Failed in the main loop

list index out of range

假设 .lower() 正在丢弃索引，为什么会发生这种情况？

Answer 1

您只使用小写字符串（lower() 就是这样做的）但您正尝试使用 topSplit = 'NEW YORK -- ' 进行拆分，这应该创建一个包含单个项目的列表。

然后您尝试访问索引 1 上的列表，这将总是失败：

sourceCode.split(topSplit)[1]

要考虑这两种情况，请查看 re 模块的正则表达式用法，这里是一个示例：

>>> string = "some STRING lol"
>>> re.split("string", string, flags=re.IGNORECASE)
['some ', ' lol']
>>> re.split("STRING", string, flags=re.IGNORECASE)
['some ', ' lol']

使用 .lower() 解析网站时列出超出范围的索引

List index out of range when parsing a website using .lower()

python

nlp

python-2.7