While 循环从 python 中的 HTML 代码获取信息

Question

正在尝试创建一个代码，以便从 Booking.com 获取审阅者的姓名和评论。

我能够获得所有必要的 URL 并从 HTML 代码中分离出审阅者的姓名和评论，但我正在努力创造时间来进行下一次审阅。

while 循环应该将审阅者的名字附加到列表中，移动到下一个名字附加它等等。我也需要同样的评论。

当运行代码没有任何反应时，我不确定我的问题出在哪里。

#Loop parameters
##HTMLs
#Booking.com URL
search_url[0] = 'https://www.booking.com/reviews/us/hotel/shore-cliff.es.html?label=gen173nr-1DEgdyZXZpZXdzKIICOOgHSDNYBGiTAogBAZgBCrgBF8gBDNgBA-gBAYgCAagCA7gC5bPZkQbAAgHSAiQzMTc3NTA4OS00OGRkLTQ5ZjYtYjBhNi1kOWEzYzZhN2QwOWXYAgTgAgE;sid=3e3ae22b47e3df3ac2590eb19d37f888;customer_type=total;hp_nav=0;old_page=0;order=featuredreviews;page=1;r_lang=all;rows=75&'

link = search_urls[0] #Just the first one to try
url = link
html = urllib.request.urlopen(url).read().decode('utf-8') #loading each search page

#Main HTML of first hotel
index=html.find('class="review_list"')
review_list_html = html[index:]

##Lists:
hotels=[]
reviewer_name=[]
review_comment=[]

#Creating counter variable
counter=0
reviewercount =0

                      
#Main HTML of first hotel
index=html.find('class="review_list"')
review_list_html = html[index:]
reviewer_html = review_list_html[review_list_html.find('reviewer_name'):]
review_html = review_list_html[review_list_html.find('class="review_pos ">'):]

#Loop to get reviewer
while review_list_html.find('reviewer_name'):
    #Get reviewer's name
    #Start of reviewers name
    start =reviewer_html.find('<span itemprop="name">')+22 #To ignore <span itemprop="name"> and jump right the name
    start
    #End of reviewers name
    end =reviewer_html.find('</span>')
    #Isolating reviewers name
    reviewer_html=reviewer_html[start:end]
    #Adding reviewer to list
    reviewer_name.append(reviewer_html)

Answer 1

您的问题是每次下一次索引查找都需要从上一个索引开始，否则您将创建永恒循环。通常使用 HTML 解析器如 Beautiful Soup 更常见，但绝对有可能使用您尝试使用的方法解析此页面。

我们可以使用"reviewer_name"作为每个评论块的主要索引。从这个索引开始，我们将得到 "name" 和 </span> 的索引。这些索引之间的文本是审稿人的姓名。为了解析评论正文，我们将在下一个评论块的索引之前找到 "reviewBody" 的所有索引。

完整代码：

from urllib.request import urlopen

link = "https://www.booking.com/reviews/us/hotel/shore-cliff.es.html"
with urlopen(link) as request:
    response = request.read().decode()

reviews = []

name_pos = response.find('"reviewer_name"')  # find first review
while name_pos >= 0:
    name = ""
    review_blocks = []

    start_pos = response.find('"name"', name_pos)
    end_pos = response.find("</span>", start_pos)

    if end_pos > start_pos >= 0:
        name = response[start_pos + 7: end_pos]

    prev_name_pos = name_pos
    name_pos = response.find('"reviewer_name"', name_pos + 1)  # get next review

    start_pos = response.find('"reviewBody"', prev_name_pos, name_pos)
    while start_pos >= 0:
        end_pos = response.find("</span>", start_pos)
        if end_pos > start_pos >= 0:
            review_blocks.append(response[start_pos + 13: end_pos])
        start_pos = response.find('"reviewBody"', start_pos + 1, name_pos)

    reviews.append((name, "\n".join(review_blocks)))

reviews内容：

[
    ('Adriana',
     'Nada para criticar.\n'
     'Impecable lugar, habitación con vistas hermosas cualquiera sea. Camas '
     'confortables, pequeña cocina completa, todo impecable.\n'
     'La atención en recepción excelente, no se pierdan las cookies que convidan '
     'por la tarde allí. El desayuno variado y con unos tamales exquisitos! Cerca '
     'de todo.'),
    ('Ana', 'Todo excelente'),
    ('Lara',
     'simplemente un poco de ruido en el tercer piso pero solo fue un poco antes '
     'de las 10:00pm\n'
     'realmente todo estaba excelente, ese gran detalle de el desayuno se les '
     'agradece mucho.'),
    ('Rodrigo',
     'Todo me gustó solo lo único que me hubiera gustado que también tuvieran es '
     'unas chimeneas.\n'
     'El hotel tiene una hermosa vista y se puede caminar y disfrutar por toda la '
     'orilla de la playa hasta llegar al muelle y mas lejos si uno quiere.'),
    ('May', 'Me encanto q estaba abierta la piscina el mar expectacular'),
    ('Scq', 'Las vistas al Pacífico'),
    ('Eva', 'Desayuno\nUbicación y limpieza'),
    ('Marta',
     'Muy buena ubicación y vistas al mar. Habitaciones modernas, amplias y con '
     'cocina. Buen desayuno y hasta las 10, a diferencia de otros hoteles en los '
     'que estuvimos. Personal muy amable. El chek out es a las 12 por lo que te '
     'permite disfrutar de las piscina y de las vistas y paseo por la costa.'),
    ('Filippo',
     'Habitación enorme, y muy limpio. \n'
     'La habitación con vista al Ocean .... top'),
    ('Enrique', 'La atención del personal'),
    ('Lucia',
     'El lugar para el desayuno es demasiado pequeño y no hay lugar suficiente '
     'para sentarse\n'
     'La vista, los jardines y todo el entorno son preciosos. Y es muy '
     'confortable!'),
    ('Pablo', 'El precio.\nLa ubicación y el desayuno'),
    ('Walter', 'El hotel está bien, la ubicación es buena'),
    ('Anónimo', 'Muy bueno, el personal muy amable\nExcelente lugar muy cómodo'),
    ('Gonzalo', ''),
    ('Maria', ''),
    ('Rosana', ''),
    ('Leticia', ''),
    ('María', ''),
    ('Samantha', '')
]

你可以帮助我的国家，检查my profile info。

While 循环从 python 中的 HTML 代码获取信息

While loop to get information from HTML code in python

html

python

while-loop