循环浏览一页语音样本,下载每个样本并将行放入文本文件

Loop through a page of voice samples, downloading each sample and putting the lines into a text file

Here is the page I am trying to do this on. It is the voice lines of GLaDOS from Portal. Each line is inner "i" HTML text as well as between quotes as displayed on the page. They each have a direct download link beside them labeled "download". I'm trying to put the voice lines into the MARY TTS voice synthesizer here 采用两种格式之一。在其自己的文本文件中的每一行,其文件名与 wav 文件的名称相匹配,或者全部在一个文本文件中,格式为 ( filename "insert line here" ).

我本来想自己做,但我已经花了 4 个小时,只得到了一小段 Python 不起作用的代码。

from bs4 import BeautifulSoup
import re
import urllib.request
soup = BeautifulSoup(urllib.request.urlopen("http://theportalwiki.com/wiki/GLaDOS_voice_lines"), "html.parser")
tags = soup.find_all('i')
f = open('Lines.txt', 'w')
for t in range(len(tags)):
    f.write(tags[t] + '\n')

f.close()

它returns "TypeError: unsupported operand type(s) for +: 'Tag' and 'str'."

我也试过 AutoHotKey。

^g::

IEGet(Name="")        ;Retrieve pointer to existing IE window/tab
{
    IfEqual, Name,, WinGetTitle, Name, ahk_class IEFrame
        Name := ( Name="New Tab - Windows Internet Explorer" ) ? "about:Tabs"
        : RegExReplace( Name, " - (Windows|Microsoft) Internet Explorer" )
    For wb in ComObjCreate( "Shell.Application" ).Windows
        If ( wb.LocationName = Name ) && InStr( wb.FullName, "iexplore.exe" )
            Return wb
} ;written by Jethrow

wb := IEGet()

IELoad(wb)    ;You need to send the IE handle to the function unless you define it as global.
{
    If !wb    ;If wb is not a valid pointer then quit
        Return False
    Loop    ;Otherwise sleep for .1 seconds untill the page starts loading
        Sleep,100
    Until (wb.busy)
    Loop    ;Once it starts loading wait until completes
        Sleep,100
    Until (!wb.busy)
    Loop    ;optional check to wait for the page to completely load
        Sleep,100
    Until (wb.Document.Readystate = "Complete")
Return True
}

For IE in ComObjCreate("Shell.Application").Windows ; for each open window
If InStr(IE.FullName, "iexplore.exe") ; check if it's an ie window
break ; keep that window's handle
; this assumes an ie window is available. it won't work if not

IE.Navigate("http://theportalwiki.com/wiki/GLaDOS_voice_lines")
While IE.Busy
    Sleep, 100
Links := IE.Document.Links

Inner := FileOpen("C:\Users\Johnson\Desktop\GLaDOS Voice", "w")
Rows := IE.Document.All.Tags("table")[4].Rows
    Loop % Rows.Length
        Inner.Write(Row[A_Index].InnerText . "`r`n")

Inner.Close()
Return

据我所知,AutoHotKey 脚本什么都不做。我使用热键,但没有任何反应。

我更喜欢 Lua 因为它是一致的而且我理解它。

您的 Python 代码非常接近工作。下面的小修复(加上对文件使用上下文管理器):

from bs4 import BeautifulSoup
import urllib.request
soup = BeautifulSoup(urllib.request.urlopen("http://theportalwiki.com/wiki/GLaDOS_voice_lines"), "html.parser")
tags = soup.find_all('i')
with open('Lines.txt', 'w') as f:
    for t in range(len(tags)):
        f.write(tags[t].text.strip('“”') + '\n')

Lines.txt:

You just have to look at things objectively, see what you don't need anymore, and trim out the fat.
Portal
Portal 2

Hello and, again, welcome to the Aperture Science computer-aided enrichment center.
...

编辑

要回答下面评论中的问题,这应该得到下载链接:

from bs4 import BeautifulSoup
import urllib.request
soup = BeautifulSoup(urllib.request.urlopen("http://theportalwiki.com/wiki/GLaDOS_voice_lines"), "html.parser")
tags = soup.find_all('a')
with open('Downloads.txt', 'w') as f:
    for tag in tags:
        if tag.text == 'Download':
            f.write(tag['href'] + '\n')

Downloads.txt:

http://i1.theportalwiki.net/img/e/e5/GLaDOS_00_part1_entry-1.wav
http://i1.theportalwiki.net/img/d/d7/GLaDOS_00_part1_entry-2.wav
http://i1.theportalwiki.net/img/5/50/GLaDOS_00_part1_entry-3.wav
...