使用 bs4 提取 html 页面字符串的问题

Question

我正在编写一个程序来查找歌词，该程序快要完成了，但是 bs4 数据类型有点问题，我的问题是如何从行尾的歌词变量中提取纯文本？

import re
import requests
import bs4
from urllib import unquote

def getLink(fileName):
    webFileName = unquote(fileName)
    page = requests.get("http://songmeanings.com/query/?query="+str(webFileName)+"&type=songtitles")    
    match = re.search('songmeanings\.com\/[^image].*?\/"',page.content)
    if match:
        Mached = str("http://"+match.group())
        return(Mached[:-1:]) # this line used to remove a " at the end of line
    else:
        return(1)       

def getText(link):    
    page = requests.get(str(link))          
    soup = bs4.BeautifulSoup(page.content ,"lxml")     
    return(soup)        

Soup = getText(getLink("paranoid android"))
lyric = Soup.findAll(attrs={"lyric-box"})
print (lyric)

这里是输出：

[\n\t\t\t\t\t\tPlease 你能不能停止噪音，
\n我正在努力休息一下
\n我脑子里所有未出生的鸡的声音
\n那是什么？
\n那是什么？
\n
\n当我为王时，你会第一个撞墙
\n你的意见是一点都不重要
\n那是什么？
\n那是什么？
\n
\n野心让你看起来很丑
\n踢和尖叫古奇小猪
\n你不记得
\n你不记得
\n你为什么不记得我的名字？
\n和他一起去头，伙计
\n砍掉他的头，伙计
\n你为什么不记得我的名字？
\n我猜他记得
\n
\n下雨，下雨
\n来吧，下雨在我身上
\n从很高的地方
\n从很高的地方，高度
\n下雨, 下大雨
\n快给我下大雨
\n高处
\n高处, 高,
\n下雨, 下雨
\n来给我下雨吧
\n
\n就是这样，先生
\n你是我编织
\n猪皮的噼啪声
\n灰尘和尖叫声
\n雅皮士网络
\n恐慌，呕吐
\n恐慌,吐槽
\n神爱他children,
\n神爱他children,yeah!

\n编辑Lyrics\nEdit Wiki\nAdd Video\n
]

Answer 1

首先 trim 前导和尾随 [] 通过执行 stringvar[1:-1] 然后在每一行调用 linevar.strip() 这将去除所有空白。

Answer 2

附加以下代码行：

lyric = ''.join([tag.text for tag in lyric])

之后

lyric = Soup.findAll(attrs={"lyric-box"})

你会得到类似

的输出

                        Please could you stop the noise,
I'm trying to get some rest
From all the unborn chicken voices in my head
What's that?
What's that?

When I am king, you will be first against the wall
With your opinion which is of no consequence at all
What's that?
What's that?

...

Answer 3

对于喜欢这个想法的人来说，经过一些小改动，我的代码最终看起来像这样:)

import re
import pycurl
import bs4
from urllib import unquote
from StringIO import StringIO


def getLink(fileName):
    fileName = unquote(fileName)
    baseAddres = "https://songmeanings.com/query/?query="
    linkToPage = str(baseAddres)+str(fileName)+str("&type=songtitles")
    
    buffer = StringIO()
    page = pycurl.Curl()
    page.setopt(page.URL,linkToPage)
    page.setopt(page.WRITEDATA,buffer)
    page.perform()
    page.close()
    
    pageSTR = buffer.getvalue()
    
    soup = bs4.BeautifulSoup(pageSTR,"lxml")  
    
    tab_content = str(soup.find_all(attrs={"tab-content"}))    
    pattern = r'\"\/\/songmeanings.com\/.+?\"'
    links = re.findall(pattern,tab_content)
    
    """returns first mached item without double quote
    at the beginning and at the end of the string"""
    return("http:"+links[0][1:-1:])

    
def getText(linkToSong):
    
    buffer = StringIO()
    page = pycurl.Curl()
    page.setopt(page.URL,linkToSong)
    page.setopt(page.WRITEDATA,buffer)
    page.perform()
    page.close()
    
    pageSTR = buffer.getvalue()
    
    soup = bs4.BeautifulSoup(pageSTR,"lxml")  
    
    lyric_box = soup.find_all(attrs={"lyric-box"})
    lyric_boxSTR = ''.join([tag.text for tag in lyric_box])
    return(lyric_boxSTR)
    
    
link = getLink("Anarchy In The U.K")
text = getText(link)
print(text)

使用 bs4 提取 html 页面字符串的问题

issue extracting html page's string using bs4

python

regex

bs4