如何使用 xpath python 方法提取不带括号的文本?
how to extract text without brackets using xpath python method?
我正在构建一个数据库,该数据库按照此代码 https://github.com/jhnwr/webscrapenewsarticles/blob/master/newscraper.py 的说明收集报纸网站上发布的新闻 https://github.com/jhnwr/webscrapenewsarticles/blob/master/newscraper.py.. John Watson Rooney github 站点
但是当我提取信息进行网络抓取过程时,输出在括号“[]”内,我无法删除它们来清理数据并制作新闻数据框
'''
#find all the articles by using inspect element and create blank list
n=0
newslist = []
#loop through each article to find the title, subtitle, link, date and author. try and except as repeated articles from other sources have different h tags.
for item in articles:
try:
newsitem = item.find('h3', first=True)
title = newsitem.text
link = newsitem.absolute_links
subtitle = item.xpath('//a[@class="epigraph page-link"]//text()')
author = item.xpath('//span[@class="oculto"]/span//text()')
date = item.xpath('//meta[@itemprop="datePublished"]/@content')
date_scrap = dt.datetime.utcnow().strftime("%d/%b/%Y")
hour_scrap = dt.datetime.utcnow().strftime("%H:%M:%S")
print(n, '\n', title, '\n', subtitel, '\n', link, '\n', author, '\n', date, '\n', date_scrap , '\n', hour_scrap)
newsarticle = {
'title': title,
'subtitle': subtitle,
'link': link,
'autor': author,
'fecha': date,
'date_scrap': dat_scrap,
'hour_scrap': hour_scrap
}
newslist.append(newsarticle)
n+=1
except:
pass
news_db = pd.DataFrame(rows)
news_db.to_excel (r'db_article.xlsx', index = False, header=True)
news_db.head(10)
'''
我不允许嵌入图像,但打印输出如下:
En Vivo Procuraduría y Fiscalía investigan caso de joven que se
自杀 tras detención
['Una joven de 17 años denunció que 4
policías la agredieron sexualmente durante las protestas']
{'https://www.eltiempo.com/justicia/investigacion/investigan-denuncia-de-agresion-sexual-de-policias-a-menor-en-popayan-588429'}
['Here_Author_name']
['2021-05-14']
15/May/2021
18:14:48
我想删除两个类型括号“[]”和“{}”,我使用了以下命令但它们将值转换为 NAN:
news_db['subtitle']= news_bd['subtitle'].str.strip(']')
news_db['subtitle']= news_bd['subtitle']..str.replace(r"\[.*\]", "")
item.xpath
方法 returns 找到的项目列表,例如['Author']
而不是 'Author'
,就像 item.find
一样,它在搜索多个元素时很有用(例如 ['Author1', 'Author2']
)。
要只获得一个值,请使用 first
参数:
subtitle = item.xpath('//a[@class="epigraph page-link"]//text()', first=True)
author = item.xpath('//span[@class="oculto"]/span//text()', first=True)
date = item.xpath('//meta[@itemprop="datePublished"]/@content', first=True)
absoule_links
是可能一个set
,可以通过
得到一个随机元素
link = next(iter(newsitem.absolute_links))
# or
link = newsitem.absolute_links.pop()
我正在构建一个数据库,该数据库按照此代码 https://github.com/jhnwr/webscrapenewsarticles/blob/master/newscraper.py 的说明收集报纸网站上发布的新闻 https://github.com/jhnwr/webscrapenewsarticles/blob/master/newscraper.py.. John Watson Rooney github 站点 但是当我提取信息进行网络抓取过程时,输出在括号“[]”内,我无法删除它们来清理数据并制作新闻数据框
'''
#find all the articles by using inspect element and create blank list
n=0
newslist = []
#loop through each article to find the title, subtitle, link, date and author. try and except as repeated articles from other sources have different h tags.
for item in articles:
try:
newsitem = item.find('h3', first=True)
title = newsitem.text
link = newsitem.absolute_links
subtitle = item.xpath('//a[@class="epigraph page-link"]//text()')
author = item.xpath('//span[@class="oculto"]/span//text()')
date = item.xpath('//meta[@itemprop="datePublished"]/@content')
date_scrap = dt.datetime.utcnow().strftime("%d/%b/%Y")
hour_scrap = dt.datetime.utcnow().strftime("%H:%M:%S")
print(n, '\n', title, '\n', subtitel, '\n', link, '\n', author, '\n', date, '\n', date_scrap , '\n', hour_scrap)
newsarticle = {
'title': title,
'subtitle': subtitle,
'link': link,
'autor': author,
'fecha': date,
'date_scrap': dat_scrap,
'hour_scrap': hour_scrap
}
newslist.append(newsarticle)
n+=1
except:
pass
news_db = pd.DataFrame(rows)
news_db.to_excel (r'db_article.xlsx', index = False, header=True)
news_db.head(10)
'''
我不允许嵌入图像,但打印输出如下:
En Vivo Procuraduría y Fiscalía investigan caso de joven que se
自杀 tras detención
['Una joven de 17 años denunció que 4
policías la agredieron sexualmente durante las protestas']
{'https://www.eltiempo.com/justicia/investigacion/investigan-denuncia-de-agresion-sexual-de-policias-a-menor-en-popayan-588429'}
['Here_Author_name']
['2021-05-14']
15/May/2021
18:14:48
我想删除两个类型括号“[]”和“{}”,我使用了以下命令但它们将值转换为 NAN:
news_db['subtitle']= news_bd['subtitle'].str.strip(']')
news_db['subtitle']= news_bd['subtitle']..str.replace(r"\[.*\]", "")
item.xpath
方法 returns 找到的项目列表,例如['Author']
而不是 'Author'
,就像 item.find
一样,它在搜索多个元素时很有用(例如 ['Author1', 'Author2']
)。
要只获得一个值,请使用 first
参数:
subtitle = item.xpath('//a[@class="epigraph page-link"]//text()', first=True)
author = item.xpath('//span[@class="oculto"]/span//text()', first=True)
date = item.xpath('//meta[@itemprop="datePublished"]/@content', first=True)
absoule_links
是可能一个set
,可以通过
link = next(iter(newsitem.absolute_links))
# or
link = newsitem.absolute_links.pop()