新闻在数据框中抓取多个 url
News scraping multiple url inside a dataframe
所以我尝试使用 Newspaper3k 来抓取一些 website.In 库的内容 Article()
函数 Article()
只需要一个 url.Is 这可能迭代一个数据帧充满 url到scrape它是自动的吗?我的df是这样的
df = ['https://www.liputan6.com/bisnis/read/4661489/erick-thohir-apresiasi-transformasi-digital-pos-indonesia','https://ekonomi.bisnis.com/read/20210918/98/1443952/pos-indonesia-gandeng-nujek-perluas-segmen-pengiriman','https://www.republika.co.id/berita/qzkxdm380/perkuat-layanan-pt-pos-indonesia-gandeng-kurir-wanita']
我试了几个这样的可能答案
for x in df.iterrows():
print(x)
a = Article(x,language='id')
b = a.download()
c = a.parse()
但是出现错误
AttributeError: 'tuple' object has no attribute 'decode'
我也试试
a = Article(url=x in df.iterrows(),language='id')
b = a.download()
c = a.parse()
author = a.authors
date = a.publish_date
text = a.text
combine = {'author':author,'date':date,'text':text}
data = pd.DataFrame(data=combine)
但出现错误
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
我尝试了更多的代码,如果得到 help.Thanks
,我真的很感激
df
不是数据框,而是列表。只需遍历列表即可。
from newspaper import Article
import pandas as pd
urls = ['https://www.liputan6.com/bisnis/read/4661489/erick-thohir-apresiasi-transformasi-digital-pos-indonesia','https://ekonomi.bisnis.com/read/20210918/98/1443952/pos-indonesia-gandeng-nujek-perluas-segmen-pengiriman','https://www.republika.co.id/berita/qzkxdm380/perkuat-layanan-pt-pos-indonesia-gandeng-kurir-wanita']
rows = []
for url in urls:
try:
a = Article(url,language='id')
a.download()
a.parse()
author = a.authors
date = a.publish_date
text = a.text
print(author, date, text)
row = {'url':url,
'author':author,
'data':date,
'text':text}
rows.append(row)
except Exception as e:
print(e)
row = {'url':url,
'author':'N/A',
'data':'N/A',
'text':'N/A'}
rows.append(row)
df = pd.DataFrame(rows)
所以我尝试使用 Newspaper3k 来抓取一些 website.In 库的内容 Article()
函数 Article()
只需要一个 url.Is 这可能迭代一个数据帧充满 url到scrape它是自动的吗?我的df是这样的
df = ['https://www.liputan6.com/bisnis/read/4661489/erick-thohir-apresiasi-transformasi-digital-pos-indonesia','https://ekonomi.bisnis.com/read/20210918/98/1443952/pos-indonesia-gandeng-nujek-perluas-segmen-pengiriman','https://www.republika.co.id/berita/qzkxdm380/perkuat-layanan-pt-pos-indonesia-gandeng-kurir-wanita']
我试了几个这样的可能答案
for x in df.iterrows():
print(x)
a = Article(x,language='id')
b = a.download()
c = a.parse()
但是出现错误
AttributeError: 'tuple' object has no attribute 'decode'
我也试试
a = Article(url=x in df.iterrows(),language='id')
b = a.download()
c = a.parse()
author = a.authors
date = a.publish_date
text = a.text
combine = {'author':author,'date':date,'text':text}
data = pd.DataFrame(data=combine)
但出现错误
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
我尝试了更多的代码,如果得到 help.Thanks
,我真的很感激df
不是数据框,而是列表。只需遍历列表即可。
from newspaper import Article
import pandas as pd
urls = ['https://www.liputan6.com/bisnis/read/4661489/erick-thohir-apresiasi-transformasi-digital-pos-indonesia','https://ekonomi.bisnis.com/read/20210918/98/1443952/pos-indonesia-gandeng-nujek-perluas-segmen-pengiriman','https://www.republika.co.id/berita/qzkxdm380/perkuat-layanan-pt-pos-indonesia-gandeng-kurir-wanita']
rows = []
for url in urls:
try:
a = Article(url,language='id')
a.download()
a.parse()
author = a.authors
date = a.publish_date
text = a.text
print(author, date, text)
row = {'url':url,
'author':author,
'data':date,
'text':text}
rows.append(row)
except Exception as e:
print(e)
row = {'url':url,
'author':'N/A',
'data':'N/A',
'text':'N/A'}
rows.append(row)
df = pd.DataFrame(rows)