newsletter3k，在第一个 "by" 字后的可见文本中查找作者姓名

Question

Newsletter3K 是一个很好的 python 新闻内容提取库。它 大部分 很好用 .我想在可见文本中的第一个“by”字之后提取名称。这是我的代码，效果不佳，请有人帮忙：

import re
from newspaper import Config
from newspaper import Article
from newspaper import ArticleException
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
USER_AGENT = 'Mozilla/5.0 (Macintosh;Intel Mac OS X 10.15; rv:78.0)Gecko/20100101   Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10 
html1='https://saugeentimes.com/new-perspectives-a-senior-moment-food-glorious-food-part-2/'
article = Article(html1.strip(), config=config)
article.download()
article.parse()
soup = BeautifulSoup(article)
## I want to take only visible text
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()
for line in visible_text:
    # Capture one-or-more words after first (By or by) the initial match
    match = re.search(r'By (\S+)', line)

    # Did we find a match?
    if match:
        # Yes, process it to print 
        By = match.group(1)
        print('By {}'.format(By))`

Answer 1

这不是一个全面的答案，但它是您可以构建的答案。添加其他源时，您将需要扩展此代码。正如我之前所说，我的 Newspaper3k overview document 有很多提取示例，所以请仔细阅读。

在使用 newspaper3k 尝试这些提取方法后，正则表达式应该是最后的努力：

article.authors
元标签
json
汤

from newspaper import Config
from newspaper import Article
from newspaper.utils import BeautifulSoup

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

urls = ['https://saugeentimes.com/new-perspectives-a-senior-moment-food-glorious-food-part-2',
        'https://www.macleans.ca/education/what-college-students-in-canada-can-expect-during-covid',
        'https://www.cnn.com/2021/02/12/asia/india-glacier-raini-village-chipko-intl-hnk/index.html',
        'https://www.latimes.com/california/story/2021-02-13/wildfire-santa-cruz-boulder-creek-residents-fear-water'
        '-quality',
        'https://foxbaltimore.com/news/local/maryland-lawmakers-move-ahead-with-first-tax-on-internet-ads-02-13-2021']

for url in urls:
    try:
        article = Article(url, config=config)
        article.download()
        article.parse()
        author = article.authors
        if author:
            print(author)
        elif not author:
            soup = BeautifulSoup(article.html, 'html.parser')
            author_tag = soup.find(True, {'class': ['td-post-author-name', 'byline']}).find(['a', 'span'])
            if author_tag:
                print(author_tag.get_text().replace('By', '').strip())
            else:
                print('no author found')
    except AttributeError as e:
        pass

newsletter3k，在第一个 "by" 字后的可见文本中查找作者姓名

newsletter3k, find author name in visible text after first "by" word

extract

beautifulsoup

word

visible

newspaper3k