从 pandas df 中的 URL 生成的词云 - 220 篇文章一个词云，而不是每篇文章一个词云

Question

我有以下数据框：

    Title                      Source       Date              Link
    'Corona news'              NY Times     01-06-2020        nyt.com/corona_news
220 rows × 4 columns

所以我在这个 df 中有 220 个新闻文章链接，我想用它来制作 wordcloud 并通过 flask 托管它。

我对以下代码进行了改造，以使用它从 df 链接中制作词云。

main.py

import base64
import feedparser
import io
import requests
import pandas as pd

from bs4 import BeautifulSoup
from wordcloud import WordCloud
from flask import Flask
from flask import render_template 

app = Flask(__name__)

BBC_FEED = pd.read_excel('All_news_corona.xlsx')
LIMIT = 20

class Article:
    def __init__(self, url, image):
        self.url = url
        self.image = image

@app.route("/")
def home():
    # feed = feedparser.parse(BBC_FEED)
    articles = []

    for article in BBC_FEED['Link'][:LIMIT]:
        print(article)
        text = parse_article(article)
        cloud = get_wordcloud(text)
        articles.append(Article(article, cloud))
    return render_template('home.html', articles=articles)

def parse_article(article_url):
    print("Downloading {}".format(article_url))
    r = requests.get(article_url)
    soup = BeautifulSoup(r.text, "html.parser")
    ps = soup.find_all('p')
    text = "\n".join(p.get_text() for p in ps)
    return text

def get_wordcloud(text):
    pil_img = WordCloud().generate(text=text).to_image()
    img = io.BytesIO()
    pil_img.save(img, "PNG")
    img.seek(0)
    img_b64 = base64.b64encode(img.getvalue()).decode()
    return img_b64

if __name__ == '__main__':
    app.debug = True
    app.run('127.0.0.1')

home.html

<html>
  <head>
    <title>News in WordClouds | Home</title>
    <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/3.4.1/css/bootstrap.min.css" integrity="sha384-HSMxcRTRxnN+Bdg0JdbxYKrThecOKuH5zCYotlSAcp1+c8xmyTe9GYg1l9a69psu" crossorigin="anonymous">

    <style type="text/css">
      body {padding: 20px;}
      img{padding: 5px;}
    </style>
  </head>

  <body>
    <h1>News Word Clouds</h1>
      <p>Too busy to click on each news article to see what it's about? Below you can see all the articles from the BBC front page, displayed as word clouds. If you want to read more about any particular article, just click on the wordcloud to go to the original article</p>
      {% for article in articles %}
        <a href="{{article.url}}"><img src="data:image/png;base64,{{article.image}}"></a>
      {% endfor %}
  </body>
</html>

但是，我不想用它为每篇文章制作一个词云，而是想用它制作一个词云，它从所有文章中获取输入。有人有快速修复方法吗？

Answer 1

将所有已解析的文本连接成一个字符串，然后将其传递给您的 wordcloud 函数：

all_texts = []

for article in BBC_FEED['Link'][:LIMIT]:
    all_texts.append(parse_article(article))

cloud = get_wordcloud(" ".join(all_texts))
articles.append(Article(url=None, image=cloud))  # no URL for the "meta-article"
return render_template('home.html', articles=articles)

从 pandas df 中的 URL 生成的词云 - 220 篇文章一个词云，而不是每篇文章一个词云

Word cloud generated from URLs in pandas df - one word cloud for 220 articles instead of one word cloud per article

python

pandas

flask

beautifulsoup

word-cloud