计算网页内的字数
counting words inside a webpage
我需要使用 python3 计算网页内的字数。我应该使用哪个模块?网址库?
这是我的代码:
def web():
f =("urllib.request.urlopen("https://americancivilwar.com/north/lincoln.html")
lu = f.read()
print(lu)
使用下面的自解释代码,您可以很好地开始计算网页中的字数:
import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation
# We get the url
r = requests.get("https://en.wikiquote.org/wiki/Khalil_Gibran")
soup = BeautifulSoup(r.content)
# We get the words within paragrphs
text_p = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
c_p = Counter((x.rstrip(punctuation).lower() for y in text_p for x in y.split()))
# We get the words within divs
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
c_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))
# We sum the two countesr and get a list with words count from most to less common
total = c_div + c_p
list_most_common_words = total.most_common()
例如,如果您想要前 10 个最常用的单词,您只需执行以下操作:
total.most_common(10)
在这种情况下输出:
In [100]: total.most_common(10)
Out[100]:
[('the', 2097),
('and', 1651),
('of', 998),
('in', 625),
('i', 592),
('a', 529),
('to', 529),
('that', 426),
('is', 369),
('my', 365)]
我需要使用 python3 计算网页内的字数。我应该使用哪个模块?网址库?
这是我的代码:
def web():
f =("urllib.request.urlopen("https://americancivilwar.com/north/lincoln.html")
lu = f.read()
print(lu)
使用下面的自解释代码,您可以很好地开始计算网页中的字数:
import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation
# We get the url
r = requests.get("https://en.wikiquote.org/wiki/Khalil_Gibran")
soup = BeautifulSoup(r.content)
# We get the words within paragrphs
text_p = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
c_p = Counter((x.rstrip(punctuation).lower() for y in text_p for x in y.split()))
# We get the words within divs
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
c_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))
# We sum the two countesr and get a list with words count from most to less common
total = c_div + c_p
list_most_common_words = total.most_common()
例如,如果您想要前 10 个最常用的单词,您只需执行以下操作:
total.most_common(10)
在这种情况下输出:
In [100]: total.most_common(10)
Out[100]:
[('the', 2097),
('and', 1651),
('of', 998),
('in', 625),
('i', 592),
('a', 529),
('to', 529),
('that', 426),
('is', 369),
('my', 365)]