Python: print/get 每段第一句
Python: print/get first sentence of each paragraph
这是我的代码,但它打印了整个段落。如何只打印第一句话,直到第一个点?
from bs4 import BeautifulSoup
import urllib.request,time
article = 'https://www.theguardian.com/science/2012/\
oct/03/philosophy-artificial-intelligence'
req = urllib.request.Request(article, headers={'User-agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = BeautifulSoup(html,'lxml')
def print_intro():
if len(soup.find_all('p')[0].get_text()) > 100:
print(soup.find_all('p')[0].get_text())
此代码打印:
To state that the human brain has capabilities that are, in some
respects, far superior to those of all other known objects in the
cosmos would be uncontroversial. The brain is the only kind of object
capable of understanding that the cosmos is even there, or why there
are infinitely many prime numbers, or that apples fall because of the
curvature of space-time, or that obeying its own inborn instincts can
be morally wrong, or that it itself exists. Nor are its unique
abilities confined to such cerebral matters. The cold, physical fact
is that it is the only kind of object that can propel itself into
space and back without harm, or predict and prevent a meteor strike on
itself, or cool objects to a billionth of a degree above absolute
zero, or detect others of its kind across galactic distances.
但我只想打印:
To state that the human brain has capabilities that are, in some
respects, far superior to those of all other known objects in the
cosmos would be uncontroversial.
感谢帮助
在那个点上拆分文本;对于单个拆分,使用 str.partition()
is faster than str.split()
和限制:
text = soup.find_all('p')[0].get_text()
if len(text) > 100:
text = text.partition('.')[0] + '.'
print(text)
如果你只需要处理第一个<p>
元素,使用soup.find()
代替:
text = soup.find('p').get_text()
if len(text) > 100:
text = text.partition('.')[0] + '.'
print(text)
然而,对于您给定的 URL,示例文本被发现为 second 段落:
>>> soup.find_all('p')[1]
<p><span class="drop-cap"><span class="drop-cap__inner">T</span></span>o state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial. The brain is the only kind of object capable of understanding that the cosmos is even there, or why there are infinitely many prime numbers, or that apples fall because of the curvature of space-time, or that obeying its own inborn instincts can be morally wrong, or that it itself exists. Nor are its unique abilities confined to such cerebral matters. The cold, physical fact is that it is the only kind of object that can propel itself into space and back without harm, or predict and prevent a meteor strike on itself, or cool objects to a billionth of a degree above absolute zero, or detect others of its kind across galactic distances.</p>
>>> text = soup.find_all('p')[1].get_text()
>>> text.partition('.')[0] + '.'
'To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial.'
def print_intro():
if len(soup.find_all('p')[0].get_text()) > 100:
paragraph = soup.find_all('p')[0].get_text()
phrase_list = paragraph.split('.')
print(phrase_list[0])
split
第一段 period
。参数 1
指定 MAXSPLIT
并节省不必要的额外拆分时间。
def print_intro():
if len(soup.find_all('p')[0].get_text()) > 100:
my_paragraph = soup.find_all('p')[0].get_text()
my_list = my_paragraph.split('.', 1)
print(my_list[0])
您可以使用 find('.')
,它 return 您要查找的内容第一次出现的索引。
因此,如果该段落存储在名为 paragraph
的变量中
sentence_index = paragraph.find('.')
# add the '.'
sentence += 1
print(paragraph[0: sentence_index])
显然这里缺少控制部分,比如检查 paragraph
变量中包含的字符串是否有 '.'等..无论如何 find() return -1 如果它没有找到你正在寻找的子字符串。
这是我的代码,但它打印了整个段落。如何只打印第一句话,直到第一个点?
from bs4 import BeautifulSoup
import urllib.request,time
article = 'https://www.theguardian.com/science/2012/\
oct/03/philosophy-artificial-intelligence'
req = urllib.request.Request(article, headers={'User-agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = BeautifulSoup(html,'lxml')
def print_intro():
if len(soup.find_all('p')[0].get_text()) > 100:
print(soup.find_all('p')[0].get_text())
此代码打印:
To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial. The brain is the only kind of object capable of understanding that the cosmos is even there, or why there are infinitely many prime numbers, or that apples fall because of the curvature of space-time, or that obeying its own inborn instincts can be morally wrong, or that it itself exists. Nor are its unique abilities confined to such cerebral matters. The cold, physical fact is that it is the only kind of object that can propel itself into space and back without harm, or predict and prevent a meteor strike on itself, or cool objects to a billionth of a degree above absolute zero, or detect others of its kind across galactic distances.
但我只想打印:
To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial.
感谢帮助
在那个点上拆分文本;对于单个拆分,使用 str.partition()
is faster than str.split()
和限制:
text = soup.find_all('p')[0].get_text()
if len(text) > 100:
text = text.partition('.')[0] + '.'
print(text)
如果你只需要处理第一个<p>
元素,使用soup.find()
代替:
text = soup.find('p').get_text()
if len(text) > 100:
text = text.partition('.')[0] + '.'
print(text)
然而,对于您给定的 URL,示例文本被发现为 second 段落:
>>> soup.find_all('p')[1]
<p><span class="drop-cap"><span class="drop-cap__inner">T</span></span>o state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial. The brain is the only kind of object capable of understanding that the cosmos is even there, or why there are infinitely many prime numbers, or that apples fall because of the curvature of space-time, or that obeying its own inborn instincts can be morally wrong, or that it itself exists. Nor are its unique abilities confined to such cerebral matters. The cold, physical fact is that it is the only kind of object that can propel itself into space and back without harm, or predict and prevent a meteor strike on itself, or cool objects to a billionth of a degree above absolute zero, or detect others of its kind across galactic distances.</p>
>>> text = soup.find_all('p')[1].get_text()
>>> text.partition('.')[0] + '.'
'To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial.'
def print_intro():
if len(soup.find_all('p')[0].get_text()) > 100:
paragraph = soup.find_all('p')[0].get_text()
phrase_list = paragraph.split('.')
print(phrase_list[0])
split
第一段 period
。参数 1
指定 MAXSPLIT
并节省不必要的额外拆分时间。
def print_intro():
if len(soup.find_all('p')[0].get_text()) > 100:
my_paragraph = soup.find_all('p')[0].get_text()
my_list = my_paragraph.split('.', 1)
print(my_list[0])
您可以使用 find('.')
,它 return 您要查找的内容第一次出现的索引。
因此,如果该段落存储在名为 paragraph
sentence_index = paragraph.find('.')
# add the '.'
sentence += 1
print(paragraph[0: sentence_index])
显然这里缺少控制部分,比如检查 paragraph
变量中包含的字符串是否有 '.'等..无论如何 find() return -1 如果它没有找到你正在寻找的子字符串。