使用 BeautifulSoup 将抓取的文本转换为 Pandas 数据框
Converting scraped text into Pandas data frame with BeautifulSoup
我正在使用以下代码从网站中提取一些文本。我有它的字符串形式。
import requests
URL = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
r = requests.get(URL)
page = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, 'lxml')
import re
strong_el = soup.find('strong',text='WHAT RESPONDENTS ARE SAYING …')
ul_tag = strong_el.find_next_sibling('ul')
LI_TAG =''
for li_tag in ul_tag.children:
LI_TAG += li_tag.string
print LI_TAG
我正在尝试创建一个包含 2 列的数据框:1) 评论 2) 行业(括号内的子字符串)。
当我尝试使用 StringIO 时出现如下错误:“类型错误:数据参数不能是迭代器”。如何将这些评论转换为数据框?
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
LI_TAG = StringIO(LI_TAG)
df = pd.DataFrame(LI_TAG)
似乎 LI_TAG 变量只是一个长字符串 - 因此您必须将其拆分以将其存储在数据框中。
import requests
URL = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
r = requests.get(URL)
page = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, 'lxml')
import re
strong_el = soup.find('strong',text='WHAT RESPONDENTS ARE SAYING …')
ul_tag = strong_el.find_next_sibling('ul')
LI_TAG =''
for li_tag in ul_tag.children:
LI_TAG += li_tag.string
# Convert to unicode to remove quotation marks \u201c and \u201d
LI_TAG_U = unicode(LI_TAG)
comments=[]
industries=[]
for string in LI_TAG.strip().split('\n'):
comment, industry = string.split(u'\u201d')
comments.append(comment.strip(u'\u201c'))
industries.append(industry.strip(' (').strip(')'))
import pandas as pd
data = pd.DataFrame()
data['Comment']=comments
data['Industry']=industries
希望这对你有用!
我正在使用以下代码从网站中提取一些文本。我有它的字符串形式。
import requests
URL = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
r = requests.get(URL)
page = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, 'lxml')
import re
strong_el = soup.find('strong',text='WHAT RESPONDENTS ARE SAYING …')
ul_tag = strong_el.find_next_sibling('ul')
LI_TAG =''
for li_tag in ul_tag.children:
LI_TAG += li_tag.string
print LI_TAG
我正在尝试创建一个包含 2 列的数据框:1) 评论 2) 行业(括号内的子字符串)。 当我尝试使用 StringIO 时出现如下错误:“类型错误:数据参数不能是迭代器”。如何将这些评论转换为数据框?
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
LI_TAG = StringIO(LI_TAG)
df = pd.DataFrame(LI_TAG)
似乎 LI_TAG 变量只是一个长字符串 - 因此您必须将其拆分以将其存储在数据框中。
import requests
URL = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
r = requests.get(URL)
page = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, 'lxml')
import re
strong_el = soup.find('strong',text='WHAT RESPONDENTS ARE SAYING …')
ul_tag = strong_el.find_next_sibling('ul')
LI_TAG =''
for li_tag in ul_tag.children:
LI_TAG += li_tag.string
# Convert to unicode to remove quotation marks \u201c and \u201d
LI_TAG_U = unicode(LI_TAG)
comments=[]
industries=[]
for string in LI_TAG.strip().split('\n'):
comment, industry = string.split(u'\u201d')
comments.append(comment.strip(u'\u201c'))
industries.append(industry.strip(' (').strip(')'))
import pandas as pd
data = pd.DataFrame()
data['Comment']=comments
data['Industry']=industries
希望这对你有用!