TypeError: XXXXX got an unexpected keyword argument 'XXXXXX'
TypeError: XXXXX got an unexpected keyword argument 'XXXXXX'
我从 运行 代码中得到了一个意外的关键字参数。资料来源:https://sempioneer.com/python-for-seo/how-to-extract-text-from-multiple-webpages-in-python/
有人可以帮忙吗?谢谢
运行 下面的代码:
single_url = 'https://understandingdata.com/'
text = extract_text_from_single_web_page(url=single_url)
print(text)
给出以下错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_10260/3606377172.py in <module>
1 single_url = 'https://understandingdata.com/'
----> 2 text = extract_text_from_single_web_page(url=single_url)
3 print(text)
~\AppData\Local\Temp/ipykernel_10260/850098094.py in extract_text_from_single_web_page(url)
42 try:
43 a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True, include_comments = False,
---> 44 date_extraction_params={'extensive_search': True, 'original_date': True})
45 except AttributeError:
46 a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True,
TypeError: extract() got an unexpected keyword argument 'json_output'
“extract_text_from_single_web_page(url=single_url)
的代码
def extract_text_from_single_web_page(url):
downloaded_url = trafilatura.fetch_url(url)
try:
a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True, include_comments = False,
date_extraction_params={'extensive_search': True, 'original_date': True})
except AttributeError:
a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True,
date_extraction_params={'extensive_search': True, 'original_date': True})
if a:
json_output = json.loads(a)
return json_output['text']
else:
try:
resp = requests.get(url)
# We will only extract the text from successful requests:
if resp.status_code == 200:
return beautifulsoup_extract_text_fallback(resp.content)
else:
# This line will handle for any failures in both the Trafilature and BeautifulSoup4 functions:
return np.nan
# Handling for any URLs that don't have the correct protocol
except MissingSchema:
return np.nan
正如我在评论中所建议的那样,最好的选择是找到一个不使用 trafilatura
的教程,因为这似乎是坏掉的东西。但是,修改此特定函数以避免它并仅使用回退非常简单:
def extract_text_from_single_web_page(url):
try:
resp = requests.get(url)
# We will only extract the text from successful requests:
if resp.status_code == 200:
return beautifulsoup_extract_text_fallback(resp.content)
else:
# This line will handle for any failures in the BeautifulSoup4 function:
return np.nan
# Handling for any URLs that don't have the correct protocol
except MissingSchema:
return np.nan
除了我同意 Samwise 尝试坚持使用标准的、得到良好支持的 Python 模块之外,我认为这里有一个关于 版本管理 .
在您提供的教程中,他们只安装了最新版本的软件包。这通常不是好的做法。特别是在生产环境中,您希望能够控制版本,这样您就不会因为其他人更改了您的依赖项而最终破坏您的代码。
对于您的情况,trafilatura
版本 0.7.0 still supports the json_output
keyword argument, but later versions have dropped this. For example, the latest version at time of writing: 0.9.3。
我从 运行 代码中得到了一个意外的关键字参数。资料来源:https://sempioneer.com/python-for-seo/how-to-extract-text-from-multiple-webpages-in-python/ 有人可以帮忙吗?谢谢
运行 下面的代码:
single_url = 'https://understandingdata.com/'
text = extract_text_from_single_web_page(url=single_url)
print(text)
给出以下错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_10260/3606377172.py in <module>
1 single_url = 'https://understandingdata.com/'
----> 2 text = extract_text_from_single_web_page(url=single_url)
3 print(text)
~\AppData\Local\Temp/ipykernel_10260/850098094.py in extract_text_from_single_web_page(url)
42 try:
43 a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True, include_comments = False,
---> 44 date_extraction_params={'extensive_search': True, 'original_date': True})
45 except AttributeError:
46 a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True,
TypeError: extract() got an unexpected keyword argument 'json_output'
“extract_text_from_single_web_page(url=single_url)
的代码def extract_text_from_single_web_page(url):
downloaded_url = trafilatura.fetch_url(url)
try:
a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True, include_comments = False,
date_extraction_params={'extensive_search': True, 'original_date': True})
except AttributeError:
a = trafilatura.extract(downloaded_url, json_output=True, with_metadata=True,
date_extraction_params={'extensive_search': True, 'original_date': True})
if a:
json_output = json.loads(a)
return json_output['text']
else:
try:
resp = requests.get(url)
# We will only extract the text from successful requests:
if resp.status_code == 200:
return beautifulsoup_extract_text_fallback(resp.content)
else:
# This line will handle for any failures in both the Trafilature and BeautifulSoup4 functions:
return np.nan
# Handling for any URLs that don't have the correct protocol
except MissingSchema:
return np.nan
正如我在评论中所建议的那样,最好的选择是找到一个不使用 trafilatura
的教程,因为这似乎是坏掉的东西。但是,修改此特定函数以避免它并仅使用回退非常简单:
def extract_text_from_single_web_page(url):
try:
resp = requests.get(url)
# We will only extract the text from successful requests:
if resp.status_code == 200:
return beautifulsoup_extract_text_fallback(resp.content)
else:
# This line will handle for any failures in the BeautifulSoup4 function:
return np.nan
# Handling for any URLs that don't have the correct protocol
except MissingSchema:
return np.nan
除了我同意 Samwise 尝试坚持使用标准的、得到良好支持的 Python 模块之外,我认为这里有一个关于 版本管理 .
在您提供的教程中,他们只安装了最新版本的软件包。这通常不是好的做法。特别是在生产环境中,您希望能够控制版本,这样您就不会因为其他人更改了您的依赖项而最终破坏您的代码。
对于您的情况,trafilatura
版本 0.7.0 still supports the json_output
keyword argument, but later versions have dropped this. For example, the latest version at time of writing: 0.9.3。