从在线图片中提取文本 Url In python
Extract text from Online image Url In python
我已经根据网络上的参考资料和一些 youtube 视频编写了代码,但它似乎对我不起作用,我也无法进一步理解可能是什么问题。
import io
import requests
import pytesseract
from PIL import Image
r = requests.get("http://www.teamjimmyjoe.com/wp-content/uploads/2014/09/Classic-Best-Funny-Text-Messages-earthquake-titties.jpg",stream=True)
# print( type(response) ) # <class 'requests.models.Response'>
img = Image.open(io.BytesIO(r.content))
# print( type(img) ) # <class 'PIL.JpegImagePlugin.JpegImageFile'>
text = pytesseract.image_to_string(img)
print(text)
我收到这个错误
File "F:\Projects\FileExtractor\untitled3.py", line 16, in <module>
img = Image.open(io.BytesIO(r.content))
File "C:\ProgramData\Anaconda3\lib\site-packages\PIL\Image.py", line 2943, in open
raise UnidentifiedImageError(
UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x000001E85C0BAA40>
请帮我解决这个问题。
谢谢
始终从最简单的修复开始,然后从那里开始。
import requests
# import pytesseract
# from PIL import Image
r = requests.get("http://www.teamjimmyjoe.com/wp-content/uploads/2014/09/Classic-Best-Funny-Text-Messages-earthquake-titties.jpg",stream=True)
print(r.text)
产生这个结果:
<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>
问题是您没有下载图像,您被 Mod_Security 阻止了。在获得图像和文本之前,您需要克服这一点。
想了想。为什么不欺骗浏览器 headers,这现在似乎可以工作了。
import io
import requests
import pytesseract
from PIL import Image
url = 'http://www.teamjimmyjoe.com/wp-content/uploads/2014/09/Classic-Best-Funny-Text-Messages-earthquake-titties.jpg'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
img = Image.open(io.BytesIO(r.content))
# # print( type(img) ) # <class 'PIL.JpegImagePlugin.JpegImageFile'>
text = pytesseract.image_to_string(img)
#
print(text)
响应是:
Hey! | just saw on CNN
there was an earthquake
near you. Are you ok?
| Yes! We're all fine!
What did it rate.on the titty
scale?
| Well they only jiggled a |
little bit, so probably not
that high.
HAHAHAHAHAHA | LOVE
YOU
Richter scale. My phone is |
a 12 yr old boy.
—————————r
我已经根据网络上的参考资料和一些 youtube 视频编写了代码,但它似乎对我不起作用,我也无法进一步理解可能是什么问题。
import io
import requests
import pytesseract
from PIL import Image
r = requests.get("http://www.teamjimmyjoe.com/wp-content/uploads/2014/09/Classic-Best-Funny-Text-Messages-earthquake-titties.jpg",stream=True)
# print( type(response) ) # <class 'requests.models.Response'>
img = Image.open(io.BytesIO(r.content))
# print( type(img) ) # <class 'PIL.JpegImagePlugin.JpegImageFile'>
text = pytesseract.image_to_string(img)
print(text)
我收到这个错误
File "F:\Projects\FileExtractor\untitled3.py", line 16, in <module>
img = Image.open(io.BytesIO(r.content))
File "C:\ProgramData\Anaconda3\lib\site-packages\PIL\Image.py", line 2943, in open
raise UnidentifiedImageError(
UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x000001E85C0BAA40>
请帮我解决这个问题。 谢谢
始终从最简单的修复开始,然后从那里开始。
import requests
# import pytesseract
# from PIL import Image
r = requests.get("http://www.teamjimmyjoe.com/wp-content/uploads/2014/09/Classic-Best-Funny-Text-Messages-earthquake-titties.jpg",stream=True)
print(r.text)
产生这个结果:
<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>
问题是您没有下载图像,您被 Mod_Security 阻止了。在获得图像和文本之前,您需要克服这一点。
想了想。为什么不欺骗浏览器 headers,这现在似乎可以工作了。
import io
import requests
import pytesseract
from PIL import Image
url = 'http://www.teamjimmyjoe.com/wp-content/uploads/2014/09/Classic-Best-Funny-Text-Messages-earthquake-titties.jpg'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
img = Image.open(io.BytesIO(r.content))
# # print( type(img) ) # <class 'PIL.JpegImagePlugin.JpegImageFile'>
text = pytesseract.image_to_string(img)
#
print(text)
响应是:
Hey! | just saw on CNN
there was an earthquake
near you. Are you ok?
| Yes! We're all fine!
What did it rate.on the titty
scale?
| Well they only jiggled a |
little bit, so probably not
that high.
HAHAHAHAHAHA | LOVE
YOU
Richter scale. My phone is |
a 12 yr old boy.
—————————r