无法从文本文件中提取 URL
Cannot extract URLs from a text file
我正在尝试解析在线文本文件的内容,然后提取所有 URL。除 URL 提取部分外,一切正常。它不会发生。我在本地文件上尝试了相同的过程,并且成功了。怎么了?
命令
import requests
import re
from io import StringIO
link = "https://pastebin.com/raw/B8QauiXU"
urls = requests.get(link)
with open(urls.text) as file, io.StringIO() as output:
for line in file:
urls = re.findall('https?://[^\s<>"]+[|www\.^\s<>"]+', line)
print(*urls, file=output)
urls = output.getvalue()
print(urls)
输出
https://google.com and https://bing.com are both the two largest search engines in the world. They are followed by https://duckduckgo.com.
你没有逃脱 //
我为你修复了正则表达式
https?:\/\/[^\s<>"]+[|www\.^\s<>"]+
顺便说一下,你应该导入 re.
使您的正则表达式成为 raw string
效果很好:
import requests, re
from io import StringIO
with StringIO() as output:
link = "https://pastebin.com/raw/B8QauiXU"
data = requests.get(link).text
urls = re.findall(r'https?://[^\s<>"]+[|www\.^\s<>"]+', data)
for i, url in enumerate(urls):
output.write(f"{i}: {url}\n")
print(output.getvalue())
输出:
0: https://google.com
1: https://bing.com
2: https://duckduckgo.com.
我正在尝试解析在线文本文件的内容,然后提取所有 URL。除 URL 提取部分外,一切正常。它不会发生。我在本地文件上尝试了相同的过程,并且成功了。怎么了?
命令
import requests
import re
from io import StringIO
link = "https://pastebin.com/raw/B8QauiXU"
urls = requests.get(link)
with open(urls.text) as file, io.StringIO() as output:
for line in file:
urls = re.findall('https?://[^\s<>"]+[|www\.^\s<>"]+', line)
print(*urls, file=output)
urls = output.getvalue()
print(urls)
输出
https://google.com and https://bing.com are both the two largest search engines in the world. They are followed by https://duckduckgo.com.
你没有逃脱 //
我为你修复了正则表达式
https?:\/\/[^\s<>"]+[|www\.^\s<>"]+
顺便说一下,你应该导入 re.
使您的正则表达式成为 raw string
效果很好:
import requests, re
from io import StringIO
with StringIO() as output:
link = "https://pastebin.com/raw/B8QauiXU"
data = requests.get(link).text
urls = re.findall(r'https?://[^\s<>"]+[|www\.^\s<>"]+', data)
for i, url in enumerate(urls):
output.write(f"{i}: {url}\n")
print(output.getvalue())
输出:
0: https://google.com
1: https://bing.com
2: https://duckduckgo.com.