WinInet 只下载网页的一部分
WinInet only downloading a part of a webpage
我有一个将网页下载到文本文件的功能
#include <iostream>
#include <string>
#include <fstream>
#include <Windows.h>
#include <WinINet.h>
#pragma comment(lib, "WinINet.lib")
void Download(wstring url)
{
std::ofstream fout(L"temp.txt");
HINTERNET hopen = InternetOpen(L"MyAppName",
INTERNET_OPEN_TYPE_PRECONFIG, NULL, NULL, 0);
if (hopen)
{
DWORD flags = INTERNET_FLAG_DONT_CACHE;
if (url.find(L"https://") == 0)
flags |= INTERNET_FLAG_SECURE;
HINTERNET hinternet = InternetOpenUrl(hopen, url.c_str(), NULL, 0, flags, 0);
if (hinternet)
{
char buf[1024];
DWORD received = 0;
while (InternetReadFile(hinternet, buf, sizeof(buf), &received))
{
if (!received) break;
fout.write(buf, received);
}
InternetCloseHandle(hinternet);
}
InternetCloseHandle(hopen);
}
return;
}
当我给它“https://camelcamelcamel.com/Lodge-LMS3-Miniature-Skillet/product/B000LXA9YI”作为参数时只输出
https://hastebin.com/gilomexomu.xml(太大放不下)
这切断了大部分网页。我不确定网站上是否有一些反下载脚本,或者它是否太大了。
这不是您的代码。这是网站。而且我相信它只能传递 gzip 压缩数据。否则它会在几 kb 的数据后爆炸。 curl 显示网站正在过早中止传输:
$ curl https://camelcamelcamel.com/Lodge-LMS3-Miniature-Skillet/product/B000LXA9YI -o text.txt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 15725 0 15725 0 0 4702 0 --:--:-- 0:00:03 --:--:-- 4702
curl: (18) transfer closed with outstanding read data remaining
所以我做了两件事来用你的代码更好地模拟网络浏览器
- 将完全相同的 headers 和 user-agent 作为浏览器。
- 既然这个网站似乎只想 return gzip 编码,我不得不调整你的文件保存代码保存为二进制而不是文本(这导致 Windows CRT 不正确 "fix" 换行字符)。
然后为了解码整个 HTML,我只是 运行 从 Bash 命令提示符:
gunzip < temp.txt > temp_final.txt
结果是 temp_final.txt 具有完整的 html 响应。
这里是调整后的代码:
#include <iostream>
#include <string>
#include <fstream>
#include <Windows.h>
#include <WinINet.h>
#pragma comment(lib, "WinINet.lib")
void Download(const std::wstring& url)
{
FILE* file = fopen("temp.txt", "wb");
HINTERNET hopen = InternetOpen(L"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
INTERNET_OPEN_TYPE_PRECONFIG, NULL, NULL, 0);
if (hopen)
{
DWORD flags = INTERNET_FLAG_DONT_CACHE;
if (url.find(L"https://") == 0)
flags |= INTERNET_FLAG_SECURE;
LPCWSTR headers = L"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36\r\n"
L"DNT: 1\r\n"
L"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8\r\n"
L"Accept-Encoding: gzip, deflate, br\r\n"
L"Accept-Language: en-US,en;q=0.9\r\n";
HINTERNET hinternet = InternetOpenUrl(hopen, url.c_str(), headers, 0, flags, 0);
if (hinternet)
{
char buf[1024+1]={};
DWORD received = 0;
while (InternetReadFile(hinternet, buf, 1024, &received))
{
if (!received) break;
printf("%d\n", received);
fwrite(buf, 1, received, file);
}
InternetCloseHandle(hinternet);
}
InternetCloseHandle(hopen);
}
return;
}
void main()
{
Download(L"https://camelcamelcamel.com/Lodge-LMS3-Miniature-Skillet/product/B000LXA9YI");
}
我试着去掉 Accept-Encoding 或将其设置为 "identity"。结果是服务器发回半页然后中止。
我有一个将网页下载到文本文件的功能
#include <iostream>
#include <string>
#include <fstream>
#include <Windows.h>
#include <WinINet.h>
#pragma comment(lib, "WinINet.lib")
void Download(wstring url)
{
std::ofstream fout(L"temp.txt");
HINTERNET hopen = InternetOpen(L"MyAppName",
INTERNET_OPEN_TYPE_PRECONFIG, NULL, NULL, 0);
if (hopen)
{
DWORD flags = INTERNET_FLAG_DONT_CACHE;
if (url.find(L"https://") == 0)
flags |= INTERNET_FLAG_SECURE;
HINTERNET hinternet = InternetOpenUrl(hopen, url.c_str(), NULL, 0, flags, 0);
if (hinternet)
{
char buf[1024];
DWORD received = 0;
while (InternetReadFile(hinternet, buf, sizeof(buf), &received))
{
if (!received) break;
fout.write(buf, received);
}
InternetCloseHandle(hinternet);
}
InternetCloseHandle(hopen);
}
return;
}
当我给它“https://camelcamelcamel.com/Lodge-LMS3-Miniature-Skillet/product/B000LXA9YI”作为参数时只输出 https://hastebin.com/gilomexomu.xml(太大放不下) 这切断了大部分网页。我不确定网站上是否有一些反下载脚本,或者它是否太大了。
这不是您的代码。这是网站。而且我相信它只能传递 gzip 压缩数据。否则它会在几 kb 的数据后爆炸。 curl 显示网站正在过早中止传输:
$ curl https://camelcamelcamel.com/Lodge-LMS3-Miniature-Skillet/product/B000LXA9YI -o text.txt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 15725 0 15725 0 0 4702 0 --:--:-- 0:00:03 --:--:-- 4702
curl: (18) transfer closed with outstanding read data remaining
所以我做了两件事来用你的代码更好地模拟网络浏览器
- 将完全相同的 headers 和 user-agent 作为浏览器。
- 既然这个网站似乎只想 return gzip 编码,我不得不调整你的文件保存代码保存为二进制而不是文本(这导致 Windows CRT 不正确 "fix" 换行字符)。
然后为了解码整个 HTML,我只是 运行 从 Bash 命令提示符:
gunzip < temp.txt > temp_final.txt
结果是 temp_final.txt 具有完整的 html 响应。
这里是调整后的代码:
#include <iostream>
#include <string>
#include <fstream>
#include <Windows.h>
#include <WinINet.h>
#pragma comment(lib, "WinINet.lib")
void Download(const std::wstring& url)
{
FILE* file = fopen("temp.txt", "wb");
HINTERNET hopen = InternetOpen(L"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
INTERNET_OPEN_TYPE_PRECONFIG, NULL, NULL, 0);
if (hopen)
{
DWORD flags = INTERNET_FLAG_DONT_CACHE;
if (url.find(L"https://") == 0)
flags |= INTERNET_FLAG_SECURE;
LPCWSTR headers = L"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36\r\n"
L"DNT: 1\r\n"
L"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8\r\n"
L"Accept-Encoding: gzip, deflate, br\r\n"
L"Accept-Language: en-US,en;q=0.9\r\n";
HINTERNET hinternet = InternetOpenUrl(hopen, url.c_str(), headers, 0, flags, 0);
if (hinternet)
{
char buf[1024+1]={};
DWORD received = 0;
while (InternetReadFile(hinternet, buf, 1024, &received))
{
if (!received) break;
printf("%d\n", received);
fwrite(buf, 1, received, file);
}
InternetCloseHandle(hinternet);
}
InternetCloseHandle(hopen);
}
return;
}
void main()
{
Download(L"https://camelcamelcamel.com/Lodge-LMS3-Miniature-Skillet/product/B000LXA9YI");
}
我试着去掉 Accept-Encoding 或将其设置为 "identity"。结果是服务器发回半页然后中止。