每天使用 javascript 将 InnerHTML 复制到文本文件

Question

我正在尝试编写一个 javascript 程序，它将从 BBC 网站 (http://www.bbc.co.uk/news) 的头条新闻故事中获取内部 HTML 代码，并将其放入txt文件。我不太了解javascript，我知道更多的是.BAT和.VBS，但我知道他们做不到

我不确定如何处理这个问题。我想让它扫描一个固定的外部HTML代码，然后将内部代码复制到txt文件。

但是，我似乎找不到一个每天都永久有效的外部HTML代码。例如，这是今天的标题。

<span class="title-link__title-text">Benefit plan 'could hit young Britons'</span>

如您所见，它包含了标题。

如果有不同，我正在使用 Firefox。

如有任何帮助，我们将不胜感激。

此致，

Master-chip.

Answer 1

您要下载内容来自 html 的 txt 文件吗？对吗，您可以使用这个 create txt file and download it 如果您想从所有标题范围中获取文本，您需要这样做

var txt = "";
var nodeList = document.querySelectorAll(".title-link__title-text") 
for(var i=0; i<nodeList.length;i++){
   txt+="\n"+nodeList[i].innerText; 
}

然后将 txt 变量写入文件，就像我上面提到的post。

Answer 2

我的想法-

JS 可用于从页面获取 data/text，但是，要将其保存到文件中，您必须在后端使用 Python 或 PHP等,
为什么要用JS？您可以使用 CURL 很好地抓取网络。如果对您来说更容易，请使用 PHP Curl。

您可以 scrape/download 网页使用 -

<?php
    // Defining the basic cURL function
    function curl($url) {
        $ch = curl_init();  // Initialising cURL
        curl_setopt($ch, CURLOPT_URL, $url);    // Setting cURL's URL option with the $url variable passed into the function
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
        $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
        curl_close($ch);    // Closing cURL
        return $data;   // Returning the data from the function
    }
?>

然后根据您的判断使用该功能-

<?php
    $scraped_website = curl("http://www.yahoo.com");  // Executing our curl function to scrape the webpage http://www.yahoo.com and return the results into the $scraped_website variable
?>

参考链接-

Web scraping with PHP and CURL

Scraping in PHP with CURL

您可以使用 DIV 和节点的 HTML 元素更清楚地抓取。检查这些 - Part1 - Part2 - Part3

希望对您有所帮助。编码愉快！

Answer 3

纯客户端浏览器方法：

好的，我为您制作了这个 fiddle，也可能对其他人有所帮助。这对我来说很有趣，也很有挑战性。以下是我如何实现可能的解决方案的要点

使用 ECMA 5 Blob Api 即时创建文本文件。
已在 iframe http://www.bbc.co.uk/news 中加载 （跨域源访问 - 请参阅下面的注释部分）
在 iframe 加载事件上使用 setTimeout 或 setInterval（已评论 - 每小时或每天重复执行）根据需要 (根据需要调整时间).
使用 document.querySelectorAll(".title-link span") 查询文本节点似乎基于检查网页源是通用的。
查看 fiddler Link

Javascript:

 (function () {
    var textFile = null,
        makeTextFile = function (text) {
            var data = new Blob([text], {
                type: 'text/plain'
            });

            // If we are replacing a previously generated file we need to
            // manually revoke the object URL to avoid memory leaks.
            if (textFile !== null) {
                window.URL.revokeObjectURL(textFile);
            }

            textFile = window.URL.createObjectURL(data);

            return textFile;
        };

    var iframe = document.getElementById('frame');    
    var commFunc = function () {
            var iframe2 = document.getElementById('frame'); //This is required to get the fresh updated DOM
            var innerDoc = iframe2.contentDocument || iframe2.contentWindow.document;            
            var getAll = Array.prototype.slice.call(innerDoc.querySelectorAll(".title-link span"));          
            var dummy = "";
            for (var obj in getAll) {
                dummy = dummy.concat("\n" + (getAll[obj]).innerText);
            }
            var link = document.createElement("a");
            link.href = makeTextFile(dummy);
            link.download = "sample.txt"
            link.click();
            console.log("Downloaded the sample.txt file");
        };

    iframe.onload = function () {
        setTimeout(commFunc, 1000); //Adjust the time required to load
        //setInterval(commFunc, 1000);
    };  

    //Click the button when the page inside the iframe is loaded
    create.addEventListener('click', commFunc);            
})();

HTML:

<span class="title-link__title-text">Benefit plan 'could hit young Britons'</span>
    <div>
        <iframe id="frame" src="http://www.bbc.co.uk/news"></iframe>
    </div>
    <button id="create">Download</button>

注：

为了运行在 chrome 上的上述 java 脚本你需要 disable web security。上面的脚本应该运行在 firefox 上很好，不需要调整。
这是一个可能的例子，可以使用纯浏览器脚本。 Tab 应该处于活动状态以便定期抓取。
针对现代浏览器

建议的方法：

使用node.js服务器，你可以将上面的脚本修改为运行为甾醇
或任何服务器端脚本框架，如 php、java spring 等

使用 Node js 方法：

Javascript:

var jsdom = require("node-jsdom");
var fs = require("fs");
jsdom.env({
  url: "http://www.bbc.co.uk/news",
  scripts: ["http://code.jquery.com/jquery.js"],
  done: function (errors, window) {
    var $ = window.$;
    console.log("HN Links");
    $(".title-link span").each(function() {
      //console.log(" -", $(this).text());
      fs.existsSync("sample.txt") === true ? fs.appendFile("sample.txt", "\r"+ $(this).text()) : fs.writeFile("sample.txt", "\r"+ $(this).text())
    });
  }
});

以上代码的依赖关系：

NodeJS
JSDOM
Jquery
NodeJS Filesystem

希望对您和其他人有所帮助

每天使用 javascript 将 InnerHTML 复制到文本文件

Copy InnerHTML to text file Daily using javascript

html

javascript

feed

node.js

纯客户端浏览器方法：

使用 Node js 方法：