file_get_contents 来自 html 分解,写入电子表格的单元格
file_get_contents from html explode, write to cell of spreadsheet
我想要实现的是通过 file_get_contents() 从 URL 的源中提取特定内容,然后围绕该内容所在的位置展开()标记,仅返回 HTML 格式的内容,然后将其写入电子表格或 CSV 的单个单元格。简单,我想。
这是我得到的:
<?php
//My .html
$url = 'http://spiderlearning.com/demo/ALG_SA_U1_L1.html';
//Get content
$content = file_get_contents($url);
//Get content sections
$lesson_name = explode( '<section id="nameField" class="editable" contenteditable="false">' , $content);
$section_title1 = explode( '<a onclick="goToByScroll(\'obj0\')" href="#">' , $content);
$challenge_q = explode( '<section id="redactor_content" class="editable" contenteditable="false">' , $content);
//Write content
$write1 = explode("</section>" , $lesson_name[1]);
$write2 = explode("</a>" , $section_title1[1]);
$write3 = explode("</section>" , $challenge_q[1]);
//Into arrays
$line1 = array($write1[0],$write2[0],$write3[0]);
$list = array($line1);
//Open .csv
$file = fopen("data/data.csv", "w");
//Write as line, delimitate with ";"
foreach ($list as $line) fputcsv($file, $line, ';');
//Close
fclose($file);
?>
哪个returns:
我要找的是:
CSV:
Unit 1 Lesson 1; 1. Challenge Questions; <p><img src="https://s3-eu-west-1.amazonaws.com/teacher-uploads.fishtree.com/SpiderLearning/1428953716a42b06b9-1ce1-4594-badd-4ab8c9b65ac0.jpeg" alt="" rel="float: left; width: 171px; height: 113.697826086957px; margin: 0px 10px 10px 0px;" style="float: left; width: 171px; height: 113.697826086957px; margin: 0px 10px 10px 0px;"></p><p>Before you begin this lesson, let's see what you already know about the topic. Take a moment to complete the three Challenge Questions that follow.</p>
问题在我看来是格式化内容中的回车returns。它还在返回的内容周围加上括号,但我不确定从哪里来。有什么办法可以逃避这些吗?我过去曾将类似的功能放在一起,没有任何问题,但这是我第一次 file_get_contents() 到 CSV 中,几周后我终于遇到了困难。
首先要摆脱换行符,请执行以下操作:
foreach ($list as $line) fputcsv($file, preg_replace( "/\r|\n/", "", $line), ';');
最好保留 fputcsv 引入的那些字段分隔符。原因是其中一个字段内的任何分号都会破坏您的 CSV 以上您想要的 CSV 看起来像:
"Unit 1 Lesson 1";"1. Challenge Questions";"<p><img src=""https://s3-eu-west-1.amazonaws.com/teacher-uploads.fishtree.com/SpiderLearning/1428953716a42b06b9-1ce1-4594-badd-4ab8c9b65ac0.jpeg"" alt="""" rel=""float: left; width: 171px; height: 113.697826086957px; margin: 0px 10px 10px 0px;"" style=""float: left; width: 171px; height: 113.697826086957px; margin: 0px 10px 10px 0px;""></p><p>Before you begin this lesson, let's see what you already know about the topic. Take a moment to complete the three Challenge Questions that follow.</p>"
但是在大多数情况下你不能直接在excel中打开它(某处有一个全局设置)。您需要导入此数据,然后设置以下内容:
这是一个基于 PHP 的 DOMDocument class 的替代解决方案:
$url = 'http://spiderlearning.com/demo/ALG_SA_U1_L1.html';
// Load HTML via DOMDocument class
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($url);
// Extract the elements of interest
$xpath = new DOMXPath($doc);
$list = [
[
"lesson" => $doc->getElementById('nameField')->textContent,
"section" => $xpath->query("//div[@class='activitySelect']//a")[0]->textContent,
"challenge" => innerHTML($doc->getElementById('redactor_content'))
]
];
// Write CSV (unchanged code)
$file = fopen("php://output", "w");
foreach ($list as $line) fputcsv($file, $line, ';');
fclose($file);
// Utility function
function innerHTML($node) {
return implode(array_map([$node->ownerDocument,"saveHTML"],
iterator_to_array($node->childNodes)));
}
我想要实现的是通过 file_get_contents() 从 URL 的源中提取特定内容,然后围绕该内容所在的位置展开()标记,仅返回 HTML 格式的内容,然后将其写入电子表格或 CSV 的单个单元格。简单,我想。
这是我得到的:
<?php
//My .html
$url = 'http://spiderlearning.com/demo/ALG_SA_U1_L1.html';
//Get content
$content = file_get_contents($url);
//Get content sections
$lesson_name = explode( '<section id="nameField" class="editable" contenteditable="false">' , $content);
$section_title1 = explode( '<a onclick="goToByScroll(\'obj0\')" href="#">' , $content);
$challenge_q = explode( '<section id="redactor_content" class="editable" contenteditable="false">' , $content);
//Write content
$write1 = explode("</section>" , $lesson_name[1]);
$write2 = explode("</a>" , $section_title1[1]);
$write3 = explode("</section>" , $challenge_q[1]);
//Into arrays
$line1 = array($write1[0],$write2[0],$write3[0]);
$list = array($line1);
//Open .csv
$file = fopen("data/data.csv", "w");
//Write as line, delimitate with ";"
foreach ($list as $line) fputcsv($file, $line, ';');
//Close
fclose($file);
?>
哪个returns:
我要找的是:
CSV:
Unit 1 Lesson 1; 1. Challenge Questions; <p><img src="https://s3-eu-west-1.amazonaws.com/teacher-uploads.fishtree.com/SpiderLearning/1428953716a42b06b9-1ce1-4594-badd-4ab8c9b65ac0.jpeg" alt="" rel="float: left; width: 171px; height: 113.697826086957px; margin: 0px 10px 10px 0px;" style="float: left; width: 171px; height: 113.697826086957px; margin: 0px 10px 10px 0px;"></p><p>Before you begin this lesson, let's see what you already know about the topic. Take a moment to complete the three Challenge Questions that follow.</p>
问题在我看来是格式化内容中的回车returns。它还在返回的内容周围加上括号,但我不确定从哪里来。有什么办法可以逃避这些吗?我过去曾将类似的功能放在一起,没有任何问题,但这是我第一次 file_get_contents() 到 CSV 中,几周后我终于遇到了困难。
首先要摆脱换行符,请执行以下操作:
foreach ($list as $line) fputcsv($file, preg_replace( "/\r|\n/", "", $line), ';');
最好保留 fputcsv 引入的那些字段分隔符。原因是其中一个字段内的任何分号都会破坏您的 CSV 以上您想要的 CSV 看起来像:
"Unit 1 Lesson 1";"1. Challenge Questions";"<p><img src=""https://s3-eu-west-1.amazonaws.com/teacher-uploads.fishtree.com/SpiderLearning/1428953716a42b06b9-1ce1-4594-badd-4ab8c9b65ac0.jpeg"" alt="""" rel=""float: left; width: 171px; height: 113.697826086957px; margin: 0px 10px 10px 0px;"" style=""float: left; width: 171px; height: 113.697826086957px; margin: 0px 10px 10px 0px;""></p><p>Before you begin this lesson, let's see what you already know about the topic. Take a moment to complete the three Challenge Questions that follow.</p>"
但是在大多数情况下你不能直接在excel中打开它(某处有一个全局设置)。您需要导入此数据,然后设置以下内容:
这是一个基于 PHP 的 DOMDocument class 的替代解决方案:
$url = 'http://spiderlearning.com/demo/ALG_SA_U1_L1.html';
// Load HTML via DOMDocument class
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($url);
// Extract the elements of interest
$xpath = new DOMXPath($doc);
$list = [
[
"lesson" => $doc->getElementById('nameField')->textContent,
"section" => $xpath->query("//div[@class='activitySelect']//a")[0]->textContent,
"challenge" => innerHTML($doc->getElementById('redactor_content'))
]
];
// Write CSV (unchanged code)
$file = fopen("php://output", "w");
foreach ($list as $line) fputcsv($file, $line, ';');
fclose($file);
// Utility function
function innerHTML($node) {
return implode(array_map([$node->ownerDocument,"saveHTML"],
iterator_to_array($node->childNodes)));
}