从 pdf 转换 html 时,图像始终位于顶部
Image is always at top in converted html from pdf
我正在使用以下代码,特定 pdf 页面的所有内容都以正确的方式转换。但是,如果 pdf 页面中间有任何图像,则 HTML 中的图像会显示在顶部。
PHP 代码:
umask(0);
$output = shell_exec('pdftohtml create.pdf create.html');
编辑:
请查看我为此使用的 pdf:https://www.dropbox.com/s/6uy9wq27ff00n0x/create.pdf?dl=0
在这个 PDF 中,图像在 2 行之后。
// 加载转换后的 html 页面。 shell_exec 添加 's' 到 html 文件,creates.html
$html = file_get_contents('creates.html');
print_r($html);
//输出
<!DOCTYPE html><html>
<head>
</head>
<body>
<img src="/var/www/html/pdf-sign/public/converted_path/create-1_1.png"/><br/>
Test document PDF <br/> <br/>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla est purus, ultrices in porttitor <br/>in, accumsan non quam. Nam consectetur porttitor rhoncus. Curabitur eu est et leo feugiat <br/>auctor vel quis lorem. Ut et ligula dolor, sit amet consequat lorem. Aliquam porta eros sed <br/>velit imperdiet egestas. Maecenas tempus eros ut diam ullamcorper id dictum libero <br/>tempor. Donec quis augue quis magna condimentum lobortis. Quisque imperdiet ipsum vel <br/>magna viverra rutrum. Cras viverra molestie urna, vitae vestibulum turpis varius id. <br/>   PLACEHOLDER      <br/>nulla ac dolor. Maecenas urna elit, tincidunt in dapibus nec, vehicula eu dui. Duis lacinia <br/>fringilla massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur <br/>
suscipit felis eget condimentum. Cum sociis natoque penatibus et magnis dis parturient <br/>montes, nascetur ridiculus mus. Integer bibendum sagittis ligula, non faucibus nulla volutpat <br/>vitae. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.  <br/>In aliquet quam et velit bibendum accumsan. Cum sociis natoque penatibus et magnis dis <br/>parturient montes, nascetur ridiculus mus. Vestibulum vitae ipsum nec arcu semper <br/>adipiscing at ac lacus. Praesent id pellentesque orci. Morbi congue viverra nisl nec rhoncus. <br/>Integer mattis, ipsum a tincidunt commodo, lacus arcu elementum elit, at mollis eros ante ac <br/>risus. In volutpat, ante at pretium ultricies, velit magna suscipit enim, aliquet blandit massa <br/>orci nec lorem. Nulla facilisi. Duis eu vehicula arcu. Nulla facilisi. Maecenas pellentesque <br/>volutpat felis, quis tristique ligula luctus vel. Sed nec mi eros. Integer augue enim, sollicitudin <br/>ullamcorper mattis eget, aliquam in est. Morbi sollicitudin libero nec augue dignissim ut <br/>consectetur dui volutpat. Nulla facilisi. Mauris egestas vestibulum neque cursus tincidunt. <br/>Donec sit amet pulvinar orci.  <br/>Quisque volutpat pharetra tincidunt. Fusce sapien arcu, molestie eget varius egestas, <br/>faucibus ac urna. Sed at nisi in velit egestas aliquam ut a felis. Aenean malesuada iaculis nisl, <br/>ut tempor lacus egestas consequat. Nam nibh lectus, gravida sed egestas ut, feugiat quis <br/>dolor. Donec eu leo enim, non laoreet ante. Morbi dictum tempor vulputate. Phasellus <br/>ultricies risus vel augue sagittis euismod. Vivamus tincidunt placerat nisi in aliquam. Cras <br/>quis mi ac nunc pretium aliquam. Aenean elementum erat ac metus commodo rhoncus. <br/>
<hr/>
</body>
</html>
现看
<img src="/var/www/html/pdf-sign/public/converted_path/create-1_1.png"/>
就在 BODY 标签之后。这意味着该图像已移至顶部以取代第三行。
而不是 print_r 只打印文件的内容:
<?php
echo file_get_contents('creates.html');
?>
并确保这是并且只有它会在您的 php 中输出。如果你在开头有一些 html,它会破坏布局。
我也遇到过这种问题。我有一个解决方案。首先你需要将pdf文档转换成XML
$output = shell_exec('pdftohtml -xml create.pdf create.xml');
XML 输出如下
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="0.33.0">
<page number="1" position="absolute" top="0" left="0" height="1262" width="892">
<fontspec id="0" size="16" family="Times" color="#000000"/>
<image top="117" left="51" width="424" height="96" src="converted_path/create1.jpg"/>
<text top="57" left="99" width="144" height="16" font="0">Test document PDF</text>
</page>
</pdf2xml>
然后你将这个XML的字符串转换成一个对象
$xml = simplexml_load_string($xmlContent);
之后,您需要使用 xml 属性最高值来测量确切的图像位置,如下所示
$pg = 0;
foreach($xml->page as $page) {
foreach ($page as $e) {
$all_attribute[$pg][(int)$e['top']] = $e;
}
$pg++;
}
找出所有属性的最高值后,根据数组[key]对值进行排序
foreach($all_attribute as $page) {
ksort($page);
}
当所有属性都根据xml最高值排序后,只需像下面这样处理html
foreach($xml->page as $page) {
foreach($page as $p){
if($p->getName() == 'image'){
<img width="'.$p['width'].'" height="'.$p['height'].'" src="'.$p['src'].'" >
}
}
}
我觉得对你有帮助
您还可以管理您的文本字体
xml 将所有字体存储在 fontspec 属性中并给出一个 id
<fontspec id="0" size="16" family="Times" color="#000000"/>
并且这个id是在文本属性字体值中调用的
<text top="57" left="99" width="144" height="16" font="0">
现在借助这些值,您需要像下面这样处理字体
$font = [];
foreach($xml->page as $page) {
foreach ($page as $e) {
if($e->getName() == 'fontspec'){
$font[(int)$e['id']]['family'] = (string)$e['family'];
$font[(int)$e['id']]['size'] = (string)$e['size'];
$font[(int)$e['id']]['color'] = (string)$e['color'];
}
}
}
之后需要把这个字体处理成html
foreach($page as $p){
if($p->getName() == 'text'){
$ind = (int)$p['font'];
$font_size = $font[$ind]['size'];
$font_color = $font[$ind]['color'];
$font_family = $font[$ind]['family'];
'<span style="font-size:'.$font_size.'px;color:'.$font_color.';font-family:'.$font_family.'; font-weight: 900;">'.(string)$p.'</span>=';
}
}
我正在使用以下代码,特定 pdf 页面的所有内容都以正确的方式转换。但是,如果 pdf 页面中间有任何图像,则 HTML 中的图像会显示在顶部。
PHP 代码:
umask(0);
$output = shell_exec('pdftohtml create.pdf create.html');
编辑:
请查看我为此使用的 pdf:https://www.dropbox.com/s/6uy9wq27ff00n0x/create.pdf?dl=0
在这个 PDF 中,图像在 2 行之后。
// 加载转换后的 html 页面。 shell_exec 添加 's' 到 html 文件,creates.html
$html = file_get_contents('creates.html');
print_r($html);
//输出
<!DOCTYPE html><html>
<head>
</head>
<body>
<img src="/var/www/html/pdf-sign/public/converted_path/create-1_1.png"/><br/>
Test document PDF <br/> <br/>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla est purus, ultrices in porttitor <br/>in, accumsan non quam. Nam consectetur porttitor rhoncus. Curabitur eu est et leo feugiat <br/>auctor vel quis lorem. Ut et ligula dolor, sit amet consequat lorem. Aliquam porta eros sed <br/>velit imperdiet egestas. Maecenas tempus eros ut diam ullamcorper id dictum libero <br/>tempor. Donec quis augue quis magna condimentum lobortis. Quisque imperdiet ipsum vel <br/>magna viverra rutrum. Cras viverra molestie urna, vitae vestibulum turpis varius id. <br/>   PLACEHOLDER      <br/>nulla ac dolor. Maecenas urna elit, tincidunt in dapibus nec, vehicula eu dui. Duis lacinia <br/>fringilla massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur <br/>
suscipit felis eget condimentum. Cum sociis natoque penatibus et magnis dis parturient <br/>montes, nascetur ridiculus mus. Integer bibendum sagittis ligula, non faucibus nulla volutpat <br/>vitae. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.  <br/>In aliquet quam et velit bibendum accumsan. Cum sociis natoque penatibus et magnis dis <br/>parturient montes, nascetur ridiculus mus. Vestibulum vitae ipsum nec arcu semper <br/>adipiscing at ac lacus. Praesent id pellentesque orci. Morbi congue viverra nisl nec rhoncus. <br/>Integer mattis, ipsum a tincidunt commodo, lacus arcu elementum elit, at mollis eros ante ac <br/>risus. In volutpat, ante at pretium ultricies, velit magna suscipit enim, aliquet blandit massa <br/>orci nec lorem. Nulla facilisi. Duis eu vehicula arcu. Nulla facilisi. Maecenas pellentesque <br/>volutpat felis, quis tristique ligula luctus vel. Sed nec mi eros. Integer augue enim, sollicitudin <br/>ullamcorper mattis eget, aliquam in est. Morbi sollicitudin libero nec augue dignissim ut <br/>consectetur dui volutpat. Nulla facilisi. Mauris egestas vestibulum neque cursus tincidunt. <br/>Donec sit amet pulvinar orci.  <br/>Quisque volutpat pharetra tincidunt. Fusce sapien arcu, molestie eget varius egestas, <br/>faucibus ac urna. Sed at nisi in velit egestas aliquam ut a felis. Aenean malesuada iaculis nisl, <br/>ut tempor lacus egestas consequat. Nam nibh lectus, gravida sed egestas ut, feugiat quis <br/>dolor. Donec eu leo enim, non laoreet ante. Morbi dictum tempor vulputate. Phasellus <br/>ultricies risus vel augue sagittis euismod. Vivamus tincidunt placerat nisi in aliquam. Cras <br/>quis mi ac nunc pretium aliquam. Aenean elementum erat ac metus commodo rhoncus. <br/>
<hr/>
</body>
</html>
现看
<img src="/var/www/html/pdf-sign/public/converted_path/create-1_1.png"/>
就在 BODY 标签之后。这意味着该图像已移至顶部以取代第三行。
而不是 print_r 只打印文件的内容:
<?php
echo file_get_contents('creates.html');
?>
并确保这是并且只有它会在您的 php 中输出。如果你在开头有一些 html,它会破坏布局。
我也遇到过这种问题。我有一个解决方案。首先你需要将pdf文档转换成XML
$output = shell_exec('pdftohtml -xml create.pdf create.xml');
XML 输出如下
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="0.33.0">
<page number="1" position="absolute" top="0" left="0" height="1262" width="892">
<fontspec id="0" size="16" family="Times" color="#000000"/>
<image top="117" left="51" width="424" height="96" src="converted_path/create1.jpg"/>
<text top="57" left="99" width="144" height="16" font="0">Test document PDF</text>
</page>
</pdf2xml>
然后你将这个XML的字符串转换成一个对象
$xml = simplexml_load_string($xmlContent);
之后,您需要使用 xml 属性最高值来测量确切的图像位置,如下所示
$pg = 0;
foreach($xml->page as $page) {
foreach ($page as $e) {
$all_attribute[$pg][(int)$e['top']] = $e;
}
$pg++;
}
找出所有属性的最高值后,根据数组[key]对值进行排序
foreach($all_attribute as $page) {
ksort($page);
}
当所有属性都根据xml最高值排序后,只需像下面这样处理html
foreach($xml->page as $page) {
foreach($page as $p){
if($p->getName() == 'image'){
<img width="'.$p['width'].'" height="'.$p['height'].'" src="'.$p['src'].'" >
}
}
}
我觉得对你有帮助
您还可以管理您的文本字体
xml 将所有字体存储在 fontspec 属性中并给出一个 id
<fontspec id="0" size="16" family="Times" color="#000000"/>
并且这个id是在文本属性字体值中调用的
<text top="57" left="99" width="144" height="16" font="0">
现在借助这些值,您需要像下面这样处理字体
$font = [];
foreach($xml->page as $page) {
foreach ($page as $e) {
if($e->getName() == 'fontspec'){
$font[(int)$e['id']]['family'] = (string)$e['family'];
$font[(int)$e['id']]['size'] = (string)$e['size'];
$font[(int)$e['id']]['color'] = (string)$e['color'];
}
}
}
之后需要把这个字体处理成html
foreach($page as $p){
if($p->getName() == 'text'){
$ind = (int)$p['font'];
$font_size = $font[$ind]['size'];
$font_color = $font[$ind]['color'];
$font_family = $font[$ind]['family'];
'<span style="font-size:'.$font_size.'px;color:'.$font_color.';font-family:'.$font_family.'; font-weight: 900;">'.(string)$p.'</span>=';
}
}