更快地确定文件是否为 PDF 的方法

way to determine if file is PDF faster

寻找一些指示/提示以提高速度 and/or 下面的功效。会对其他方法开放,但只涉足 powershell、cmd 和 python.

在信用到期的地方也信用:这是 hack-job 以下内容:

我不是在本地工作,而是通过 VPN 以糟糕的连接速度访问网络共享。 粗略地说,它的工作速度为 8 秒/PDF。

我试图解决的问题,目标是确保每个 PDF 都可以被 Adob​​e 读取。保存为 PDF(但不是 pdf)的图像可以在某些 PDF 软件中打开,但 Adob​​e 讨厌它们。我有转换的方法,但我的速率限制器正在识别它们。

$items = Get-ChildItem | Where-Object {$_.Extension -eq ".pdf"}
$arrary = @()
$logFile = "RESULTS_$(get-date -Format yyyymmdd).log"
$badCounter = 0
$goodCounter = 0
$msg = "`n`nProcessing " + $items.count + " files... "
Write-Host -nonewline -foregroundcolor Yellow $msg
foreach ($item in $items)
{
    trap { Write-Output "Error trapped: $_"; continue; }
    try {
    $pdfText = Get-Content $item -raw
    $ptr3 = '%PDF'
     if ('%PDF' -ne $pdfText.SubString(([System.Math]::Max(0,$pdfText.IndexOf($ptr3))),4)) { $arrary+= "$item |-failed" >>$logfile;$badCounter += 1; $badCounter} else { $goodCounter += 1; $goodCounter}
      continue;}
catch [System.Exception]{write-output "$item $_";}}
$totalCounter = $badCounter + $goodCounter

Write-Output $arrary >> $logFile
1..3 | %{ Write-Output "" >> $logFile }

Write-Output "Total: $totalCounter / BAD: $badCounter / GOOD: $goodCounter" >> $logFile
Write-Output "DONE!`n`n"

如果当前 运行 在 PS 版本 7.1.3 中有任何差异 / 但在本地也有 5.1.18。

实际上,PDF 文件根本不是纯文本文件,而是二进制文件,因此您不应该将它们读入 string.
您要查找的内容称为 FourCC magic number in the file. This four-character code can be seen as Magic number 以标识文件类型。 对于 PDF 文件,这 4 个字节是 0x25, 0x50, 0x44, 0x46(“%PDF”),文件 应该 以这些字节开头。

对于那些真实的 PDF 文件,您可以使用以下方法进行测试:

[byte[]]$fourCC = Get-Content -Encoding Byte -ReadCount 4 -TotalCount 4 -Path 'X:\TheFile.pdf'
if ([System.Text.Encoding]::ASCII.GetString($fourCC) -ceq '%PDF') {
    Write-Host "This is a true PDF file"
}

但是,正如您所说 “银行 pdf 通常以空白开头 space”,要同时考虑这些文件“好”,您可以这样做:

[byte[]]$sixCC = Get-Content -Encoding Byte -ReadCount 6 -TotalCount 6 -Path 'X:\TheFile.pdf'
if ([System.Text.Encoding]::ASCII.GetString($sixCC) -cmatch '%PDF') {
    Write-Host "This is a PDF file"
}

如果您还想将在文件中任何位置找到“%PDF”的文件视为“良好”,则需要将整个文件读取为字符串,但一对一字节映射字节。 为此,您可以使用以下辅助函数:

function ConvertTo-BinaryString {
    # converts the bytes of a file to a string that has a
    # 1-to-1 mapping back to the file's original bytes.
    # Useful for performing binary regular expressions.
    Param (
        [Parameter(Mandatory = $True, ValueFromPipeline = $True, Position = 0)]
        [ValidateScript( { Test-Path $_ -PathType Leaf } )]
        [String]$Path
    )

    # Note: Codepage 28591 returns a 1-to-1 char to byte mapping
    $Encoding     = [Text.Encoding]::GetEncoding(28591)
    $Stream       = [System.IO.FileStream]::new($Path, 'Open', 'Read')
    $StreamReader = [System.IO.StreamReader]::new($Stream, $Encoding)
    $BinaryText   = $StreamReader.ReadToEnd()

    $StreamReader.Close()
    $Stream.Close()

    return $BinaryText
}

接下来,您可以将该函数用作:

$binString = ConvertTo-BinaryString -Path 'X:\TheFile.pdf'
if ($binString.IndexOf("%PDF") -ge 0) {
    Write-Host "This is a PDF file"
}

将它们放在一起并假设您想要 所有 文件标记为 .PDF 文件,其中幻数“%PDF”(区分大小写)可以在文件中的任何位置找到:

function ConvertTo-BinaryString {
    # converts the bytes of a file to a string that has a
    # 1-to-1 mapping back to the file's original bytes.
    # Useful for performing binary regular expressions.
    Param (
        [Parameter(Mandatory = $True, ValueFromPipeline = $True, Position = 0)]
        [ValidateScript( { Test-Path $_ -PathType Leaf } )]
        [String]$Path
    )

    # Note: Codepage 28591 returns a 1-to-1 char to byte mapping
    $Encoding     = [Text.Encoding]::GetEncoding(28591)
    $Stream       = [System.IO.FileStream]::new($Path, 'Open', 'Read')
    $StreamReader = [System.IO.StreamReader]::new($Stream, $Encoding)
    $BinaryText   = $StreamReader.ReadToEnd()

    $StreamReader.Close()
    $Stream.Close()

    return $BinaryText
}

$badCounter  = 0
$goodCounter = 0
$logFile     = "RESULTS_{0:yyyyMMdd}.log" -f (Get-Date)

# get an array of pdf file FullNames
$files = @(Get-ChildItem -File -Filter '*.pdf').FullName
Write-Host "Processing $($files.Count) files... " -ForegroundColor Yellow
# loop through the array, test if '%PDF' is found and output strings for the log file
$result = foreach ($item in $files) {
    $pdfText = ConvertTo-BinaryString -Path $item
    if ($pdfText.IndexOf("%PDF") -ge 0) {
        $goodCounter++
        "Success - $item"
    }
    else {
        $badCounter++
        "Fail - $item"
    }
}

# write the output to the log file
$result | Set-Content -Path $logFile
"=" * 25 | Add-Content -Path $logFile
"BAD:   $badCounter"  | Add-Content -Path $logFile
"GOOD:  $goodCounter" | Add-Content -Path $logFile
"Total: $($files.Count)" | Add-Content -Path $logFile

Write-Host "DONE!" -ForegroundColor Green