如何使用 powershell 从多部分文件中挑选出二进制部分?
How to pick out the binary part from a multipart file with powershell?
我从服务器收到了一个多部分文件,我需要从中挑选出 pdf 部分。我尝试用
删除前 x 行和最后 2 行
$content=Get-Content $originalfile
$content[0..($content.length-3)] |$outfile
但是它破坏了二进制数据,那么从文件中获取二进制部分的方法是什么?
MIME-Version: 1.0
Content-Type: multipart/related; boundary=MIME_Boundary;
start="<6624867311297537120--4d6a31bb.16a77205e4d.3282>";
type="text/xml"
--MIME_Boundary
Content-ID: <6624867311297537120--4d6a31bb.16a77205e4d.3282>
Content-Type: text/xml; charset=utf-8
Content-Transfer-Encoding: 8bit
<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Body xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"/>
--MIME_Boundary
Content-ID:
Content-Type: application/xml
Content-Disposition: form-data; name="metadata"
<?xml version="1.0" encoding="ISO-8859-1"?>
<metadata><contentLength>64288</contentLength><etag>7e3da21f7ed1b434def94f4b</etag><contentType>application/octet-stream</contentType><properties><property><key>Account</key><value>finance</value></property><property><key>Business Unit</key><value>EU DEBMfg</value></property><property><key>Document Type</key><value>PAYABLES</value></property><property><key>Filename</key><value>test-pdf.pdf</value></property></properties></metadata>
--MIME_Boundary
Content-ID:
Content-Type: application/octet-stream
Content-Disposition: form-data; name="content"
%PDF-1.6
%âãÏÓ
37 0 obj <</Linearized 1/L 20597/O 40/E 14115/N 1/T 19795/H [ 1005 215]>>
endobj
xref
37 34
0000000016 00000 n
0000001386 00000 n
0000001522 00000 n
0000001787 00000 n
0000002250 00000 n
.
.
.
0000062787 00000 n
0000063242 00000 n
trailer
<<
/Size 76
/Prev 116
/Root 74 0 R
/Encrypt 38 0 R
/Info 75 0 R
/ID [ <C21F21EA44C1E2ED2581435FA5A2DCCE> <3B7296EB948466CB53FB76CC134E3E76> ]
>>
startxref
63926
%%EOF
--MIME_Boundary-
您需要将文件作为一系列字节读取并将其视为二进制文件。
接下来,要解析出文件的PDF部分,需要将其再次读取为String,这样就可以对其进行正则表达式了。
字符串应该采用不会以任何方式改变字节的编码,为此,有特殊编码 Codepage 28591 (ISO 8859-1)
,原始文件中的字节按原样使用。
为此,我编写了以下辅助函数:
function ConvertTo-BinaryString {
# converts the bytes of a file to a string that has a
# 1-to-1 mapping back to the file's original bytes.
# Useful for performing binary regular expressions.
Param (
[Parameter(Mandatory = $True, ValueFromPipeline = $True, Position = 0)]
[ValidateScript( { Test-Path $_ -PathType Leaf } )]
[String]$Path
)
$Stream = New-Object System.IO.FileStream -ArgumentList $Path, 'Open', 'Read'
# Note: Codepage 28591 (ISO 8859-1) returns a 1-to-1 char to byte mapping
$Encoding = [Text.Encoding]::GetEncoding(28591)
$StreamReader = New-Object System.IO.StreamReader -ArgumentList $Stream, $Encoding
$BinaryText = $StreamReader.ReadToEnd()
$StreamReader.Close()
$Stream.Close()
return $BinaryText
}
使用上述函数,您应该能够像这样从多部分文件中获取二进制部分:
$inputFile = 'D:\blah.txt'
$outputFile = 'D:\blah.pdf'
# read the file as byte array
$fileBytes = [System.IO.File]::ReadAllBytes($inputFile)
# and again as string where every byte has a 1-to-1 mapping to the file's original bytes
$binString = ConvertTo-BinaryString -Path $inputFile
# create your regex, all as ASCII byte characters: '%PDF.*%%EOF[\r?\n]{0,2}'
$regex = [Regex]'(?s)(\x25\x50\x44\x46[\x00-\xFF]*\x25\x25\x45\x4F\x46[\x0D\x0A]{0,2})'
$match = $regex.Match($binString)
# use a MemoryStream object to store the result
$stream = New-Object System.IO.MemoryStream
$stream.Write($fileBytes, $match.Index, $match.Length)
# save the binary data of the match as a series of bytes
[System.IO.File]::WriteAllBytes($outputFile, $stream.ToArray())
# clean up
$stream.Dispose()
正则表达式详细信息:
( Match the regular expression below and capture its match into backreference number 1
\x25 Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
\x50 Match the ASCII or ANSI character with position 0x50 (80 decimal => P) in the character set
\x44 Match the ASCII or ANSI character with position 0x44 (68 decimal => D) in the character set
\x46 Match the ASCII or ANSI character with position 0x46 (70 decimal => F) in the character set
[\x00-\xFF] Match a single character in the range between ASCII character 0x00 (0 decimal) and ASCII character 0xFF (255 decimal)
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\x25 Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
\x25 Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
\x45 Match the ASCII or ANSI character with position 0x45 (69 decimal => E) in the character set
\x4F Match the ASCII or ANSI character with position 0x4F (79 decimal => O) in the character set
\x46 Match the ASCII or ANSI character with position 0x46 (70 decimal => F) in the character set
[\x0D\x0A] Match a single character present in the list below
ASCII character 0x0D (13 decimal)
ASCII character 0x0A (10 decimal)
{0,2} Between zero and 2 times, as many times as possible, giving back as needed (greedy)
)
我从服务器收到了一个多部分文件,我需要从中挑选出 pdf 部分。我尝试用
删除前 x 行和最后 2 行$content=Get-Content $originalfile
$content[0..($content.length-3)] |$outfile
但是它破坏了二进制数据,那么从文件中获取二进制部分的方法是什么?
MIME-Version: 1.0
Content-Type: multipart/related; boundary=MIME_Boundary;
start="<6624867311297537120--4d6a31bb.16a77205e4d.3282>";
type="text/xml"
--MIME_Boundary
Content-ID: <6624867311297537120--4d6a31bb.16a77205e4d.3282>
Content-Type: text/xml; charset=utf-8
Content-Transfer-Encoding: 8bit
<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Body xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"/>
--MIME_Boundary
Content-ID:
Content-Type: application/xml
Content-Disposition: form-data; name="metadata"
<?xml version="1.0" encoding="ISO-8859-1"?>
<metadata><contentLength>64288</contentLength><etag>7e3da21f7ed1b434def94f4b</etag><contentType>application/octet-stream</contentType><properties><property><key>Account</key><value>finance</value></property><property><key>Business Unit</key><value>EU DEBMfg</value></property><property><key>Document Type</key><value>PAYABLES</value></property><property><key>Filename</key><value>test-pdf.pdf</value></property></properties></metadata>
--MIME_Boundary
Content-ID:
Content-Type: application/octet-stream
Content-Disposition: form-data; name="content"
%PDF-1.6
%âãÏÓ
37 0 obj <</Linearized 1/L 20597/O 40/E 14115/N 1/T 19795/H [ 1005 215]>>
endobj
xref
37 34
0000000016 00000 n
0000001386 00000 n
0000001522 00000 n
0000001787 00000 n
0000002250 00000 n
.
.
.
0000062787 00000 n
0000063242 00000 n
trailer
<<
/Size 76
/Prev 116
/Root 74 0 R
/Encrypt 38 0 R
/Info 75 0 R
/ID [ <C21F21EA44C1E2ED2581435FA5A2DCCE> <3B7296EB948466CB53FB76CC134E3E76> ]
>>
startxref
63926
%%EOF
--MIME_Boundary-
您需要将文件作为一系列字节读取并将其视为二进制文件。 接下来,要解析出文件的PDF部分,需要将其再次读取为String,这样就可以对其进行正则表达式了。
字符串应该采用不会以任何方式改变字节的编码,为此,有特殊编码 Codepage 28591 (ISO 8859-1)
,原始文件中的字节按原样使用。
为此,我编写了以下辅助函数:
function ConvertTo-BinaryString {
# converts the bytes of a file to a string that has a
# 1-to-1 mapping back to the file's original bytes.
# Useful for performing binary regular expressions.
Param (
[Parameter(Mandatory = $True, ValueFromPipeline = $True, Position = 0)]
[ValidateScript( { Test-Path $_ -PathType Leaf } )]
[String]$Path
)
$Stream = New-Object System.IO.FileStream -ArgumentList $Path, 'Open', 'Read'
# Note: Codepage 28591 (ISO 8859-1) returns a 1-to-1 char to byte mapping
$Encoding = [Text.Encoding]::GetEncoding(28591)
$StreamReader = New-Object System.IO.StreamReader -ArgumentList $Stream, $Encoding
$BinaryText = $StreamReader.ReadToEnd()
$StreamReader.Close()
$Stream.Close()
return $BinaryText
}
使用上述函数,您应该能够像这样从多部分文件中获取二进制部分:
$inputFile = 'D:\blah.txt'
$outputFile = 'D:\blah.pdf'
# read the file as byte array
$fileBytes = [System.IO.File]::ReadAllBytes($inputFile)
# and again as string where every byte has a 1-to-1 mapping to the file's original bytes
$binString = ConvertTo-BinaryString -Path $inputFile
# create your regex, all as ASCII byte characters: '%PDF.*%%EOF[\r?\n]{0,2}'
$regex = [Regex]'(?s)(\x25\x50\x44\x46[\x00-\xFF]*\x25\x25\x45\x4F\x46[\x0D\x0A]{0,2})'
$match = $regex.Match($binString)
# use a MemoryStream object to store the result
$stream = New-Object System.IO.MemoryStream
$stream.Write($fileBytes, $match.Index, $match.Length)
# save the binary data of the match as a series of bytes
[System.IO.File]::WriteAllBytes($outputFile, $stream.ToArray())
# clean up
$stream.Dispose()
正则表达式详细信息:
( Match the regular expression below and capture its match into backreference number 1
\x25 Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
\x50 Match the ASCII or ANSI character with position 0x50 (80 decimal => P) in the character set
\x44 Match the ASCII or ANSI character with position 0x44 (68 decimal => D) in the character set
\x46 Match the ASCII or ANSI character with position 0x46 (70 decimal => F) in the character set
[\x00-\xFF] Match a single character in the range between ASCII character 0x00 (0 decimal) and ASCII character 0xFF (255 decimal)
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\x25 Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
\x25 Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
\x45 Match the ASCII or ANSI character with position 0x45 (69 decimal => E) in the character set
\x4F Match the ASCII or ANSI character with position 0x4F (79 decimal => O) in the character set
\x46 Match the ASCII or ANSI character with position 0x46 (70 decimal => F) in the character set
[\x0D\x0A] Match a single character present in the list below
ASCII character 0x0D (13 decimal)
ASCII character 0x0A (10 decimal)
{0,2} Between zero and 2 times, as many times as possible, giving back as needed (greedy)
)