找出字节内容

Figure out bytes content

我正在处理一个包含多个流的复合文件。我很沮丧如何弄清楚每个流的内容。我不知道这些字节是文本还是 mp3 或视频。 例如:有没有办法了解这些字节可能是什么类型的数据?

b'\x00\x00\x00\x00\x00\x00\x00\x00\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x0bz\xcc\xc9\xc8\xc0\xc0\x00\xc2?\x82\x1e<\x0ec\xbc*8\x19\xc8i\xb3W_\x0b\x14bH\x00\xb2-\x99\x18\x18\xfe\x03\x01\x88\xcf\xc0\x01\xc4\xe1\x0c\xf9\x0cE\x0c\xd9\x0c\xc5\x0c\xa9\x0c%\x0c\x86`\xcd \x0c\x020\x1a\x00\x00\x00\xff\xff\x02\x080\x00\x96L~\x89W\x00\x00\x00\x00\x80(\B\xefI;\x9e}p\xfe\x1a\xb2\x9b>(\x81\x86/=\xc9xH0:Pwb\xb7\xdck-\xd2F\x04\xd7co'

是的,有办法弄清楚每个流的内容。除了不可靠的扩展名之外,这个星球上的每个文件都有一个签名。它可能被删除或错误添加。

那么signature是什么?

In computing, a file signature is data used to identify or verify the contents of a file. In particular, it may refer to:

  • File magic number: bytes within a file used to identify the format of the file; generally a short sequence of bytes (most are 2-4 bytes long) placed at the beginning of the file; see list of file signatures

  • File checksum or more generally the result of a hash function over the file contents: data used to verify the integrity of the file contents, generally against transmission errors or malicious attacks. The signature can be included at the end of the file or in a separate file.

我使用 magic number 来定义幻数术语 我是从维基百科复制的

In computer programming, the term magic number has multiple meanings. It could refer to one or more of the following:

  • Unique values with unexplained meaning or multiple occurrences which could (preferably) be replaced with named constants
  • A constant numerical or text value used to identify a file format or protocol; for files, see List of file signatures
  • Distinctive unique values that are unlikely to be mistaken for other meanings(e.g., Globally Unique Identifiers)

在第二点,它是一个特定的字节序列,如

PNG (89 50 4E 47 0D 0A 1A 0A) 

BMP (42 4D)

那么如何知道每个文件的幻数?

在这篇文章“Investigating File Signatures Using PowerShell”中,我们发现作者创建了一个很棒的强大 shell 函数来获取幻数,他还提到了一个工具,我从他的文章中复制了这个

PowerShell V5 brings in Format-Hex, which can provide an alternative approach to reading the file and displaying the hex and ASCII value to determine the magic number.

表单Format-Hex帮助我正在复制此描述

The Format-Hex cmdlet displays a file or other input as hexadecimal values. To determine the offset of a character from the output, add the number at the leftmost of the row to the number at the top of the column for that character.

This cmdlet can help you determine the file type of a corrupted file or a file which may not have a file name extension. Run this cmdlet, and then inspect the results for file information.

这个工具也非常适合获取文件的幻数。这是一个例子

另一个工具是 online hex editor 但刚开始我不明白如何使用它。

现在我们得到了幻数,但是如何知道什么类型的数据或者那个文件或流? 这是最好的问题。 幸运的是,这些幻数有很多数据库。让我列出一些

  1. File Signatures
  2. FILE SIGNATURES TABLE
  3. List of file signatures

例如第一个数据库具有搜索功能。只需输入不带空格的幻数并搜索

以后你可能会发现。是的,可能。您很可能无法直接找到有问题的文件类型。

我面对这个问题并通过针对特定类型的签名测试流来解决它。就像我在流中搜索的 PNG

def GetPngStartingOffset(arr):

    #targted magic Number for png (89 50 4E 47 0D 0A 1A 0A)
    markerFound = False
    startingOffset = 0
    previousValue = 0
    arraylength = range(0, len(arr) -1) 

    for i in arraylength:
        currentValue = arr[i]
        if (currentValue == 137):   # 0x89  
            markerFound = True
            startingOffset = i
            previousValue = currentValue
            continue

        if currentValue == 80:  # 0x50
            if (markerFound and (previousValue == 137)):
                previousValue = currentValue
                continue
            markerFound = False

        elif currentValue == 78:   # 0x4E
            if (markerFound and (previousValue == 80)):
                previousValue = currentValue
                continue
            markerFound = False

        elif currentValue == 71:   # 0x47
            if (markerFound and (previousValue == 78)):
                previousValue = currentValue
                continue
            markerFound = False

        elif currentValue == 13:   # 0x0D
            if (markerFound and (previousValue == 71)):
                previousValue = currentValue
                continue
            markerFound = False

        elif currentValue == 10:   # 0x0A
            if (markerFound and (previousValue == 26)):
                return startingOffset
            if (markerFound and (previousValue == 13)):
                previousValue = currentValue
                continue
            markerFound = False

        elif currentValue == 26:   # 0x1A
            if (markerFound and (previousValue == 10)):
                previousValue = currentValue
                continue
            markerFound = False
    return 0

一旦这个函数找到了幻数

我分流保存png文件

    arr = stream.read()
    a = list(arr)
    B = a[GetPngStartingOffset(a):len(a)]
    bytesString = bytes(B)
    image = Image.open(io.BytesIO(bytesString))
    image.show()

最后这不是一个端到端的解决方案,而是一种找出流内容的方法 感谢阅读并感谢@Robert Columbia 的耐心等待