在 Swift 中，如何在较大的缓冲区中找到 UTF8 字符串的字节长度

Question

我有一个数据对象，其中包含编码的 UTF8 字符串和其他类型的序列化值。这个问题的第一个版本假设 UTF8 有一个内置的字符串终止，但它没有。

一大块 UTF8 字符与一大块 ascii 字节有同样的问题。字符串的长度必须通过显式存储长度或使用终止符（如 NUL/0）来处理。

如果您使用终止符，则必须限制字符串内容，使其不包含终止符值。这将使您的代码不适合对所有合法 Swift 字符串进行编码，但这可能没问题，具体取决于应用程序。

Answer 1

试试这个解决方案

str.utf8.count

Answer 2

Swift 字符串可以包含 NUL 字节（例如 "Hello\u{0000}world!" 是有效的 String），因此假设您的字符串以 NUL 字节结尾，你的两种方法都不够。

相反，您可能希望采用@Larme 作为评论发布的方法：首先拆分数据，然后从这些切片创建字符串。

如果你的分隔符确实是一个NUL字节，这可以像

一样简单

import Foundation

func decode(_ data: Data, separator: UInt8) -> [String] {
    data.split(separator: separator).map { String(decoding: [=10=], as: UTF8.self) }
}

let data = Data("Hello, world!\u{00}Following string.\u{00}And another one!".utf8)

print(decode(data, separator: 0x00))
// => ["Hello, world!", "Following string.", "And another one!"]

这里的 split(separator:) 方法是 Sequence.split(separator:maxSplits:omittingEmptySubsequences:)，它采用单个 Sequence.Element 的分隔符——在本例中，是单个 UInt8。 omittingEmptySubsequences默认为true，所以

如果空字符串是有效输入并且您需要处理它们，请确保传入false。否则，
如果您的分隔符是连续 N NUL 个字节，此方法仍然适用于您：您将得到 N - 1 个空拆分，所有这些都将被抛出离开

或者，如果您不想预先急切地拆分整个缓冲区（例如，您可能正在寻找一个指示停止处理的标记值），您可以通过循环遍历使用 Data.prefix(while:):

以分隔符终止的缓冲区和抓取前缀

import Foundation

func process(_ data: Data, separator: UInt8, using action: (String) -> Bool) {
    var slice = data[...]
    while !slice.isEmpty {
        let substring = String(decoding: slice.prefix(while: { [=11=] != separator }), as: UTF8.self)
        if !action(substring) {
            break
        }
        
        slice = slice.dropFirst(substring.utf8.count + 1)
    }
}

let data = Data("Hello, world!\u{00}Following string.\u{00}And another one!".utf8)
process(data, separator: 0x00) { string in
    print(string)
    return true // continue
}

如果您的分隔符更复杂（例如，多个不同的字符长），您仍然可以使用 Data 方法来查找分隔符序列的实例并自行拆分它们：

import Foundation

func decode(_ data: Data, separator: String) -> [String] {
    // `firstRange(of:)` below takes a type conforming to `DataProtocol`.
    // `String.UTF8View` doesn't conform, but `Array` does. This copy should
    // be cheap if the separator is small.
    let separatorBytes = Array(separator.utf8)
    var strings = [String]()
    
    // Slicing the data will give cheap no-copy views into it.
    // This first slice is the full data blob.
    var slice = data[...]

    // As long as there's an instance of `separator` in the  data...
    while let separatorRange = slice.firstRange(of: separatorBytes) {
        // ... pull out all of the bytes before it into a String...
        strings.append(String(decoding: slice[..<separatorRange.lowerBound], as: UTF8.self))

        // ... and skip past the separator to keep looking for more.
        slice = slice[separatorRange.upperBound...]
    }
    
    // If there are no separators, in the string, or the last string is not
    // terminated with a separator itself, pull out the remaining contents.
    if !slice.isEmpty {
        strings.append(String(decoding: slice, as: UTF8.self))
    }
    
    return strings
}

let separator = "\u{00}\u{20}\u{00}"
let data = Data("Hello, world!\(separator)Following string.\(separator)And another one!".utf8)
print(decode(data, separator: separator))
// => ["Hello, world!", "Following string.", "And another one!"]

Answer 3

@Itai 的回答很好，有很多额外的细节，但我想做一个简短的总结。

这是我最终得到的代码：

let buffer: Data = ...
let pos: Int = ...
let separator = UInt8(0x00)
let s = buffer[pos...].split(
    separator: separator,
    maxSplits: 1,
    omittingEmptySubsequences: false).map { 
        data in
        String(decoding: data, as: UTF8.self)
    }[0]
pos += s.utf8.count + 1

注意：如果缓冲区数据在当前位置有一个零，使用 omittingEmptySubsequences 选项很重要，以便返回空字符串。

注意：小心使用Data.suffix。它总是创建一个相对于原始 后备存储 开始的新数据对象。例如：

let data: Data = ...
let d1 = data.suffix(from: 10)
let d2 = d1.suffix(from: 10)
// d1 and d2 will have the same data.

这就是我选择使用保留整数位置变量的方法的原因。

在 Swift 中，如何在较大的缓冲区中找到 UTF8 字符串的字节长度

In Swift, how can I find the byte length of a UTF8 string within a larger buffer

string

utf-8

character-encoding

swift