gcount() 的输出不一致

Question

我编写了以下简单的 MRE，它在我的程序中重新生成了一个错误：

#include <iostream>
#include <utility>
#include <sstream>
#include <string_view>
#include <array>
#include <vector>
#include <iterator>

// this function is working fine only if string_view contains all the user provided chars and nothing extra like null bytes
std::pair< bool, std::vector< std::string > > tokenize( const std::string_view inputStr, const std::size_t expectedTokenCount )
{
    // unnecessary implementation details

    std::stringstream ss;
    ss << inputStr.data( ); // works for null-terminated strings, but not for the non-null terminated strings

    // unnecessary implementation details
}

int main( )
{
    constexpr std::size_t REQUIRED_TOKENS_COUNT { 3 };
    std::array<char, 50> input_buffer { };

    std::cin.getline( input_buffer.data( ), input_buffer.size( ) ); // user can enter at max 50 characters

    const auto [ hasExpectedTokenCount, foundTokens ] { tokenize( { input_buffer.data( ), input_buffer.size( ) }, REQUIRED_TOKENS_COUNT ) };

    for ( const auto& token : foundTokens ) // print the tokens
    {
        std::cout << '\'' << token << "' ";
    }

    std::cout << '\n';
}

这是一个标记化程序（有关完整代码，请参阅下面 link 处的 Compiler Explorer）。另外，我使用 GCC v11.2.

首先，我想避免使用 data()，因为它的效率有点低。

我查看了 Compiler Explorer 中的程序集，显然，data() 调用了 strlen()，所以当它到达第一个空字节时它停止了。但是如果 string_view 对象不是空终止的呢？这有点令人担忧。所以我切换到 ss << inputStr;.

其次，当我执行此操作时 ss << inputStr;，整个 50 个字符的缓冲区及其所有空字节都被插入到 ss 中。以下是一些错误的示例输出：

示例 #1：

1                  2    3
'1' '2' '3                                     ' // '1' and '2' are correct, '3' has lots of null bytes

示例 #2（在这个示例中，我在 3 之后输入了一个 space 字符）：

1                  2    3
'1' '2' '3' '                                     ' // an extra token consisting of 1 space char and lots of null bytes has been created!

有办法解决这个问题吗？我现在应该怎么做才能同时支持 non-null terminated 字符串？我想出了如下 gcount() 的想法：

    const std::streamsize charCount { std::cin.gcount( ) };
                                                                                        // here I pass charCount instead of the size of buffer
    const auto [ hasExpectedTokenCount, foundTokens ] { tokenize( { input_buffer.data( ), charCount },
                                                                    REQUIRED_TOKENS_COUNT ) };

但问题是，当用户输入的字符少于缓冲区大小时，gcount()returns一个比实际输入的char个数多1的值（例如，用户输入了 5 个字符，但 gcount returns 6 显然也考虑到了 '\0'。

这导致最后一个标记的末尾也有一个空字节：

1   2     3
'1' '2' '3 ' // see the null byte in '3 ', it's NOT a space char

我该如何解决 gcount 的不一致输出？

或者也许我应该更改函数 tokenize 以便它摆脱 string_view 末尾的任何 '\0' 然后开始对其进行标记化。

虽然听起来像是 XY 问题。但我真的需要帮助来决定要做什么。

Answer 1

您遇到的基本问题是 operator<< 函数。您已经尝试过其中两个：

operator<<(ostream &, const char *) 将从指针获取字符直到（不包括）下一个 NUL。正如您所指出的，如果指针来自 string_view 而没有终止 NUL，则可能会出现问题。
operator<<(ostream &, const string_view &) 将获取 string_view 中的所有字符，包括可能存在的任何 NUL。

您似乎想要做的是从 string_view 到（不包括）第一个 NUL 或 string_view 的末尾提取字符，以先到者为准。您可以使用 find 并构造一个直到 NUL 或结尾的子字符串：

ss << inputStr.substr(0, inputStr.find('[=10=]'));

gcount() 的输出不一致

Inconsistent output from gcount()

c++

stringstream

stringtokenizer

string-view