如何用 scanf/sscanf 确认没有空格或尾随数据?
How to confirm no whitespace or trailing data with scanf/sscanf?
sscanf() 似乎很适合去除匹配数据,例如:
sscanf ("abc,f,123,234", "%[a-z],%c,%d,%d", str, &chr, &i1, &i2)
但是我需要断言它没有遇到空格:
sscanf ("abc, f , 123 , 234 ", "%[a-z],%c,%d,%d", str, &chr, &i1, &i2)
/* How to tell it to fail on whitespace?? */
我还需要断言没有尾随数据:
sscanf ("abc,f,123,234__SOMERUBBISH", "%[a-z],%c,%d,%d", str, &chr, &i1, &i2)
/* How to detect trailing rubbish or make sscanf fail */
如何让 sscanf 更严格地解析字符串?
这是一个编译为 ANSI C 的大学作业,我没有包含正则表达式的选项。
使用 getchar() 查看下一个字符是换行符还是空格。
来自 sscanf 的手册页:
[
Matches a nonempty sequence of characters from the specified set of
accepted characters; the next pointer must be a pointer to char, and
there must be enough room for all the characters in the string, plus a
terminating null byte. The usual skip of leading white space is
suppressed. The string is to be made up of characters in (or not in) a
particular set; the set is defined by the characters between the open
bracket [ character and a close bracket ] character. The set excludes
those characters if the first character after the open bracket is a
circumflex (^). To include a close bracket in the set, make it the
first character after the open bracket or the circumflex; any other
position will end the set. The hyphen character - is also special;
when placed between two other characters, it adds all intervening
characters to the set. To include a hyphen, make it the last character
before the final close bracket. For instance, [^]0-9-] means the set
"everything except close bracket, zero through nine, and hyphen". The
string ends with the appearance of a character not in the (or, with a
circumflex, in) set or when the field width runs out.
这应该可以让您更好地控制输入。 (但它基本上是一个糟糕的正则表达式版本。)
简而言之,如果不能允许白色space,则不能使用直接文件I/O功能,例如scanf()
等。每个 %d
转换都允许在值之前有任意数量的白色 space,包括换行符。您必须改用基于字符串的函数,例如 sscanf()
。
您最好使用 fgets()
或 POSIX
getline()
读取数据行,然后使用 %n
确定转换完成的位置。
如果您还没有消除 fgets()
或 getline()
保存的换行符,您可以测试输入中最后一个匹配(或第一个不匹配的字符)之后的第一个字符是新队;否则,您可以测试空字节作为第一个不匹配的字符。
您仍然需要检查两个数字之前是否没有space;你对每一个都再次使用 %n
。请注意,%n
转换规范不计入 scanf()
等人编辑的 return 数量中。
ws.c
#include <stdio.h>
int main(void)
{
char str[10] = "QQQQQQQQQ";
char chr = 'Z';
int i1 = 77;
int i2 = 88;
int n1;
int n2;
int n3;
char *line = 0;
size_t linelen = 0;
int length;
while ((length = getline(&line, &linelen, stdin)) != -1)
{
printf("Line: <<%.*s>>\n", length - 1, line);
int rc = sscanf(line, "%[a-z],%c,%n%d,%n%d%n",
str, &chr, &n1, &i1, &n2, &i2, &n3);
const char *tag = "success";
if (rc <= 0)
tag = "total failure";
else if (rc < 4)
tag = "partial failure";
else if (rc > 4)
tag = "WTF?";
printf("rc = %d: %s\n", rc, tag);
printf("n1 = %d [%c], n2 = %d [%c], n3 = %d [%c]\n",
n1, line[n1], n2, line[n2], n3, line[n3]);
printf("<<%s>>,<<%c>>,%d,%d\n", str, chr, i1, i2);
}
return 0;
}
这样您就可以确定问题所在。
data
使用 ☐ 标记行尾,考虑数据文件 (data
):
abc,f,123,234☐
abc, f , 123 , 234 ☐
abc,f,123,234__SOMERUBBISH☐
xyz,f, 123, 234☐
xyz,f,123 ,234 ☐
示例运行
上面程序的输出是:
$ ./ws < data
Line: <<abc,f,123,234>>
rc = 4: success
n1 = 6 [1], n2 = 10 [2], n3 = 13 [
]
<<abc>>,<<f>>,123,234
Line: <<abc, f , 123 , 234 >>
rc = 2: partial failure
n1 = 6 [f], n2 = 10 [ ], n3 = 13 [3]
<<abc>>,<< >>,123,234
Line: <<abc,f,123,234__SOMERUBBISH>>
rc = 4: success
n1 = 6 [1], n2 = 10 [2], n3 = 13 [_]
<<abc>>,<<f>>,123,234
Line: <<xyz,f, 123, 234>>
rc = 4: success
n1 = 6 [ ], n2 = 11 [ ], n3 = 15 [
]
<<xyz>>,<<f>>,123,234
Line: <<xyz,f,123 ,234 >>
rc = 3: partial failure
n1 = 6 [1], n2 = 11 [2], n3 = 15 [
]
<<xyz>>,<<f>>,123,234
$
显然,对于标记为 'partial failure' 的行,您不能依赖上次成功转换之后的数据。但是在转换成功的地方,你可以看到通过检查line[n1]
等可以发现问题
ws2.c
这个代码的微小变化给出了对问题的稍微扩展的分析。请注意,此分析不适用于部分或完全不成功的扫描。当 sscanf()
中的 return 值不是 4 时,如果它只是报告一个问题,只在扫描成功时才分析这些值,那将是最好的。 (这样做的修改并不复杂。)它还可以防止长字符串作为第一个字段的缓冲区溢出。
#include <ctype.h>
#include <stdio.h>
#undef isdecint
static inline int isdecint(int c)
{
return (isdigit(c) || c == '+' || c == '-');
}
int main(void)
{
char str[10] = "QQQQQQQQQ";
char chr = 'Z';
int i1 = 77;
int i2 = 88;
int n1;
int n2;
int n3;
char *line = 0;
size_t linelen = 0;
int length;
while ((length = getline(&line, &linelen, stdin)) != -1)
{
printf("Line: <<%.*s>>\n", length - 1, line);
int rc = sscanf(line, "%9[a-z],%c,%n%d,%n%d%n",
str, &chr, &n1, &i1, &n2, &i2, &n3);
const char *tag = "success";
if (rc <= 0)
tag = "total failure";
else if (rc < 4)
tag = "partial failure";
else if (rc > 4)
tag = "WTF?";
printf("rc = %d: %s\n", rc, tag);
printf("n1 = %d [%c], n2 = %d [%c], n3 = %d [%c]\n",
n1, line[n1], n2, line[n2], n3, line[n3]);
if (!isdecint(line[n1]))
printf("Invalid char for n1\n");
if (!isdecint(line[n2]))
printf("Invalid char for n2\n");
if (line[n3] != '\n')
printf("Invalid char for n3\n");
printf("<<%s>>,<<%c>>,%d,%d\n", str, chr, i1, i2);
}
return 0;
}
data2
abc,f,123,234☐
abc, f , 345 , 456 ☐
abc,f,567,678__SOMERUBBISH☐
xyz,f, 1234, 2345☐
xyz,f,-3456 ,-4567 ☐
xyz,f,+5678,+6789☐
xyz,f,+ 5678,- 6789 X☐
样本运行
$ ./ws2 < data2
Line: <<abc,f,123,234>>
rc = 4: success
n1 = 6 [1], n2 = 10 [2], n3 = 13 [
]
<<abc>>,<<f>>,123,234
Line: <<abc, f , 345 , 456 >>
rc = 2: partial failure
n1 = 6 [f], n2 = 10 [ ], n3 = 13 [5]
Invalid char for n1
Invalid char for n2
Invalid char for n3
<<abc>>,<< >>,123,234
Line: <<abc,f,567,678__SOMERUBBISH>>
rc = 4: success
n1 = 6 [5], n2 = 10 [6], n3 = 13 [_]
Invalid char for n3
<<abc>>,<<f>>,567,678
Line: <<xyz,f, 1234, 2345>>
rc = 4: success
n1 = 6 [ ], n2 = 12 [ ], n3 = 17 [
]
Invalid char for n1
Invalid char for n2
<<xyz>>,<<f>>,1234,2345
Line: <<xyz,f,-3456 ,-4567 >>
rc = 3: partial failure
n1 = 6 [-], n2 = 12 [,], n3 = 17 [7]
Invalid char for n2
Invalid char for n3
<<xyz>>,<<f>>,-3456,2345
Line: <<xyz,f,+5678,+6789>>
rc = 4: success
n1 = 6 [+], n2 = 12 [+], n3 = 17 [
]
<<xyz>>,<<f>>,5678,6789
Line: <<xyz,f,+ 5678,- 6789 X>>
rc = 2: partial failure
n1 = 6 [+], n2 = 12 [,], n3 = 17 [8]
Invalid char for n2
Invalid char for n3
<<xyz>>,<<f>>,5678,6789
您可以先将字段读入字符串,检查字符串是否以空格开头或结尾,并以此为基础进行处理。
这是一个演示检测逻辑的示例程序。
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
int startsOrEndsWithSpace(char str[])
{
return (isspace(str[0]) || isspace(str[strlen(str)-2]));
}
void testSccanf(char const source[],
char str1[],
char str2[],
char str3[],
char str4[])
{
int n = sscanf(source, "%[^,],%[^,],%[^,],%[^,]", str1, str2, str3, str4);
if ( n != 4 )
{
// Problem
printf("Found only %d fields.\n", n);
}
if ( n >= 1 && startsOrEndsWithSpace(str1) )
{
printf("1st field is not good\n");
}
if ( n >= 2 && startsOrEndsWithSpace(str2) )
{
printf("2nd field is not good\n");
}
if ( n >=3 && startsOrEndsWithSpace(str3) )
{
printf("3rd field is not good\n");
}
if ( startsOrEndsWithSpace(str4) )
{
printf("4th field is not good\n");
}
}
int main ()
{
char str1[50];
char str2[50];
char str3[50];
char str4[50];
testSccanf("abc, f , 123 , 234 ", str1, str2, str3, str4);
testSccanf("abc,f,123,234", str1, str2, str3, str4);
return (0);
}
简单易行。
更改格式以定位潜在的不需要的白色-space。使用"%n"
记录缓冲区中的扫描位置。在格式说明符(如 "%d"
、"%s"
、"%f"
之前使用可选的前导白色-space。添加最后的 "%n"
以检查尾随垃圾。
首先检查是否扫描了 4 个变量。然后检查是否出现不需要的数据。
注意:,只有"%[]"
,"%c"
,"%n"
不消耗可选前导白-space.
int ws[3];
int cnt = sscanf (buf, "%[a-z],%c,%n%d,%n%d%n", str, &chr, &ws[0], &i1, &ws[1], &i2, &ws[2]);
if (cnt != 4 || isspace(buf[ws[0]]) || isspace(buf[ws[1]]) || buf[ws[2]]) {
Fail();
}
sscanf() 似乎很适合去除匹配数据,例如:
sscanf ("abc,f,123,234", "%[a-z],%c,%d,%d", str, &chr, &i1, &i2)
但是我需要断言它没有遇到空格:
sscanf ("abc, f , 123 , 234 ", "%[a-z],%c,%d,%d", str, &chr, &i1, &i2)
/* How to tell it to fail on whitespace?? */
我还需要断言没有尾随数据:
sscanf ("abc,f,123,234__SOMERUBBISH", "%[a-z],%c,%d,%d", str, &chr, &i1, &i2)
/* How to detect trailing rubbish or make sscanf fail */
如何让 sscanf 更严格地解析字符串?
这是一个编译为 ANSI C 的大学作业,我没有包含正则表达式的选项。
使用 getchar() 查看下一个字符是换行符还是空格。
来自 sscanf 的手册页:
[
Matches a nonempty sequence of characters from the specified set of accepted characters; the next pointer must be a pointer to char, and there must be enough room for all the characters in the string, plus a terminating null byte. The usual skip of leading white space is suppressed. The string is to be made up of characters in (or not in) a particular set; the set is defined by the characters between the open bracket [ character and a close bracket ] character. The set excludes those characters if the first character after the open bracket is a circumflex (^). To include a close bracket in the set, make it the first character after the open bracket or the circumflex; any other position will end the set. The hyphen character - is also special; when placed between two other characters, it adds all intervening characters to the set. To include a hyphen, make it the last character before the final close bracket. For instance, [^]0-9-] means the set "everything except close bracket, zero through nine, and hyphen". The string ends with the appearance of a character not in the (or, with a circumflex, in) set or when the field width runs out.
这应该可以让您更好地控制输入。 (但它基本上是一个糟糕的正则表达式版本。)
简而言之,如果不能允许白色space,则不能使用直接文件I/O功能,例如scanf()
等。每个 %d
转换都允许在值之前有任意数量的白色 space,包括换行符。您必须改用基于字符串的函数,例如 sscanf()
。
您最好使用 fgets()
或 POSIX
getline()
读取数据行,然后使用 %n
确定转换完成的位置。
如果您还没有消除 fgets()
或 getline()
保存的换行符,您可以测试输入中最后一个匹配(或第一个不匹配的字符)之后的第一个字符是新队;否则,您可以测试空字节作为第一个不匹配的字符。
您仍然需要检查两个数字之前是否没有space;你对每一个都再次使用 %n
。请注意,%n
转换规范不计入 scanf()
等人编辑的 return 数量中。
ws.c
#include <stdio.h>
int main(void)
{
char str[10] = "QQQQQQQQQ";
char chr = 'Z';
int i1 = 77;
int i2 = 88;
int n1;
int n2;
int n3;
char *line = 0;
size_t linelen = 0;
int length;
while ((length = getline(&line, &linelen, stdin)) != -1)
{
printf("Line: <<%.*s>>\n", length - 1, line);
int rc = sscanf(line, "%[a-z],%c,%n%d,%n%d%n",
str, &chr, &n1, &i1, &n2, &i2, &n3);
const char *tag = "success";
if (rc <= 0)
tag = "total failure";
else if (rc < 4)
tag = "partial failure";
else if (rc > 4)
tag = "WTF?";
printf("rc = %d: %s\n", rc, tag);
printf("n1 = %d [%c], n2 = %d [%c], n3 = %d [%c]\n",
n1, line[n1], n2, line[n2], n3, line[n3]);
printf("<<%s>>,<<%c>>,%d,%d\n", str, chr, i1, i2);
}
return 0;
}
这样您就可以确定问题所在。
data
使用 ☐ 标记行尾,考虑数据文件 (data
):
abc,f,123,234☐
abc, f , 123 , 234 ☐
abc,f,123,234__SOMERUBBISH☐
xyz,f, 123, 234☐
xyz,f,123 ,234 ☐
示例运行
上面程序的输出是:
$ ./ws < data
Line: <<abc,f,123,234>>
rc = 4: success
n1 = 6 [1], n2 = 10 [2], n3 = 13 [
]
<<abc>>,<<f>>,123,234
Line: <<abc, f , 123 , 234 >>
rc = 2: partial failure
n1 = 6 [f], n2 = 10 [ ], n3 = 13 [3]
<<abc>>,<< >>,123,234
Line: <<abc,f,123,234__SOMERUBBISH>>
rc = 4: success
n1 = 6 [1], n2 = 10 [2], n3 = 13 [_]
<<abc>>,<<f>>,123,234
Line: <<xyz,f, 123, 234>>
rc = 4: success
n1 = 6 [ ], n2 = 11 [ ], n3 = 15 [
]
<<xyz>>,<<f>>,123,234
Line: <<xyz,f,123 ,234 >>
rc = 3: partial failure
n1 = 6 [1], n2 = 11 [2], n3 = 15 [
]
<<xyz>>,<<f>>,123,234
$
显然,对于标记为 'partial failure' 的行,您不能依赖上次成功转换之后的数据。但是在转换成功的地方,你可以看到通过检查line[n1]
等可以发现问题
ws2.c
这个代码的微小变化给出了对问题的稍微扩展的分析。请注意,此分析不适用于部分或完全不成功的扫描。当 sscanf()
中的 return 值不是 4 时,如果它只是报告一个问题,只在扫描成功时才分析这些值,那将是最好的。 (这样做的修改并不复杂。)它还可以防止长字符串作为第一个字段的缓冲区溢出。
#include <ctype.h>
#include <stdio.h>
#undef isdecint
static inline int isdecint(int c)
{
return (isdigit(c) || c == '+' || c == '-');
}
int main(void)
{
char str[10] = "QQQQQQQQQ";
char chr = 'Z';
int i1 = 77;
int i2 = 88;
int n1;
int n2;
int n3;
char *line = 0;
size_t linelen = 0;
int length;
while ((length = getline(&line, &linelen, stdin)) != -1)
{
printf("Line: <<%.*s>>\n", length - 1, line);
int rc = sscanf(line, "%9[a-z],%c,%n%d,%n%d%n",
str, &chr, &n1, &i1, &n2, &i2, &n3);
const char *tag = "success";
if (rc <= 0)
tag = "total failure";
else if (rc < 4)
tag = "partial failure";
else if (rc > 4)
tag = "WTF?";
printf("rc = %d: %s\n", rc, tag);
printf("n1 = %d [%c], n2 = %d [%c], n3 = %d [%c]\n",
n1, line[n1], n2, line[n2], n3, line[n3]);
if (!isdecint(line[n1]))
printf("Invalid char for n1\n");
if (!isdecint(line[n2]))
printf("Invalid char for n2\n");
if (line[n3] != '\n')
printf("Invalid char for n3\n");
printf("<<%s>>,<<%c>>,%d,%d\n", str, chr, i1, i2);
}
return 0;
}
data2
abc,f,123,234☐
abc, f , 345 , 456 ☐
abc,f,567,678__SOMERUBBISH☐
xyz,f, 1234, 2345☐
xyz,f,-3456 ,-4567 ☐
xyz,f,+5678,+6789☐
xyz,f,+ 5678,- 6789 X☐
样本运行
$ ./ws2 < data2
Line: <<abc,f,123,234>>
rc = 4: success
n1 = 6 [1], n2 = 10 [2], n3 = 13 [
]
<<abc>>,<<f>>,123,234
Line: <<abc, f , 345 , 456 >>
rc = 2: partial failure
n1 = 6 [f], n2 = 10 [ ], n3 = 13 [5]
Invalid char for n1
Invalid char for n2
Invalid char for n3
<<abc>>,<< >>,123,234
Line: <<abc,f,567,678__SOMERUBBISH>>
rc = 4: success
n1 = 6 [5], n2 = 10 [6], n3 = 13 [_]
Invalid char for n3
<<abc>>,<<f>>,567,678
Line: <<xyz,f, 1234, 2345>>
rc = 4: success
n1 = 6 [ ], n2 = 12 [ ], n3 = 17 [
]
Invalid char for n1
Invalid char for n2
<<xyz>>,<<f>>,1234,2345
Line: <<xyz,f,-3456 ,-4567 >>
rc = 3: partial failure
n1 = 6 [-], n2 = 12 [,], n3 = 17 [7]
Invalid char for n2
Invalid char for n3
<<xyz>>,<<f>>,-3456,2345
Line: <<xyz,f,+5678,+6789>>
rc = 4: success
n1 = 6 [+], n2 = 12 [+], n3 = 17 [
]
<<xyz>>,<<f>>,5678,6789
Line: <<xyz,f,+ 5678,- 6789 X>>
rc = 2: partial failure
n1 = 6 [+], n2 = 12 [,], n3 = 17 [8]
Invalid char for n2
Invalid char for n3
<<xyz>>,<<f>>,5678,6789
您可以先将字段读入字符串,检查字符串是否以空格开头或结尾,并以此为基础进行处理。
这是一个演示检测逻辑的示例程序。
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
int startsOrEndsWithSpace(char str[])
{
return (isspace(str[0]) || isspace(str[strlen(str)-2]));
}
void testSccanf(char const source[],
char str1[],
char str2[],
char str3[],
char str4[])
{
int n = sscanf(source, "%[^,],%[^,],%[^,],%[^,]", str1, str2, str3, str4);
if ( n != 4 )
{
// Problem
printf("Found only %d fields.\n", n);
}
if ( n >= 1 && startsOrEndsWithSpace(str1) )
{
printf("1st field is not good\n");
}
if ( n >= 2 && startsOrEndsWithSpace(str2) )
{
printf("2nd field is not good\n");
}
if ( n >=3 && startsOrEndsWithSpace(str3) )
{
printf("3rd field is not good\n");
}
if ( startsOrEndsWithSpace(str4) )
{
printf("4th field is not good\n");
}
}
int main ()
{
char str1[50];
char str2[50];
char str3[50];
char str4[50];
testSccanf("abc, f , 123 , 234 ", str1, str2, str3, str4);
testSccanf("abc,f,123,234", str1, str2, str3, str4);
return (0);
}
简单易行。
更改格式以定位潜在的不需要的白色-space。使用"%n"
记录缓冲区中的扫描位置。在格式说明符(如 "%d"
、"%s"
、"%f"
之前使用可选的前导白色-space。添加最后的 "%n"
以检查尾随垃圾。
首先检查是否扫描了 4 个变量。然后检查是否出现不需要的数据。
注意:,只有"%[]"
,"%c"
,"%n"
不消耗可选前导白-space.
int ws[3];
int cnt = sscanf (buf, "%[a-z],%c,%n%d,%n%d%n", str, &chr, &ws[0], &i1, &ws[1], &i2, &ws[2]);
if (cnt != 4 || isspace(buf[ws[0]]) || isspace(buf[ws[1]]) || buf[ws[2]]) {
Fail();
}