检查数组是否存在于一组数组中

Check if an array is present in a set of arrays

我有这样的定长数组:

char x[10] = "abcdefghij";   // could also contain non-printable char

如何在固定的10000个相同长度的数组集合中高效地测试这个数组是否存在?(注意:这些数组要么都是二进制数据数组,或者都是以null结尾的字符串,开头是固定的)

在 Python 中,我会使用 set/hashtable/dict 而不是列表(因为 O(1) 查找非常快):

s = "abcdefghij"
S = {"foo1234567", "bar1234567", ..., "baz9876543"}
print(s in S)  # True or False

如何在 C 中进行等效操作? (不是 C++)


注意:链接的问题 How to check if a string is in an array of strings in C? 是关于一种简单的方法,没有性能要求(它们遍历所有字符串并使用 strcmp)。这里有所不同,因为有 10k 个数组,需要使用另一种方法来提高性能(也许是哈希表?)。

我认为如果数组的数组已排序,则以下二进制搜索有效(感谢@WeatherVane),然后复杂度为 O(log n)。我们可以用memcmp来比较。

int binarysearch(unsigned char *T, unsigned char *t, int num, int size)  // https://en.wikipedia.org/wiki/Binary_search_algorithm#Algorithm
{
    int a = 0;
    int b = num-1;
    int m, res;
    while (a <= b) {
        m = (a + b) / 2;
        res = memcmp(&T[0] + m*size, &t[0], size);
        if (res < 0) {
            a = m + 1;
        } else if (res > 0) {
            b = m - 1;
        } else { 
            return 1;
        } 
    }
    return 0;
}
int main()
{
    unsigned char T[][4] = { 
        { 0x00, 0x6e, 0xab, 0xcd },
        { 0xea, 0x6e, 0xab, 0xcd },
        { 0xeb, 0x6e, 0xab, 0xcd },
        { 0xec, 0x6e, 0xab, 0xcd },
        { 0xed, 0x6e, 0xab, 0xcd },
        { 0xef, 0x6e, 0xab, 0xcd },
        { 0xfa, 0x6e, 0xab, 0xcd },
        { 0xfb, 0x6e, 0xab, 0xcd },
        { 0xff, 0x6e, 0xab, 0xcd }
    };
    unsigned char t1[4] = { 0x33, 0x6e, 0xab, 0xcd };
    unsigned char t2[4] = { 0xfb, 0x6e, 0xab, 0xcd };
    printf("%s\n", binarysearch(T[0], t1, 9, 4) ? "found" : "not found");
    printf("%s\n", binarysearch(T[0], t2, 9, 4) ? "found" : "not found");
}    

结果:

not found
found

bsearch函数的签名如下:

   void *bsearch(const void *key, const void *base,
                 size_t nmemb, size_t size,
                 int (*compar)(const void *, const void *));

在这里,key 将是 t1t2,而 base 将是 Tnmembsize 分别是 9 和 4。

compar 是指向进行比较的回调函数的指针。这可以只是 memcmp:

的包装
int compare_char4(const void *p1, const void *p2)
{
    return memcmp(p1, p2, 4);
}

然后使用这些参数调用 bsearch

printf("%s\n", bsearch(t1, T, 9, 4, compare_char4) ? "found" : "not found");
printf("%s\n", bsearch(t2, T, 9, 4, compare_char4) ? "found" : "not found");

关于在数组中存储 null-terminated 字符串和二进制数据的注意事项

在您的问题中,您指定 fixed-length 数组可以表示

  • null-terminated 字符串或
  • 二进制数据。

如果您的程序无法知道数组内容表示这两种可能性中的哪一种,那么,如果数组内容应该表示 null-terminated 字符串,您将不得不填充具有空字符的数组的其余部分(例如使用函数 strncpy)。只写一个空终止字符是不够的。

这很重要,因为如果不这样做,您的程序将无法知道在遇到值为 0 的字节后如何解释剩余的数据。它不知道是否

  • 将此字节解释为字符串的 null-terminating 字符并忽略字符串的所有剩余字节,或

  • 将此字节解释为二进制数据并将剩余字节也视为二进制数据(即不忽略剩余字节)。

但是,如果您总是用空字符填充字符串的终止空字符之后的所有剩余字节,那么您的程序就不必担心数组内容代表的是字符串还是二进制数据,因为它可以两者一视同仁。

因此,在我的回答的其余部分,我将假设如果数组内容表示一个字符串,那么字符串末尾之后的所有剩余字节都将填充值为 0 的字节。这样,我可以假设任何字节都不应被忽略。

散列table解法

In Python, I would use a set/hashtable/dict and not a list (because of very fast O(1) lookup):

也可以在 C 中使用 hash table,但您必须自己编程或使用现有的库。 C 标准库不提供任何散列函数。

下面的代码会随机生成10000个fixed-length(不是null-terminated)个长度为10的字符串数组,字符为azAZ09,但它还会将三个 hard-coded 单词放入数组的不同位置,以便您稍后搜索这些单词,在为了测试搜索功能。该程序然后将所有单词插入哈希 table,然后对哈希 table.

执行几次 hard-coded 查找
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <stdbool.h>

#define NUM_ARRAYS 10000
#define ARRAY_LENGTH 10
#define HASH_TABLE_SIZE 10000

struct hash_table_entry
{
    struct hash_table_entry *next;
    unsigned char data[ARRAY_LENGTH];
};

//function prototype declarations
void fill_with_random_data( unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] );
size_t hash_array( const unsigned char array[ARRAY_LENGTH] );
void build_hash_table_from_data( unsigned char data[][ARRAY_LENGTH], size_t data_length, struct hash_table_entry *hash_table[HASH_TABLE_SIZE] );
void cleanup_hash_table( struct hash_table_entry *hash_table[HASH_TABLE_SIZE] );
bool lookup( const unsigned char array[ARRAY_LENGTH], struct hash_table_entry *hash_table[HASH_TABLE_SIZE] );
bool verbose_lookup( const unsigned char array[ARRAY_LENGTH], struct hash_table_entry *hash_table[HASH_TABLE_SIZE] );

int main( void )
{
    //declare 2D array for the input data
    static unsigned char data[NUM_ARRAYS][ARRAY_LENGTH];

    //declare hash table and initialize all fields to zero
    static struct hash_table_entry *hash_table[HASH_TABLE_SIZE] = {NULL};

    //seed random number generator
    srand( (unsigned)time( NULL ) );

    //fill array "data" with random data
    printf( "Generating random data..." );
    fflush( stdout );
    fill_with_random_data( data );
    printf( "done\n" );
    fflush( stdout );

    //overwrite a few strings, so that we can search for
    //them later
    _Static_assert( NUM_ARRAYS > 500, "unexpected value" );
    _Static_assert( ARRAY_LENGTH == 10, "unexpected value" );
    memcpy( data[47], "abcdefghij", 10 );
    memcpy( data[218], "klmnopqrst", 10 );
    memcpy( data[419], "uvwxyz0123", 10 );

    //build hash table
    printf( "Building hash table..." );
    fflush( stdout );
    build_hash_table_from_data( data, NUM_ARRAYS, hash_table );
    printf( "done\n" );
    fflush( stdout );

    //perform lookups
    verbose_lookup( (unsigned char *)"uvwxyz0123", hash_table );
    verbose_lookup( (unsigned char *)"hsdkesldhg", hash_table );
    verbose_lookup( (unsigned char *)"abcdefghij", hash_table );
    verbose_lookup( (unsigned char *)"erlsodn3ls", hash_table );
    verbose_lookup( (unsigned char *)"klmnopqrst", hash_table );

    //cleanup
    printf( "Performing cleanup..." );
    fflush( stdout );
    cleanup_hash_table( hash_table );
    printf( "done\n" );
    fflush( stdout );
}

void fill_with_random_data( unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] )
{
    static const unsigned char random_chars[62] =
        "abcdefghijklmnopqrstuvwxyz"
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        "0123456789";

    for ( size_t i = 0; i < NUM_ARRAYS; i++ )
    {
        for ( size_t j = 0; j < ARRAY_LENGTH; j++ )
        {
            data[i][j] = random_chars[rand()%sizeof random_chars];
        }
    }
}

//This is a simple hash function. Depending on the type of data, a
//more complex hashing function may be required, in order to prevent
//hash collisions.
size_t hash_array( const unsigned char array[ARRAY_LENGTH] )
{
    size_t hash = 0;

    for ( size_t i = 0; i < ARRAY_LENGTH; i++ )
    {
        hash = hash * 7 + array[i];
    }

    return hash % HASH_TABLE_SIZE;
}

void build_hash_table_from_data( unsigned char data[][ARRAY_LENGTH], size_t data_length, struct hash_table_entry *hash_table[HASH_TABLE_SIZE] )
{
    for ( size_t i = 0; i < data_length; i++ )
    {
        struct hash_table_entry **pp, *p;

        //hash the array and store the result
        size_t hash = hash_array( data[i] );

        //perform hash table lookup
        pp = &hash_table[hash];

        //make pp point to the last null pointer in the list
        while ( ( p = *pp ) != NULL )
            pp = &p->next;

        //allocate memory for linked list node
        p = malloc( sizeof *p );
        if ( p == NULL )
        {
            fprintf( stderr, "Memory allocation failure!\n" );
            exit( EXIT_FAILURE );
        }

        //fill in data for new node
        p->next = NULL;
        memcpy( p->data, data[i], ARRAY_LENGTH );

        //insert new node into linked list
        *pp = p;
    }
}

void cleanup_hash_table( struct hash_table_entry *hash_table[HASH_TABLE_SIZE] )
{
    for ( size_t i = 0; i < HASH_TABLE_SIZE; i++ )
    {
        struct hash_table_entry *p;

        p = hash_table[i];

        while ( p != NULL )
        {
            struct hash_table_entry *q = p;

            p = p->next;

            free( q );
        }
    }
}

bool lookup( const unsigned char array[ARRAY_LENGTH], struct hash_table_entry *hash_table[HASH_TABLE_SIZE] )
{
    size_t hash;
    struct hash_table_entry *p;

    //find hash
    hash = hash_array( array );

    //index into hash table
    p = hash_table[hash];

    //attempt to find array in linked list
    while ( p != NULL )
    {
        if ( memcmp( array, p->data, ARRAY_LENGTH ) == 0 )
        {
            //array was found
            return true;
        }

        p = p->next;
    }

    //array was not found
    return false;
}

//This function serves as a wrapper function for the
//function "lookup". It prints to "stdout" what it is
//looking for and what the result of the lookup was.
bool verbose_lookup( const unsigned char array[ARRAY_LENGTH], struct hash_table_entry *hash_table[HASH_TABLE_SIZE] )
{
    //print message
    printf( "Searching for %.*s...", ARRAY_LENGTH, array );
    fflush( stdout );

    if ( lookup( array, hash_table ) )
    {
        printf( "found\n" );
        fflush( stdout );
        return true;
    }
    else
    {
        printf( "not found\n" );
        fflush( stdout );
        return false;
    }
}

这个程序有以下输出:

Generating random data...done
Building hash table...done
Searching for uvwxyz0123...found
Searching for hsdkesldhg...not found
Searching for abcdefghij...found
Searching for erlsodn3ls...not found
Searching for klmnopqrst...found
Performing cleanup...done

如您所见,它找到了所有显式插入的 3 个字符串,没有找到其他字符串。

在代码中,我使用unsigned char 而不是char,因为在C 中,char * 通常用于传递null-terminated 字符串。因此,使用 unsigned char * 似乎更适合传递可以是二进制或非二进制的数据 null-terminated.

bsearch解决方案

为了比较,这里有一个解决方案,它以相同的方式生成输入数组,但使用 bsearch 而不是散列 table 来搜索它:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <stdbool.h>

#define NUM_ARRAYS 10000
#define ARRAY_LENGTH 10

//function prototype declarations
void fill_with_random_data( unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] );
int compare_func( const void *a, const void *b );
bool verbose_search( const unsigned char array[ARRAY_LENGTH], unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] );

int main( void )
{
    //declare 2D array for the input data
    static unsigned char data[NUM_ARRAYS][ARRAY_LENGTH];

    //seed random number generator
    srand( (unsigned)time( NULL ) );

    //fill array "data" with random data
    printf( "Generating random data..." );
    fflush( stdout );
    fill_with_random_data( data );
    printf( "done\n" );
    fflush( stdout );

    //overwrite a few strings, so that we can search for
    //them later
    _Static_assert( NUM_ARRAYS > 500, "unexpected value" );
    _Static_assert( ARRAY_LENGTH == 10, "unexpected value" );
    memcpy( data[47], "abcdefghij", 10 );
    memcpy( data[218], "klmnopqrst", 10 );
    memcpy( data[419], "uvwxyz0123", 10 );

    //sort the array
    printf( "Sorting array..." );
    fflush( stdout );
    qsort( data, NUM_ARRAYS, ARRAY_LENGTH, compare_func );
    printf( "done\n" );
    fflush( stdout );

    //perform lookups
    verbose_search( (unsigned char *)"uvwxyz0123", data );
    verbose_search( (unsigned char *)"hsdkesldhg", data );
    verbose_search( (unsigned char *)"abcdefghij", data );
    verbose_search( (unsigned char *)"erlsodn3ls", data );
    verbose_search( (unsigned char *)"klmnopqrst", data );
}

void fill_with_random_data( unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] )
{
    static const unsigned char random_chars[62] =
        "abcdefghijklmnopqrstuvwxyz"
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        "0123456789";

    for ( size_t i = 0; i < NUM_ARRAYS; i++ )
    {
        for ( size_t j = 0; j < ARRAY_LENGTH; j++ )
        {
            data[i][j] = random_chars[rand()%sizeof random_chars];
        }
    }
}

int compare_func( const void *a, const void *b )
{
    return memcmp( a, b, ARRAY_LENGTH );
}

//This function serves as a wrapper function for the
//function "bsearch". It prints to "stdout" what it is
//looking for and what the result of the search was.
bool verbose_search( const unsigned char array[ARRAY_LENGTH], unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] )
{
    //print message
    printf( "Searching for %.*s...", ARRAY_LENGTH, array );
    fflush( stdout );

    if ( bsearch( array, data, NUM_ARRAYS, ARRAY_LENGTH, compare_func ) != NULL )
    {
        printf( "found\n" );
        fflush( stdout );
        return true;
    }
    else
    {
        printf( "not found\n" );
        fflush( stdout );
        return false;
    }
}

这个程序有以下输出:

Generating random data...done
Sorting array...done
Searching for uvwxyz0123...found
Searching for hsdkesldhg...not found
Searching for abcdefghij...found
Searching for erlsodn3ls...not found
Searching for klmnopqrst...found

哈希 table 解决方案可能比二进制搜索解决方案更快,假设为输入选择了一个好的哈希函数,因此哈希冲突的数量最少。