在 C 中查找 char* 的子字符串时,使用单链表的程序会出现段错误

Program that uses singly-linked list is segfaulting when finding substring of a char* in C

我正在编写一个简单的 Lisp 解释器,并且正在编写一个 S-exp 解析器。我决定逐步解析 S 表达式,即。通过迭代 字符串 S 表达式并使用字符线索(例如字符是否为 '(', ')'、'"'、''' 等)来确定当前标记的类型及其相关值。我正在解析使用单个 linked 列表将 S 表达式存入内存,也就是说,在迭代 S 表达式时,我的程序将获取令牌,将其解析为结构,然后 link 使用下一个标记。

这是我目前的代码:

#include <stdio.h>
#include <stddef.h>
#include <string.h>
#include <stdlib.h>
#define MAX_SNODE_LENGTH 256

enum styp {
    /*
    Enum that declares all possible types
    that a token could be in my lisp S-expression.
    In order, opening paranthesis, closing paranthesis,
    string, integer, decimal (float), charachter, nil
    */
    oparn,
    cparn,
    strn,
    intg,
    dec,
    chac,
    nil,
};


struct snode_t {
    /*
    When a S-expression is parsed every token is
    parsed into this. All data is stored in a
    string, but when operations need to be completed on it
    they are converted based on their type (defined by enum 
    styp typ).
    Each node points to the next one. When a S-exp is parsed,
    an snode_t is returned that is the first item. THe programmer
    can then iterate over the list and use the data stored in "val"
    to interpret code, etc.
    */
    char val[MAX_SNODE_LENGTH];
    enum styp typ;
    struct snode_t* nxt;
};

struct snode_t* node() {
    /*
    Function returns the pointer to a struct snode_t
    which has had memory allocated for it. This function
    is just for convenience.
    */
    struct snode_t* tmp = malloc(sizeof(struct snode_t*));
    tmp->nxt = malloc(sizeof(struct snode_t*));
    return tmp;
}

char* sstring(char* str, int s, int e) {
    /*
    Takes a string, a starting index and an 
    end index, and allocates the memory for a substring,
    creates a substring and returns it. Saves programmer
    time fiddling with memory when getting a substring.
    */
    size_t n = e - s + 2;
    // n is the number of bytes needed to be allocated
    char* sub = malloc(sizeof(char) * n);
    memcpy(sub, &str[s], n-1);
    sub[n] = '[=10=]';
    return sub;
}

struct snode_t* psexp(char* str) {
        struct snode_t* head = node();
        // head is the first item in the linked list.
        // head is the struct that is returned once 
        // the function completes.
        
        struct snode_t* aft = node();
        // aft contains the current struct.
        // once the type of a token is determined,
        // aft is set to the "nxt" value of the current
        // struct. aft is then assigned the according values.

        int p = 1;
        // indicates the position in the string. Starts 
        // at 1 because the first index of a properly formatted
        // S-exp will always be a closing bracket.
        // Note that this doesn't consider trailing whitespace,
        // I'm only writing functions that do the bare minimum
        // at this point so that I can make it more robust 
        // and safe later.

        if(str[0] == '(') {
            head->val[0] = '(';
            head->val[1] = '[=10=]';
            head->typ = oparn;
            aft = head->nxt;
        }

        while(str[p] != '[=10=]') {
            if(str[p] == '(') {
                head->val[0] = '(';
                head->val[1] = '[=10=]';
                aft->typ = oparn;
            } else if(str[p] == ')') {
                head->val[0] = ')';
                head->val[1] = '[=10=]';
                aft->typ = cparn;
            } else if(str[p] == '"') {
                // if the current charachter is a
                // ", all the following charachters
                // until the next " comprise a string
                // this next set of lines finds the 
                // position of the next " after the current one,
                // and uses that number to produce a substring
                // This substring is then stored in the current
                // struct.

                int s = p;
                int e;
                p++;
                
                while(str[p] != '"') {
                    p++;    
                }

                e = p++;
                
                char* tocp = sstring(str, s, e);
                strcpy(aft->val, tocp);
                aft->typ = strn;
            }

            aft->nxt = node();
            aft = aft->nxt;
            // the current struct is shifted to the next one.
        }
        return head;
}

int main() {
    psexp("(\"hello world!\")");
    return 0;
}

当我编译它时,我没有收到任何警告或错误,但是,当 运行 它时,我得到了一个段错误。我已经测试了 node 函数和 sstring 函数,所以我有理由相信 psexp 函数存在内存故障。

我不确定我分配内存的方式有什么问题,在 link 将一个结构转换为另一个结构之前,我确保分配了内存。是什么导致了我的段错误?

注意:目前 psexp 只关心括号和字符串,但是一旦我让字符串工作(我认为是导致段错误的区域),我打算添加整个 Enum 的数据类型.

这是一个错误:

struct snode_t* tmp = malloc(sizeof(struct snode_t*));
                                                  ^
                                                ups... pointer

所以当你真的想要一个“struct snode_t”时,你为“指向struct snode_t的指针”分配了内存。

更正为

struct snode_t* tmp = malloc(sizeof(struct snode_t));

或更好

struct snode_t* tmp = malloc(sizeof *tmp);

此外,这很奇怪:

tmp->nxt = malloc(sizeof(struct snode_t*));

创建新节点时你通常做的是:

tmp->nxt = NULL;

在链表中,你永远不会 malloc 指向下一个指针的任何东西。

node 中,您正在对节点 指针 的大小执行 malloc 而不是 [=15= 的完整大小].

修复后,程序无限循环

预分配 nxt 不好。 node 应该只分配一个新节点 而不是 两个。

sstring 中执行 malloc 会泄漏内存。最好将目标缓冲区作为参数传递。

对于 '('')' 的情况,p 从未增加,所以如果我们命中一个,我们将永远循环

链表代码需要修改。

这是重构后的版本:

#include <stdio.h>
#include <stddef.h>
#include <string.h>
#include <stdlib.h>
#define MAX_SNODE_LENGTH 256

enum styp {
    /*
       Enum that declares all possible types that a token could be in my
        lisp S-expression. In order, opening paranthesis, closing paranthesis,
        string, integer, decimal (float), charachter, nil */
    oparn,
    cparn,
    strn,
    intg,
    dec,
    chac,
    nil,
};

struct snode_t {
    /*
       When a S-expression is parsed every token is parsed into this. All
        data is stored in a string, but when operations need to be completed
        on it they are converted based on their type (defined by enum styp
        typ). Each node points to the next one. When a S-exp is parsed, an
        snode_t is returned that is the first item. THe programmer can then
        iterate over the list and use the data stored in "val" to interpret
        code, etc. */
    char val[MAX_SNODE_LENGTH];
    enum styp typ;
    struct snode_t *nxt;
};

struct snode_t *
node()
{
    /*
       Function returns the pointer to a struct snode_t which has had memory
        allocated for it. This function is just for convenience. */
#if 0
    struct snode_t *tmp = malloc(sizeof(struct snode_t *));
    tmp->nxt = malloc(sizeof(struct snode_t *));
#endif

#if 0
    struct snode_t *tmp = malloc(sizeof(struct snode_t));
    tmp->nxt = malloc(sizeof(struct snode_t));
#endif

#if 1
    struct snode_t *tmp = calloc(1,sizeof(*tmp));
#endif

    return tmp;
}

char *
sstring(char *str, int s, int e, char *sub)
{
    /*
       Takes a string, a starting index and an end index, and allocates the
        memory for a substring, creates a substring and returns it. Saves
        programmer time fiddling with memory when getting a substring. */
    size_t n = e - s + 2;

    // n is the number of bytes needed to be allocated
#if 0
    char *sub = malloc(sizeof(char) * n);
#endif

    memcpy(sub, &str[s], n - 1);
    sub[n] = 0;
    return sub;
}

// append -- allocate a new node and append to tail of list
struct snode_t *
append(struct snode_t **head)
{
    struct snode_t *cur;
    struct snode_t *prev;

    // find the tail of the list
    prev = NULL;
    for (cur = *head;  cur != NULL;  cur = cur->nxt)
        prev = cur;

    cur = node();

    if (prev != NULL)
        prev->nxt = cur;
    else
        *head = cur;

    return cur;
}

struct snode_t *
psexp(char *str)
{
    struct snode_t *head = NULL;

    // head is the first item in the linked list.
    // head is the struct that is returned once
    // the function completes.

    struct snode_t *aft = NULL;

    // aft contains the current struct.
    // once the type of a token is determined,
    // aft is set to the "nxt" value of the current
    // struct. aft is then assigned the according values.

    int p = 1;

    int s;
    int e;

    // indicates the position in the string. Starts
    // at 1 because the first index of a properly formatted
    // S-exp will always be a closing bracket.
    // Note that this doesn't consider trailing whitespace,
    // I'm only writing functions that do the bare minimum
    // at this point so that I can make it more robust
    // and safe later.

    while (str[p] != 0) {
        aft = append(&head);

        switch (str[p]) {
        case '(':
            aft->val[0] = '(';
            aft->val[1] = 0;
            aft->typ = oparn;
            p++;
            break;

        case ')':
            aft->val[0] = ')';
            aft->val[1] = 0;
            aft->typ = cparn;
            p++;
            break;

        case '"':
            // if the current charachter is a
            // ", all the following charachters
            // until the next " comprise a string
            // this next set of lines finds the
            // position of the next " after the current one,
            // and uses that number to produce a substring
            // This substring is then stored in the current
            // struct.

            s = p;

            p++;
            while (str[p] != '"') {
                p++;
            }

            e = p++;

            sstring(str, s, e, aft->val);
            aft->typ = strn;
        }

#if 0
        aft->nxt = node();
        aft = aft->nxt;
#endif
        // the current struct is shifted to the next one.
    }

    return head;
}

int
main()
{
    psexp("(\"hello world!\")");
    return 0;
}

更新:

Just as a side note - what does the #if 0 within some of the functions actually do? I've only ever really seen preprocessor directives used to make code more robust or customisable, ie. to check whether a header is defined and if not define it, or for user customisation before compilation. – Bithov Vin

我使用 #if 0 作为一种通过将预处理器 (cpp) 获取到 include/elide 代码来“注释掉”代码的方法。它比将代码包装在 /**/ 对中更干净。我也用#if 1来标记我写的“experimental/new”代码。

这里我用它[作为教学工具]来展示old/original代码与new/improved代码:

#if 0
// old code
#else
// new code
#endif

或:

#if 1
// new code
#endif

对于我自己的代码,当我确定代码正确时,我将删除 #if 0 并保留 #else 代码 [删除 #if/else/endif 行]留下 clean/simple/correct 代码。

所以,当你理解了差异后,你也可以去掉指令,只留下正确的代码。

请注意,我会使用 psexp 中的 #if 0 技巧来展示如何修复它,但是更改是如此广泛以至于使用预处理器指令会使事情变得如此 more难懂。