如何避免在标记化函数中打印“”?

How do I avoid printing " " in my tokenize function?

我应该在 Python 中创建一个字数统计程序,它检查给定文本中的字词种类和这些字词的频率。

作为程序的一部分,某些停用词不应计算在内,空格和特殊字符(+-??:"; 等)也不应计算在内。

程序的第一部分是创建一个tokenize函数(我稍后会测试我的函数,应该经过以下测试):

if hasattr(wordfreq, "tokenize"):
    fun_count = fun_count + 1
    test(wordfreq.tokenize, [], [])
    test(wordfreq.tokenize, [""], [])
    test(wordfreq.tokenize, ["   "], [])
    test(wordfreq.tokenize, ["This is a simple sentence"], ["this","is","a","simple","sentence"])
    test(wordfreq.tokenize, ["I told you!"], ["i","told","you","!"])
    test(wordfreq.tokenize, ["The 10 little chicks"], ["the","10","little","chicks"])
    test(wordfreq.tokenize, ["15th anniversary"], ["15","th","anniversary"])
    test(wordfreq.tokenize, ["He is in the room, she said."], ["he","is","in","the","room",",","she","said","."])
else:
    print("tokenize is not implemented yet!")

但是我的函数通过了 8 个中的 7 个。

测试后的输出为:

Condition failed:
tokenize([' ']) == []
tokenize returned/printed:
['']
countWords is not implemented yet!
printTopMost is not implemented yet!
7 out of 8 passed.

我怀疑和我的else语句有关。关于我如何使用 end = start 或类似的东西。

谁能帮我解决我应该更改的问题,并解释一下正确解决方案和我的解决方案之间的区别?

我的代码:

def tokenize(lines):
    words = []
    for line in lines:
        start = 0
        while start < len(line):
            while start < len(line) and line[start].isspace():
                start = start + 1
            end = start
            if end < len(line) and line[end].isdigit():
                end = start
                while end < len(line) and line[end].isdigit():
                    end = end + 1
                words.append(line[start:end])
                start = end
            elif end < len(line) and line[end].isalpha():
                end = start
                while end < len(line) and line[end].isalpha():
                    end = end + 1
                words.append(line[start:end].lower())
                start = end
            else: 
                end = start
                end < len(line)
                end = end + 1
                words.append(line[start:end])
                start = end 
    return words

除了最后一个我认为您错过了 if 条件的地方外,一切看起来都不错。我还在任何逻辑之前添加了一个 line.strip() 开始。

条件[" "],[]是失败的,因为如果不去掉空句,最后的结果就是[''],测试用例失败,因为[]不等于[' ']

def tokenize(lines):
    words = []
    for line in lines:
        line = line.strip()
        start = 0
        while start < len(line):

            while start < len(line) and line[start].isspace():
                start = start + 1
            end = start

            if end < len(line) and line[end].isdigit():
                end = start
                while end < len(line) and line[end].isdigit():
                    end = end + 1
                words.append(line[start:end])
                start = end

            elif end < len(line) and line[end].isalpha():
                end = start
                while end < len(line) and line[end].isalpha():
                    end = end + 1
                words.append(line[start:end].lower())
                start = end
            else:
                end = start
                if end < len(line):
                    end = end + 1
                words.append(line[start:end])
                start = end

    return words

如果您不想使用 line.strip(),另一种实现方法是在附加到单词之前添加一个额外的 if 条件,如下所示:

def tokenize(lines):
    words = []
    for line in lines:
        start = 0
        while start < len(line):

            while start < len(line) and line[start].isspace():
                start = start + 1
            end = start

            if end < len(line) and line[end].isdigit():
                end = start
                while end < len(line) and line[end].isdigit():
                    end = end + 1

            elif end < len(line) and line[end].isalpha():
                end = start
                while end < len(line) and line[end].isalpha():
                    end = end + 1
            else:
                end = start
                if end < len(line):
                    end = end + 1

            if start != end:
                words.append(line[start:end].lower())

            start = end

    return words

谢谢!成功了! 这是我编程的第二周,非常感谢。

现在,如果我想进行字数统计,并增加文本中的字数,而不是另一个名为“stopWords”的文件中的字数。这种做法是正确的,还是完全错误的。

def countWords(words, stopWords):
counts = {}
for w in words:
    counts[w] = counts.get(w,0) + 1
    if words in eng_stopwords == True
    frequencies = counts in words and words not in frequencies == True

return counts

5:th 和 6:th 行我不太确定。