如何避免在标记化函数中打印“”？

Question

我应该在 Python 中创建一个字数统计程序，它检查给定文本中的字词种类和这些字词的频率。

作为程序的一部分，某些停用词不应计算在内，空格和特殊字符（+-??:"; 等）也不应计算在内。

程序的第一部分是创建一个tokenize函数（我稍后会测试我的函数，应该经过以下测试）：

if hasattr(wordfreq, "tokenize"):
    fun_count = fun_count + 1
    test(wordfreq.tokenize, [], [])
    test(wordfreq.tokenize, [""], [])
    test(wordfreq.tokenize, ["   "], [])
    test(wordfreq.tokenize, ["This is a simple sentence"], ["this","is","a","simple","sentence"])
    test(wordfreq.tokenize, ["I told you!"], ["i","told","you","!"])
    test(wordfreq.tokenize, ["The 10 little chicks"], ["the","10","little","chicks"])
    test(wordfreq.tokenize, ["15th anniversary"], ["15","th","anniversary"])
    test(wordfreq.tokenize, ["He is in the room, she said."], ["he","is","in","the","room",",","she","said","."])
else:
    print("tokenize is not implemented yet!")

但是我的函数通过了 8 个中的 7 个。

测试后的输出为：

Condition failed:
tokenize([' ']) == []
tokenize returned/printed:
['']
countWords is not implemented yet!
printTopMost is not implemented yet!
7 out of 8 passed.

我怀疑和我的else语句有关。关于我如何使用 end = start 或类似的东西。

谁能帮我解决我应该更改的问题，并解释一下正确解决方案和我的解决方案之间的区别？

我的代码：

def tokenize(lines):
    words = []
    for line in lines:
        start = 0
        while start < len(line):
            while start < len(line) and line[start].isspace():
                start = start + 1
            end = start
            if end < len(line) and line[end].isdigit():
                end = start
                while end < len(line) and line[end].isdigit():
                    end = end + 1
                words.append(line[start:end])
                start = end
            elif end < len(line) and line[end].isalpha():
                end = start
                while end < len(line) and line[end].isalpha():
                    end = end + 1
                words.append(line[start:end].lower())
                start = end
            else: 
                end = start
                end < len(line)
                end = end + 1
                words.append(line[start:end])
                start = end 
    return words

Answer 1

除了最后一个我认为您错过了 if 条件的地方外，一切看起来都不错。我还在任何逻辑之前添加了一个 line.strip() 开始。

条件[" "],[]是失败的，因为如果不去掉空句，最后的结果就是['']，测试用例失败，因为[]不等于[' ']

def tokenize(lines):
    words = []
    for line in lines:
        line = line.strip()
        start = 0
        while start < len(line):

            while start < len(line) and line[start].isspace():
                start = start + 1
            end = start

            if end < len(line) and line[end].isdigit():
                end = start
                while end < len(line) and line[end].isdigit():
                    end = end + 1
                words.append(line[start:end])
                start = end

            elif end < len(line) and line[end].isalpha():
                end = start
                while end < len(line) and line[end].isalpha():
                    end = end + 1
                words.append(line[start:end].lower())
                start = end
            else:
                end = start
                if end < len(line):
                    end = end + 1
                words.append(line[start:end])
                start = end

    return words

如果您不想使用 line.strip()，另一种实现方法是在附加到单词之前添加一个额外的 if 条件，如下所示：

def tokenize(lines):
    words = []
    for line in lines:
        start = 0
        while start < len(line):

            while start < len(line) and line[start].isspace():
                start = start + 1
            end = start

            if end < len(line) and line[end].isdigit():
                end = start
                while end < len(line) and line[end].isdigit():
                    end = end + 1

            elif end < len(line) and line[end].isalpha():
                end = start
                while end < len(line) and line[end].isalpha():
                    end = end + 1
            else:
                end = start
                if end < len(line):
                    end = end + 1

            if start != end:
                words.append(line[start:end].lower())

            start = end

    return words

Answer 2

谢谢！成功了！这是我编程的第二周，非常感谢。

现在，如果我想进行字数统计，并增加文本中的字数，而不是另一个名为“stopWords”的文件中的字数。这种做法是正确的，还是完全错误的。

def countWords(words, stopWords):
counts = {}
for w in words:
    counts[w] = counts.get(w,0) + 1
    if words in eng_stopwords == True
    frequencies = counts in words and words not in frequencies == True

return counts

5:th 和 6:th 行我不太确定。

如何避免在标记化函数中打印“”？

How do I avoid printing " " in my tokenize function?

python

tokenize

word-count