如何避免在标记化函数中打印“”?
How do I avoid printing " " in my tokenize function?
我应该在 Python 中创建一个字数统计程序,它检查给定文本中的字词种类和这些字词的频率。
作为程序的一部分,某些停用词不应计算在内,空格和特殊字符(+-??:"; 等)也不应计算在内。
程序的第一部分是创建一个tokenize函数(我稍后会测试我的函数,应该经过以下测试):
if hasattr(wordfreq, "tokenize"):
fun_count = fun_count + 1
test(wordfreq.tokenize, [], [])
test(wordfreq.tokenize, [""], [])
test(wordfreq.tokenize, [" "], [])
test(wordfreq.tokenize, ["This is a simple sentence"], ["this","is","a","simple","sentence"])
test(wordfreq.tokenize, ["I told you!"], ["i","told","you","!"])
test(wordfreq.tokenize, ["The 10 little chicks"], ["the","10","little","chicks"])
test(wordfreq.tokenize, ["15th anniversary"], ["15","th","anniversary"])
test(wordfreq.tokenize, ["He is in the room, she said."], ["he","is","in","the","room",",","she","said","."])
else:
print("tokenize is not implemented yet!")
但是我的函数通过了 8 个中的 7 个。
测试后的输出为:
Condition failed:
tokenize([' ']) == []
tokenize
returned/printed:
['']
countWords is not implemented
yet!
printTopMost is not implemented yet!
7 out of 8 passed.
我怀疑和我的else语句有关。关于我如何使用 end = start 或类似的东西。
谁能帮我解决我应该更改的问题,并解释一下正确解决方案和我的解决方案之间的区别?
我的代码:
def tokenize(lines):
words = []
for line in lines:
start = 0
while start < len(line):
while start < len(line) and line[start].isspace():
start = start + 1
end = start
if end < len(line) and line[end].isdigit():
end = start
while end < len(line) and line[end].isdigit():
end = end + 1
words.append(line[start:end])
start = end
elif end < len(line) and line[end].isalpha():
end = start
while end < len(line) and line[end].isalpha():
end = end + 1
words.append(line[start:end].lower())
start = end
else:
end = start
end < len(line)
end = end + 1
words.append(line[start:end])
start = end
return words
除了最后一个我认为您错过了 if 条件的地方外,一切看起来都不错。我还在任何逻辑之前添加了一个 line.strip() 开始。
条件[" "],[]是失败的,因为如果不去掉空句,最后的结果就是[''],测试用例失败,因为[]不等于[' ']
def tokenize(lines):
words = []
for line in lines:
line = line.strip()
start = 0
while start < len(line):
while start < len(line) and line[start].isspace():
start = start + 1
end = start
if end < len(line) and line[end].isdigit():
end = start
while end < len(line) and line[end].isdigit():
end = end + 1
words.append(line[start:end])
start = end
elif end < len(line) and line[end].isalpha():
end = start
while end < len(line) and line[end].isalpha():
end = end + 1
words.append(line[start:end].lower())
start = end
else:
end = start
if end < len(line):
end = end + 1
words.append(line[start:end])
start = end
return words
如果您不想使用 line.strip(),另一种实现方法是在附加到单词之前添加一个额外的 if 条件,如下所示:
def tokenize(lines):
words = []
for line in lines:
start = 0
while start < len(line):
while start < len(line) and line[start].isspace():
start = start + 1
end = start
if end < len(line) and line[end].isdigit():
end = start
while end < len(line) and line[end].isdigit():
end = end + 1
elif end < len(line) and line[end].isalpha():
end = start
while end < len(line) and line[end].isalpha():
end = end + 1
else:
end = start
if end < len(line):
end = end + 1
if start != end:
words.append(line[start:end].lower())
start = end
return words
谢谢!成功了!
这是我编程的第二周,非常感谢。
现在,如果我想进行字数统计,并增加文本中的字数,而不是另一个名为“stopWords”的文件中的字数。这种做法是正确的,还是完全错误的。
def countWords(words, stopWords):
counts = {}
for w in words:
counts[w] = counts.get(w,0) + 1
if words in eng_stopwords == True
frequencies = counts in words and words not in frequencies == True
return counts
5:th 和 6:th 行我不太确定。
我应该在 Python 中创建一个字数统计程序,它检查给定文本中的字词种类和这些字词的频率。
作为程序的一部分,某些停用词不应计算在内,空格和特殊字符(+-??:"; 等)也不应计算在内。
程序的第一部分是创建一个tokenize函数(我稍后会测试我的函数,应该经过以下测试):
if hasattr(wordfreq, "tokenize"):
fun_count = fun_count + 1
test(wordfreq.tokenize, [], [])
test(wordfreq.tokenize, [""], [])
test(wordfreq.tokenize, [" "], [])
test(wordfreq.tokenize, ["This is a simple sentence"], ["this","is","a","simple","sentence"])
test(wordfreq.tokenize, ["I told you!"], ["i","told","you","!"])
test(wordfreq.tokenize, ["The 10 little chicks"], ["the","10","little","chicks"])
test(wordfreq.tokenize, ["15th anniversary"], ["15","th","anniversary"])
test(wordfreq.tokenize, ["He is in the room, she said."], ["he","is","in","the","room",",","she","said","."])
else:
print("tokenize is not implemented yet!")
但是我的函数通过了 8 个中的 7 个。
测试后的输出为:
Condition failed:
tokenize([' ']) == []
tokenize returned/printed:
['']
countWords is not implemented yet!
printTopMost is not implemented yet!
7 out of 8 passed.
我怀疑和我的else语句有关。关于我如何使用 end = start 或类似的东西。
谁能帮我解决我应该更改的问题,并解释一下正确解决方案和我的解决方案之间的区别?
我的代码:
def tokenize(lines):
words = []
for line in lines:
start = 0
while start < len(line):
while start < len(line) and line[start].isspace():
start = start + 1
end = start
if end < len(line) and line[end].isdigit():
end = start
while end < len(line) and line[end].isdigit():
end = end + 1
words.append(line[start:end])
start = end
elif end < len(line) and line[end].isalpha():
end = start
while end < len(line) and line[end].isalpha():
end = end + 1
words.append(line[start:end].lower())
start = end
else:
end = start
end < len(line)
end = end + 1
words.append(line[start:end])
start = end
return words
除了最后一个我认为您错过了 if 条件的地方外,一切看起来都不错。我还在任何逻辑之前添加了一个 line.strip() 开始。
条件[" "],[]是失败的,因为如果不去掉空句,最后的结果就是[''],测试用例失败,因为[]不等于[' ']
def tokenize(lines):
words = []
for line in lines:
line = line.strip()
start = 0
while start < len(line):
while start < len(line) and line[start].isspace():
start = start + 1
end = start
if end < len(line) and line[end].isdigit():
end = start
while end < len(line) and line[end].isdigit():
end = end + 1
words.append(line[start:end])
start = end
elif end < len(line) and line[end].isalpha():
end = start
while end < len(line) and line[end].isalpha():
end = end + 1
words.append(line[start:end].lower())
start = end
else:
end = start
if end < len(line):
end = end + 1
words.append(line[start:end])
start = end
return words
如果您不想使用 line.strip(),另一种实现方法是在附加到单词之前添加一个额外的 if 条件,如下所示:
def tokenize(lines):
words = []
for line in lines:
start = 0
while start < len(line):
while start < len(line) and line[start].isspace():
start = start + 1
end = start
if end < len(line) and line[end].isdigit():
end = start
while end < len(line) and line[end].isdigit():
end = end + 1
elif end < len(line) and line[end].isalpha():
end = start
while end < len(line) and line[end].isalpha():
end = end + 1
else:
end = start
if end < len(line):
end = end + 1
if start != end:
words.append(line[start:end].lower())
start = end
return words
谢谢!成功了! 这是我编程的第二周,非常感谢。
现在,如果我想进行字数统计,并增加文本中的字数,而不是另一个名为“stopWords”的文件中的字数。这种做法是正确的,还是完全错误的。
def countWords(words, stopWords):
counts = {}
for w in words:
counts[w] = counts.get(w,0) + 1
if words in eng_stopwords == True
frequencies = counts in words and words not in frequencies == True
return counts
5:th 和 6:th 行我不太确定。