如何将逗号与单词分开(标记化)
How to separate comma from word (tokenization)
我在标记化方面遇到了一些问题,作业是将一个句子分成单词。
这就是我目前所做的。
def tokenize(s):
d = []
start = 0
while start < len(s):
while start < len(s) and s[start].isspace():
start = start+1
end = start
while end < len(s) and not s[end].isspace():
end = end+1
d = d + [s[start:end]]
start = end
print(d)
运行 程序:
>>> tokenize("He was walking, it was fun")
['He', 'was', 'walking,', 'it', 'was', 'fun']
这工作正常,但问题是如您所见,我的程序将在 walking 一词中包含逗号。我想将逗号(和其他“符号”)分隔为一个单独的“单词”。
如:
['He', 'was', 'walking', ',', 'it', 'was', 'fun']
如何修改我的代码来解决这个问题?
提前致谢!
这里有一个可能的修改建议,它适用于您的特定示例,但 肯定 会失败,例如“你好吗?!”:
def tokenize(s):
d = []
start = 0
while start < len(s):
while start < len(s) and s[start].isspace():
start = start+1
end = start
while end < len(s) and not s[end].isspace():
end = end+1
if(s[end-1] in ["!", ",", ".", ";", ":"]):
d = d + [s[start:(end-1)]]
d = d + [s[end-1]]
else:
d = d + [s[start:end]]
start = end
print(d)
tokenize("He was walking, it was fun!")
# ['He', 'was', 'walking', ',', 'it', 'was', 'fun', '!']
另一种方法是使用 split
函数,如下所示
def tokenize(s):
d1 = s.split(",")
d3 = []
for d2 in d1:
for d in d2.split():
d3.append( d )
d3.append( "," )
d3.pop(-1)
print(d3)
tokenize("He was walking, it was fun")
我在标记化方面遇到了一些问题,作业是将一个句子分成单词。
这就是我目前所做的。
def tokenize(s):
d = []
start = 0
while start < len(s):
while start < len(s) and s[start].isspace():
start = start+1
end = start
while end < len(s) and not s[end].isspace():
end = end+1
d = d + [s[start:end]]
start = end
print(d)
运行 程序:
>>> tokenize("He was walking, it was fun")
['He', 'was', 'walking,', 'it', 'was', 'fun']
这工作正常,但问题是如您所见,我的程序将在 walking 一词中包含逗号。我想将逗号(和其他“符号”)分隔为一个单独的“单词”。
如:
['He', 'was', 'walking', ',', 'it', 'was', 'fun']
如何修改我的代码来解决这个问题?
提前致谢!
这里有一个可能的修改建议,它适用于您的特定示例,但 肯定 会失败,例如“你好吗?!”:
def tokenize(s):
d = []
start = 0
while start < len(s):
while start < len(s) and s[start].isspace():
start = start+1
end = start
while end < len(s) and not s[end].isspace():
end = end+1
if(s[end-1] in ["!", ",", ".", ";", ":"]):
d = d + [s[start:(end-1)]]
d = d + [s[end-1]]
else:
d = d + [s[start:end]]
start = end
print(d)
tokenize("He was walking, it was fun!")
# ['He', 'was', 'walking', ',', 'it', 'was', 'fun', '!']
另一种方法是使用 split
函数,如下所示
def tokenize(s):
d1 = s.split(",")
d3 = []
for d2 in d1:
for d in d2.split():
d3.append( d )
d3.append( "," )
d3.pop(-1)
print(d3)
tokenize("He was walking, it was fun")