如何把一个句子分成几个词
How to Break a sentence into a few words
我想问一下如何把一个句子分成几个词,这个python中的NLP(自然语言处理)是用什么叫NLTK或PARSER?关于 python 我对这个方法感到困惑,在我的情况下我应该采用什么方法。
如果你想找到句子包含的所有单词,即 tokenization,那么使用 NLTK:
tokens = nltk.word_tokenize(sentence)
请注意,简单的空格分割 sentence.split()
效果更差。
In particular, 'This quickly comes into problems when an abbreviation is processed. “etc.” would be interpreted as a sentence terminator, and “U.N.E.S.C.O.” would be interpreted as six individual sentences, when both should be treated as single word tokens. How should hyphens be interpreted? What about speech marks and apostrophes?'
或者看看 another source:“你砍掉了空格并扔掉了标点符号。这是一个起点,但即使是英语也有很多棘手的情况。例如,什么您对撇号在所有和收缩中的各种用法有何看法?
Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing.
一个简单的策略是只拆分所有非字母数字字符,但是 o
neill
看起来不错,aren
t
看起来很糟糕。” =20=]
如果不使用自然语言工具包 (NLTK),您可以使用简单的 Python 命令
如下
>>> line="a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']
>>>
在How to split a string into a list?
中给出
我想问一下如何把一个句子分成几个词,这个python中的NLP(自然语言处理)是用什么叫NLTK或PARSER?关于 python 我对这个方法感到困惑,在我的情况下我应该采用什么方法。
如果你想找到句子包含的所有单词,即 tokenization,那么使用 NLTK:
tokens = nltk.word_tokenize(sentence)
请注意,简单的空格分割 sentence.split()
效果更差。
In particular, 'This quickly comes into problems when an abbreviation is processed. “etc.” would be interpreted as a sentence terminator, and “U.N.E.S.C.O.” would be interpreted as six individual sentences, when both should be treated as single word tokens. How should hyphens be interpreted? What about speech marks and apostrophes?'
或者看看 another source:“你砍掉了空格并扔掉了标点符号。这是一个起点,但即使是英语也有很多棘手的情况。例如,什么您对撇号在所有和收缩中的各种用法有何看法?
Mr. O'Neill thinks that the boys' stories about Chile's capital aren't amusing.
一个简单的策略是只拆分所有非字母数字字符,但是 o
neill
看起来不错,aren
t
看起来很糟糕。” =20=]
如果不使用自然语言工具包 (NLTK),您可以使用简单的 Python 命令 如下
>>> line="a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']
>>>
在How to split a string into a list?
中给出