nltk词语料库不包含"okay"?
nltk words corpus does not contain "okay"?
NLTK词语料库没有词组"okay"、"ok"、"Okay"?
> from nltk.corpus import words
> words.words().__contains__("check")
> True
> words.words().__contains__("okay")
> False
> len(words.words())
> 236736
知道为什么吗?
TL;DR
from nltk.corpus import words
from nltk.corpus import wordnet
manywords = words.words() + wordnet.words()
中龙
来自docs, the nltk.corpus.words
are words a list of words from "http://en.wikipedia.org/wiki/Words_(Unix)
在Unix中,你可以这样做:
ls /usr/share/dict/
并阅读自述文件:
$ cd /usr/share/dict/
/usr/share/dict$ cat README
# @(#)README 8.1 (Berkeley) 6/5/93
# $FreeBSD$
WEB ---- (introduction provided by jaw@riacs) -------------------------
Welcome to web2 (Webster's Second International) all 234,936 words worth.
The 1934 copyright has lapsed, according to the supplier. The
supplemental 'web2a' list contains hyphenated terms as well as assorted
noun and adverbial phrases. The wordlist makes a dandy 'grep' victim.
-- James A. Woods {ihnp4,hplabs}!ames!jaw (or jaw@riacs)
Country names are stored in the file /usr/share/misc/iso3166.
FreeBSD Maintenance Notes ---------------------------------------------
Note that FreeBSD is not maintaining a historical document, we're
maintaining a list of current [American] English spellings.
A few words have been removed because their spellings have depreciated.
This list of words includes:
corelation (and its derivatives) "correlation" is the preferred spelling
freen typographical error in original file
freend archaic spelling no longer in use;
masks common typo in modern text
--
A list of technical terms has been added in the file 'freebsd'. This
word list contains FreeBSD/Unix lexicon that is used by the system
documentation. It makes a great ispell(1) personal dictionary to
supplement the standard English language dictionary.
既然是234,936的固定列表,肯定会有不在该列表中的单词。
如果您需要扩展您的单词列表,您可以使用来自 WordNet 的单词添加到列表中 nltk.corpus.wordnet.words()
。
很可能,您所需要的只是足够大的文本语料库,例如维基百科转储然后对其进行标记化并提取所有独特的单词。
由于声誉低下,我无法发表评论,但我可以提供一些东西。
我在 nltk_data issue related to this 中发布了一个 zip 文件,其中包含从 Ubuntu18.04 /usr/share/dict/american-english
中合并的更全面的单词集
原始 /usr/share/dict 文件中有一些严重缺失的单词,例如 'failed' 和 'failings'。不幸的是,使用 wordnet 并不能真正解决这个问题;它添加了 'fail-safe' 和几种失败类型,例如 'equipment_failure' 和 'renal_failure' 但它没有添加基本单词。希望提供的 zip 文件会有一些用处。
NLTK词语料库没有词组"okay"、"ok"、"Okay"?
> from nltk.corpus import words
> words.words().__contains__("check")
> True
> words.words().__contains__("okay")
> False
> len(words.words())
> 236736
知道为什么吗?
TL;DR
from nltk.corpus import words
from nltk.corpus import wordnet
manywords = words.words() + wordnet.words()
中龙
来自docs, the nltk.corpus.words
are words a list of words from "http://en.wikipedia.org/wiki/Words_(Unix)
在Unix中,你可以这样做:
ls /usr/share/dict/
并阅读自述文件:
$ cd /usr/share/dict/
/usr/share/dict$ cat README
# @(#)README 8.1 (Berkeley) 6/5/93
# $FreeBSD$
WEB ---- (introduction provided by jaw@riacs) -------------------------
Welcome to web2 (Webster's Second International) all 234,936 words worth.
The 1934 copyright has lapsed, according to the supplier. The
supplemental 'web2a' list contains hyphenated terms as well as assorted
noun and adverbial phrases. The wordlist makes a dandy 'grep' victim.
-- James A. Woods {ihnp4,hplabs}!ames!jaw (or jaw@riacs)
Country names are stored in the file /usr/share/misc/iso3166.
FreeBSD Maintenance Notes ---------------------------------------------
Note that FreeBSD is not maintaining a historical document, we're
maintaining a list of current [American] English spellings.
A few words have been removed because their spellings have depreciated.
This list of words includes:
corelation (and its derivatives) "correlation" is the preferred spelling
freen typographical error in original file
freend archaic spelling no longer in use;
masks common typo in modern text
--
A list of technical terms has been added in the file 'freebsd'. This
word list contains FreeBSD/Unix lexicon that is used by the system
documentation. It makes a great ispell(1) personal dictionary to
supplement the standard English language dictionary.
既然是234,936的固定列表,肯定会有不在该列表中的单词。
如果您需要扩展您的单词列表,您可以使用来自 WordNet 的单词添加到列表中 nltk.corpus.wordnet.words()
。
很可能,您所需要的只是足够大的文本语料库,例如维基百科转储然后对其进行标记化并提取所有独特的单词。
由于声誉低下,我无法发表评论,但我可以提供一些东西。 我在 nltk_data issue related to this 中发布了一个 zip 文件,其中包含从 Ubuntu18.04 /usr/share/dict/american-english
中合并的更全面的单词集原始 /usr/share/dict 文件中有一些严重缺失的单词,例如 'failed' 和 'failings'。不幸的是,使用 wordnet 并不能真正解决这个问题;它添加了 'fail-safe' 和几种失败类型,例如 'equipment_failure' 和 'renal_failure' 但它没有添加基本单词。希望提供的 zip 文件会有一些用处。