CoreNLP MaxentTagger 数据格式错误
CoreNLP MaxentTagger Data Format Error
我正在使用 Stanford CoreNLP 进行 NLP 处理,并且正在使用更多领域特定数据训练词性标注器。但是,出于某种原因,当我 运行 它与我得到的属性文件时,培训师正在抛出 "Data format error"。这是上下文:
训练文件
Please#UH let#VBP us#PRP know#VB if#IN you#PRP have#VBP any#DT other#JJ thoughts#NNS that#WDT...
(基本上是很长的1行字+标签集。)
训练属性文件
model = special_postagger.tagger
arch = words(-1,1),unicodeshapes(-1,1),order(2),suffix(4)
wordFunction =
trainFile = /path/to/POS_trainer1.csv
closedClassTags =
closedClassTagThreshold = 40
curWordMinFeatureThresh = 2
debug = false
debugPrefix =
tagSeparator = #
encoding = UTF-8
iterations = 100
lang =
learnClosedClassTags = false
minFeatureThresh = 5
openClassTags =
rareWordMinFeatureThresh = 10
rareWordThresh = 5
search = qn
sgml = false
sigmaSquared = 0.5
regL1 = 1.0
tagInside =
tokenize = true
tokenizerFactory =
tokenizerOptions =
verbose = false
verboseResults = true
veryCommonWordThresh = 250
xmlInput =
outputFile =
outputFormat = slashTags
outputFormatOptions =
nthreads = 1
命令运行
java edu.stanford.nlp.tagger.maxent.MaxentTagger -prop myProps.props
但出于某种原因,我收到此错误消息:
warning: no language set, no open-class tags specified, and no closed-class tags specified; assuming ALL tags are open class tags
TaggerExperiments: adding word/tags
Exception in thread "main" java.lang.IllegalArgumentException: Data format error: can't find delimiter "#" in word "as" (line 2 of /path/to/POS_Trainer1.csv)
at edu.stanford.nlp.tagger.io.TextTaggedFileReader.primeNext(TextTaggedFileReader.java:74)
at edu.stanford.nlp.tagger.io.TextTaggedFileReader.<init>(TextTaggedFileReader.java:34)
at edu.stanford.nlp.tagger.io.TaggedFileRecord.reader(TaggedFileRecord.java:111)
at edu.stanford.nlp.tagger.maxent.ReadDataTagged.<init>(ReadDataTagged.java:52)
at edu.stanford.nlp.tagger.maxent.TaggerExperiments.<init>(TaggerExperiments.java:86)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.trainAndSaveModel(MaxentTagger.java:1140)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.runTraining(MaxentTagger.java:1207)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.main(MaxentTagger.java:1839)
在这里回答我自己的问题:训练文件必须具有完美格式的[word][delimiter][tag],否则它会抛出致命的运行时错误。您可以使用任何您想要的分隔符,例如井号标签 # 符号,但如果有:
- 空格
- 缺少标签
在[word][delimiter][tag]模式之间,会失败。
我正在使用 Stanford CoreNLP 进行 NLP 处理,并且正在使用更多领域特定数据训练词性标注器。但是,出于某种原因,当我 运行 它与我得到的属性文件时,培训师正在抛出 "Data format error"。这是上下文:
训练文件
Please#UH let#VBP us#PRP know#VB if#IN you#PRP have#VBP any#DT other#JJ thoughts#NNS that#WDT...
(基本上是很长的1行字+标签集。)
训练属性文件
model = special_postagger.tagger
arch = words(-1,1),unicodeshapes(-1,1),order(2),suffix(4)
wordFunction =
trainFile = /path/to/POS_trainer1.csv
closedClassTags =
closedClassTagThreshold = 40
curWordMinFeatureThresh = 2
debug = false
debugPrefix =
tagSeparator = #
encoding = UTF-8
iterations = 100
lang =
learnClosedClassTags = false
minFeatureThresh = 5
openClassTags =
rareWordMinFeatureThresh = 10
rareWordThresh = 5
search = qn
sgml = false
sigmaSquared = 0.5
regL1 = 1.0
tagInside =
tokenize = true
tokenizerFactory =
tokenizerOptions =
verbose = false
verboseResults = true
veryCommonWordThresh = 250
xmlInput =
outputFile =
outputFormat = slashTags
outputFormatOptions =
nthreads = 1
命令运行
java edu.stanford.nlp.tagger.maxent.MaxentTagger -prop myProps.props
但出于某种原因,我收到此错误消息:
warning: no language set, no open-class tags specified, and no closed-class tags specified; assuming ALL tags are open class tags
TaggerExperiments: adding word/tags
Exception in thread "main" java.lang.IllegalArgumentException: Data format error: can't find delimiter "#" in word "as" (line 2 of /path/to/POS_Trainer1.csv)
at edu.stanford.nlp.tagger.io.TextTaggedFileReader.primeNext(TextTaggedFileReader.java:74)
at edu.stanford.nlp.tagger.io.TextTaggedFileReader.<init>(TextTaggedFileReader.java:34)
at edu.stanford.nlp.tagger.io.TaggedFileRecord.reader(TaggedFileRecord.java:111)
at edu.stanford.nlp.tagger.maxent.ReadDataTagged.<init>(ReadDataTagged.java:52)
at edu.stanford.nlp.tagger.maxent.TaggerExperiments.<init>(TaggerExperiments.java:86)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.trainAndSaveModel(MaxentTagger.java:1140)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.runTraining(MaxentTagger.java:1207)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.main(MaxentTagger.java:1839)
在这里回答我自己的问题:训练文件必须具有完美格式的[word][delimiter][tag],否则它会抛出致命的运行时错误。您可以使用任何您想要的分隔符,例如井号标签 # 符号,但如果有:
- 空格
- 缺少标签
在[word][delimiter][tag]模式之间,会失败。