如何将主题标签及其文字作为一个标记保存
how to keep hashtags and their words as a single token
如何更改默认设置以防我想保持主题标签符号及其文字完整(即#company 而不是#and company)
x_mod <- udpipe_load_model("D:/Users/asongara/Documents/english-ewt-ud-2.3-181115.udpipe")
ud_model <- udpipe_load_model(x_mod$file)
anno_op3 <- udpipe_annotate(ud_model,
"This is a better #company than i thought @mr_jones!",
tokenizer = "tokenizer",
tagger = "default",
trace = TRUE)
anno_op3 <- as.data.table(as.data.frame(anno_op3))
View(anno_op3)
我得到的是 # 和 company 作为两个不同的标记。我希望#company 作为一个标记。虽然我得到@mr_jones作为一个单一的标记。
您可以将其他标记化工具与 udpipe R 包结合使用。这显示在 https://bnosac.github.io/udpipe/docs/doc2.html。例如。下面使用特定于 Twitter 消息的标记器,然后使用 udpipe
完成词性标记、形态特征注释和依赖项解析
library(tokenizers)
library(udpipe)
x <- tokenize_tweets(c("#rstats is a programming_language", "you can combine the #tokenizers package with @udpipe parsing"),
lowercase = FALSE, strip_punct = FALSE)
x <- sapply(x, FUN=function(x) paste(x, collapse="\n"))
x <- udpipe(x, "english-ewt", tokenizer = "vertical", trace = TRUE)
x
doc_id paragraph_id sentence_id sentence start end term_id token_id token lemma upos xpos feats head_token_id dep_rel deps misc
doc1 1 1 <NA> 1 7 1 1 #rstats #rstat PRON PRP$ Gender=Neut|Number=Sing|Person=3|Poss=Yes|PronType=Prs 4 nsubj <NA> <NA>
doc1 1 1 <NA> 9 10 2 2 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 cop <NA> <NA>
doc1 1 1 <NA> 12 12 3 3 a a DET DT Definite=Ind|PronType=Art 4 det <NA> <NA>
doc1 1 1 <NA> 14 33 4 4 programming_language programming_language NOUN NN Number=Sing 0 root <NA> <NA>
doc2 1 1 <NA> 1 3 1 1 you you PRON PRP Case=Nom|Person=2|PronType=Prs 3 nsubj <NA> <NA>
doc2 1 1 <NA> 5 7 2 2 can can AUX MD VerbForm=Fin 3 aux <NA> <NA>
doc2 1 1 <NA> 9 15 3 3 combine combine VERB VB VerbForm=Inf 0 root <NA> <NA>
doc2 1 1 <NA> 17 19 4 4 the the DET DT Definite=Def|PronType=Art 6 det <NA> <NA>
doc2 1 1 <NA> 21 31 5 5 #tokenizers #tokenizer NOUN NNS Number=Plur 6 compound <NA> <NA>
doc2 1 1 <NA> 33 39 6 6 package package NOUN NN Number=Sing 3 obj <NA> <NA>
doc2 1 1 <NA> 41 44 7 7 with with ADP IN <NA> 9 case <NA> <NA>
doc2 1 1 <NA> 46 52 8 8 @udpipe @udpipe NOUN NN Number=Sing 9 compound <NA> <NA>
doc2 1 1 <NA> 54 60 9 9 parsing parsing NOUN NN Number=Sing 6 nmod <NA> <NA>
>
如何更改默认设置以防我想保持主题标签符号及其文字完整(即#company 而不是#and company)
x_mod <- udpipe_load_model("D:/Users/asongara/Documents/english-ewt-ud-2.3-181115.udpipe")
ud_model <- udpipe_load_model(x_mod$file)
anno_op3 <- udpipe_annotate(ud_model,
"This is a better #company than i thought @mr_jones!",
tokenizer = "tokenizer",
tagger = "default",
trace = TRUE)
anno_op3 <- as.data.table(as.data.frame(anno_op3))
View(anno_op3)
我得到的是 # 和 company 作为两个不同的标记。我希望#company 作为一个标记。虽然我得到@mr_jones作为一个单一的标记。
您可以将其他标记化工具与 udpipe R 包结合使用。这显示在 https://bnosac.github.io/udpipe/docs/doc2.html。例如。下面使用特定于 Twitter 消息的标记器,然后使用 udpipe
完成词性标记、形态特征注释和依赖项解析library(tokenizers)
library(udpipe)
x <- tokenize_tweets(c("#rstats is a programming_language", "you can combine the #tokenizers package with @udpipe parsing"),
lowercase = FALSE, strip_punct = FALSE)
x <- sapply(x, FUN=function(x) paste(x, collapse="\n"))
x <- udpipe(x, "english-ewt", tokenizer = "vertical", trace = TRUE)
x
doc_id paragraph_id sentence_id sentence start end term_id token_id token lemma upos xpos feats head_token_id dep_rel deps misc
doc1 1 1 <NA> 1 7 1 1 #rstats #rstat PRON PRP$ Gender=Neut|Number=Sing|Person=3|Poss=Yes|PronType=Prs 4 nsubj <NA> <NA>
doc1 1 1 <NA> 9 10 2 2 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 cop <NA> <NA>
doc1 1 1 <NA> 12 12 3 3 a a DET DT Definite=Ind|PronType=Art 4 det <NA> <NA>
doc1 1 1 <NA> 14 33 4 4 programming_language programming_language NOUN NN Number=Sing 0 root <NA> <NA>
doc2 1 1 <NA> 1 3 1 1 you you PRON PRP Case=Nom|Person=2|PronType=Prs 3 nsubj <NA> <NA>
doc2 1 1 <NA> 5 7 2 2 can can AUX MD VerbForm=Fin 3 aux <NA> <NA>
doc2 1 1 <NA> 9 15 3 3 combine combine VERB VB VerbForm=Inf 0 root <NA> <NA>
doc2 1 1 <NA> 17 19 4 4 the the DET DT Definite=Def|PronType=Art 6 det <NA> <NA>
doc2 1 1 <NA> 21 31 5 5 #tokenizers #tokenizer NOUN NNS Number=Plur 6 compound <NA> <NA>
doc2 1 1 <NA> 33 39 6 6 package package NOUN NN Number=Sing 3 obj <NA> <NA>
doc2 1 1 <NA> 41 44 7 7 with with ADP IN <NA> 9 case <NA> <NA>
doc2 1 1 <NA> 46 52 8 8 @udpipe @udpipe NOUN NN Number=Sing 9 compound <NA> <NA>
doc2 1 1 <NA> 54 60 9 9 parsing parsing NOUN NN Number=Sing 6 nmod <NA> <NA>
>