如何将主题标签及其文字作为一个标记保存

Question

如何更改默认设置以防我想保持主题标签符号及其文字完整（即#company 而不是#and company）

x_mod <- udpipe_load_model("D:/Users/asongara/Documents/english-ewt-ud-2.3-181115.udpipe")

ud_model <- udpipe_load_model(x_mod$file)
anno_op3 <- udpipe_annotate(ud_model, 
                            "This is a better #company than i thought @mr_jones!", 
                            tokenizer = "tokenizer", 
                            tagger = "default", 
                            trace = TRUE)

anno_op3 <- as.data.table(as.data.frame(anno_op3))

View(anno_op3)

我得到的是 # 和 company 作为两个不同的标记。我希望#company 作为一个标记。虽然我得到@mr_jones作为一个单一的标记。

Answer 1

您可以将其他标记化工具与 udpipe R 包结合使用。这显示在 https://bnosac.github.io/udpipe/docs/doc2.html。例如。下面使用特定于 Twitter 消息的标记器，然后使用 udpipe

完成词性标记、形态特征注释和依赖项解析

library(tokenizers)
library(udpipe)
x <- tokenize_tweets(c("#rstats is a programming_language", "you can combine the #tokenizers package with @udpipe parsing"), 
                     lowercase = FALSE, strip_punct = FALSE)
x <- sapply(x, FUN=function(x) paste(x, collapse="\n"))
x <- udpipe(x, "english-ewt", tokenizer = "vertical", trace = TRUE)
x
 doc_id paragraph_id sentence_id sentence start end term_id token_id                token                lemma upos xpos                                                  feats head_token_id  dep_rel deps misc
   doc1            1           1     <NA>     1   7       1        1              #rstats               #rstat PRON PRP$ Gender=Neut|Number=Sing|Person=3|Poss=Yes|PronType=Prs             4    nsubj <NA> <NA>
   doc1            1           1     <NA>     9  10       2        2                   is                   be  AUX  VBZ  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin             4      cop <NA> <NA>
   doc1            1           1     <NA>    12  12       3        3                    a                    a  DET   DT                              Definite=Ind|PronType=Art             4      det <NA> <NA>
   doc1            1           1     <NA>    14  33       4        4 programming_language programming_language NOUN   NN                                            Number=Sing             0     root <NA> <NA>
   doc2            1           1     <NA>     1   3       1        1                  you                  you PRON  PRP                         Case=Nom|Person=2|PronType=Prs             3    nsubj <NA> <NA>
   doc2            1           1     <NA>     5   7       2        2                  can                  can  AUX   MD                                           VerbForm=Fin             3      aux <NA> <NA>
   doc2            1           1     <NA>     9  15       3        3              combine              combine VERB   VB                                           VerbForm=Inf             0     root <NA> <NA>
   doc2            1           1     <NA>    17  19       4        4                  the                  the  DET   DT                              Definite=Def|PronType=Art             6      det <NA> <NA>
   doc2            1           1     <NA>    21  31       5        5          #tokenizers           #tokenizer NOUN  NNS                                            Number=Plur             6 compound <NA> <NA>
   doc2            1           1     <NA>    33  39       6        6              package              package NOUN   NN                                            Number=Sing             3      obj <NA> <NA>
   doc2            1           1     <NA>    41  44       7        7                 with                 with  ADP   IN                                                   <NA>             9     case <NA> <NA>
   doc2            1           1     <NA>    46  52       8        8              @udpipe              @udpipe NOUN   NN                                            Number=Sing             9 compound <NA> <NA>
   doc2            1           1     <NA>    54  60       9        9              parsing              parsing NOUN   NN                                            Number=Sing             6     nmod <NA> <NA>
>

如何将主题标签及其文字作为一个标记保存

how to keep hashtags and their words as a single token

r

token

udpipe