Quanteda:多层次应用 Yoshikoder 词典

Quanteda: applying Yoshikoder dictionary with multiple levels

我使用 quanteda 通过基于字典的方法进行定量文本分析。我正在使用 Lowe 的 Yoshikoder 构建我自己的字典。我可以将我的 Yoshikoder 字典与 quanteda 一起使用(见下文)——但是,该函数只占字典的第一级。我需要查看每个类别的所有值,包括所有子类别(至少 4 个级别)。我该怎么做?

# load my Yoshikoder dictionary with multiple levels
mydict <- dictionary(file = "mydictionary.ykd", 
format = "yoshikoder", concatenator = "_", tolower = TRUE, encoding = "auto")

# apply dictionary
mydfm <- dfm(mycorpus, dictionary = mydict)
mydfm
# problem: shows only results for the first level of the dictionary

dfm_lookup(和 tokens_lookup)有一个 levels 参数,其默认值为 1:5。尝试单独应用查找:

mydfm <- dfm(mycorpus)
dfm_lookup(mydfm, dictionary = mydict)

或:

mytoks <- tokens(mycorpus)
mytoks <- tokens_lookup(mytoks, dictionary = mydict)
dfm(mytoks)

更新:

现已在 v0.9.9.55 中修复。

> library(quanteda)
# Loading required package: quanteda
# quanteda version 0.9.9.55
# Using 7 of 8 cores for parallel computing

> mydict <- dictionary(file = "~/Desktop/LaverGarryAJPS.ykd")
> mydfm <- dfm(data_corpus_irishbudget2010, dictionary = mydict, verbose = TRUE)
# Creating a dfm from a corpus ...
# ... tokenizing texts
# ... lowercasing
# ... found 14 documents, 5,058 features
# ... applying a dictionary consisting of 19 keys
# ... created a 14 x 19 sparse dfm
# ... complete. 
# Elapsed time: 0.422 seconds.

> mydict
# Dictionary object with 9 primary key entries and 2 nested levels.
# - Economy:
#     - +State+:
#     - accommodation, age, ambulance, assist, benefit, care, class, classes, clinics, deprivation, disabilities, disadvantaged, elderly, establish, hardship, hunger, invest, investing, investment, patients, pension, poor, poorer, poorest, poverty, school, transport, vulnerable, carer*, child*, collective*, contribution*, cooperative*, co-operative*, educat*, equal*, fair*, guarantee*, health*, homeless*, hospital*, inequal*, means-test*, nurse*, rehouse*, re-house*, teach*, underfund*, unemploy*, widow*
#     - =State=:
#     - accountant, accounting, accounts, bargaining, electricity, fee, fees, import, imports, jobs, opportunity, performance, productivity, settlement, software, supply, trade, welfare, advert*, airline*, airport*, audit*, bank*, breadwinner*, budget*, buy*, cartel*, cash*, charge*, chemical*, commerce*, compensat*, consum*, cost*, credit*, customer*, debt*, deficit*, dwelling*, earn*, econ*, estate*, export*, financ*, hous*, industr*, lease*, loan*, manufactur*, mortgage*, negotiat*, partnership*, passenger*, pay*, port*, profession*, purchas*, railway*, rebate*, recession*, research*, revenue*, salar*, sell*, supplier*, telecom*, telephon*, tenan*, touris*, train*, wage*, work*
#     - -State-:
#     - assets, autonomy, bid, bidders, bidding, confidence, confiscatory, controlled, controlling, controls, corporate, deregulating, expensive, fund-holding, initiative, intrusive, monetary, money, private, privately, privatisations, privatised, privatising, profitable, risk, risks, savings, shares, sponsorship, taxable, taxes, tax-free, trading, value, barrier*, burden*, charit*, choice*, compet*, constrain*, contracting*, contractor*, corporation*, dismantl*, entrepreneur*, flexib*, franchise*, fundhold*, homestead*, investor*, liberali*, market*, own*, produce*, regulat*, retail*, sell*, simplif*, spend*, thrift*, volunt*, voucher*
#     - Institutions:
#     - Radical:
#     - abolition, accountable, answerable, scrap, consult*, corrupt*, democratic*, elect*, implement*, modern*, monitor*, rebuild*, reexamine*, reform*, re-organi*, repeal*, replace*, representat*, scandal*, scrap*, scrutin*, transform*, voice*
#     - Neutral:
#     - assembly, headquarters, office, offices, official, opposition, queen, voting, westminster, administr*, advis*, agenc*, amalgamat*, appoint*, chair*, commission*, committee*, constituen*, council*, department*, directorate*, executive*, legislat*, mechanism*, minister*, operat*, organisation*, parliament*, presiden*, procedur*, process*, regist*, scheme*, secretariat*, sovereign*, subcommittee*, tribunal*, vote*
#     - Conservative:
#     - authority, legitimate, moratorium, whitehall, continu*, disrupt*, inspect*, jurisdiction*, manag*, rul*, strike*
#     - Values:
#     - Liberal:
#     - innocent, inter-racial, rights, cruel*, discriminat*, human*, injustice*, minorit*, repressi*, sex*
#     - Conservative:
#     - defend, defended, defending, discipline, glories, glorious, grammar, heritage, integrity, maintain, majesty, marriage, past, pride, probity, professionalism, proud, histor*, honour*, immigra*, inherit*, jubilee*, leader*, obscen*, pornograph*, preserv*, principl*, punctual*, recapture*, reliab*, threat*, tradition*
#     - Law and Order:
#     - Liberal:
#     - harassment, non-custodial
# - Conservative:
#     - assaults, bail, court, courts, dealing, delinquen*, deter, disorder, fine, fines, firmness, police, policemen, policing, probation, prosecution, re-offend, ruc, sentence*, shop-lifting, squatting, uniformed, unlawful, victim*, burglar*, constab*, convict*, custod*, deter*, drug*, force*, fraud*, guard*, hooligan*, illegal*, intimidat*, joy-ride*, lawless*, magistrat*, offence*, officer*, penal*, prison*, punish*, seiz*, terror*, theft*, thug*, tough*, trafficker*, vandal*, vigilan*
#     - Environment:
#     - Pro:
#     - car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl*
# 
#     ...

虽然我在 Quanteda 中修复了它,但请尝试这个折叠在类别上的替换功能:

library(xml2)

read_dict_yoshikoder <- function(path, sep=">"){
  doc <- xml2::read_xml(path)
  pats <- xml2::xml_find_all(doc, ".//pnode")
  pnode_names <- xml2::xml_attr(pats, "name")  
  get_pnode_path <- function(pn) {
    pars <- xml2::xml_attr(xml2::xml_parents(pn), "name")
    paste0(rev(na.omit(pars)), collapse = sep)
  }
  pnode_paths <- lapply(pats, get_pnode_path)
  lst <- split(pnode_names, unlist(pnode_paths))
  dictionary(lst)
}

用法:

read_dict_yoshikoder("laver-garry-ajps.ykd")

Dictionary object with 19 key entries.
 - Laver and Garry>Culture>High: art, artistic, dance, galler*, museum*, music*, opera*, theatre*
 - Laver and Garry>Culture>Popular: media
 - Laver and Garry>Culture>Sport: angler*
 - Laver and Garry>Environment>Con: produc*
 - Laver and Garry>Environment>Pro: car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl*
 - Laver and Garry>Groups>Ethnic: race, asian*, buddhist*, ethnic*, raci*

...