Quanteda:多层次应用 Yoshikoder 词典
Quanteda: applying Yoshikoder dictionary with multiple levels
我使用 quanteda 通过基于字典的方法进行定量文本分析。我正在使用 Lowe 的 Yoshikoder 构建我自己的字典。我可以将我的 Yoshikoder 字典与 quanteda 一起使用(见下文)——但是,该函数只占字典的第一级。我需要查看每个类别的所有值,包括所有子类别(至少 4 个级别)。我该怎么做?
# load my Yoshikoder dictionary with multiple levels
mydict <- dictionary(file = "mydictionary.ykd",
format = "yoshikoder", concatenator = "_", tolower = TRUE, encoding = "auto")
# apply dictionary
mydfm <- dfm(mycorpus, dictionary = mydict)
mydfm
# problem: shows only results for the first level of the dictionary
dfm_lookup
(和 tokens_lookup
)有一个 levels
参数,其默认值为 1:5
。尝试单独应用查找:
mydfm <- dfm(mycorpus)
dfm_lookup(mydfm, dictionary = mydict)
或:
mytoks <- tokens(mycorpus)
mytoks <- tokens_lookup(mytoks, dictionary = mydict)
dfm(mytoks)
更新:
现已在 v0.9.9.55 中修复。
> library(quanteda)
# Loading required package: quanteda
# quanteda version 0.9.9.55
# Using 7 of 8 cores for parallel computing
> mydict <- dictionary(file = "~/Desktop/LaverGarryAJPS.ykd")
> mydfm <- dfm(data_corpus_irishbudget2010, dictionary = mydict, verbose = TRUE)
# Creating a dfm from a corpus ...
# ... tokenizing texts
# ... lowercasing
# ... found 14 documents, 5,058 features
# ... applying a dictionary consisting of 19 keys
# ... created a 14 x 19 sparse dfm
# ... complete.
# Elapsed time: 0.422 seconds.
> mydict
# Dictionary object with 9 primary key entries and 2 nested levels.
# - Economy:
# - +State+:
# - accommodation, age, ambulance, assist, benefit, care, class, classes, clinics, deprivation, disabilities, disadvantaged, elderly, establish, hardship, hunger, invest, investing, investment, patients, pension, poor, poorer, poorest, poverty, school, transport, vulnerable, carer*, child*, collective*, contribution*, cooperative*, co-operative*, educat*, equal*, fair*, guarantee*, health*, homeless*, hospital*, inequal*, means-test*, nurse*, rehouse*, re-house*, teach*, underfund*, unemploy*, widow*
# - =State=:
# - accountant, accounting, accounts, bargaining, electricity, fee, fees, import, imports, jobs, opportunity, performance, productivity, settlement, software, supply, trade, welfare, advert*, airline*, airport*, audit*, bank*, breadwinner*, budget*, buy*, cartel*, cash*, charge*, chemical*, commerce*, compensat*, consum*, cost*, credit*, customer*, debt*, deficit*, dwelling*, earn*, econ*, estate*, export*, financ*, hous*, industr*, lease*, loan*, manufactur*, mortgage*, negotiat*, partnership*, passenger*, pay*, port*, profession*, purchas*, railway*, rebate*, recession*, research*, revenue*, salar*, sell*, supplier*, telecom*, telephon*, tenan*, touris*, train*, wage*, work*
# - -State-:
# - assets, autonomy, bid, bidders, bidding, confidence, confiscatory, controlled, controlling, controls, corporate, deregulating, expensive, fund-holding, initiative, intrusive, monetary, money, private, privately, privatisations, privatised, privatising, profitable, risk, risks, savings, shares, sponsorship, taxable, taxes, tax-free, trading, value, barrier*, burden*, charit*, choice*, compet*, constrain*, contracting*, contractor*, corporation*, dismantl*, entrepreneur*, flexib*, franchise*, fundhold*, homestead*, investor*, liberali*, market*, own*, produce*, regulat*, retail*, sell*, simplif*, spend*, thrift*, volunt*, voucher*
# - Institutions:
# - Radical:
# - abolition, accountable, answerable, scrap, consult*, corrupt*, democratic*, elect*, implement*, modern*, monitor*, rebuild*, reexamine*, reform*, re-organi*, repeal*, replace*, representat*, scandal*, scrap*, scrutin*, transform*, voice*
# - Neutral:
# - assembly, headquarters, office, offices, official, opposition, queen, voting, westminster, administr*, advis*, agenc*, amalgamat*, appoint*, chair*, commission*, committee*, constituen*, council*, department*, directorate*, executive*, legislat*, mechanism*, minister*, operat*, organisation*, parliament*, presiden*, procedur*, process*, regist*, scheme*, secretariat*, sovereign*, subcommittee*, tribunal*, vote*
# - Conservative:
# - authority, legitimate, moratorium, whitehall, continu*, disrupt*, inspect*, jurisdiction*, manag*, rul*, strike*
# - Values:
# - Liberal:
# - innocent, inter-racial, rights, cruel*, discriminat*, human*, injustice*, minorit*, repressi*, sex*
# - Conservative:
# - defend, defended, defending, discipline, glories, glorious, grammar, heritage, integrity, maintain, majesty, marriage, past, pride, probity, professionalism, proud, histor*, honour*, immigra*, inherit*, jubilee*, leader*, obscen*, pornograph*, preserv*, principl*, punctual*, recapture*, reliab*, threat*, tradition*
# - Law and Order:
# - Liberal:
# - harassment, non-custodial
# - Conservative:
# - assaults, bail, court, courts, dealing, delinquen*, deter, disorder, fine, fines, firmness, police, policemen, policing, probation, prosecution, re-offend, ruc, sentence*, shop-lifting, squatting, uniformed, unlawful, victim*, burglar*, constab*, convict*, custod*, deter*, drug*, force*, fraud*, guard*, hooligan*, illegal*, intimidat*, joy-ride*, lawless*, magistrat*, offence*, officer*, penal*, prison*, punish*, seiz*, terror*, theft*, thug*, tough*, trafficker*, vandal*, vigilan*
# - Environment:
# - Pro:
# - car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl*
#
# ...
虽然我在 Quanteda 中修复了它,但请尝试这个折叠在类别上的替换功能:
library(xml2)
read_dict_yoshikoder <- function(path, sep=">"){
doc <- xml2::read_xml(path)
pats <- xml2::xml_find_all(doc, ".//pnode")
pnode_names <- xml2::xml_attr(pats, "name")
get_pnode_path <- function(pn) {
pars <- xml2::xml_attr(xml2::xml_parents(pn), "name")
paste0(rev(na.omit(pars)), collapse = sep)
}
pnode_paths <- lapply(pats, get_pnode_path)
lst <- split(pnode_names, unlist(pnode_paths))
dictionary(lst)
}
用法:
read_dict_yoshikoder("laver-garry-ajps.ykd")
Dictionary object with 19 key entries.
- Laver and Garry>Culture>High: art, artistic, dance, galler*, museum*, music*, opera*, theatre*
- Laver and Garry>Culture>Popular: media
- Laver and Garry>Culture>Sport: angler*
- Laver and Garry>Environment>Con: produc*
- Laver and Garry>Environment>Pro: car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl*
- Laver and Garry>Groups>Ethnic: race, asian*, buddhist*, ethnic*, raci*
...
我使用 quanteda 通过基于字典的方法进行定量文本分析。我正在使用 Lowe 的 Yoshikoder 构建我自己的字典。我可以将我的 Yoshikoder 字典与 quanteda 一起使用(见下文)——但是,该函数只占字典的第一级。我需要查看每个类别的所有值,包括所有子类别(至少 4 个级别)。我该怎么做?
# load my Yoshikoder dictionary with multiple levels
mydict <- dictionary(file = "mydictionary.ykd",
format = "yoshikoder", concatenator = "_", tolower = TRUE, encoding = "auto")
# apply dictionary
mydfm <- dfm(mycorpus, dictionary = mydict)
mydfm
# problem: shows only results for the first level of the dictionary
dfm_lookup
(和 tokens_lookup
)有一个 levels
参数,其默认值为 1:5
。尝试单独应用查找:
mydfm <- dfm(mycorpus)
dfm_lookup(mydfm, dictionary = mydict)
或:
mytoks <- tokens(mycorpus)
mytoks <- tokens_lookup(mytoks, dictionary = mydict)
dfm(mytoks)
更新:
现已在 v0.9.9.55 中修复。
> library(quanteda)
# Loading required package: quanteda
# quanteda version 0.9.9.55
# Using 7 of 8 cores for parallel computing
> mydict <- dictionary(file = "~/Desktop/LaverGarryAJPS.ykd")
> mydfm <- dfm(data_corpus_irishbudget2010, dictionary = mydict, verbose = TRUE)
# Creating a dfm from a corpus ...
# ... tokenizing texts
# ... lowercasing
# ... found 14 documents, 5,058 features
# ... applying a dictionary consisting of 19 keys
# ... created a 14 x 19 sparse dfm
# ... complete.
# Elapsed time: 0.422 seconds.
> mydict
# Dictionary object with 9 primary key entries and 2 nested levels.
# - Economy:
# - +State+:
# - accommodation, age, ambulance, assist, benefit, care, class, classes, clinics, deprivation, disabilities, disadvantaged, elderly, establish, hardship, hunger, invest, investing, investment, patients, pension, poor, poorer, poorest, poverty, school, transport, vulnerable, carer*, child*, collective*, contribution*, cooperative*, co-operative*, educat*, equal*, fair*, guarantee*, health*, homeless*, hospital*, inequal*, means-test*, nurse*, rehouse*, re-house*, teach*, underfund*, unemploy*, widow*
# - =State=:
# - accountant, accounting, accounts, bargaining, electricity, fee, fees, import, imports, jobs, opportunity, performance, productivity, settlement, software, supply, trade, welfare, advert*, airline*, airport*, audit*, bank*, breadwinner*, budget*, buy*, cartel*, cash*, charge*, chemical*, commerce*, compensat*, consum*, cost*, credit*, customer*, debt*, deficit*, dwelling*, earn*, econ*, estate*, export*, financ*, hous*, industr*, lease*, loan*, manufactur*, mortgage*, negotiat*, partnership*, passenger*, pay*, port*, profession*, purchas*, railway*, rebate*, recession*, research*, revenue*, salar*, sell*, supplier*, telecom*, telephon*, tenan*, touris*, train*, wage*, work*
# - -State-:
# - assets, autonomy, bid, bidders, bidding, confidence, confiscatory, controlled, controlling, controls, corporate, deregulating, expensive, fund-holding, initiative, intrusive, monetary, money, private, privately, privatisations, privatised, privatising, profitable, risk, risks, savings, shares, sponsorship, taxable, taxes, tax-free, trading, value, barrier*, burden*, charit*, choice*, compet*, constrain*, contracting*, contractor*, corporation*, dismantl*, entrepreneur*, flexib*, franchise*, fundhold*, homestead*, investor*, liberali*, market*, own*, produce*, regulat*, retail*, sell*, simplif*, spend*, thrift*, volunt*, voucher*
# - Institutions:
# - Radical:
# - abolition, accountable, answerable, scrap, consult*, corrupt*, democratic*, elect*, implement*, modern*, monitor*, rebuild*, reexamine*, reform*, re-organi*, repeal*, replace*, representat*, scandal*, scrap*, scrutin*, transform*, voice*
# - Neutral:
# - assembly, headquarters, office, offices, official, opposition, queen, voting, westminster, administr*, advis*, agenc*, amalgamat*, appoint*, chair*, commission*, committee*, constituen*, council*, department*, directorate*, executive*, legislat*, mechanism*, minister*, operat*, organisation*, parliament*, presiden*, procedur*, process*, regist*, scheme*, secretariat*, sovereign*, subcommittee*, tribunal*, vote*
# - Conservative:
# - authority, legitimate, moratorium, whitehall, continu*, disrupt*, inspect*, jurisdiction*, manag*, rul*, strike*
# - Values:
# - Liberal:
# - innocent, inter-racial, rights, cruel*, discriminat*, human*, injustice*, minorit*, repressi*, sex*
# - Conservative:
# - defend, defended, defending, discipline, glories, glorious, grammar, heritage, integrity, maintain, majesty, marriage, past, pride, probity, professionalism, proud, histor*, honour*, immigra*, inherit*, jubilee*, leader*, obscen*, pornograph*, preserv*, principl*, punctual*, recapture*, reliab*, threat*, tradition*
# - Law and Order:
# - Liberal:
# - harassment, non-custodial
# - Conservative:
# - assaults, bail, court, courts, dealing, delinquen*, deter, disorder, fine, fines, firmness, police, policemen, policing, probation, prosecution, re-offend, ruc, sentence*, shop-lifting, squatting, uniformed, unlawful, victim*, burglar*, constab*, convict*, custod*, deter*, drug*, force*, fraud*, guard*, hooligan*, illegal*, intimidat*, joy-ride*, lawless*, magistrat*, offence*, officer*, penal*, prison*, punish*, seiz*, terror*, theft*, thug*, tough*, trafficker*, vandal*, vigilan*
# - Environment:
# - Pro:
# - car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl*
#
# ...
虽然我在 Quanteda 中修复了它,但请尝试这个折叠在类别上的替换功能:
library(xml2)
read_dict_yoshikoder <- function(path, sep=">"){
doc <- xml2::read_xml(path)
pats <- xml2::xml_find_all(doc, ".//pnode")
pnode_names <- xml2::xml_attr(pats, "name")
get_pnode_path <- function(pn) {
pars <- xml2::xml_attr(xml2::xml_parents(pn), "name")
paste0(rev(na.omit(pars)), collapse = sep)
}
pnode_paths <- lapply(pats, get_pnode_path)
lst <- split(pnode_names, unlist(pnode_paths))
dictionary(lst)
}
用法:
read_dict_yoshikoder("laver-garry-ajps.ykd")
Dictionary object with 19 key entries.
- Laver and Garry>Culture>High: art, artistic, dance, galler*, museum*, music*, opera*, theatre*
- Laver and Garry>Culture>Popular: media
- Laver and Garry>Culture>Sport: angler*
- Laver and Garry>Environment>Con: produc*
- Laver and Garry>Environment>Pro: car, catalytic, congestion, energy-saving, fur, green, husbanded, opencast, ozone, planet, population, re-use, toxic, warming, chemical*, chimney*, clean*, cyclist*, deplet*, ecolog*, emission*, environment*, habitat*, hedgerow*, litter*, open-cast*, recycl*, re-cycl*
- Laver and Garry>Groups>Ethnic: race, asian*, buddhist*, ethnic*, raci*
...