运行 python循环迭代和撤消倒排索引
Running python loop to iterate and undo inverted index
我正在尝试撤消倒排索引以生成纯文本格式。我很少使用Python,所以我只是使用几年前的记忆来生成算法。这是我要打印的内容:
Despite growing interest in Open Access (OA) to scholarly literature,
there is an unmet need for large-scale, up-to-date, and reproducible
studies assessing the prevalence and characteristics of OA. We address
this need using oaDOI, an open online service that determines OA
status for 67 million articles. We use three samples, each of 100,000
articles, to investigate OA in three populations: (1) all journal
articles assigned a Crossref DOI, (2) recent journal articles indexed
in Web of Science, and (3) articles viewed by users of Unpaywall, an
open-source browser extension that lets users find OA articles using
oaDOI. We estimate that at least 28% of the scholarly literature is OA
(19M in total) and that this proportion is growing, driven
particularly by growth in Gold and Hybrid. The most recent year
analyzed (2015) also has the highest percentage of OA (45%). Because
of this growth, and the fact that readers disproportionately access
newer articles, we find that Unpaywall users encounter OA quite
frequently: 47% of articles they view are OA. Notably, the most common
mechanism for OA is not Gold, Green, or Hybrid OA, but rather an
under-discussed category we dub Bronze: articles made free-to-read on
the publisher website, without an explicit Open license. We also
examine the citation impact of OA articles, corroborating the
so-called open-access citation advantage: accounting for age and
discipline, OA articles receive 18% more citations than average, an
effect driven primarily by Green and Hybrid OA. We encourage further
research using the free oaDOI service, as a way to inform OA policy
and practice.
这是倒排索引中的数据(可以在“abstract_inverted_index”->https://api.openalex.org/W2741809807下找到):
"abstract_inverted_index":{"Despite":[0],"growing":[1],"interest":[2],"in":[3,57,73,110,122],"Open":[4,201],"Access":[5],"(OA)":[6],"to":[7,54,252],"scholarly":[8,105],"literature,":[9],"there":[10],"is":[11,107,116,176],"an":[12,34,85,185,199,231],"unmet":[13],"need":[14,31],"for":[15,42,174,219],"large-scale,":[16],"up-to-date,":[17],"and":[18,24,77,112,124,144,221,237,256],"reproducible":[19],"studies":[20],"assessing":[21],"the":[22,104,134,145,170,195,206,213,245],"prevalence":[23],"characteristics":[25],"of":[26,51,75,83,103,137,141,163,209],"OA.":[27,168,239],"We":[28,46,97,203,240],"address":[29],"this":[30,114,142],"using":[32,95,244],"oaDOI,":[33],"open":[35],"online":[36],"service":[37],"that":[38,89,99,113,147,155],"determines":[39],"OA":[40,56,93,108,138,159,175,210,223,254],"status":[41],"67":[43],"million":[44],"articles.":[45],"use":[47],"three":[48,58],"samples,":[49],"each":[50],"100,000":[52],"articles,":[53,152,211],"investigate":[55],"populations:":[59],"(1)":[60],"all":[61],"journal":[62,70],"articles":[63,71,79,94,164,191,224],"assigned":[64],"a":[65,250],"Crossref":[66],"DOI,":[67],"(2)":[68],"recent":[69,128],"indexed":[72],"Web":[74],"Science,":[76],"(3)":[78],"viewed":[80],"by":[81,120,235],"users":[82,91,157],"Unpaywall,":[84],"open-source":[86],"browser":[87],"extension":[88],"lets":[90],"find":[92,154],"oaDOI.":[96],"estimate":[98],"at":[100],"least":[101],"28%":[102],"literature":[106],"(19M":[109],"total)":[111],"proportion":[115],"growing,":[117],"driven":[118,233],"particularly":[119],"growth":[121],"Gold":[123],"Hybrid.":[125],"The":[126],"most":[127,171],"year":[129],"analyzed":[130],"(2015)":[131],"also":[132,204],"has":[133],"highest":[135],"percentage":[136],"(45%).":[139],"Because":[140],"growth,":[143],"fact":[146],"readers":[148],"disproportionately":[149],"access":[150],"newer":[151],"we":[153,188],"Unpaywall":[156],"encounter":[158],"quite":[160],"frequently:":[161],"47%":[162],"they":[165],"view":[166],"are":[167],"Notably,":[169],"common":[172],"mechanism":[173],"not":[177],"Gold,":[178],"Green,":[179],"or":[180],"Hybrid":[181,238],"OA,":[182],"but":[183],"rather":[184],"under-discussed":[186],"category":[187],"dub":[189],"Bronze:":[190],"made":[192],"free-to-read":[193],"on":[194],"publisher":[196],"website,":[197],"without":[198],"explicit":[200],"license.":[202],"examine":[205],"citation":[207,216],"impact":[208],"corroborating":[212],"so-called":[214],"open-access":[215],"advantage:":[217],"accounting":[218],"age":[220],"discipline,":[222],"receive":[225],"18%":[226],"more":[227],"citations":[228],"than":[229],"average,":[230],"effect":[232],"primarily":[234],"Green":[236],"encourage":[241],"further":[242],"research":[243],"free":[246],"oaDOI":[247],"service,":[248],"as":[249],"way":[251],"inform":[253],"policy":[255],"practice.":[257]}
这是我当前用于解码反转的代码,但是它 returns 只是
import requests
abstractInvertedIndex = requests.get(
'https://api.openalex.org/W2741809807'
).json()['abstract_inverted_index']
arrayAbstractIndex = [[k, abstractInvertedIndex[k]] for k in abstractInvertedIndex]
# Position of the word in the abstract
wordPos = 0
# The number position of the key value
wordNum = 0
abstract = ""
for x in arrayAbstractIndex:
if wordPos in arrayAbstractIndex[wordNum][1]:
abstract = abstract + str(arrayAbstractIndex[wordNum][0] + ' ')
wordPos = wordPos + 1
wordNum = wordNum + 1
print(abstract)
Despite growing interest in Open Access (OA) to scholarly literature, there is an unmet need for large-scale, up-to-date, and reproducible studies assessing the prevalence
我知道这是因为单词 'and' 在索引中有多个位置,但是,我不知道如何配置 Python for 循环遍历每个字典值和键中的所有数组项以确保打印整个纯文本?
有什么建议吗?
abstractInvertedIndex
是单词的字典:[indices]。从这本字典中,首先得到一个 (word,index) 对
的列表
word_index = []
对于 abstractInvertedIndex.items() 中的 k,v:
对于 v 中的索引:
word_index.append([k,index])
现在对这个列表进行排序word_index
以保留索引顺序
word_index = 已排序(word_index,key = lambda x : x[1])
最后仅将 word_index
列表中的单词加入 space
Despite growing interest in Open Access (OA) ...... as a way to inform OA policy and practice.
我正在尝试撤消倒排索引以生成纯文本格式。我很少使用Python,所以我只是使用几年前的记忆来生成算法。这是我要打印的内容:
Despite growing interest in Open Access (OA) to scholarly literature, there is an unmet need for large-scale, up-to-date, and reproducible studies assessing the prevalence and characteristics of OA. We address this need using oaDOI, an open online service that determines OA status for 67 million articles. We use three samples, each of 100,000 articles, to investigate OA in three populations: (1) all journal articles assigned a Crossref DOI, (2) recent journal articles indexed in Web of Science, and (3) articles viewed by users of Unpaywall, an open-source browser extension that lets users find OA articles using oaDOI. We estimate that at least 28% of the scholarly literature is OA (19M in total) and that this proportion is growing, driven particularly by growth in Gold and Hybrid. The most recent year analyzed (2015) also has the highest percentage of OA (45%). Because of this growth, and the fact that readers disproportionately access newer articles, we find that Unpaywall users encounter OA quite frequently: 47% of articles they view are OA. Notably, the most common mechanism for OA is not Gold, Green, or Hybrid OA, but rather an under-discussed category we dub Bronze: articles made free-to-read on the publisher website, without an explicit Open license. We also examine the citation impact of OA articles, corroborating the so-called open-access citation advantage: accounting for age and discipline, OA articles receive 18% more citations than average, an effect driven primarily by Green and Hybrid OA. We encourage further research using the free oaDOI service, as a way to inform OA policy and practice.
这是倒排索引中的数据(可以在“abstract_inverted_index”->https://api.openalex.org/W2741809807下找到):
"abstract_inverted_index":{"Despite":[0],"growing":[1],"interest":[2],"in":[3,57,73,110,122],"Open":[4,201],"Access":[5],"(OA)":[6],"to":[7,54,252],"scholarly":[8,105],"literature,":[9],"there":[10],"is":[11,107,116,176],"an":[12,34,85,185,199,231],"unmet":[13],"need":[14,31],"for":[15,42,174,219],"large-scale,":[16],"up-to-date,":[17],"and":[18,24,77,112,124,144,221,237,256],"reproducible":[19],"studies":[20],"assessing":[21],"the":[22,104,134,145,170,195,206,213,245],"prevalence":[23],"characteristics":[25],"of":[26,51,75,83,103,137,141,163,209],"OA.":[27,168,239],"We":[28,46,97,203,240],"address":[29],"this":[30,114,142],"using":[32,95,244],"oaDOI,":[33],"open":[35],"online":[36],"service":[37],"that":[38,89,99,113,147,155],"determines":[39],"OA":[40,56,93,108,138,159,175,210,223,254],"status":[41],"67":[43],"million":[44],"articles.":[45],"use":[47],"three":[48,58],"samples,":[49],"each":[50],"100,000":[52],"articles,":[53,152,211],"investigate":[55],"populations:":[59],"(1)":[60],"all":[61],"journal":[62,70],"articles":[63,71,79,94,164,191,224],"assigned":[64],"a":[65,250],"Crossref":[66],"DOI,":[67],"(2)":[68],"recent":[69,128],"indexed":[72],"Web":[74],"Science,":[76],"(3)":[78],"viewed":[80],"by":[81,120,235],"users":[82,91,157],"Unpaywall,":[84],"open-source":[86],"browser":[87],"extension":[88],"lets":[90],"find":[92,154],"oaDOI.":[96],"estimate":[98],"at":[100],"least":[101],"28%":[102],"literature":[106],"(19M":[109],"total)":[111],"proportion":[115],"growing,":[117],"driven":[118,233],"particularly":[119],"growth":[121],"Gold":[123],"Hybrid.":[125],"The":[126],"most":[127,171],"year":[129],"analyzed":[130],"(2015)":[131],"also":[132,204],"has":[133],"highest":[135],"percentage":[136],"(45%).":[139],"Because":[140],"growth,":[143],"fact":[146],"readers":[148],"disproportionately":[149],"access":[150],"newer":[151],"we":[153,188],"Unpaywall":[156],"encounter":[158],"quite":[160],"frequently:":[161],"47%":[162],"they":[165],"view":[166],"are":[167],"Notably,":[169],"common":[172],"mechanism":[173],"not":[177],"Gold,":[178],"Green,":[179],"or":[180],"Hybrid":[181,238],"OA,":[182],"but":[183],"rather":[184],"under-discussed":[186],"category":[187],"dub":[189],"Bronze:":[190],"made":[192],"free-to-read":[193],"on":[194],"publisher":[196],"website,":[197],"without":[198],"explicit":[200],"license.":[202],"examine":[205],"citation":[207,216],"impact":[208],"corroborating":[212],"so-called":[214],"open-access":[215],"advantage:":[217],"accounting":[218],"age":[220],"discipline,":[222],"receive":[225],"18%":[226],"more":[227],"citations":[228],"than":[229],"average,":[230],"effect":[232],"primarily":[234],"Green":[236],"encourage":[241],"further":[242],"research":[243],"free":[246],"oaDOI":[247],"service,":[248],"as":[249],"way":[251],"inform":[253],"policy":[255],"practice.":[257]}
这是我当前用于解码反转的代码,但是它 returns 只是
import requests
abstractInvertedIndex = requests.get(
'https://api.openalex.org/W2741809807'
).json()['abstract_inverted_index']
arrayAbstractIndex = [[k, abstractInvertedIndex[k]] for k in abstractInvertedIndex]
# Position of the word in the abstract
wordPos = 0
# The number position of the key value
wordNum = 0
abstract = ""
for x in arrayAbstractIndex:
if wordPos in arrayAbstractIndex[wordNum][1]:
abstract = abstract + str(arrayAbstractIndex[wordNum][0] + ' ')
wordPos = wordPos + 1
wordNum = wordNum + 1
print(abstract)
Despite growing interest in Open Access (OA) to scholarly literature, there is an unmet need for large-scale, up-to-date, and reproducible studies assessing the prevalence
我知道这是因为单词 'and' 在索引中有多个位置,但是,我不知道如何配置 Python for 循环遍历每个字典值和键中的所有数组项以确保打印整个纯文本?
有什么建议吗?
的列表abstractInvertedIndex
是单词的字典:[indices]。从这本字典中,首先得到一个 (word,index) 对word_index = [] 对于 abstractInvertedIndex.items() 中的 k,v: 对于 v 中的索引: word_index.append([k,index])
现在对这个列表进行排序
word_index
以保留索引顺序word_index = 已排序(word_index,key = lambda x : x[1])
最后仅将
word_index
列表中的单词加入 space
Despite growing interest in Open Access (OA) ...... as a way to inform OA policy and practice.