运行 python循环迭代和撤消倒排索引

Running python loop to iterate and undo inverted index

我正在尝试撤消倒排索引以生成纯文本格式。我很少使用Python,所以我只是使用几年前的记忆来生成算法。这是我要打印的内容:

Despite growing interest in Open Access (OA) to scholarly literature, there is an unmet need for large-scale, up-to-date, and reproducible studies assessing the prevalence and characteristics of OA. We address this need using oaDOI, an open online service that determines OA status for 67 million articles. We use three samples, each of 100,000 articles, to investigate OA in three populations: (1) all journal articles assigned a Crossref DOI, (2) recent journal articles indexed in Web of Science, and (3) articles viewed by users of Unpaywall, an open-source browser extension that lets users find OA articles using oaDOI. We estimate that at least 28% of the scholarly literature is OA (19M in total) and that this proportion is growing, driven particularly by growth in Gold and Hybrid. The most recent year analyzed (2015) also has the highest percentage of OA (45%). Because of this growth, and the fact that readers disproportionately access newer articles, we find that Unpaywall users encounter OA quite frequently: 47% of articles they view are OA. Notably, the most common mechanism for OA is not Gold, Green, or Hybrid OA, but rather an under-discussed category we dub Bronze: articles made free-to-read on the publisher website, without an explicit Open license. We also examine the citation impact of OA articles, corroborating the so-called open-access citation advantage: accounting for age and discipline, OA articles receive 18% more citations than average, an effect driven primarily by Green and Hybrid OA. We encourage further research using the free oaDOI service, as a way to inform OA policy and practice.

这是倒排索引中的数据(可以在“abstract_inverted_index”->https://api.openalex.org/W2741809807下找到):

"abstract_inverted_index":{"Despite":[0],"growing":[1],"interest":[2],"in":[3,57,73,110,122],"Open":[4,201],"Access":[5],"(OA)":[6],"to":[7,54,252],"scholarly":[8,105],"literature,":[9],"there":[10],"is":[11,107,116,176],"an":[12,34,85,185,199,231],"unmet":[13],"need":[14,31],"for":[15,42,174,219],"large-scale,":[16],"up-to-date,":[17],"and":[18,24,77,112,124,144,221,237,256],"reproducible":[19],"studies":[20],"assessing":[21],"the":[22,104,134,145,170,195,206,213,245],"prevalence":[23],"characteristics":[25],"of":[26,51,75,83,103,137,141,163,209],"OA.":[27,168,239],"We":[28,46,97,203,240],"address":[29],"this":[30,114,142],"using":[32,95,244],"oaDOI,":[33],"open":[35],"online":[36],"service":[37],"that":[38,89,99,113,147,155],"determines":[39],"OA":[40,56,93,108,138,159,175,210,223,254],"status":[41],"67":[43],"million":[44],"articles.":[45],"use":[47],"three":[48,58],"samples,":[49],"each":[50],"100,000":[52],"articles,":[53,152,211],"investigate":[55],"populations:":[59],"(1)":[60],"all":[61],"journal":[62,70],"articles":[63,71,79,94,164,191,224],"assigned":[64],"a":[65,250],"Crossref":[66],"DOI,":[67],"(2)":[68],"recent":[69,128],"indexed":[72],"Web":[74],"Science,":[76],"(3)":[78],"viewed":[80],"by":[81,120,235],"users":[82,91,157],"Unpaywall,":[84],"open-source":[86],"browser":[87],"extension":[88],"lets":[90],"find":[92,154],"oaDOI.":[96],"estimate":[98],"at":[100],"least":[101],"28%":[102],"literature":[106],"(19M":[109],"total)":[111],"proportion":[115],"growing,":[117],"driven":[118,233],"particularly":[119],"growth":[121],"Gold":[123],"Hybrid.":[125],"The":[126],"most":[127,171],"year":[129],"analyzed":[130],"(2015)":[131],"also":[132,204],"has":[133],"highest":[135],"percentage":[136],"(45%).":[139],"Because":[140],"growth,":[143],"fact":[146],"readers":[148],"disproportionately":[149],"access":[150],"newer":[151],"we":[153,188],"Unpaywall":[156],"encounter":[158],"quite":[160],"frequently:":[161],"47%":[162],"they":[165],"view":[166],"are":[167],"Notably,":[169],"common":[172],"mechanism":[173],"not":[177],"Gold,":[178],"Green,":[179],"or":[180],"Hybrid":[181,238],"OA,":[182],"but":[183],"rather":[184],"under-discussed":[186],"category":[187],"dub":[189],"Bronze:":[190],"made":[192],"free-to-read":[193],"on":[194],"publisher":[196],"website,":[197],"without":[198],"explicit":[200],"license.":[202],"examine":[205],"citation":[207,216],"impact":[208],"corroborating":[212],"so-called":[214],"open-access":[215],"advantage:":[217],"accounting":[218],"age":[220],"discipline,":[222],"receive":[225],"18%":[226],"more":[227],"citations":[228],"than":[229],"average,":[230],"effect":[232],"primarily":[234],"Green":[236],"encourage":[241],"further":[242],"research":[243],"free":[246],"oaDOI":[247],"service,":[248],"as":[249],"way":[251],"inform":[253],"policy":[255],"practice.":[257]}

这是我当前用于解码反转的代码,但是它 returns 只是

    import requests

abstractInvertedIndex = requests.get(
    'https://api.openalex.org/W2741809807'
).json()['abstract_inverted_index']

arrayAbstractIndex = [[k, abstractInvertedIndex[k]] for k in abstractInvertedIndex]

# Position of the word in the abstract
wordPos = 0
# The number position of the key value
wordNum = 0
abstract = ""

for x in arrayAbstractIndex:
    if wordPos in arrayAbstractIndex[wordNum][1]:
        abstract = abstract + str(arrayAbstractIndex[wordNum][0] + ' ')
        wordPos = wordPos + 1
    wordNum = wordNum + 1

print(abstract)

Despite growing interest in Open Access (OA) to scholarly literature, there is an unmet need for large-scale, up-to-date, and reproducible studies assessing the prevalence

我知道这是因为单词 'and' 在索引中有多个位置,但是,我不知道如何配置 Python for 循环遍历每个字典值和键中的所有数组项以确保打印整个纯文本?

有什么建议吗?

  1. abstractInvertedIndex 是单词的字典:[indices]。从这本字典中,首先得到一个 (word,index) 对

    的列表

    word_index = [] 对于 abstractInvertedIndex.items() 中的 k,v: 对于 v 中的索引: word_index.append([k,index])

  2. 现在对这个列表进行排序word_index以保留索引顺序

    word_index = 已排序(word_index,key = lambda x : x[1])

  3. 最后仅将 word_index 列表中的单词加入 space

Despite growing interest in Open Access (OA) ...... as a way to inform OA policy and practice.