Ruby 中的正则表达式 - 从 Gutenberg 中提取

Regular expression in Ruby - extracting from Gutenberg

我是 Ruby 的新手,我正在努力使用正则表达式从这个文本文件播种数据库:http://www.gutenberg.org/cache/epub/673/pg673.txt.

我想要 <h1> 标签作为字典数据库的词,<def> 标签作为定义。

我在这里可能会大错特错(我只在数据库中植入了副本和过去;):

require 'open-uri'  

Dictionary.delete_all  

g_text = open('http://www.gutenberg.org/cache/epub/673/pg673.txt')   

y = g_text.read(/<h1>(.*?)<\/h1>/)  
a = g_text.read(/<def>(.*?)<\/def>/)    

Dictionary.create!(:word => y, :definition => a)

如您所见,每个 <h1> 通常有不止一个 <def>,这很好,因为我可以为 definition1、definition2 添加列到我的 table等

但是这个正则表达式应该是什么样子才能确保每个定义都与紧接在前的 <h1> 标记位于同一行?

感谢您的帮助!

编辑:

好的,这就是我现在正在尝试的:

doc.scan(Regexp.union(/<h1>(.*?)<\/h1>/, /<def>(.*?)<\/def>/)).map do |m, n|
  p [m,n]
end

如何删除所有 nil 条目?

似乎正则表达式是唯一一种在遇到错误时不在中途停止的情况下遍历整个文档的方法……至少在对其他解析器进行了几次尝试之后。

我得到了什么(用于沙箱的本地提取):

require 'pp' # For SO to pretty print the hash at end

h1regex="h1>(.+)<\/h1" # Define the hl regex (avoid empty tags)
defregex="def>(.+)<\/def" # define the def regex (avoid empty tags)
# Initialize vars
defhash={}
key=nil
last=nil

open("./gut.txt") do |f|
  f.each_line do |l|
    newkey=l[/#{h1regex}/i,1] # get the next key (or nothing)
    if (newkey != last && newkey != nil) then # if we changed key, update the hash (some redundant hl entries with other defs)
        key = last = newkey # update current key
        defhash[key] = [] # init the new entry to empty array
    end
    if l[/#{defregex}/i] then
        defhash[key] << l[/#{defregex}/i,1] # we did match a def, add it to the current key array
    end
  end
end

pp defhash # print the result

给出以下输出:

{"A"=>
  [" The first letter of the English and of many other alphabets. The capital A of the alphabets of Middle and Western Europe, as also the small letter (a), besides the forms in Italic, black letter, etc., are all descended from the old Latin A, which was borrowed from the Greek <spn>Alpha</spn>, of the same form; and this was made from the first letter (<i>Aleph</i>, and itself from the Egyptian origin. The <i>Aleph</i> was a consonant letter, with a guttural breath sound that was not an element of Greek articulation; and the Greeks took it to represent their vowel <i>Alpha</i> with the \'84 sound, the Ph\'d2nician alphabet having no vowel symbols.",
   "The name of the sixth tone in the model major scale (that in C), or the first tone of the minor scale, which is named after it the scale in A minor. The second string of the violin is tuned to the A in the treble staff. -- A sharp (A#) is the name of a musical tone intermediate between A and B. -- A flat (A&flat;) is the name of a tone intermediate between A and G.",
   "In each; to or for each; <as>as, \"twenty leagues <ex>a</ex> day\", \"a hundred pounds <ex>a</ex> year\", \"a dollar <ex>a</ex> yard\", etc.</as>",
   "In; on; at; by.",
   "In process of; in the act of; into; to; -- used with verbal substantives in <i>-ing</i> which begin with a consonant. This is a shortened form of the preposition <i>an</i> (which was used before the vowel sound); as in <i>a</i> hunting, <i>a</i> building, <i>a</i> begging. \"Jacob, when he was <i>a</i> dying\" <i>Heb. xi. 21</i>.  \"We'll <i>a</i> birding together.\" \" It was <i>a</i> doing.\" <i>Shak.</i>  \"He burst out <i>a</i> laughing.\" <i>Macaulay</i>.  The hyphen may be used to connect <i>a</i> with the verbal substantive (as, <i>a</i>-hunting, <i>a</i>-building) or the words may be written separately. This form of expression is now for the most part obsolete, the <i>a</i> being omitted and the verbal substantive treated as a participle.",
   "Of.",
   " A barbarous corruption of <i>have</i>, of <i>he</i>, and sometimes of <i>it</i> and of <i>they</i>."],
 "Abalone"=>
  ["A univalve mollusk of the genus <spn>Haliotis</spn>. The shell is lined with mother-of-pearl, and used for ornamental purposes; the sea-ear. Several large species are found on the coast of California, clinging closely to the rocks."],
 "Aband"=>["To abandon.", "To banish; to expel."],
 "Abandon"=>
  ["To cast or drive out; to banish; to expel; to reject.",
   "To give up absolutely; to forsake entirely ; to renounce utterly; to relinquish all connection with or concern on; to desert, as a person to whom one owes allegiance or fidelity; to quit; to surrender.",
   "Reflexively : To give (one's self) up without attempt at self-control ; to yield (one's self) unrestrainedly ; -- often in a bad sense.",
   "To relinquish all claim to; -- used when an insured person gives up to underwriters all claim to the property covered by a policy, which may remain after loss or damage by a peril insured against."]}

希望对您有所帮助。

后期编辑:可能有更好的方法,我不是 ruby 专家。我只是在审查时给出了一个通常的建议,但似乎没有人回答这就是我的做法。