从下面的页面中提取段落的正则表达式
Regular expression to extract paragraph from the page below
我有使用 iText 从 pdf 中提取并放入字符串变量中的文本:
(1) A a, — al'-fah; of Hebrew origin; the first letter of the alphabet;
figurative only (from its use as a numeral) the first: — Alpha.
Often used (usually ajn an, before a vowel) also in composition
(as a contraction from (427) (a]neu,)) in the sense of privation;
so in many words beginning with this letter; occasionally in the
sense of union (as a contraction of (260) (a[ma)).
(2) ÆAarw>n, — ah-ar-ohn'; of Hebrew origin [Hebrew {175}
('Aharown)]; Aaron, the brother of Moses: — Aaron.
(3) ÆAbaddw>n, — ab-ad-dohn'; of Hebrew origin [Hebrew {11}
('abaddown)]; a destroying angel: — Abaddon.
(4) ajbarh>v, — ab-ar-ace'; from (1) (a) (as a negative particle) and (922)
(ba>rov); weightless, i.e. (figurative) not burdensome: — from
being burdensome.
(5) ÆAbba~, — ab-bah'; of Chaldee origin [Hebrew {2} ('ab (Chaldee))];
father (as a vocative): — Abba.
(6) &Abel, — ab'-el; of Hebrew origin [Hebrew {1893} (Hebel)]; Abel,
the son of Adam: — Abel.
(7) ÆAbia>, — ab-ee-ah'; of Hebrew origin [Hebrew {29} ('Abiyah)];
Abijah, the name of two Israelites: — Abia.
(8) ÆAbia>qar, — ab-ee-ath'-ar; of Hebrew origin [Hebrew {54}
('Ebyathar)]; Abiathar, an Israelite: — Abiathar.
(9) ÆAbilhnh>, — ab-ee-lay-nay'; of foreign origin [compare Hebrew {58}
('abel)]; Abilene, a region of Syria: — Abilene.
(10) ÆAbiou>d, — ab-ee-ood'; of Hebrew origin [Hebrew {31}
('Abiyhuwd)]; Abihud, an Israelite: — Abiud.
字符串中的段落以 ([0-9])
开头,如 (9)
或 (5)
,我想使用 pagestring.split("regex")
提取以此字符序列开头的每个段落。有什么帮助吗?
这避免了拆分文本中嵌入的“(999)”。它基于这样的假设,即行尾位于表示段落开始的带括号的数字之前。另请注意,示例文本从第一个带括号的数字之前没有文本生成空 "paragraph" - 因此是 if 语句。
String text = ...;
String[] paras = text.split( "(?<=(^|\n))\(\d+\)" );
for( String para: paras ){
if( para.length() > 0 ){
System.out.println( "Para: " + para );
}
}
您可以将以下正则表达式 "[\n|.]\([0-9]{1,2}\)"
与拆分方法一起使用,它将从您的文本中提取所有段落(包括从 0 到 99 的数字):
String[] parts=st.split("[\n|.]\([0-9]{1,2}\)");
[\n|.]
: to consider only the new paragraphs and ignore (n)
in the pragraphs text.
\([0-9]{1,2}\)
: to match any group of one or two digits inside ().
这里是the working DEMO,给出一个包含所有段落的数组。
有关使用正则表达式的更多信息,请参阅 Java Regex Pattern。
我有使用 iText 从 pdf 中提取并放入字符串变量中的文本:
(1) A a, — al'-fah; of Hebrew origin; the first letter of the alphabet;
figurative only (from its use as a numeral) the first: — Alpha.
Often used (usually ajn an, before a vowel) also in composition
(as a contraction from (427) (a]neu,)) in the sense of privation;
so in many words beginning with this letter; occasionally in the
sense of union (as a contraction of (260) (a[ma)).
(2) ÆAarw>n, — ah-ar-ohn'; of Hebrew origin [Hebrew {175}
('Aharown)]; Aaron, the brother of Moses: — Aaron.
(3) ÆAbaddw>n, — ab-ad-dohn'; of Hebrew origin [Hebrew {11}
('abaddown)]; a destroying angel: — Abaddon.
(4) ajbarh>v, — ab-ar-ace'; from (1) (a) (as a negative particle) and (922)
(ba>rov); weightless, i.e. (figurative) not burdensome: — from
being burdensome.
(5) ÆAbba~, — ab-bah'; of Chaldee origin [Hebrew {2} ('ab (Chaldee))];
father (as a vocative): — Abba.
(6) &Abel, — ab'-el; of Hebrew origin [Hebrew {1893} (Hebel)]; Abel,
the son of Adam: — Abel.
(7) ÆAbia>, — ab-ee-ah'; of Hebrew origin [Hebrew {29} ('Abiyah)];
Abijah, the name of two Israelites: — Abia.
(8) ÆAbia>qar, — ab-ee-ath'-ar; of Hebrew origin [Hebrew {54}
('Ebyathar)]; Abiathar, an Israelite: — Abiathar.
(9) ÆAbilhnh>, — ab-ee-lay-nay'; of foreign origin [compare Hebrew {58}
('abel)]; Abilene, a region of Syria: — Abilene.
(10) ÆAbiou>d, — ab-ee-ood'; of Hebrew origin [Hebrew {31}
('Abiyhuwd)]; Abihud, an Israelite: — Abiud.
字符串中的段落以 ([0-9])
开头,如 (9)
或 (5)
,我想使用 pagestring.split("regex")
提取以此字符序列开头的每个段落。有什么帮助吗?
这避免了拆分文本中嵌入的“(999)”。它基于这样的假设,即行尾位于表示段落开始的带括号的数字之前。另请注意,示例文本从第一个带括号的数字之前没有文本生成空 "paragraph" - 因此是 if 语句。
String text = ...;
String[] paras = text.split( "(?<=(^|\n))\(\d+\)" );
for( String para: paras ){
if( para.length() > 0 ){
System.out.println( "Para: " + para );
}
}
您可以将以下正则表达式 "[\n|.]\([0-9]{1,2}\)"
与拆分方法一起使用,它将从您的文本中提取所有段落(包括从 0 到 99 的数字):
String[] parts=st.split("[\n|.]\([0-9]{1,2}\)");
[\n|.]
: to consider only the new paragraphs and ignore(n)
in the pragraphs text.
\([0-9]{1,2}\)
: to match any group of one or two digits inside ().
这里是the working DEMO,给出一个包含所有段落的数组。
有关使用正则表达式的更多信息,请参阅 Java Regex Pattern。