使用内联元素解析 XML/XLIFF 时出现问题
Problems parsing XML/XLIFF with inline elements
我正在尝试解析来自 SDL Trados 翻译软件的 xliff (XML) 变体,其中包含翻译,我正在解析的 "sdlxliff" 文件看起来像这样(有点简化和 "prettified").
XML/XLIFF 正在处理的文件 ("sample.sdlxliff"):
<?xml version="1.0" encoding="utf-8"?><xliff xmlns:sdl="http://sdl.com/FileTypes/SdlXliff/1.0" xmlns="urn:oasis:names:tc:xliff:document:1.2" version="1.2" sdl:version="1.0"><file original="\TRADOS_SERVER\Trados17\Doc_Helps\en-US\import\Test.xml" datatype="x-sdlfilterframework2" source-language="en-US" target-language="hr-HR"><header><sniff-info><detected-encoding detection-level="Certain" encoding="utf-8"/><detected-source-lang detection-level="Guess" lang="en-US"/><props><value key="xmlDeclaration">true</value><value key="standalone">yes</value><value key="HasUtf8Bom">false</value><value key="IsFragment">false</value></props></sniff-info></header>
<body>
<trans-unit id="a1f4768e-a026-46c2-b65d-599d2108d176">
<source>
<g id="461">Add or edit text: </g>Just begin typing. The blinking insertion point indicates where your text starts. To edit text, <g id="462">select the text</g>, then type. Use the controls in the Format <g id="463"> <g id="464"/></g> sidebar on the right.
</source>
<seg-source>
<g id="461">
<mrk mtype="seg" mid="182">Add or edit text:</mrk> </g>
<mrk mtype="seg" mid="183">Just begin typing.</mrk>
<mrk mtype="seg" mid="184">The blinking insertion point indicates where your text starts.</mrk>
<mrk mtype="seg" mid="185">To edit text, <g id="462">select the text</g>, then type.</mrk>
<mrk mtype="seg" mid="186">Use the controls in the Format <g id="463"><g id="464"/></g> sidebar on the right.</mrk>
</seg-source>
<target>
<g id="461">
<mrk mtype="seg" mid="182">Dodajte ili uredite tekst:</mrk> </g>
<mrk mtype="seg" mid="183">Samo počnite tipkati.</mrk>
<mrk mtype="seg" mid="184">Trepereća točka umetanja pokazuje gdje počinje vaš tekst.</mrk>
<mrk mtype="seg" mid="185">Za uređivanje teksta <g id="462">odaberite tekst</g>, zatim unesite tekst.</mrk>
<mrk mtype="seg" mid="186">Upotrijebite kontrole u rubnom stupcu Formatiraj <g id="463"><g id="464"/></g> s desne strane.</mrk>
</target>
<blahblahblah></blahblahblah>
</trans-unit>
<trans-unit id="7f7ede5e-75b9-403a-b1c6-43f654ea8245">
<source>
<g id="492"><g id="493">The toolbar with buttons.</g></g>
</source>
<seg-source>
<g id="492">
<g id="493">
<mrk mtype="seg" mid="199">The toolbar with buttons.</mrk></g></g>
</seg-source>
<target>
<g id="492">
<g id="493">
<mrk mtype="seg" mid="199">Alatna traka sa tipkama.</mrk></g></g>
</target>
<blahblahblah></blahblahblah>
</trans-unit>
</body>
</file></xliff>
所以,XML/XLIFF 文件有 "seg-source" 和 "target" 部分,我对此很感兴趣,我想提取这些部分并稍后打印到制表符分隔的普通 TXT 文件, 或者其他..
但是,我在使用内嵌标签时遇到问题 - 就像这一行:
<mrk mtype="seg" mid="185">To edit text, <g id="462">select the text</g>, then type.</mrk>
-> 我只得到第一个内联 '<g id="xxx">'
标记之前的字符串部分 :(
而不是“要编辑文本,select 文本,然后键入。”,我只得到“要编辑文本,".
Python 我试过的代码:
# parsesdlxliff-test.py:
from lxml import etree
tree = etree.parse("sample.sdlxliff")
root = tree.getroot()
for element in root:
pass # not important
# now the children
for all_tags in element.findall('.//'):
if 'mrk' in all_tags.tag:
attrs = all_tags.attrib
numb = attrs.get("mid")
# remove all internal tags within 'mrk', leave only clean string/text? - how?
print(numb, all_tags.text)
我用这段代码得到的结果:
182 Add or edit text:
183 Just begin typing.
184 The blinking insertion point indicates where your text starts.
185 To edit text,
186 Use the controls in the Format
182 Dodajte ili uredite tekst:
183 Samo počnite tipkati.
184 Trepereća točka umetanja pokazuje gdje počinje vaš tekst.
185 Za uređivanje teksta
186 Upotrijebite kontrole u rubnom stupcu Formatiraj
199 The toolbar with buttons.
199 Alatna traka sa tipkama.
从结果行中可以看出。 185 和 186('mid' 数字),第一个内联标记后缺少文本(在 'seg-source' 和 'target' 中)。
最终我想要得到的是这样的(仅供参考):
Add or edit text: <TAB> Dodajte ili uredite tekst:
To edit text, select the text, then type. <TAB> Za uređivanje teksta odaberite tekst, zatim unesite tekst.
Use the controls in the Format sidebar on the right. <TAB> Upotrijebite kontrole u rubnom stupcu Formatiraj s desne strane.
即制表符分隔的源目标 句子 对。
我可以稍后使用 'mid' 数字将它们配对,但只有在我设法获得整个字符串之后(以某种方式摆脱内部标签?)...
简而言之,如何 get/extract 整个字符串,包括任何 '<gxxx>'
或 '</g>'
内部标签之后的部分?
如果我没理解错的话,像这样的东西应该有用:
import lxml.html as lh #while an xml parser would be more appropriate, in this case it's cleaner to use an html parser
diff = """[your xml above]"""
doc = lh.fromstring(diff.encode('utf-8'))
engs = []
cros = []
eng = doc.xpath('//seg-source//mrk')
cro = doc.xpath('//target//mrk')
for e in eng:
engs.append(e.text_content())
for c in cro:
cros.append(c.text_content())
for eng, cro in zip(engs, cros):
print(eng, '<tab>',cro)
输出:
Add or edit text: <tab> Dodajte ili uredite tekst:
Just begin typing. <tab> Samo počnite tipkati.
The blinking insertion point indicates where your text starts. <tab> Trepereća točka umetanja pokazuje gdje počinje vaš tekst.
To edit text, select the text, then type. <tab> Za uređivanje teksta odaberite tekst, zatim unesite tekst.
Use the controls in the Format sidebar on the right. <tab> Upotrijebite kontrole u rubnom stupcu Formatiraj s desne strane.
The toolbar with buttons. <tab> Alatna traka sa tipkama.
我正在尝试解析来自 SDL Trados 翻译软件的 xliff (XML) 变体,其中包含翻译,我正在解析的 "sdlxliff" 文件看起来像这样(有点简化和 "prettified").
XML/XLIFF 正在处理的文件 ("sample.sdlxliff"):
<?xml version="1.0" encoding="utf-8"?><xliff xmlns:sdl="http://sdl.com/FileTypes/SdlXliff/1.0" xmlns="urn:oasis:names:tc:xliff:document:1.2" version="1.2" sdl:version="1.0"><file original="\TRADOS_SERVER\Trados17\Doc_Helps\en-US\import\Test.xml" datatype="x-sdlfilterframework2" source-language="en-US" target-language="hr-HR"><header><sniff-info><detected-encoding detection-level="Certain" encoding="utf-8"/><detected-source-lang detection-level="Guess" lang="en-US"/><props><value key="xmlDeclaration">true</value><value key="standalone">yes</value><value key="HasUtf8Bom">false</value><value key="IsFragment">false</value></props></sniff-info></header>
<body>
<trans-unit id="a1f4768e-a026-46c2-b65d-599d2108d176">
<source>
<g id="461">Add or edit text: </g>Just begin typing. The blinking insertion point indicates where your text starts. To edit text, <g id="462">select the text</g>, then type. Use the controls in the Format <g id="463"> <g id="464"/></g> sidebar on the right.
</source>
<seg-source>
<g id="461">
<mrk mtype="seg" mid="182">Add or edit text:</mrk> </g>
<mrk mtype="seg" mid="183">Just begin typing.</mrk>
<mrk mtype="seg" mid="184">The blinking insertion point indicates where your text starts.</mrk>
<mrk mtype="seg" mid="185">To edit text, <g id="462">select the text</g>, then type.</mrk>
<mrk mtype="seg" mid="186">Use the controls in the Format <g id="463"><g id="464"/></g> sidebar on the right.</mrk>
</seg-source>
<target>
<g id="461">
<mrk mtype="seg" mid="182">Dodajte ili uredite tekst:</mrk> </g>
<mrk mtype="seg" mid="183">Samo počnite tipkati.</mrk>
<mrk mtype="seg" mid="184">Trepereća točka umetanja pokazuje gdje počinje vaš tekst.</mrk>
<mrk mtype="seg" mid="185">Za uređivanje teksta <g id="462">odaberite tekst</g>, zatim unesite tekst.</mrk>
<mrk mtype="seg" mid="186">Upotrijebite kontrole u rubnom stupcu Formatiraj <g id="463"><g id="464"/></g> s desne strane.</mrk>
</target>
<blahblahblah></blahblahblah>
</trans-unit>
<trans-unit id="7f7ede5e-75b9-403a-b1c6-43f654ea8245">
<source>
<g id="492"><g id="493">The toolbar with buttons.</g></g>
</source>
<seg-source>
<g id="492">
<g id="493">
<mrk mtype="seg" mid="199">The toolbar with buttons.</mrk></g></g>
</seg-source>
<target>
<g id="492">
<g id="493">
<mrk mtype="seg" mid="199">Alatna traka sa tipkama.</mrk></g></g>
</target>
<blahblahblah></blahblahblah>
</trans-unit>
</body>
</file></xliff>
所以,XML/XLIFF 文件有 "seg-source" 和 "target" 部分,我对此很感兴趣,我想提取这些部分并稍后打印到制表符分隔的普通 TXT 文件, 或者其他..
但是,我在使用内嵌标签时遇到问题 - 就像这一行:
<mrk mtype="seg" mid="185">To edit text, <g id="462">select the text</g>, then type.</mrk>
-> 我只得到第一个内联 '<g id="xxx">'
标记之前的字符串部分 :(
而不是“要编辑文本,select 文本,然后键入。”,我只得到“要编辑文本,".
Python 我试过的代码:
# parsesdlxliff-test.py:
from lxml import etree
tree = etree.parse("sample.sdlxliff")
root = tree.getroot()
for element in root:
pass # not important
# now the children
for all_tags in element.findall('.//'):
if 'mrk' in all_tags.tag:
attrs = all_tags.attrib
numb = attrs.get("mid")
# remove all internal tags within 'mrk', leave only clean string/text? - how?
print(numb, all_tags.text)
我用这段代码得到的结果:
182 Add or edit text:
183 Just begin typing.
184 The blinking insertion point indicates where your text starts.
185 To edit text,
186 Use the controls in the Format
182 Dodajte ili uredite tekst:
183 Samo počnite tipkati.
184 Trepereća točka umetanja pokazuje gdje počinje vaš tekst.
185 Za uređivanje teksta
186 Upotrijebite kontrole u rubnom stupcu Formatiraj
199 The toolbar with buttons.
199 Alatna traka sa tipkama.
从结果行中可以看出。 185 和 186('mid' 数字),第一个内联标记后缺少文本(在 'seg-source' 和 'target' 中)。
最终我想要得到的是这样的(仅供参考):
Add or edit text: <TAB> Dodajte ili uredite tekst:
To edit text, select the text, then type. <TAB> Za uređivanje teksta odaberite tekst, zatim unesite tekst.
Use the controls in the Format sidebar on the right. <TAB> Upotrijebite kontrole u rubnom stupcu Formatiraj s desne strane.
即制表符分隔的源目标 句子 对。
我可以稍后使用 'mid' 数字将它们配对,但只有在我设法获得整个字符串之后(以某种方式摆脱内部标签?)...
简而言之,如何 get/extract 整个字符串,包括任何 '<gxxx>'
或 '</g>'
内部标签之后的部分?
如果我没理解错的话,像这样的东西应该有用:
import lxml.html as lh #while an xml parser would be more appropriate, in this case it's cleaner to use an html parser
diff = """[your xml above]"""
doc = lh.fromstring(diff.encode('utf-8'))
engs = []
cros = []
eng = doc.xpath('//seg-source//mrk')
cro = doc.xpath('//target//mrk')
for e in eng:
engs.append(e.text_content())
for c in cro:
cros.append(c.text_content())
for eng, cro in zip(engs, cros):
print(eng, '<tab>',cro)
输出:
Add or edit text: <tab> Dodajte ili uredite tekst:
Just begin typing. <tab> Samo počnite tipkati.
The blinking insertion point indicates where your text starts. <tab> Trepereća točka umetanja pokazuje gdje počinje vaš tekst.
To edit text, select the text, then type. <tab> Za uređivanje teksta odaberite tekst, zatim unesite tekst.
Use the controls in the Format sidebar on the right. <tab> Upotrijebite kontrole u rubnom stupcu Formatiraj s desne strane.
The toolbar with buttons. <tab> Alatna traka sa tipkama.