如何在 logstash 中解析 tmx 文件(翻译数据的 xml 文件)
How do I parse a tmx file (xml file for translation data) in logstash
我在 Logstash 中使用 TMX 文件(xml 翻译数据文件)作为我在 Elasticsearch 中索引数据的来源。
示例 TMX 文件如下所示,
<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4">
<header creationtool="ModernMT - modernmt.eu" creationtoolversion="1.0" datatype="plaintext" o-tmf="ModernMT" segtype="sentence" adminlang="en-us" srclang="en-GB"/>
<body>
<tu srclang="en-GB" datatype="plaintext" creationdate="20121019T114713Z">
<tuv xml:lang="en-GB">
<seg>The purpose of the standard is to establish and define the requirements for the provision of quality services by translation service providers.</seg>
</tuv>
<tuv xml:lang="it">
<seg>L'obiettivo dello standard è stabilire e definire i requisiti affinché i fornitori di servizi di traduzione garantiscano servizi di qualità.</seg>
</tuv>
</tu>
<tu srclang="en-GB" datatype="plaintext" creationdate="20111223T112746Z">
<tuv xml:lang="en-GB">
<seg>With 1,800 experienced and qualified resources translating regularly into over 200 language combinations, you can count on us for high quality professional translation services.</seg>
</tuv>
<tuv xml:lang="it">
<seg>Abbiamo 1.800 professionisti esperti e qualificati che traducono regolarmente in oltre 200 combinazioni linguistiche; perciò, se cercate la qualità, potete contare su di noi.</seg>
</tuv>
</tu>
<tu srclang="en-GB" datatype="plaintext" creationdate="20111223T112746Z">
<tuv xml:lang="en-GB">
<seg>Access our section of useful links</seg>
</tuv>
<tuv xml:lang="it">
<seg>Da qui potrete accedere a una sezione che propone link a siti che possono essere di vostro interesse</seg>
</tuv>
</tu>
这里我需要做的是将每个<tu>
块作为事件访问,其中两个<tuv>
块将用作数据字段。第一个 tuv
块中存储的数据将在 ES 中作为源语言数据字段进行索引,第二个 tuv
块中存储的数据是目标语言数据字段。
一个 TMX 文档可以包含超过 10000 个 tuv
块。
我在使用 xml 过滤器时遇到问题,现在看起来像这样,
input {
file {
path => "/en-gb_pt-pt/81384/81384.xml"
start_position => "beginning"
codec => multiline {
pattern => "<tu>"
negate => "true"
what => "previous"
}
}
}
filter {
xml {
source => "message"
target => "xml_content"
xpath => [ "//seg", "seg" ]
}
}
output {
stdout {
#codec => json
codec => rubydebug
}
}
这是我的索引模板的一部分,
"segment": {
"_parent": {
"type": "tm"
},
"_routing": {
"required": "true"
},
"properties": {
"@timestamp": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"@version": {
"type": "string"
},
"source": {
"type": "string",
"store": "true",
"fields": {
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
},
"target": {
"type": "string",
"store": "true",
"fields": {
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
}
}
}
Ì 建议使用 grok 或 dissect 过滤器的简单方法。
filter {
dissect {
mapping => { "message" => "%{}<seg>%{src}</seg>%{}<seg>%{trg}</seg>%{}" }
}
mutate {
remove_field => ["message"]
}
}
你得到:
{
"path" => "/en-gb_pt-pt/81384/81384.xml",
"@timestamp" => 2017-08-25T15:07:34.567Z,
"src" => "The purpose of the standard is to establish and define the requirements for the provision of quality services by translation service providers.",
"@version" => "1",
"host" => "my_host",
"trg" => "L'obiettivo dello standard è stabilire e definire i requisiti affinché i fornitori di servizi di traduzione garantiscano servizi di qualità.",
"tags" => [
[0] "multiline"
]
}
我在 Logstash 中使用 TMX 文件(xml 翻译数据文件)作为我在 Elasticsearch 中索引数据的来源。
示例 TMX 文件如下所示,
<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4">
<header creationtool="ModernMT - modernmt.eu" creationtoolversion="1.0" datatype="plaintext" o-tmf="ModernMT" segtype="sentence" adminlang="en-us" srclang="en-GB"/>
<body>
<tu srclang="en-GB" datatype="plaintext" creationdate="20121019T114713Z">
<tuv xml:lang="en-GB">
<seg>The purpose of the standard is to establish and define the requirements for the provision of quality services by translation service providers.</seg>
</tuv>
<tuv xml:lang="it">
<seg>L'obiettivo dello standard è stabilire e definire i requisiti affinché i fornitori di servizi di traduzione garantiscano servizi di qualità.</seg>
</tuv>
</tu>
<tu srclang="en-GB" datatype="plaintext" creationdate="20111223T112746Z">
<tuv xml:lang="en-GB">
<seg>With 1,800 experienced and qualified resources translating regularly into over 200 language combinations, you can count on us for high quality professional translation services.</seg>
</tuv>
<tuv xml:lang="it">
<seg>Abbiamo 1.800 professionisti esperti e qualificati che traducono regolarmente in oltre 200 combinazioni linguistiche; perciò, se cercate la qualità, potete contare su di noi.</seg>
</tuv>
</tu>
<tu srclang="en-GB" datatype="plaintext" creationdate="20111223T112746Z">
<tuv xml:lang="en-GB">
<seg>Access our section of useful links</seg>
</tuv>
<tuv xml:lang="it">
<seg>Da qui potrete accedere a una sezione che propone link a siti che possono essere di vostro interesse</seg>
</tuv>
</tu>
这里我需要做的是将每个<tu>
块作为事件访问,其中两个<tuv>
块将用作数据字段。第一个 tuv
块中存储的数据将在 ES 中作为源语言数据字段进行索引,第二个 tuv
块中存储的数据是目标语言数据字段。
一个 TMX 文档可以包含超过 10000 个 tuv
块。
我在使用 xml 过滤器时遇到问题,现在看起来像这样,
input {
file {
path => "/en-gb_pt-pt/81384/81384.xml"
start_position => "beginning"
codec => multiline {
pattern => "<tu>"
negate => "true"
what => "previous"
}
}
}
filter {
xml {
source => "message"
target => "xml_content"
xpath => [ "//seg", "seg" ]
}
}
output {
stdout {
#codec => json
codec => rubydebug
}
}
这是我的索引模板的一部分,
"segment": {
"_parent": {
"type": "tm"
},
"_routing": {
"required": "true"
},
"properties": {
"@timestamp": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"@version": {
"type": "string"
},
"source": {
"type": "string",
"store": "true",
"fields": {
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
},
"target": {
"type": "string",
"store": "true",
"fields": {
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
}
}
}
Ì 建议使用 grok 或 dissect 过滤器的简单方法。
filter {
dissect {
mapping => { "message" => "%{}<seg>%{src}</seg>%{}<seg>%{trg}</seg>%{}" }
}
mutate {
remove_field => ["message"]
}
}
你得到:
{
"path" => "/en-gb_pt-pt/81384/81384.xml",
"@timestamp" => 2017-08-25T15:07:34.567Z,
"src" => "The purpose of the standard is to establish and define the requirements for the provision of quality services by translation service providers.",
"@version" => "1",
"host" => "my_host",
"trg" => "L'obiettivo dello standard è stabilire e definire i requisiti affinché i fornitori di servizi di traduzione garantiscano servizi di qualità.",
"tags" => [
[0] "multiline"
]
}