将 .sgm 转换成 .txt

Convert .sgm to .txt

我有一些 .sgm 格式的文件,我必须对它们进行评估(应用语言模型并获得文本的复杂度)。

主要问题是我需要这些文件的纯格式,即 txt 格式。但是,我一直在互联网上搜索在线转换或执行此操作的某种脚本,但找不到。

除此之外,我的一位老师用 perl 发给我这个命令:

perl -n 'print ."\n" if /<seg[^>]+>\s*(.*\S)\s*<.seg>/i;’ < file.sgm > file

我从来没有使用过 perl,老实说,我对此一无所知。我想我已经安装了 perl:

$ perl -v

This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)

Copyright 1987-2013, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl".  If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

顺便说一下,我正在使用 Mac OS X.

示例 .sgm 文件:

<srcset setid="newsdiscusstest2015" srclang="any">
<doc sysid="ref" docid="39-Guardian" genre="newsdiscuss" origlang="en">
<p>
<seg id="1">This is perfectly illustrated by the UKIP numbties banning people with HIV.</seg>
<seg id="2">You mean Nigel Farage saying the NHS should not be used to pay for people coming to the UK as health tourists, and saying yes when the interviewer specifically asked if, with the aforementioned in mind, people with HIV were included in not being welcome.</seg>
<seg id="3">You raise a straw man and then knock it down with thinly veiled homophobia.</seg>

输出 .txt 文件:

This is perfectly illustrated by the UKIP numbties banning people with HIV. You mean Nigel Farage saying the NHS should not be used to pay for people coming to the UK as health tourists, and saying yes when the interviewer specifically asked if, with the aforementioned in mind, people with HIV were included in not being welcome. You raise a straw man and then knock it down with thinly veiled homophobia.

好的,我找到了解决方案:

将文件从 "file.sgm" 重命名为 "file.html"。然后用文本编辑器打开html文件,在最上面加上一行<meta charset="utf-8">,这样所有的字符就可以正确显示了。最后,用网络浏览器打开这个文件并将内容复制到一个新的文本文件中。

您可以尝试使用此脚本从文件中去除 SGML 标签:

#!/usr/bin/env perl
use strict;
use warnings;

use HTML::Parser;

my $file = $ARGV[0];

HTML::Parser->new(default_h => [""],
    text_h => [ sub { print shift }, 'text' ]
  )->parse_file($file) or die "Failed to parse $file: $!";

使用方法如下:

./strip_sgml.pl file.sgm > file.txt

对于 python 解决方案,用户 Hugo 在此处的回答将从文档 (Python/BeautifulSoup - how to remove all tags from an element?) 中删除所有标签。

TLDR 使用 Beautiful Soup 中的 get_text() 函数。