如何将 SRT 文件制作成数据集?

How to make an SRT file into a dataset?

是否可以将用于视频字幕的 SRT 文件 转换为数据集

导入到 Excel 时,SRT 文件格式如下所示:

1
00:00:03,000 --> 00:00:04,000
OVERLAPS PURE COINCIDENCE THAT

...

随着 "video"/transcript 中时间的推移,这种模式仍在继续。我想这样格式化 SRT 文件:

number ; start ; end ; text

1 ; 00:00:03,000 ; 00:00:04,000 ; OVERLAPS PURE COINCIDENCE THAT

下面的 VBA 过程从本地文件加载标准 .srt(SubRip 电影字幕文件)并将其拆分为活动 Excel 工作表上的 rows/columns。

从本地文件导入 SRT 字幕:

Sub importSRTfromFile(fName As String)
'Loads SRT from local file and converts to columns in Active Worksheet

    Dim sIn As String, sOut As String, sArr() As String, x As Long

    'load file
    Open fName For Input As #1
        While Not EOF(1)
            Line Input #1, sIn
            sOut = sOut & sIn & vbLf
        Wend
    Close #1

    'convert LFs to delimiters & split into array
    sOut = Replace(sOut, vbLf & vbLf, vbCr)
    sOut = Replace(Replace(sOut, vbLf, "|"), " --> ", "|")
    sArr = Split(sOut, vbCr)

    'check if activesheet is blank
    If ActiveSheet.UsedRange.Cells.Count > 1 Then
        If MsgBox(UBound(sArr) & " rows found." & vbLf & vbLf & _
            "Okay to clear worksheet '" & ActiveSheet.Name & "'?", _
            vbOKCancel, "Delete Existing Data?") <> vbOK Then Exit Sub
        ActiveSheet.Cells.ClearContents
    End If

    'breakout into rows
    For x = 1 To UBound(sArr)
        Range("A" & x) = sArr(x)
    Next x

    'split into columns
    Columns("A:A").TextToColumns Destination:=Range("A1"), _
        DataType:=xlDelimited, Other:=True, OtherChar:="|"

    MsgBox "Imported " & UBound(sArr) & " rows from:" & vbLf & fName

End Sub

用法示例:

Sub test_FileImport()
    importSRTfromFile "c:\yourPath\yourFilename.srt"
End Sub

从网站导入 SRT 字幕 URL:

或者,您可以从 网站 URL 导入 .srt(或其他类似的文本文件),例如 https://subtitle-index.org/ :

Sub importSRTfromWeb(url As String)
'Loads SRT from URL and converts to columns in Active Worksheet

    Dim sIn As String, sOut As String, sArr() As String, rw As Long
    Dim httpData() As Byte, XMLHTTP As Object

    'load file from URL
    Set XMLHTTP = CreateObject("MSXML2.XMLHTTP")
    XMLHTTP.Open "GET", url, False
    XMLHTTP.send
    httpData = XMLHTTP.responseBody
    Set XMLHTTP = Nothing
    sOut = StrConv(httpData, vbUnicode)

    'convert LFs to delimiters & split into array
    sOut = Replace(sOut, vbLf & vbLf, vbCr)
    sOut = Replace(Replace(sOut, vbLf, "|"), " --> ", "|")
    sArr = Split(sOut, vbCr)

    'check if activesheet is blank
    If ActiveSheet.UsedRange.Cells.Count > 1 Then
        If MsgBox(UBound(sArr) & " rows found." & vbLf & vbLf & _
            "Okay to clear worksheet '" & ActiveSheet.Name & "'?", _
            vbOKCancel, "Delete Existing Data?") <> vbOK Then Exit Sub
        ActiveSheet.Cells.ClearContents
    End If

    'breakout into rows
    For rw = 1 To UBound(sArr)
        Range("A" & rw) = sArr(rw)
    Next rw

    'split into columns
    Columns("A:A").TextToColumns Destination:=Range("A1"), _
        DataType:=xlDelimited, Other:=True, OtherChar:="|"
    MsgBox "Imported " & UBound(sArr) & " rows from:" & vbLf & url

End Sub

用法示例:

Sub testImport()
    importSRTfromWeb _
        "https://subtitle-index.org/download/4670541854528212663953859964/SRT/Pulp+Fiction"
End Sub

许多网站都免费提供 .srt;您可能需要 right-click 下载按钮来复制 link(它可能有一个 .srt 扩展名或者可能是一个指针,如上例)。该过程不适用于 .zip 个文件。


更多信息:

在上面的代码中:

'breakout into rows For rw = 1 To UBound(sArr) Range("A" & rw) = sArr(rw) Next rw

应替换为:

'breakout into rows For rw = 0 To UBound(sArr) Range("A" & rw+1) = sArr(rw) Next rw

否则输出将从第 2 行开始

我使用 Vim 并编写了一个快速正则表达式来将 .srt 转换为 .csv 文件,供需要类似转换的翻译朋友使用。然后可以在 Excel / LibreOffice 中打开 csv 文件并保存为 .xls、.ods 或其他格式。 我的朋友不需要字幕编号出现在第一列中,因此正则表达式代码如下所示:

set fileencoding=utf-8
%s/"/""/g
g/^\d\+$/d
%s@^\(.*\) --> \(.*\)\n@"","","@g
%s/\n^$/"/g

保留子编号的变体:

set fileencoding=utf-8
%s/"/""/g
%s@\(^\d\+\)$\n^\(.*\) --> \(.*\)\n@"","","","@g
%s/\n^$/"/g

将此代码保存到扩展名为 .vim 的文本文件中,然后在 Vim / Gvim 中编辑 .srt 时获取此文件。将结果另存为 .csv。享受正则表达式的魔力!

注意:我的代码使用逗号作为字段分隔符。将上面代码中的逗号改为semi-colons即可使用semi-colons。我还添加了 double-quotes 作为字符串定界符,以防 double-quotes 和逗号出现在字幕文本中。更多错误证明!