从 docx 文件中提取 python 代码块并在沙箱中 运行 它们的安全方法是什么?

What is a safe way to extract python code blocks from docx files and run them in a sandbox?

我有大约 6000~6500 个 Microsoft Word .docx 个文件,其中包含各种类型的格式化答案脚本,顺序为:

Python Programming Question in Bold

Answer in form of complete, correctly-indented, single-spaced, self-sufficient code

不幸的是,似乎没有固定的模式将代码块与普通文本区分开来。前 50 个左右文件中的一些示例:

  1. 整个问题以粗体显示,之后代码突然开始,在 bold/italics

  2. 问题放在评论里,之后代码继续

  3. 完全没有问题,只是用编号列表表示开始的代码

  4. 问题完全缺失,用C/Python风格的注释表示开始

等等

目前,我正在通过 python-docx like this:

提取整个未格式化的文本
doc = Document(infil)

# For Unicode handling.
new_paragraphs = []
for paragraph in doc.paragraphs:
    new_paragraphs.append((paragraph.text).encode("utf-8"))

new_paragraphs = list(map(lambda x: convert(x), new_paragraphs))

with open(outfil, 'w', encoding='utf-8') as f:
    print('\n'.join(new_paragraphs), file=f)

提取后,我将 运行 使用我认为安全的 PyPy Sandboxing feature 他们,然后像​​在比赛中一样分配分数。

我完全坚持的是如何以编程方式检测代码的开始和结束。大多数语言检测 API 都是不需要的,因为我已经了解该语言。这个问题:How to detect source code in a text? suggests using linters and syntax highlighters like the Google Code Prettifier,但他们没有解决检测单独程序的问题。

一个合适的解决方案,from this programmers.se question,似乎是训练马尔可夫链,但在开始如此庞大的项目之前,我想要一些第二意见。

此提取码也将在评估后提供给所有学生。

如果问题过于宽泛或答案过于明显,我深表歉意。

嗯,您是在寻找某种格式化模式吗?这对我来说听起来有点奇怪。是否有任何类型的文本或字符串模式可供您利用?我不确定这是否有帮助,但下面的 VBA 脚本会搜索文件夹中的所有 Word 文档,并在与您在 Row1 中指定的搜索条件匹配的任何字段中放置 'X' .它还在 ColA 中放置了一个 hyperlink,因此您可以单击 link 并打开文件,而不是四处搜索文件。这是屏幕截图。

脚本:

Sub OpenAndReadWordDoc()

    Rows("2:1000000").Select
    Range(Selection, Selection.End(xlDown)).Select
    Selection.ClearContents
    Range("A1").Select

    ' assumes that the previous procedure has been executed
    Dim oWordApp As Word.Application
    Dim oWordDoc As Word.Document
    Dim blnStart As Boolean
    Dim r As Long
    Dim sFolder As String
    Dim strFilePattern As String
    Dim strFileName As String
    Dim sFileName As String
    Dim ws As Worksheet
    Dim c As Long
    Dim n As Long

    '~~> Establish an Word application object
    On Error Resume Next
    Set oWordApp = GetObject(, "Word.Application")
    If Err() Then
        Set oWordApp = CreateObject("Word.Application")
        ' We started Word for this macro
        blnStart = True
    End If
    On Error GoTo ErrHandler

    Set ws = ActiveSheet
    r = 1 ' startrow for the copied text from the Word document
    ' Last column
    n = ws.Range("A1").End(xlToRight).Column

    sFolder = "C:\Users\your_path_here\"

    '~~> This is the extension you want to go in for
    strFilePattern = "*.doc*"
    '~~> Loop through the folder to get the word files
    strFileName = Dir(sFolder & strFilePattern)
    Do Until strFileName = ""
        sFileName = sFolder & strFileName

        '~~> Open the word doc
        Set oWordDoc = oWordApp.Documents.Open(sFileName)
        ' Increase row number
        r = r + 1
        ' Enter file name in column A
        ws.Cells(r, 1).Value = sFileName

        ActiveCell.Offset(1, 0).Select
        ActiveSheet.Hyperlinks.Add Anchor:=Sheets("Sheet1").Range("A" & r), Address:=sFileName,
        SubAddress:="A" & r, TextToDisplay:=sFileName

        ' Loop through the columns
        For c = 2 To n
            If oWordDoc.Content.Find.Execute(FindText:=Trim(ws.Cells(1, c).Value),
                    MatchWholeWord:=True, MatchCase:=False) Then
                ' If text found, enter Yes in column number c
                ws.Cells(r, c).Value = "Yes"
            End If
        Next c
        oWordDoc.Close SaveChanges:=False

        '~~> Find next file
        strFileName = Dir()
    Loop

ExitHandler:
    On Error Resume Next
    ' close the Word application
    Set oWordDoc = Nothing
    If blnStart Then
        ' We started Word, so we close it
        oWordApp.Quit
    End If
    Set oWordApp = Nothing
    Exit Sub

ErrHandler:
    MsgBox Err.Description, vbExclamation
    Resume ExitHandler
End Sub

Function GetDirectory(path)
    GetDirectory = Left(path, InStrRev(path, "\"))
End Function