从 docx 文件中提取 python 代码块并在沙箱中 运行 它们的安全方法是什么?
What is a safe way to extract python code blocks from docx files and run them in a sandbox?
我有大约 6000~6500 个 Microsoft Word
.docx
个文件,其中包含各种类型的格式化答案脚本,顺序为:
Python Programming Question in Bold
Answer in form of complete, correctly-indented, single-spaced, self-sufficient code
不幸的是,似乎没有固定的模式将代码块与普通文本区分开来。前 50 个左右文件中的一些示例:
整个问题以粗体显示,之后代码突然开始,在
bold/italics
问题放在评论里,之后代码继续
完全没有问题,只是用编号列表表示开始的代码
问题完全缺失,用C/Python风格的注释表示开始
等等
目前,我正在通过 python-docx
like this:
提取整个未格式化的文本
doc = Document(infil)
# For Unicode handling.
new_paragraphs = []
for paragraph in doc.paragraphs:
new_paragraphs.append((paragraph.text).encode("utf-8"))
new_paragraphs = list(map(lambda x: convert(x), new_paragraphs))
with open(outfil, 'w', encoding='utf-8') as f:
print('\n'.join(new_paragraphs), file=f)
提取后,我将 运行 使用我认为安全的 PyPy Sandboxing feature 他们,然后像在比赛中一样分配分数。
我完全坚持的是如何以编程方式检测代码的开始和结束。大多数语言检测 API 都是不需要的,因为我已经了解该语言。这个问题:How to detect source code in a text? suggests using linters and syntax highlighters like the Google Code Prettifier,但他们没有解决检测单独程序的问题。
一个合适的解决方案,from this programmers.se question,似乎是训练马尔可夫链,但在开始如此庞大的项目之前,我想要一些第二意见。
此提取码也将在评估后提供给所有学生。
如果问题过于宽泛或答案过于明显,我深表歉意。
嗯,您是在寻找某种格式化模式吗?这对我来说听起来有点奇怪。是否有任何类型的文本或字符串模式可供您利用?我不确定这是否有帮助,但下面的 VBA 脚本会搜索文件夹中的所有 Word 文档,并在与您在 Row1 中指定的搜索条件匹配的任何字段中放置 'X' .它还在 ColA 中放置了一个 hyperlink,因此您可以单击 link 并打开文件,而不是四处搜索文件。这是屏幕截图。
脚本:
Sub OpenAndReadWordDoc()
Rows("2:1000000").Select
Range(Selection, Selection.End(xlDown)).Select
Selection.ClearContents
Range("A1").Select
' assumes that the previous procedure has been executed
Dim oWordApp As Word.Application
Dim oWordDoc As Word.Document
Dim blnStart As Boolean
Dim r As Long
Dim sFolder As String
Dim strFilePattern As String
Dim strFileName As String
Dim sFileName As String
Dim ws As Worksheet
Dim c As Long
Dim n As Long
'~~> Establish an Word application object
On Error Resume Next
Set oWordApp = GetObject(, "Word.Application")
If Err() Then
Set oWordApp = CreateObject("Word.Application")
' We started Word for this macro
blnStart = True
End If
On Error GoTo ErrHandler
Set ws = ActiveSheet
r = 1 ' startrow for the copied text from the Word document
' Last column
n = ws.Range("A1").End(xlToRight).Column
sFolder = "C:\Users\your_path_here\"
'~~> This is the extension you want to go in for
strFilePattern = "*.doc*"
'~~> Loop through the folder to get the word files
strFileName = Dir(sFolder & strFilePattern)
Do Until strFileName = ""
sFileName = sFolder & strFileName
'~~> Open the word doc
Set oWordDoc = oWordApp.Documents.Open(sFileName)
' Increase row number
r = r + 1
' Enter file name in column A
ws.Cells(r, 1).Value = sFileName
ActiveCell.Offset(1, 0).Select
ActiveSheet.Hyperlinks.Add Anchor:=Sheets("Sheet1").Range("A" & r), Address:=sFileName,
SubAddress:="A" & r, TextToDisplay:=sFileName
' Loop through the columns
For c = 2 To n
If oWordDoc.Content.Find.Execute(FindText:=Trim(ws.Cells(1, c).Value),
MatchWholeWord:=True, MatchCase:=False) Then
' If text found, enter Yes in column number c
ws.Cells(r, c).Value = "Yes"
End If
Next c
oWordDoc.Close SaveChanges:=False
'~~> Find next file
strFileName = Dir()
Loop
ExitHandler:
On Error Resume Next
' close the Word application
Set oWordDoc = Nothing
If blnStart Then
' We started Word, so we close it
oWordApp.Quit
End If
Set oWordApp = Nothing
Exit Sub
ErrHandler:
MsgBox Err.Description, vbExclamation
Resume ExitHandler
End Sub
Function GetDirectory(path)
GetDirectory = Left(path, InStrRev(path, "\"))
End Function
我有大约 6000~6500 个 Microsoft Word
.docx
个文件,其中包含各种类型的格式化答案脚本,顺序为:
Python Programming Question in Bold
Answer in form of complete, correctly-indented, single-spaced, self-sufficient code
不幸的是,似乎没有固定的模式将代码块与普通文本区分开来。前 50 个左右文件中的一些示例:
整个问题以粗体显示,之后代码突然开始,在 bold/italics
问题放在评论里,之后代码继续
完全没有问题,只是用编号列表表示开始的代码
问题完全缺失,用C/Python风格的注释表示开始
等等
目前,我正在通过 python-docx
like this:
doc = Document(infil)
# For Unicode handling.
new_paragraphs = []
for paragraph in doc.paragraphs:
new_paragraphs.append((paragraph.text).encode("utf-8"))
new_paragraphs = list(map(lambda x: convert(x), new_paragraphs))
with open(outfil, 'w', encoding='utf-8') as f:
print('\n'.join(new_paragraphs), file=f)
提取后,我将 运行 使用我认为安全的 PyPy Sandboxing feature 他们,然后像在比赛中一样分配分数。
我完全坚持的是如何以编程方式检测代码的开始和结束。大多数语言检测 API 都是不需要的,因为我已经了解该语言。这个问题:How to detect source code in a text? suggests using linters and syntax highlighters like the Google Code Prettifier,但他们没有解决检测单独程序的问题。
一个合适的解决方案,from this programmers.se question,似乎是训练马尔可夫链,但在开始如此庞大的项目之前,我想要一些第二意见。
此提取码也将在评估后提供给所有学生。
如果问题过于宽泛或答案过于明显,我深表歉意。
嗯,您是在寻找某种格式化模式吗?这对我来说听起来有点奇怪。是否有任何类型的文本或字符串模式可供您利用?我不确定这是否有帮助,但下面的 VBA 脚本会搜索文件夹中的所有 Word 文档,并在与您在 Row1 中指定的搜索条件匹配的任何字段中放置 'X' .它还在 ColA 中放置了一个 hyperlink,因此您可以单击 link 并打开文件,而不是四处搜索文件。这是屏幕截图。
脚本:
Sub OpenAndReadWordDoc()
Rows("2:1000000").Select
Range(Selection, Selection.End(xlDown)).Select
Selection.ClearContents
Range("A1").Select
' assumes that the previous procedure has been executed
Dim oWordApp As Word.Application
Dim oWordDoc As Word.Document
Dim blnStart As Boolean
Dim r As Long
Dim sFolder As String
Dim strFilePattern As String
Dim strFileName As String
Dim sFileName As String
Dim ws As Worksheet
Dim c As Long
Dim n As Long
'~~> Establish an Word application object
On Error Resume Next
Set oWordApp = GetObject(, "Word.Application")
If Err() Then
Set oWordApp = CreateObject("Word.Application")
' We started Word for this macro
blnStart = True
End If
On Error GoTo ErrHandler
Set ws = ActiveSheet
r = 1 ' startrow for the copied text from the Word document
' Last column
n = ws.Range("A1").End(xlToRight).Column
sFolder = "C:\Users\your_path_here\"
'~~> This is the extension you want to go in for
strFilePattern = "*.doc*"
'~~> Loop through the folder to get the word files
strFileName = Dir(sFolder & strFilePattern)
Do Until strFileName = ""
sFileName = sFolder & strFileName
'~~> Open the word doc
Set oWordDoc = oWordApp.Documents.Open(sFileName)
' Increase row number
r = r + 1
' Enter file name in column A
ws.Cells(r, 1).Value = sFileName
ActiveCell.Offset(1, 0).Select
ActiveSheet.Hyperlinks.Add Anchor:=Sheets("Sheet1").Range("A" & r), Address:=sFileName,
SubAddress:="A" & r, TextToDisplay:=sFileName
' Loop through the columns
For c = 2 To n
If oWordDoc.Content.Find.Execute(FindText:=Trim(ws.Cells(1, c).Value),
MatchWholeWord:=True, MatchCase:=False) Then
' If text found, enter Yes in column number c
ws.Cells(r, c).Value = "Yes"
End If
Next c
oWordDoc.Close SaveChanges:=False
'~~> Find next file
strFileName = Dir()
Loop
ExitHandler:
On Error Resume Next
' close the Word application
Set oWordDoc = Nothing
If blnStart Then
' We started Word, so we close it
oWordApp.Quit
End If
Set oWordApp = Nothing
Exit Sub
ErrHandler:
MsgBox Err.Description, vbExclamation
Resume ExitHandler
End Sub
Function GetDirectory(path)
GetDirectory = Left(path, InStrRev(path, "\"))
End Function