每个 PowerShell 调用 Python 脚本并传递 PSObject 和 return 已解析的数据
Call Python script per PowerShell & passing PSObject and return the parsed data
一些背景:目前我正在使用 dbatools into a PSObject (in Batch 10.000 rows each query), processing the data with PowerShell (a lot of RegEx stuff) and writing back into a MariaDb with SimplySql 从 MS SQL 服务器查询 4Mio 行(50 列)。
平均而言,我得到大约。 150 rows/sec。不得不使用很多技巧(Net 的 Stringbuilder 等)来实现这种性能,恕我直言
作为新要求,我想检测某些文本单元格的语言,我必须删除个人数据(姓名和地址)。为此,我找到了一些不错的 python 库 (spacy and pycld2)。
我用 pycld2 做了测试 - 很好的检测。
澄清的简化代码(提示:我是 python 菜鸟):
#get data from MS SQL
$data = Invoke-DbaQuery -SqlInstance $Connection -Query $Query -As PSObject -QueryTimeout 1800
for ($i=0;$i -lt $data.length;$i++){
#do a lot of other stuff here
#...
#finally make lang detection
if ($LangDetect.IsPresent){
$strLang = $tCaseDescription -replace "([^\p{L}\p{N}_\.\s]|`t|`n|`r)+",""
$arg = "import pycld2 as cld2; isReliable, textBytesFound, details = cld2.detect('" + $strLang + "', isPlainText = True, bestEffort = True);print(details[0][1])"
$tCaseLang = & $Env:Programfiles\Python39\python.exe -c $arg
} else {
$tCaseLang = ''
}
}
#write to MariaDB
Invoke-SqlUpdate -ConnectionName $ConnectionName -Query $Query
此 python 调用每次都有效,但由于循环调用和每次导入 pycld2 库,它会破坏性能 (12rows/sec)。所以,这是一个蹩脚的解决方案:)
此外,如上所述 - 我想使用 spacy - 必须解析更多列以删除个人数据。
我不确定,如果我有心情将整个 PS 解析器转换为 python :|
我相信,更好的解决方案可能是将整个 PS 对象从 PowerShell 传递到 python(在 PS 循环开始之前)并将其 return 作为以及 PS 对象 - 在 python 中处理后 - 但我不知道,如何使用 python / python 函数实现这一点。
您 approach/suggestions 有什么其他想法吗?
谢谢:)
以下简化示例向您展示了如何将多个 [pscustomobject]
([psobject]
) 实例从 PowerShell 传递到 Python 脚本(通过 -c
作为字符串传递在这种情况下):
通过使用JSON作为序列化格式,通过ConvertTo-Json
...
... 并通过 管道 传递 JSON ,Python 可以通过 stdin(标准输入)读取。
重要:
字符编码:
PowerShell 在向外部程序(例如Python)发送数据时使用$OutputEncoding
首选项变量中指定的编码,在 PowerShell [Core] v6+ 中默认为无 BOM UTF-8,但遗憾的是在 Windows PowerShell[=78= 中默认为 ASCII(!) ].
就像 PowerShell 限制您向外部程序发送 text 一样,它也总是解释它 接收到的内容作为文本,即基于 [Console]::OutputEncoding
中存储的编码;遗憾的是,在撰写本文时,两个 PowerShell 版本默认为系统的OEM 代码页。
要在两个 PowerShell 版本中发送和接收(无 BOM)UTF-8,(暂时)设置 $OutputEncoding
和 [Console]::OutputEncoding
如下:
$OutputEncoding = [Console]::OutputEncoding = [System.Text.Utf8Encoding]::new($false)
如果你想让你的Python脚本也输出对象,再次考虑使用JSON,在 PowerShell 上,您可以使用 ConvertFrom-Json
.
将其解析为对象
# Sample input objects.
$data = [pscustomobject] @{ one = 1; two = 2 }, [pscustomobject] @{ one = 10; two = 20 }
# Convert to JSON and pipe to Python.
ConvertTo-Json $data | python -c @'
import sys, json
# Parse the JSON passed via stdin into a list of dictionaries.
dicts = json.load(sys.stdin)
# Sample processing: print the 'one' entry of each dict.
for dict in dicts:
print(dict['one'])
'@
如果要传递的数据是单行字符串的集合,则不需要JSON:
$data = 'foo', 'bar', 'baz'
$data | python -c @'
import sys
# Sample processing: print each stdin input line enclosed in [...]
for line in sys.stdin:
print('[' + line.rstrip('\r\n') + ']')
'@
根据@mklement0 的回答,我想分享已完成并经过测试的解决方案,将 JSON 从 python 返回到 Powershell 并考虑正确的字符编码。
我已经在一批 100k 行中尝试过它 - 没问题,运行 完美且超快 :)
#get data from MS SQL
$query = -join@(
'SELECT `Id`, `CaseSubject`, `CaseDescription`,
`AccountCountry`, `CaseLang` '
'FROM `db`.`table_global` '
'ORDER BY `Id` DESC, `Id` ASC '
'LIMIT 10000;'
)
$data = Invoke-DbaQuery -SqlInstance $Connection -Query $Query -As PSObject -QueryTimeout 1800
$arg = @'
import pycld2 as cld2
import simplejson as json
import sys, re, logging
def main():
#toggle the logging level to stderr
# -> https://docs.python.org/3/library/logging.html#logging.debug
logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)
logging.info('->Encoding Python: ' + str(sys.stdin.encoding))
# consideration of correct character encoding ->
# Parse the JSON passed via stdin into a list of dictionaries ->
cases = json.load(sys.stdin, 'utf-8')
# Sample processing: print the 'one' entry of each dict.
# https://regex101.com/r/bymIQS/1
regex = re.compile(r'(?=[^\w\s]).|[\r\n]|\'|\"|\')
# hash table with Country vs Language for 'boosting' the language detection, if pycld2 is not sure
lang_country = {'Albania' : 'ALBANIAN', 'Algeria' : 'ARABIC', 'Argentina' : 'SPANISH', 'Armenia' : 'ARMENIAN', 'Austria' : 'GERMAN', 'Azerbaijan' : 'AZERBAIJANI', 'Bangladesh' : 'BENGALI', 'Belgium' : 'DUTCH', 'Benin' : 'FRENCH', 'Bolivia, Plurinational State of' : 'SPANISH', 'Bosnia and Herzegovina' : 'BOSNIAN', 'Brazil' : 'PORTUGUESE', 'Bulgaria' : 'BULGARIAN', 'Chile' : 'SPANISH', 'China' : 'Chinese', 'Colombia' : 'SPANISH', 'Costa Rica' : 'SPANISH', 'Croatia' : 'CROATIAN', 'Czech Republic' : 'CZECH', 'Denmark' : 'DANISH', 'Ecuador' : 'SPANISH', 'Egypt' : 'ARABIC', 'El Salvador' : 'SPANISH', 'Finland' : 'FINNISH', 'France' : 'FRENCH', 'Germany' : 'GERMAN', 'Greece' : 'GREEK', 'Greenland' : 'GREENLANDIC', 'Hungary' : 'HUNGARIAN', 'Iceland' : 'ICELANDIC', 'India' : 'HINDI', 'Iran' : 'PERSIAN', 'Iraq' : 'ARABIC', 'Ireland' : 'ENGLISH', 'Israel' : 'HEBREW', 'Italy' : 'ITALIAN', 'Japan' : 'Japanese', 'Kosovo' : 'ALBANIAN', 'Kuwait' : 'ARABIC', 'Mexico' : 'SPANISH', 'Monaco' : 'FRENCH', 'Morocco' : 'ARABIC', 'Netherlands' : 'DUTCH', 'New Zealand' : 'ENGLISH', 'Norway' : 'NORWEGIAN', 'Panama' : 'SPANISH', 'Paraguay' : 'SPANISH', 'Peru' : 'SPANISH', 'Poland' : 'POLISH', 'Portugal' : 'PORTUGUESE', 'Qatar' : 'ARABIC', 'Romania' : 'ROMANIAN', 'Russia' : 'RUSSIAN', 'San Marino' : 'ITALIAN', 'Saudi Arabia' : 'ARABIC', 'Serbia' : 'SERBIAN', 'Slovakia' : 'SLOVAK', 'Slovenia' : 'SLOVENIAN', 'South Africa' : 'AFRIKAANS', 'South Korea' : 'Korean', 'Spain' : 'SPANISH', 'Sweden' : 'SWEDISH', 'Switzerland' : 'GERMAN', 'Thailand' : 'THAI', 'Tunisia' : 'ARABIC', 'Turkey' : 'TURKISH', 'Ukraine' : 'UKRAINIAN', 'United Arab Emirates' : 'ARABIC', 'United Kingdom' : 'ENGLISH', 'United States' : 'ENGLISH', 'Uruguay' : 'SPANISH', 'Uzbekistan' : 'UZBEK', 'Venezuela' : 'SPANISH'}
for case in cases:
#concatenate two fiels and clean them a bitfield, so that we not get any faults due line brakes etc.
tCaseDescription = regex.sub('', (case['CaseSubject'] + ' ' + case['CaseDescription']))
tCaseAccCountry = case['AccountCountry']
if tCaseAccCountry in lang_country:
language = lang_country[tCaseAccCountry]
isReliable, textBytesFound, details = cld2.detect(tCaseDescription,
isPlainText = True,
bestEffort = True,
hintLanguage = language)
else:
isReliable, textBytesFound, details = cld2.detect(tCaseDescription,
isPlainText = True,
bestEffort = True)
#Take Value
case['CaseLang'] = details[0][0]
#logging.info('->Python processing CaseID: ' + str(case['Id']) + ' / Detected Language: ' + str(case['CaseLang']))
#encode to JSON
retVal = json.dumps(cases, 'utf-8')
return retVal
if __name__ == '__main__':
retVal = main()
sys.stdout.write(str(retVal))
'@
$dataJson = ConvertTo-Json $data
$data = ($dataJson | python -X utf8 -c $arg) | ConvertFrom-Json
foreach($case in $data) {
$tCaseSubject = $case.CaseSubject -replace "\", "\" -replace "'", "\'"
$tCaseDescription = $case.CaseDescription -replace "\", "\" -replace "'", "\'"
$tCaseLang = $case.CaseLang.substring(0,1).toupper() + $case.CaseLang.substring(1).tolower()
$tCaseId = $case.Id
$qUpdate = -join @(
"UPDATE db.table_global SET CaseSubject=`'$tCaseSubject`', "
"CaseDescription=`'$tCaseDescription`', "
"CaseLang=`'$tCaseLang`' "
"WHERE Id=$tCaseId;"
)
try{
$result = Invoke-SqlUpdate -ConnectionName 'maria' -Query $qUpdate
} catch {
Write-Host -Foreground Red -Background Black ("result: " + $result + ' / No. ' + $i)
#break
}
}
Close-SqlConnection -ConnectionName 'maria'
请为不幸的语法高亮表示歉意;脚本块包含 SQL、Powershell 和 Python..
一些背景:目前我正在使用 dbatools into a PSObject (in Batch 10.000 rows each query), processing the data with PowerShell (a lot of RegEx stuff) and writing back into a MariaDb with SimplySql 从 MS SQL 服务器查询 4Mio 行(50 列)。 平均而言,我得到大约。 150 rows/sec。不得不使用很多技巧(Net 的 Stringbuilder 等)来实现这种性能,恕我直言
作为新要求,我想检测某些文本单元格的语言,我必须删除个人数据(姓名和地址)。为此,我找到了一些不错的 python 库 (spacy and pycld2)。 我用 pycld2 做了测试 - 很好的检测。
澄清的简化代码(提示:我是 python 菜鸟):
#get data from MS SQL
$data = Invoke-DbaQuery -SqlInstance $Connection -Query $Query -As PSObject -QueryTimeout 1800
for ($i=0;$i -lt $data.length;$i++){
#do a lot of other stuff here
#...
#finally make lang detection
if ($LangDetect.IsPresent){
$strLang = $tCaseDescription -replace "([^\p{L}\p{N}_\.\s]|`t|`n|`r)+",""
$arg = "import pycld2 as cld2; isReliable, textBytesFound, details = cld2.detect('" + $strLang + "', isPlainText = True, bestEffort = True);print(details[0][1])"
$tCaseLang = & $Env:Programfiles\Python39\python.exe -c $arg
} else {
$tCaseLang = ''
}
}
#write to MariaDB
Invoke-SqlUpdate -ConnectionName $ConnectionName -Query $Query
此 python 调用每次都有效,但由于循环调用和每次导入 pycld2 库,它会破坏性能 (12rows/sec)。所以,这是一个蹩脚的解决方案:) 此外,如上所述 - 我想使用 spacy - 必须解析更多列以删除个人数据。
我不确定,如果我有心情将整个 PS 解析器转换为 python :|
我相信,更好的解决方案可能是将整个 PS 对象从 PowerShell 传递到 python(在 PS 循环开始之前)并将其 return 作为以及 PS 对象 - 在 python 中处理后 - 但我不知道,如何使用 python / python 函数实现这一点。
您 approach/suggestions 有什么其他想法吗? 谢谢:)
以下简化示例向您展示了如何将多个 [pscustomobject]
([psobject]
) 实例从 PowerShell 传递到 Python 脚本(通过 -c
作为字符串传递在这种情况下):
通过使用JSON作为序列化格式,通过
ConvertTo-Json
...... 并通过 管道 传递 JSON ,Python 可以通过 stdin(标准输入)读取。
重要:
字符编码:
PowerShell 在向外部程序(例如Python)发送数据时使用
$OutputEncoding
首选项变量中指定的编码,在 PowerShell [Core] v6+ 中默认为无 BOM UTF-8,但遗憾的是在 Windows PowerShell[=78= 中默认为 ASCII(!) ].就像 PowerShell 限制您向外部程序发送 text 一样,它也总是解释它 接收到的内容作为文本,即基于
[Console]::OutputEncoding
中存储的编码;遗憾的是,在撰写本文时,两个 PowerShell 版本默认为系统的OEM 代码页。要在两个 PowerShell 版本中发送和接收(无 BOM)UTF-8,(暂时)设置
$OutputEncoding
和[Console]::OutputEncoding
如下:
$OutputEncoding = [Console]::OutputEncoding = [System.Text.Utf8Encoding]::new($false)
如果你想让你的Python脚本也输出对象,再次考虑使用JSON,在 PowerShell 上,您可以使用
将其解析为对象ConvertFrom-Json
.
# Sample input objects.
$data = [pscustomobject] @{ one = 1; two = 2 }, [pscustomobject] @{ one = 10; two = 20 }
# Convert to JSON and pipe to Python.
ConvertTo-Json $data | python -c @'
import sys, json
# Parse the JSON passed via stdin into a list of dictionaries.
dicts = json.load(sys.stdin)
# Sample processing: print the 'one' entry of each dict.
for dict in dicts:
print(dict['one'])
'@
如果要传递的数据是单行字符串的集合,则不需要JSON:
$data = 'foo', 'bar', 'baz'
$data | python -c @'
import sys
# Sample processing: print each stdin input line enclosed in [...]
for line in sys.stdin:
print('[' + line.rstrip('\r\n') + ']')
'@
根据@mklement0 的回答,我想分享已完成并经过测试的解决方案,将 JSON 从 python 返回到 Powershell 并考虑正确的字符编码。 我已经在一批 100k 行中尝试过它 - 没问题,运行 完美且超快 :)
#get data from MS SQL
$query = -join@(
'SELECT `Id`, `CaseSubject`, `CaseDescription`,
`AccountCountry`, `CaseLang` '
'FROM `db`.`table_global` '
'ORDER BY `Id` DESC, `Id` ASC '
'LIMIT 10000;'
)
$data = Invoke-DbaQuery -SqlInstance $Connection -Query $Query -As PSObject -QueryTimeout 1800
$arg = @'
import pycld2 as cld2
import simplejson as json
import sys, re, logging
def main():
#toggle the logging level to stderr
# -> https://docs.python.org/3/library/logging.html#logging.debug
logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)
logging.info('->Encoding Python: ' + str(sys.stdin.encoding))
# consideration of correct character encoding ->
# Parse the JSON passed via stdin into a list of dictionaries ->
cases = json.load(sys.stdin, 'utf-8')
# Sample processing: print the 'one' entry of each dict.
# https://regex101.com/r/bymIQS/1
regex = re.compile(r'(?=[^\w\s]).|[\r\n]|\'|\"|\')
# hash table with Country vs Language for 'boosting' the language detection, if pycld2 is not sure
lang_country = {'Albania' : 'ALBANIAN', 'Algeria' : 'ARABIC', 'Argentina' : 'SPANISH', 'Armenia' : 'ARMENIAN', 'Austria' : 'GERMAN', 'Azerbaijan' : 'AZERBAIJANI', 'Bangladesh' : 'BENGALI', 'Belgium' : 'DUTCH', 'Benin' : 'FRENCH', 'Bolivia, Plurinational State of' : 'SPANISH', 'Bosnia and Herzegovina' : 'BOSNIAN', 'Brazil' : 'PORTUGUESE', 'Bulgaria' : 'BULGARIAN', 'Chile' : 'SPANISH', 'China' : 'Chinese', 'Colombia' : 'SPANISH', 'Costa Rica' : 'SPANISH', 'Croatia' : 'CROATIAN', 'Czech Republic' : 'CZECH', 'Denmark' : 'DANISH', 'Ecuador' : 'SPANISH', 'Egypt' : 'ARABIC', 'El Salvador' : 'SPANISH', 'Finland' : 'FINNISH', 'France' : 'FRENCH', 'Germany' : 'GERMAN', 'Greece' : 'GREEK', 'Greenland' : 'GREENLANDIC', 'Hungary' : 'HUNGARIAN', 'Iceland' : 'ICELANDIC', 'India' : 'HINDI', 'Iran' : 'PERSIAN', 'Iraq' : 'ARABIC', 'Ireland' : 'ENGLISH', 'Israel' : 'HEBREW', 'Italy' : 'ITALIAN', 'Japan' : 'Japanese', 'Kosovo' : 'ALBANIAN', 'Kuwait' : 'ARABIC', 'Mexico' : 'SPANISH', 'Monaco' : 'FRENCH', 'Morocco' : 'ARABIC', 'Netherlands' : 'DUTCH', 'New Zealand' : 'ENGLISH', 'Norway' : 'NORWEGIAN', 'Panama' : 'SPANISH', 'Paraguay' : 'SPANISH', 'Peru' : 'SPANISH', 'Poland' : 'POLISH', 'Portugal' : 'PORTUGUESE', 'Qatar' : 'ARABIC', 'Romania' : 'ROMANIAN', 'Russia' : 'RUSSIAN', 'San Marino' : 'ITALIAN', 'Saudi Arabia' : 'ARABIC', 'Serbia' : 'SERBIAN', 'Slovakia' : 'SLOVAK', 'Slovenia' : 'SLOVENIAN', 'South Africa' : 'AFRIKAANS', 'South Korea' : 'Korean', 'Spain' : 'SPANISH', 'Sweden' : 'SWEDISH', 'Switzerland' : 'GERMAN', 'Thailand' : 'THAI', 'Tunisia' : 'ARABIC', 'Turkey' : 'TURKISH', 'Ukraine' : 'UKRAINIAN', 'United Arab Emirates' : 'ARABIC', 'United Kingdom' : 'ENGLISH', 'United States' : 'ENGLISH', 'Uruguay' : 'SPANISH', 'Uzbekistan' : 'UZBEK', 'Venezuela' : 'SPANISH'}
for case in cases:
#concatenate two fiels and clean them a bitfield, so that we not get any faults due line brakes etc.
tCaseDescription = regex.sub('', (case['CaseSubject'] + ' ' + case['CaseDescription']))
tCaseAccCountry = case['AccountCountry']
if tCaseAccCountry in lang_country:
language = lang_country[tCaseAccCountry]
isReliable, textBytesFound, details = cld2.detect(tCaseDescription,
isPlainText = True,
bestEffort = True,
hintLanguage = language)
else:
isReliable, textBytesFound, details = cld2.detect(tCaseDescription,
isPlainText = True,
bestEffort = True)
#Take Value
case['CaseLang'] = details[0][0]
#logging.info('->Python processing CaseID: ' + str(case['Id']) + ' / Detected Language: ' + str(case['CaseLang']))
#encode to JSON
retVal = json.dumps(cases, 'utf-8')
return retVal
if __name__ == '__main__':
retVal = main()
sys.stdout.write(str(retVal))
'@
$dataJson = ConvertTo-Json $data
$data = ($dataJson | python -X utf8 -c $arg) | ConvertFrom-Json
foreach($case in $data) {
$tCaseSubject = $case.CaseSubject -replace "\", "\" -replace "'", "\'"
$tCaseDescription = $case.CaseDescription -replace "\", "\" -replace "'", "\'"
$tCaseLang = $case.CaseLang.substring(0,1).toupper() + $case.CaseLang.substring(1).tolower()
$tCaseId = $case.Id
$qUpdate = -join @(
"UPDATE db.table_global SET CaseSubject=`'$tCaseSubject`', "
"CaseDescription=`'$tCaseDescription`', "
"CaseLang=`'$tCaseLang`' "
"WHERE Id=$tCaseId;"
)
try{
$result = Invoke-SqlUpdate -ConnectionName 'maria' -Query $qUpdate
} catch {
Write-Host -Foreground Red -Background Black ("result: " + $result + ' / No. ' + $i)
#break
}
}
Close-SqlConnection -ConnectionName 'maria'
请为不幸的语法高亮表示歉意;脚本块包含 SQL、Powershell 和 Python..