使用 Google Cloud DLP API 时如何获取扫描文件的位置?
How to get the location of the scanned file when using Google Cloud DLP API?
我正在扫描云存储桶中的嵌套目录。尽管我打开了 include_quote,但结果不包含匹配值(引号)。另外,如何获取具有匹配值的文件的名称?我正在使用 Python。这是我到目前为止所拥有的。如您所见,API 找到了匹配项,但我没有得到有关哪些词(和文件)被标记的详细信息。
inspect_job = {
'inspect_config': {
'info_types': info_types,
'min_likelihood': MIN_LIKELIHOOD,
'include_quote': True,
'limits': {
'max_findings_per_request': MAX_FINDINGS
},
},
'storage_config': {
'cloud_storage_options': {
'file_set': {
'url':
'gs://{bucket_name}/{dir_name}/**'.format(
bucket_name=STAGING_BUCKET, dir_name=DIR_NAME)
}
}
}
operation = dlp.create_dlp_job(parent, inspect_job)
dlp.get_dlp_job(operation.name)
结果如下:
result {
processed_bytes: 64
total_estimated_bytes: 64
info_type_stats {
info_type {
name: "EMAIL_ADDRESS"
}
count: 1
}
info_type_stats {
info_type {
name: "PHONE_NUMBER"
}
count: 1
}
info_type_stats {
info_type {
name: "FIRST_NAME"
}
count: 2
}
您需要遵循 https://cloud.google.com/dlp/docs/inspecting-storage and specify save findings action https://cloud.google.com/dlp/docs/reference/rest/v2/InspectJobConfig#SaveFindings
中的 "Retrieving inspection results" 部分
我认为您没有得到报价值,因为您的 inspectConfig 不太正确:
根据位于 https://cloud.google.com/dlp/docs/reference/rest/v2/InspectConfig 的文档,您应该设置
"includeQuote": true
编辑:添加有关获取文件的信息:
下面这个例子:https://cloud.google.com/solutions/automating-classification-of-data-uploaded-to-cloud-storage
云函数 resolve_DLP 的代码从这样的作业详细信息中获取文件名
def resolve_DLP(data, context):
...
job = dlp.get_dlp_job(job_name)
...
file_path = (
job.inspect_details.requested_options.job_config.storage_config
.cloud_storage_options.file_set.url)
file_name = os.path.basename(file_path)
...
编辑 2:现在我看到最新的 python api 客户端使用 'include_quote': 作为 dict 键....不是这样...
编辑 3:来自 python api 代码:
message Finding {
// The content that was found. Even if the content is not textual, it
// may be converted to a textual representation here.
// Provided if `include_quote` is true and the finding is
// less than or equal to 4096 bytes long. If the finding exceeds 4096 bytes
// in length, the quote may be omitted.
string quote = 1;
所以也许较小的文件会产生引号
朗多,感谢您的意见。我相信您提到的云存储示例仅为每个作业扫描一个文件。它不使用 savefindings 对象。
乔希,你是对的。似乎需要将输出定向到 Bigquery 或 Pub/sub 才能看到完整的结果。
来自https://cloud.google.com/dlp/docs/inspecting-storage#retrieving-inspection-results:
For complete inspection job results, you have two options. Depending on the Action you've chosen, inspection jobs are:
Saved to BigQuery (the SaveFindings object) in the table specified. Before viewing or analyzing the results, first ensure that the job has completed by using the projects.dlpJobs.get method, which is described below. Note that you can specify a schema for storing findings using the OutputSchema object.
Published to a Cloud Pub/Sub topic (the PublishToPubSub object). The topic must have given publishing access rights to Cloud DLP service account that runs the DlpJob sending the notifications.
我通过修改解决方案 .
让它工作
这是我最终的工作脚本:
import google.cloud.dlp
dlp = google.cloud.dlp.DlpServiceClient()
inspect_job_data = {
'storage_config': {
'cloud_storage_options': {
'file_set': {
'url':
'gs://{bucket_name}/{dir_name}/**'.format(
bucket_name=STAGING_BUCKET, dir_name=DIR_NAME)
}
}
},
'inspect_config': {
'include_quote': include_quote,
'info_types': [
{'name': 'ALL_BASIC'},
],
},
'actions': [
{
'save_findings': {
'output_config':{
'table':{
'project_id': GCP_PROJECT_ID,
'dataset_id': DATASET_ID,
'table_id': '{}_DLP'.format(TABLE_ID)
}
}
},
},
]
}
operation = dlp.create_dlp_job(parent=dlp.project_path(GCP_PROJECT_ID),
inspect_job=inspect_job_data)
我正在扫描云存储桶中的嵌套目录。尽管我打开了 include_quote,但结果不包含匹配值(引号)。另外,如何获取具有匹配值的文件的名称?我正在使用 Python。这是我到目前为止所拥有的。如您所见,API 找到了匹配项,但我没有得到有关哪些词(和文件)被标记的详细信息。
inspect_job = {
'inspect_config': {
'info_types': info_types,
'min_likelihood': MIN_LIKELIHOOD,
'include_quote': True,
'limits': {
'max_findings_per_request': MAX_FINDINGS
},
},
'storage_config': {
'cloud_storage_options': {
'file_set': {
'url':
'gs://{bucket_name}/{dir_name}/**'.format(
bucket_name=STAGING_BUCKET, dir_name=DIR_NAME)
}
}
}
operation = dlp.create_dlp_job(parent, inspect_job)
dlp.get_dlp_job(operation.name)
结果如下:
result {
processed_bytes: 64
total_estimated_bytes: 64
info_type_stats {
info_type {
name: "EMAIL_ADDRESS"
}
count: 1
}
info_type_stats {
info_type {
name: "PHONE_NUMBER"
}
count: 1
}
info_type_stats {
info_type {
name: "FIRST_NAME"
}
count: 2
}
您需要遵循 https://cloud.google.com/dlp/docs/inspecting-storage and specify save findings action https://cloud.google.com/dlp/docs/reference/rest/v2/InspectJobConfig#SaveFindings
中的 "Retrieving inspection results" 部分我认为您没有得到报价值,因为您的 inspectConfig 不太正确: 根据位于 https://cloud.google.com/dlp/docs/reference/rest/v2/InspectConfig 的文档,您应该设置
"includeQuote": true
编辑:添加有关获取文件的信息: 下面这个例子:https://cloud.google.com/solutions/automating-classification-of-data-uploaded-to-cloud-storage
云函数 resolve_DLP 的代码从这样的作业详细信息中获取文件名
def resolve_DLP(data, context):
...
job = dlp.get_dlp_job(job_name)
...
file_path = (
job.inspect_details.requested_options.job_config.storage_config
.cloud_storage_options.file_set.url)
file_name = os.path.basename(file_path)
...
编辑 2:现在我看到最新的 python api 客户端使用 'include_quote': 作为 dict 键....不是这样...
编辑 3:来自 python api 代码:
message Finding {
// The content that was found. Even if the content is not textual, it
// may be converted to a textual representation here.
// Provided if `include_quote` is true and the finding is
// less than or equal to 4096 bytes long. If the finding exceeds 4096 bytes
// in length, the quote may be omitted.
string quote = 1;
所以也许较小的文件会产生引号
朗多,感谢您的意见。我相信您提到的云存储示例仅为每个作业扫描一个文件。它不使用 savefindings 对象。
乔希,你是对的。似乎需要将输出定向到 Bigquery 或 Pub/sub 才能看到完整的结果。
来自https://cloud.google.com/dlp/docs/inspecting-storage#retrieving-inspection-results:
For complete inspection job results, you have two options. Depending on the Action you've chosen, inspection jobs are:
Saved to BigQuery (the SaveFindings object) in the table specified. Before viewing or analyzing the results, first ensure that the job has completed by using the projects.dlpJobs.get method, which is described below. Note that you can specify a schema for storing findings using the OutputSchema object. Published to a Cloud Pub/Sub topic (the PublishToPubSub object). The topic must have given publishing access rights to Cloud DLP service account that runs the DlpJob sending the notifications.
我通过修改解决方案
这是我最终的工作脚本:
import google.cloud.dlp
dlp = google.cloud.dlp.DlpServiceClient()
inspect_job_data = {
'storage_config': {
'cloud_storage_options': {
'file_set': {
'url':
'gs://{bucket_name}/{dir_name}/**'.format(
bucket_name=STAGING_BUCKET, dir_name=DIR_NAME)
}
}
},
'inspect_config': {
'include_quote': include_quote,
'info_types': [
{'name': 'ALL_BASIC'},
],
},
'actions': [
{
'save_findings': {
'output_config':{
'table':{
'project_id': GCP_PROJECT_ID,
'dataset_id': DATASET_ID,
'table_id': '{}_DLP'.format(TABLE_ID)
}
}
},
},
]
}
operation = dlp.create_dlp_job(parent=dlp.project_path(GCP_PROJECT_ID),
inspect_job=inspect_job_data)