Video Intelligence API - 标签片段时间

Video Intelligence API - Label Segment time

我正在关注 this 标签检测教程

下面的代码执行以下操作(在收到响应后)

Our response will contain result within an AnnotateVideoResponse, which consists of a list of annotationResults, one for each video sent in the request. Because we sent only one video in the request, we take the first segmentLabelAnnotations of the results. We then loop through all the labels in segmentLabelAnnotations. For the purpose of this tutorial, we only display video-level annotations. To identify video-level annotations, we pull segment_label_annotations data from the results. Each segment label annotation includes a description (segment_label.description), a list of entity categories (category_entity.description) and where they occur in segments by start and end time offsets from the beginning of the video.

segment_labels = result.annotation_results[0].segment_label_annotations
for i, segment_label in enumerate(segment_labels):
    print('Video label description: {}'.format(
        segment_label.entity.description))
    for category_entity in segment_label.category_entities:
        print('\tLabel category description: {}'.format(
            category_entity.description))

    for i, segment in enumerate(segment_label.segments):
        start_time = (segment.segment.start_time_offset.seconds +
                      segment.segment.start_time_offset.nanos / 1e9)
        end_time = (segment.segment.end_time_offset.seconds +
                    segment.segment.end_time_offset.nanos / 1e9)
        positions = '{}s to {}s'.format(start_time, end_time)
        confidence = segment.confidence
        print('\tSegment {}: {}'.format(i, positions))
        print('\tConfidence: {}'.format(confidence))
    print('\n')

所以,它说 "Each segment label annotation includes a description (segment_label.description), a list of entity categories (category_entity.description) and where they occur in segments by start and end time offsets from the beginning of the video."

但是,在输出中,所有标签 urban areatrafficvehicle.. 具有相同的 start and end time offsets,基本上是开始和结束视频。

$ python label_det.py gs://cloud-ml-sandbox/video/chicago.mp4
Operation us-west1.4757250774497581229 started: 2017-01-30T01:46:30.158989Z
Operation processing ...
The video has been successfully processed.

Video label description: urban area
        Label category description: city
        Segment 0: 0.0s to 38.752016s  
        Confidence: 0.946980476379

Video label description: traffic
        Segment 0: 0.0s to 38.752016s
        Confidence: 0.94105899334

Video label description: vehicle
        Segment 0: 0.0s to 38.752016s
        Confidence: 0.919958174229
...

我看到您正在学习的教程部分使用了 simplest examples available, while the list of samples provides a more complete example,其中使用了 Video Intelligence API 的更多功能。

为了达到你想要的objective(有关于每个注释被识别的时间点的更详细信息),你可以探索两种可能性:

  • 选项 1

这里的关键点是视频级注释仅适用于 segments. As explained in this documentation page I linked, if segments in a video are not specified, the API will treat the video as a single segment. Therefore, if you want that the API returns more "specific" results about when each annotation is identified, you should split the video in segments yourself, by splitting it into different segments (which can overlap and may not need the complete video), and passing those arguments as part of the videoContext field in the annotate request

如果您通过 API 请求执行这些操作,您可以执行如下请求,通过指定开始和结束 TimeOffsets 来定义任意数量的段:

{
 "inputUri": "gs://cloud-ml-sandbox/video/chicago.mp4",
 "features": [
  "LABEL_DETECTION"
 ],
 "videoContext": {
  "segments": [
   {
    "startTimeOffset": "TO_DO",
    "endTimeOffset": "TO_DO"
   }
   {
    "startTimeOffset": "TO_DO",
    "endTimeOffset": "TO_DO"
   }
  ]
 }
}

如果您愿意使用 Python 客户端库,则可以改为使用 video_context 参数,如下面的代码所示:

video_client = videointelligence.VideoIntelligenceServiceClient()
features = [videointelligence.enums.Feature.LABEL_DETECTION]

mode = videointelligence.enums.LabelDetectionMode.SHOT_AND_FRAME_MODE
config = videointelligence.types.LabelDetectionConfig(label_detection_mode=mode)
context = videointelligence.types.VideoContext(label_detection_config=config)

operation = video_client.annotate_video("gs://cloud-ml-sandbox/video/chicago.mp4", features=features, video_context=context)
  • 选项 2

我为您的用例建议的第二个选项是使用不同的标签检测模式。 this documentation link. By default, the SHOT_MODE is used, and it will only provide video-level and shot-level annotations, which require that you work with segments as explained in Option 1. If, instead, you use FRAME_MODE, frame-level annotations will be processed. This is a costly option, as it analyses all the frames in the video and annotates each of them, but it may be a suitable option depending on your specific use case. This mode (well, actually the SHOT_AND_FRAME_MODE one, which is a combination of the two previous) is used in the more complete example that I mentioned at the beginning of my answer. The analyze_labels() function in that code provides a really complete example on how to perform video/shot/frame-level annotations, and specifically for frame-level annotation 中提供了可用标签检测模式的列表,其中解释了如何在发生注释的情况下获取有关帧的信息。

请注意,这个选项真的很昂贵,正如我之前解释的那样,例如,我有 运行 它用于教程中作为示例提供的 "chicago.mp4" 视频,它花了大约30分钟完成。然而,所达到的细节水平确实很高(同样,分析每一帧,然后按元素对注释进行分组),这是您可以期望获得的响应:

"frameLabelAnnotations": [
     {
      "entity": {
       "entityId": "/m/088l6h",
       "description": "family car",
       "languageCode": "en-US"
      },
      "categoryEntities": [
       {
        "entityId": "/m/0k4j",
        "description": "car",
        "languageCode": "en-US"
       }
      ],
      "frames": [
       {
        "timeOffset": "0.570808s",
        "confidence": 0.76606256
       },
       {
        "timeOffset": "1.381775s",
        "confidence": 0.74966145
       },
       {
        "timeOffset": "2.468091s",
        "confidence": 0.85502887
       },
       {
        "timeOffset": "3.426006s",
        "confidence": 0.78749716
       },
      ]
     },

TL;DR:

您按照教程中的简单示例进行的调用类型返回的结果是预期的。如果没有特定的配置,一个视频将被视为一个单独的片段,这就是为什么您收到的响应会在整个视频中标识注释。

如果您想了解有关元素何时被识别的更多详细信息,您将需要遵循以下两种方法之一:(1) 在您的视频中定义片段 (这需要您手动指定要分割视频的片段),或者 (2) 使用 FRAME_MODE(成本更高且更精确)。