"The worker lost contact with the service" 超过 6 小时后数据流作业失败?

Dataflow job failed after more than 6 hours with "The worker lost contact with the service"?

我正在使用 DataflowBigQuery 读取数据,然后使用 python 进行 NLP 预处理。我正在使用 Python 3SDK 2.16.0。我在 europe-west6 中使用 100 个工作人员(提供 IP、私有访问和 Cloud NAT),在 europe-west1 中使用端点。 BigQuery table 在 US 中。测试作业没有任何问题,但在尝试处理完整的 table (32 GB) 时,作业在 6 小时 40 分钟后失败,很难完全理解潜在的错误是什么。

首先,Dataflow 报告了以下内容: 这有点令人困惑:在一个案例中,工作项目失败,另外 2 名工人与服务失去联系,一名工人被报告死亡!

现在看看读取BigQuery数据的日志: 第一件可疑的事情是在整个数据流作业期间每 3 秒出现一次的消息 "Refreshing due to a 401 (attempt 1/2)"。我不认为这与崩溃有关,但这很奇怪。 BigQuery 问题的时间戳(16:28:07 和 16:28:15)出现在向工作人员报告的问题(16:27:44)之后。

An exception was raised when trying to execute the workitem 7962803802081012962 : Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/dataflow_worker/batchworker.py", line 649, in do_work
    work_executor.execute()
  File "/usr/local/lib/python3.6/site-packages/dataflow_worker/executor.py", line 176, in execute
    op.start()
  File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.native_operations.NativeReadOperation.start
  File "/usr/local/lib/python3.6/site-packages/dataflow_worker/nativefileio.py", line 204, in __iter__
    for record in self.read_next_block():
  File "/usr/local/lib/python3.6/site-packages/dataflow_worker/nativeavroio.py", line 198, in read_next_block
    fastavro_block = next(self._block_iterator)
  File "fastavro/_read.pyx", line 738, in fastavro._read.file_reader.next
  File "fastavro/_read.pyx", line 662, in _iter_avro_blocks
  File "fastavro/_read.pyx", line 595, in fastavro._read.null_read_block
  File "fastavro/_read.pyx", line 597, in fastavro._read.null_read_block
  File "fastavro/_read.pyx", line 304, in fastavro._read.read_bytes
  File "/usr/local/lib/python3.6/site-packages/apache_beam/io/filesystemio.py", line 113, in readinto
    data = self._downloader.get_range(start, end)
  File "/usr/local/lib/python3.6/site-packages/apache_beam/io/gcp/gcsio.py", line 522, in get_range
    self._downloader.GetRange(start, end - 1)
  File "/usr/local/lib/python3.6/site-packages/apitools/base/py/transfer.py", line 486, in GetRange
    response = self.__ProcessResponse(response)
  File "/usr/local/lib/python3.6/site-packages/apitools/base/py/transfer.py", line 424, in __ProcessResponse
    raise exceptions.HttpError.FromResponse(response)
apitools.base.py.exceptions.HttpNotFoundError: HttpError accessing <https://www.googleapis.com/storage/v1/b/xxx/o/beam%2Ftemp%2FWhosebug-raphael-191119-084402.1574153042.687677%2F11710707918635668555%2F000000000009.avro?alt=media&generation=1574154204169350>: response: <{'x-guploader-uploadid': 'AEnB2UpgIuanY0AawrT7fRC_VW3aRfWSdrrTwT_TqQx1fPAAAUohVoL-8Z8Zw_aYUQcSMNqKIh5R2TulvgHHsoxLWo2gl6wUEA', 'content-type': 'text/html; charset=UTF-8', 'date': 'Tue, 19 Nov 2019 15:28:07 GMT', 'vary': 'Origin, X-Origin', 'expires': 'Tue, 19 Nov 2019 15:28:07 GMT', 'cache-control': 'private, max-age=0', 'content-length': '142', 'server': 'UploadServer', 'status': '404'}>, content <No such object: nlp-text-classification/beam/temp/Whosebug-xxxx-191119-084402.1574153042.687677/11710707918635668555/000000000009.avro>

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/dataflow_worker/batchworker.py", line 649, in do_work
    work_executor.execute()
  File "/usr/local/lib/python3.6/site-packages/dataflow_worker/executor.py", line 176, in execute
    op.start()
  File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.native_operations.NativeReadOperation.start
  File "/usr/local/lib/python3.6/site-packages/dataflow_worker/nativefileio.py", line 204, in __iter__
    for record in self.read_next_block():
  File "/usr/local/lib/python3.6/site-packages/dataflow_worker/nativeavroio.py", line 198, in read_next_block
    fastavro_block = next(self._block_iterator)
  File "fastavro/_read.pyx", line 738, in fastavro._read.file_reader.next
  File "fastavro/_read.pyx", line 662, in _iter_avro_blocks
  File "fastavro/_read.pyx", line 595, in fastavro._read.null_read_block
  File "fastavro/_read.pyx", line 597, in fastavro._read.null_read_block
  File "fastavro/_read.pyx", line 304, in fastavro._read.read_bytes
  File "/usr/local/lib/python3.6/site-packages/apache_beam/io/filesystemio.py", line 113, in readinto
    data = self._downloader.get_range(start, end)
  File "/usr/local/lib/python3.6/site-packages/apache_beam/io/gcp/gcsio.py", line 522, in get_range
    self._downloader.GetRange(start, end - 1)
  File "/usr/local/lib/python3.6/site-packages/apitools/base/py/transfer.py", line 486, in GetRange
    response = self.__ProcessResponse(response)
  File "/usr/local/lib/python3.6/site-packages/apitools/base/py/transfer.py", line 424, in __ProcessResponse
    raise exceptions.HttpError.FromResponse(response)
apitools.base.py.exceptions.HttpNotFoundError: HttpError accessing <https://www.googleapis.com/storage/v1/b/xxxx/o/beam%2Ftemp%2FWhosebug-raphael-191119-084402.1574153042.687677%2F11710707918635668555%2F000000000009.avro?alt=media&generation=1574154204169350>: response: <{'x-guploader-uploadid': 'AEnB2UpgIuanY0AawrT7fRC_VW3aRfWSdrrTwT_TqQx1fPAAAUohVoL-8Z8Zw_aYUQcSMNqKIh5R2TulvgHHsoxLWo2gl6wUEA', 'content-type': 'text/html; charset=UTF-8', 'date': 'Tue, 19 Nov 2019 15:28:07 GMT', 'vary': 'Origin, X-Origin', 'expires': 'Tue, 19 Nov 2019 15:28:07 GMT', 'cache-control': 'private, max-age=0', 'content-length': '142', 'server': 'UploadServer', 'status': '404'}>, content <No such object: nlp-text-classification/beam/temp/Whosebug-xxxx-191119-084402.1574153042.687677/11710707918635668555/000000000009.avro>
timestamp   
2019-11-19T15:28:07.770312309Z
logger  
root:batchworker.py:do_work
severity    
ERROR
worker  
Whosebug-xxxx-191-11190044-7wyy-harness-2k89
step    
Read Posts from BigQuery
thread  
73:140029564072960

工作人员似乎无法在 Cloud Storage 上找到一些 avro 文件。这可能与消息 "The workers lost contact with the service"

有关

如果我查看 "ERROR",我会看到很多这样的问题,所以看起来工人本身有问题:

查看Stack Traces并没有给出更多提示。

我的问题如下:

  1. 我们如何确定问题与工人有关?
  2. 可能是什么原因?记忆 ?磁盘?或暂时性问题?
  3. 如果工人死亡,是否有恢复的选项?为什么全部工作停止是 3/98 工人死亡或迷路?有参数吗?

我们的设置:

我们正在使用 Stackdriver 监控一些数量,但在我看来没有任何问题:

不使用 Dataflow Shuffle 的批处理作业的默认值为 250GB,因此您设置的 50GB 留给需要存储在工作器上的任何随机播放数据的空间非常少space。

最好看看你的管道的形状(涉及的步骤是什么),但根据日志截图,你有4个步骤(从BQ读取,预处理,写入BQ,也写入GCS ).我还看到了一些 GroupBy 操作。 GroupBy 操作需要重新洗牌,您的 50GB 磁盘可能会限制存储空间。

您应该尝试几件事: - 不要将 Workers 限制为 50GB(删除 diskGB 设置,以便 Dataflow 可以使用默认值) - 试试 Dataflow Shuffle (--experiments=shuffle_mode=service) 见 https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-shuffle

当您使用 Dataflow Shuffle 时,diskGB 参数的默认值为 30GB。然后你可以使用小磁盘(我仍然建议不要自己设置 diskGBSize)

经过一些测试和一些监控图后,很明显,即使文本的长度相同,处理时间也开始迅速增加(右下图)

然后很明显问题出在 SpaCy 2.1.8(内存泄漏)上。

使用 Spacy 2.2.3 解决了这个问题。现在 32 Gb 的数据在 4 小时 30 分内处理完毕,没有任何问题。