无法成功 运行 从 DynamoDB 读取的 AWS Glue 作业
Can't Successfully Run AWS Glue Job That Reads From DynamoDB
我在 Dynamodb 和 AWS Reshift 中成功 运行 个爬虫读取了我的 table。 table 现在在目录中。
我的问题是 运行 将 Glue 作业从 Dynamodb 读取数据到 Redshift。它似乎无法从 Dynamodb 读取。
错误日志包含此
2022-02-01 10:16:55,821 WARN [task-result-getter-0] scheduler.TaskSetManager (Logging.scala:logWarning(69)): Lost task 0.0 in stage 0.0 (TID 0) (172.31.74.37 executor 1): java.lang.RuntimeException: Could not lookup table <TABLE-NAME> in DynamoDB.
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:143)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.getThroughput(ReadIopsCalculator.java:67)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.calculateTargetIops(ReadIopsCalculator.java:58)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.initReadManager(AbstractDynamoDBRecordReader.java:152)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.<init>(AbstractDynamoDBRecordReader.java:84)
at org.apache.hadoop.dynamodb.read.DefaultDynamoDBRecordReader.<init>(DefaultDynamoDBRecordReader.java:24)
at org.apache.hadoop.dynamodb.read.DynamoDBInputFormat.getRecordReader(DynamoDBInputFormat.java:32)
at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:136)
at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:610)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:120)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:132)
... 23 more
Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1207)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1153)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access0(AmazonHttpClient.java:704)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:6164)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:6131)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeDescribeTable(AmazonDynamoDBClient.java:2228)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.describeTable(AmazonDynamoDBClient.java:2193)
at org.apache.hadoop.dynamodb.DynamoDBClient.call(DynamoDBClient.java:136)
at org.apache.hadoop.dynamodb.DynamoDBClient.call(DynamoDBClient.java:133)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:80)
... 24 more
Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
at com.amazonaws.http.conn.$Proxy20.connect(Unknown Source)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1331)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
... 38 more
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:368)
at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:142)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
... 53 more
完整的日志包含以下内容:
22/02/01 10:06:07 INFO GlueContext: Glue secret manager integration: secretId is not provided.
已授予 Glue 的角色具有管理员访问权限。
下面是脚本代码:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
database="db",
table_name="db_s3_table",
transformation_ctx="S3bucket_node1",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=S3bucket_node1,
mappings=[
("column1.s", "string", "column1", "string"),
("column2.n", "string", "column2", "long"),
("column3.s", "string", "column3", "string"),
("partition_0", "string", "partition0", "string"),
],
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node Redshift Cluster
RedshiftCluster_node3 = glueContext.write_dynamic_frame.from_catalog(
frame=ApplyMapping_node2,
database="db",
table_name="db_redshift_db_schema_table",
redshift_tmp_dir=args["TempDir"],
transformation_ctx="RedshiftCluster_node3",
)
job.commit()
您似乎缺少 DynamoDB 的 VPC 端点,因为当您写入 Redshift 时,您的 Glue 作业 运行 在私有 VPC 中。
我在 Dynamodb 和 AWS Reshift 中成功 运行 个爬虫读取了我的 table。 table 现在在目录中。 我的问题是 运行 将 Glue 作业从 Dynamodb 读取数据到 Redshift。它似乎无法从 Dynamodb 读取。 错误日志包含此
2022-02-01 10:16:55,821 WARN [task-result-getter-0] scheduler.TaskSetManager (Logging.scala:logWarning(69)): Lost task 0.0 in stage 0.0 (TID 0) (172.31.74.37 executor 1): java.lang.RuntimeException: Could not lookup table <TABLE-NAME> in DynamoDB.
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:143)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.getThroughput(ReadIopsCalculator.java:67)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.calculateTargetIops(ReadIopsCalculator.java:58)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.initReadManager(AbstractDynamoDBRecordReader.java:152)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.<init>(AbstractDynamoDBRecordReader.java:84)
at org.apache.hadoop.dynamodb.read.DefaultDynamoDBRecordReader.<init>(DefaultDynamoDBRecordReader.java:24)
at org.apache.hadoop.dynamodb.read.DynamoDBInputFormat.getRecordReader(DynamoDBInputFormat.java:32)
at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:136)
at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:610)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:120)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:132)
... 23 more
Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1207)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1153)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access0(AmazonHttpClient.java:704)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:6164)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:6131)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeDescribeTable(AmazonDynamoDBClient.java:2228)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.describeTable(AmazonDynamoDBClient.java:2193)
at org.apache.hadoop.dynamodb.DynamoDBClient.call(DynamoDBClient.java:136)
at org.apache.hadoop.dynamodb.DynamoDBClient.call(DynamoDBClient.java:133)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:80)
... 24 more
Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to dynamodb.us-east-1.amazonaws.com:443 [dynamodb.us-east-1.amazonaws.com/3.218.180.106] failed: connect timed out
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
at com.amazonaws.http.conn.$Proxy20.connect(Unknown Source)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1331)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
... 38 more
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:368)
at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:142)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
... 53 more
完整的日志包含以下内容:
22/02/01 10:06:07 INFO GlueContext: Glue secret manager integration: secretId is not provided.
已授予 Glue 的角色具有管理员访问权限。
下面是脚本代码:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
database="db",
table_name="db_s3_table",
transformation_ctx="S3bucket_node1",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=S3bucket_node1,
mappings=[
("column1.s", "string", "column1", "string"),
("column2.n", "string", "column2", "long"),
("column3.s", "string", "column3", "string"),
("partition_0", "string", "partition0", "string"),
],
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node Redshift Cluster
RedshiftCluster_node3 = glueContext.write_dynamic_frame.from_catalog(
frame=ApplyMapping_node2,
database="db",
table_name="db_redshift_db_schema_table",
redshift_tmp_dir=args["TempDir"],
transformation_ctx="RedshiftCluster_node3",
)
job.commit()
您似乎缺少 DynamoDB 的 VPC 端点,因为当您写入 Redshift 时,您的 Glue 作业 运行 在私有 VPC 中。