如何减少 sqoop 导出的日志大小
How to reduce log size for sqoop export
有没有办法控制 sqoop export
创建的日志的大小?尝试将一系列 parquet
文件从 hadoop
集群导出到 microsoft sql server
并发现在映射器作业的某个点之后,进度变得非常 slow/freezes。查看 hadoop Resourcemanager
的当前理论是来自 sqoop
作业的日志正在填充到导致进程冻结的大小。
hadoop
的新手,如有任何建议,我们将不胜感激。谢谢。
更新
从资源管理器 Web 界面查看 sqoop jar 应用程序的冻结地图任务作业之一的系统日志输出,日志输出如下所示:
2017-11-14 16:26:52,243 DEBUG [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: [ 8758 8840 ]
2017-11-14 16:26:52,243 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.security.SaslRpcClient: reading next wrapped RPC packet
2017-11-14 16:26:52,243 DEBUG [IPC Parameter Sending Thread #0] org.apache.hadoop.ipc.Client: IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490 sending #280
2017-11-14 16:26:52,243 DEBUG [IPC Parameter Sending Thread #0] org.apache.hadoop.security.SaslRpcClient: wrapping token of length:751
2017-11-14 16:26:52,246 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.security.SaslRpcClient: unwrapping token of length:62
2017-11-14 16:26:52,246 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.ipc.Client: IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490 got value #280
2017-11-14 16:26:52,246 DEBUG [communication thread] org.apache.hadoop.ipc.RPC: Call: statusUpdate 3
2017-11-14 16:26:55,252 DEBUG [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: [ 8758 8840 ]
2017-11-14 16:26:55,252 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.security.SaslRpcClient: reading next wrapped RPC packet
2017-11-14 16:26:55,252 DEBUG [IPC Parameter Sending Thread #0] org.apache.hadoop.ipc.Client: IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490 sending #281
2017-11-14 16:26:55,252 DEBUG [IPC Parameter Sending Thread #0] org.apache.hadoop.security.SaslRpcClient: wrapping token of length:751
2017-11-14 16:26:55,254 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.security.SaslRpcClient: unwrapping token of length:62
2017-11-14 16:26:55,255 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.ipc.Client: IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490 got value #281
2017-11-14 16:26:55,255 DEBUG [communication thread] org.apache.hadoop.ipc.RPC: Call: statusUpdate 3
此外,让这个过程 运行 一整天,看起来 sqoop 作业确实完成了,但是需要很长时间(~500MB 的 .tsv 数据需要大约 4 小时)。
针对posted问题的标题,控制sqoop命令日志输出的方法是通过编辑$HADOOP_HOME/etc/hadoop中的log4j.properties
文件目录(因为 sqoop apparently uses this to inherit its log properties (though from what I can tell this may not be the case with sqoop2
)) or by using generic 参数在带有 -D 前缀的 sqoop 调用中,例如:
sqoop export \
-Dyarn.app.mapreduce.am.log.level=WARN\
-Dmapreduce.map.log.level=WARN \
-Dmapreduce.reduce.log.level=WARN \
--connect "$connectionstring" \
--driver com.microsoft.sqlserver.jdbc.SQLServerDriver \
--table $tablename \
--export-dir /tmp/${tablename^^}_export \
--num-mappers 24 \
--direct \
--batch \
--input-fields-terminated-by '\t'
但是,我的初步理论来自 post 的 body,即 sqoop 作业的日志已填充到导致进程冻结的大小,好像撑不住了。 map 任务的日志大小在 resourcemanager
ui 中降至 0 字节,但系统仍通过 运行 的流量达到一定百分比,然后下降到非常慢速度。
有没有办法控制 sqoop export
创建的日志的大小?尝试将一系列 parquet
文件从 hadoop
集群导出到 microsoft sql server
并发现在映射器作业的某个点之后,进度变得非常 slow/freezes。查看 hadoop Resourcemanager
的当前理论是来自 sqoop
作业的日志正在填充到导致进程冻结的大小。
hadoop
的新手,如有任何建议,我们将不胜感激。谢谢。
更新
从资源管理器 Web 界面查看 sqoop jar 应用程序的冻结地图任务作业之一的系统日志输出,日志输出如下所示:
2017-11-14 16:26:52,243 DEBUG [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: [ 8758 8840 ]
2017-11-14 16:26:52,243 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.security.SaslRpcClient: reading next wrapped RPC packet
2017-11-14 16:26:52,243 DEBUG [IPC Parameter Sending Thread #0] org.apache.hadoop.ipc.Client: IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490 sending #280
2017-11-14 16:26:52,243 DEBUG [IPC Parameter Sending Thread #0] org.apache.hadoop.security.SaslRpcClient: wrapping token of length:751
2017-11-14 16:26:52,246 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.security.SaslRpcClient: unwrapping token of length:62
2017-11-14 16:26:52,246 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.ipc.Client: IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490 got value #280
2017-11-14 16:26:52,246 DEBUG [communication thread] org.apache.hadoop.ipc.RPC: Call: statusUpdate 3
2017-11-14 16:26:55,252 DEBUG [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: [ 8758 8840 ]
2017-11-14 16:26:55,252 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.security.SaslRpcClient: reading next wrapped RPC packet
2017-11-14 16:26:55,252 DEBUG [IPC Parameter Sending Thread #0] org.apache.hadoop.ipc.Client: IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490 sending #281
2017-11-14 16:26:55,252 DEBUG [IPC Parameter Sending Thread #0] org.apache.hadoop.security.SaslRpcClient: wrapping token of length:751
2017-11-14 16:26:55,254 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.security.SaslRpcClient: unwrapping token of length:62
2017-11-14 16:26:55,255 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.ipc.Client: IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490 got value #281
2017-11-14 16:26:55,255 DEBUG [communication thread] org.apache.hadoop.ipc.RPC: Call: statusUpdate 3
此外,让这个过程 运行 一整天,看起来 sqoop 作业确实完成了,但是需要很长时间(~500MB 的 .tsv 数据需要大约 4 小时)。
针对posted问题的标题,控制sqoop命令日志输出的方法是通过编辑$HADOOP_HOME/etc/hadoop中的log4j.properties
文件目录(因为 sqoop apparently uses this to inherit its log properties (though from what I can tell this may not be the case with sqoop2
)) or by using generic 参数在带有 -D 前缀的 sqoop 调用中,例如:
sqoop export \
-Dyarn.app.mapreduce.am.log.level=WARN\
-Dmapreduce.map.log.level=WARN \
-Dmapreduce.reduce.log.level=WARN \
--connect "$connectionstring" \
--driver com.microsoft.sqlserver.jdbc.SQLServerDriver \
--table $tablename \
--export-dir /tmp/${tablename^^}_export \
--num-mappers 24 \
--direct \
--batch \
--input-fields-terminated-by '\t'
但是,我的初步理论来自 post 的 body,即 sqoop 作业的日志已填充到导致进程冻结的大小,好像撑不住了。 map 任务的日志大小在 resourcemanager
ui 中降至 0 字节,但系统仍通过 运行 的流量达到一定百分比,然后下降到非常慢速度。