博客信息

Spark Standalone Cluster模式安装和配置及高可用

发布时间:『 2017-12-06 16:21』  博客类别:Hadoop/Spark  阅读(1602) 评论(0)

1、手动启动一个spark集群

启动standalone的master节点

./sbin/start-master.sh

一旦启动后,master会打印一个spark的地址,例如:spark://HOST:PORT。这个地址可以作为SparkContext的master参数。你可以访问http://HOST:8080来查看spark的web UI。

类似地,启动一个或多个workers连接master:

./sbin/start-slave.sh <master-spark-URL>

一旦启动了一个worker,则可以在UI页面看到其信息。

可以传递下面的参数给master和worker

参数备注
-h HOST, --host HOST监听的hostname
-i HOST, --ip HOST已过期,使用-h
-p PORT, --port PORT监听的端口(默认:master是7077,worker随机端口)
--webui-port PORTweb UI的端口(默认:master是8080,worker是8081)
-c CORES, --cores CORES                        
spark可以使用的CPU的核数(默认是所有CPU),worker才有这个参数。
-m MEM, --memory MEMspark可以使用的内存数(默认是物理内存减1GB),eg:1000M 1G ,worker才有这个参数
-d DIR, --work-dir DIR  
任务的日志目录(默认:SPARK_HOME/work),只有work有这个参数
--properties-file FILE  
加载定制化的配置文件(默认:conf/spark-defaults.conf)

2、集群启动脚本

使用启动脚本启动一个standalone集群,需要创建一个文件conf/slaves。需要在里面写入所有的workers的hostname,一行一个。如果conf/slaves文件不存在,则只会启动本地的那个节点。master机器使用ssh访问每个work,所以需要集群机器间需要做免密登陆。如果没做免密登陆,可以设置SPARK_SSH_FOREGROUND参数来提供给每个work。

当设置了conf/slaves文件后,可以使用下面的脚本启动或停止spark集群。

  • sbin/start-master.sh 启动master实例

  • sbin/start-slaves.sh 启动所有的slave实例

  • sbin/start-slave.sh 启动当前节点的slave实例

  • sbin/start-all.sh 启动master以及所有的slave实例

  • sbin/stop-master.sh 停止master实例

  • sbin/stop-slaves.sh 停止所有的slave实例

  • sbin/stop-all.sh 停止包括master以及所有的slave实例

你可以选择性地配置conf/spark-evn.sh,通过拷贝conf/spark-env.sh.template来创建这个文件。

环境变量备注
SPARK_MASTER_HOST绑定一个机器名或IP地址,比如一个公开的地址
SPARK_MASTER_PORTmaster的端口(默认:7077)
SPARK_MASTER_WEBUI_PORTmaster的webUI端口(默认:8080)
SPARK_MASTER_OPTSmaster的配置项,详细见下方表格        
SPARK_LOCAL_DIRS
spark用于分配空间的目录?
SPARK_WORKER_CORES                          
spark可以用的CPU核数(默认:所有核)
SPARK_WORKER_MEMORYspark可以使用的内存(默认:物理内存减1GB)eg:1000M 2G
SPARK_WORKER_PORT                
worker的端口(默认:随机端口)
SPARK_WORKER_WEBUI_PORTworker的webUI端口(默认:8081)
SPARK_WORKER_DIRworker的目录(默认:SPARK_HOME/work)
SPARK_WORKER_OPTSworker的配置项
SPARK_DAEMON_MEMORY分配给master和worker守护进程的内存
SPARK_DAEMON_JAVA_OPTS守护进程的配置项
SPARK_PUBLIC_DNSmaster和worker的公共DNS

SPARK_MASTER_OPTS支持下面的系统参数:

参数默认备注
spark.deploy.retainedApplications200The maximum number of completed applications to display. Older applications will be dropped from the UI to maintain this limit.
spark.deploy.retainedDrivers200The maximum number of completed drivers to display. Older drivers will be dropped from the UI to maintain this limit.
spark.deploy.spreadOuttrueWhether the standalone cluster manager should spread applications out across nodes or try to consolidate them onto as few nodes as possible. Spreading out is usually better for data locality in HDFS, but consolidating is more efficient for compute-intensive workloads.
spark.deploy.defaultCores(infinite)Default number of cores to give to applications in Spark's standalone mode if they don't set spark.cores.max. If not set, applications always get all available cores unless they configure spark.cores.max themselves. Set this lower on a shared cluster to prevent users from grabbing the whole cluster by default.
spark.deploy.maxExecutorRetries10Limit on the maximum number of back-to-back executor failures that can occur before the standalone cluster manager removes a faulty application. An application will never be removed if it has any running executors. If an application experiences more than spark.deploy.maxExecutorRetries failures in a row, no executors successfully start running in between those failures, and the application has no running executors then the standalone cluster manager will remove the application and mark it as failed. To disable this automatic removal, set spark.deploy.maxExecutorRetries to -1.
spark.worker.timeout60Number of seconds after which the standalone deploy master considers a worker lost if it receives no heartbeats.

SPARK_WORDER_OPTS参数如下:

参数默认备注
spark.worker.cleanup.enabledfalseEnable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.
spark.worker.cleanup.interval1800 (30 minutes)Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.
spark.worker.cleanup.appDataTtl604800 (7 days, 7 * 24 * 3600)The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.
spark.worker.ui.compressedLogFileLengthCacheSize100For compressed log files, the uncompressed file can only be computed by uncompressing the files. Spark caches the uncompressed file size of compressed log files. This property controls the cache size.

3,使用zookeeper管理多个master

为了管理多个master做到高可用,需要设置SPARK_DAEMON_JAVA_OPTS参数:

export SPARK_DAEMON_JAVA_OPTS = '-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=HOSTNAME:2181..2181 -Dspark.deploy.zookeeper.dir=spark'

spark的zookeeper一定要配置正确,不然会出现多个master全部都是leader,各个执行各自的任务。

SparkContext连接的时候,可以连接多个spark master比如:spark://host1:port1,host2:port2。

关键字:   spark  
评论信息
暂无评论
发表评论
验证码: 
Powered by IMZHANGJIE.CN Copyright © 2015-2025 粤ICP备14056181号