启动standalone的master节点
./sbin/start-master.sh
一旦启动后,master会打印一个spark的地址,例如:spark://HOST:PORT。这个地址可以作为SparkContext的master参数。你可以访问http://HOST:8080来查看spark的web UI。
类似地,启动一个或多个workers连接master:
./sbin/start-slave.sh <master-spark-URL>
一旦启动了一个worker,则可以在UI页面看到其信息。
可以传递下面的参数给master和worker
参数 | 备注 |
---|---|
-h HOST, --host HOST | 监听的hostname |
-i HOST, --ip HOST | 已过期,使用-h |
-p PORT, --port PORT | 监听的端口(默认:master是7077,worker随机端口) |
--webui-port PORT | web UI的端口(默认:master是8080,worker是8081) |
-c CORES, --cores CORES | spark可以使用的CPU的核数(默认是所有CPU),worker才有这个参数。 |
-m MEM, --memory MEM | spark可以使用的内存数(默认是物理内存减1GB),eg:1000M 1G ,worker才有这个参数 |
-d DIR, --work-dir DIR | 任务的日志目录(默认:SPARK_HOME/work),只有work有这个参数 |
--properties-file FILE | 加载定制化的配置文件(默认:conf/spark-defaults.conf) |
使用启动脚本启动一个standalone集群,需要创建一个文件conf/slaves。需要在里面写入所有的workers的hostname,一行一个。如果conf/slaves文件不存在,则只会启动本地的那个节点。master机器使用ssh访问每个work,所以需要集群机器间需要做免密登陆。如果没做免密登陆,可以设置SPARK_SSH_FOREGROUND参数来提供给每个work。
当设置了conf/slaves文件后,可以使用下面的脚本启动或停止spark集群。
sbin/start-master.sh 启动master实例
sbin/start-slaves.sh 启动所有的slave实例
sbin/start-slave.sh 启动当前节点的slave实例
sbin/start-all.sh 启动master以及所有的slave实例
sbin/stop-master.sh 停止master实例
sbin/stop-slaves.sh 停止所有的slave实例
sbin/stop-all.sh 停止包括master以及所有的slave实例
你可以选择性地配置conf/spark-evn.sh,通过拷贝conf/spark-env.sh.template来创建这个文件。
环境变量 | 备注 |
---|---|
SPARK_MASTER_HOST | 绑定一个机器名或IP地址,比如一个公开的地址 |
SPARK_MASTER_PORT | master的端口(默认:7077) |
SPARK_MASTER_WEBUI_PORT | master的webUI端口(默认:8080) |
SPARK_MASTER_OPTS | master的配置项,详细见下方表格 |
SPARK_LOCAL_DIRS | spark用于分配空间的目录? |
SPARK_WORKER_CORES | spark可以用的CPU核数(默认:所有核) |
SPARK_WORKER_MEMORY | spark可以使用的内存(默认:物理内存减1GB)eg:1000M 2G |
SPARK_WORKER_PORT | worker的端口(默认:随机端口) |
SPARK_WORKER_WEBUI_PORT | worker的webUI端口(默认:8081) |
SPARK_WORKER_DIR | worker的目录(默认:SPARK_HOME/work) |
SPARK_WORKER_OPTS | worker的配置项 |
SPARK_DAEMON_MEMORY | 分配给master和worker守护进程的内存 |
SPARK_DAEMON_JAVA_OPTS | 守护进程的配置项 |
SPARK_PUBLIC_DNS | master和worker的公共DNS |
SPARK_MASTER_OPTS支持下面的系统参数:
参数 | 默认 | 备注 |
---|---|---|
spark.deploy.retainedApplications | 200 | The maximum number of completed applications to display. Older applications will be dropped from the UI to maintain this limit. |
spark.deploy.retainedDrivers | 200 | The maximum number of completed drivers to display. Older drivers will be dropped from the UI to maintain this limit. |
spark.deploy.spreadOut | true | Whether the standalone cluster manager should spread applications out across nodes or try to consolidate them onto as few nodes as possible. Spreading out is usually better for data locality in HDFS, but consolidating is more efficient for compute-intensive workloads. |
spark.deploy.defaultCores | (infinite) | Default number of cores to give to applications in Spark's standalone mode if they don't set spark.cores.max. If not set, applications always get all available cores unless they configure spark.cores.max themselves. Set this lower on a shared cluster to prevent users from grabbing the whole cluster by default. |
spark.deploy.maxExecutorRetries | 10 | Limit on the maximum number of back-to-back executor failures that can occur before the standalone cluster manager removes a faulty application. An application will never be removed if it has any running executors. If an application experiences more than spark.deploy.maxExecutorRetries failures in a row, no executors successfully start running in between those failures, and the application has no running executors then the standalone cluster manager will remove the application and mark it as failed. To disable this automatic removal, set spark.deploy.maxExecutorRetries to -1. |
spark.worker.timeout | 60 | Number of seconds after which the standalone deploy master considers a worker lost if it receives no heartbeats. |
SPARK_WORDER_OPTS参数如下:
参数 | 默认 | 备注 |
---|---|---|
spark.worker.cleanup.enabled | false | Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up. |
spark.worker.cleanup.interval | 1800 (30 minutes) | Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine. |
spark.worker.cleanup.appDataTtl | 604800 (7 days, 7 * 24 * 3600) | The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently. |
spark.worker.ui.compressedLogFileLengthCacheSize | 100 | For compressed log files, the uncompressed file can only be computed by uncompressing the files. Spark caches the uncompressed file size of compressed log files. This property controls the cache size. |
为了管理多个master做到高可用,需要设置SPARK_DAEMON_JAVA_OPTS参数:
export SPARK_DAEMON_JAVA_OPTS = '-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=HOSTNAME:2181..2181 -Dspark.deploy.zookeeper.dir=spark'
spark的zookeeper一定要配置正确,不然会出现多个master全部都是leader,各个执行各自的任务。
SparkContext连接的时候,可以连接多个spark master比如:spark://host1:port1,host2:port2。