1、手动启动一个spark集群

启动standalone的master节点

./sbin/start-master.sh

一旦启动后，master会打印一个spark的地址，例如：spark://HOST:PORT。这个地址可以作为SparkContext的master参数。你可以访问http://HOST:8080来查看spark的web UI。

类似地，启动一个或多个workers连接master：

./sbin/start-slave.sh <master-spark-URL>

一旦启动了一个worker，则可以在UI页面看到其信息。

可以传递下面的参数给master和worker

参数	备注
-h HOST, --host HOST	监听的hostname
-i HOST, --ip HOST	已过期，使用-h
-p PORT, --port PORT	监听的端口（默认：master是7077,worker随机端口）
--webui-port PORT	web UI的端口（默认：master是8080，worker是8081）
-c CORES, --cores CORES	spark可以使用的CPU的核数（默认是所有CPU），worker才有这个参数。
-m MEM, --memory MEM	spark可以使用的内存数（默认是物理内存减1GB），eg:1000M 1G ,worker才有这个参数
-d DIR, --work-dir DIR	任务的日志目录（默认：SPARK_HOME/work），只有work有这个参数
--properties-file FILE	加载定制化的配置文件（默认：conf/spark-defaults.conf）

2、集群启动脚本

使用启动脚本启动一个standalone集群，需要创建一个文件conf/slaves。需要在里面写入所有的workers的hostname，一行一个。如果conf/slaves文件不存在，则只会启动本地的那个节点。master机器使用ssh访问每个work，所以需要集群机器间需要做免密登陆。如果没做免密登陆，可以设置SPARK_SSH_FOREGROUND参数来提供给每个work。

当设置了conf/slaves文件后，可以使用下面的脚本启动或停止spark集群。

sbin/start-master.sh 启动master实例
sbin/start-slaves.sh 启动所有的slave实例
sbin/start-slave.sh 启动当前节点的slave实例
sbin/start-all.sh 启动master以及所有的slave实例
sbin/stop-master.sh 停止master实例
sbin/stop-slaves.sh 停止所有的slave实例
sbin/stop-all.sh 停止包括master以及所有的slave实例

你可以选择性地配置conf/spark-evn.sh，通过拷贝conf/spark-env.sh.template来创建这个文件。

环境变量	备注
SPARK_MASTER_HOST	绑定一个机器名或IP地址，比如一个公开的地址
SPARK_MASTER_PORT	master的端口（默认：7077）
SPARK_MASTER_WEBUI_PORT	master的webUI端口（默认：8080）
SPARK_MASTER_OPTS	master的配置项，详细见下方表格
SPARK_LOCAL_DIRS	spark用于分配空间的目录？
SPARK_WORKER_CORES	spark可以用的CPU核数（默认：所有核）
SPARK_WORKER_MEMORY	spark可以使用的内存（默认：物理内存减1GB）eg:1000M 2G
SPARK_WORKER_PORT	worker的端口（默认：随机端口）
SPARK_WORKER_WEBUI_PORT	worker的webUI端口（默认：8081）
SPARK_WORKER_DIR	worker的目录（默认：SPARK_HOME/work）
SPARK_WORKER_OPTS	worker的配置项
SPARK_DAEMON_MEMORY	分配给master和worker守护进程的内存
SPARK_DAEMON_JAVA_OPTS	守护进程的配置项
SPARK_PUBLIC_DNS	master和worker的公共DNS

SPARK_MASTER_OPTS支持下面的系统参数：

参数	默认	备注
spark.deploy.retainedApplications	200	The maximum number of completed applications to display. Older applications will be dropped from the UI to maintain this limit.
spark.deploy.retainedDrivers	200	The maximum number of completed drivers to display. Older drivers will be dropped from the UI to maintain this limit.
spark.deploy.spreadOut	true	Whether the standalone cluster manager should spread applications out across nodes or try to consolidate them onto as few nodes as possible. Spreading out is usually better for data locality in HDFS, but consolidating is more efficient for compute-intensive workloads.
spark.deploy.defaultCores	(infinite)	Default number of cores to give to applications in Spark's standalone mode if they don't set spark.cores.max. If not set, applications always get all available cores unless they configure spark.cores.max themselves. Set this lower on a shared cluster to prevent users from grabbing the whole cluster by default.
spark.deploy.maxExecutorRetries	10	Limit on the maximum number of back-to-back executor failures that can occur before the standalone cluster manager removes a faulty application. An application will never be removed if it has any running executors. If an application experiences more than spark.deploy.maxExecutorRetries failures in a row, no executors successfully start running in between those failures, and the application has no running executors then the standalone cluster manager will remove the application and mark it as failed. To disable this automatic removal, set spark.deploy.maxExecutorRetries to -1.
spark.worker.timeout	60	Number of seconds after which the standalone deploy master considers a worker lost if it receives no heartbeats.

SPARK_WORDER_OPTS参数如下：

参数	默认	备注
spark.worker.cleanup.enabled	false	Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.
spark.worker.cleanup.interval	1800 (30 minutes)	Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.
spark.worker.cleanup.appDataTtl	604800 (7 days, 7 * 24 * 3600)	The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.
spark.worker.ui.compressedLogFileLengthCacheSize	100	For compressed log files, the uncompressed file can only be computed by uncompressing the files. Spark caches the uncompressed file size of compressed log files. This property controls the cache size.

3，使用zookeeper管理多个master

为了管理多个master做到高可用，需要设置SPARK_DAEMON_JAVA_OPTS参数：

export SPARK_DAEMON_JAVA_OPTS = '-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=HOSTNAME:2181..2181 -Dspark.deploy.zookeeper.dir=spark'

spark的zookeeper一定要配置正确，不然会出现多个master全部都是leader，各个执行各自的任务。

SparkContext连接的时候，可以连接多个spark master比如：spark://host1:port1,host2:port2。

关键字： spark

上一篇：Java，Python，Scala三种语言开发并部署Spark的WordCount程序

下一篇：机器学习算法之聚类算法Kmeans并找出最佳K值的Python实践

评论信息