简介

Compose 作为Docker官方编排工具,可以让用户通过编写一个简单的模板文件,快速的搭建和管理基于Docker容器的应用集群。其定位是“定义和运行多个Docker容器的应用”,它允许用户通过一个YAML格式的模板文件来定义一组相关联的应用容器为一个项目。

官方文档:https://hub.docker.com/r/sequenceiq/spark/


安装

我的安装环境是:centos 7.3 Docker version 1.12.6

pip 安装

sudo pip install -U docker-compose 

验证

# docker-compose version

docker-compose version 1.17.1, build 6d101fb
docker-py version: 2.6.1
CPython version: 2.7.5
OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013

添加bash补全命令:

curl -L https://raw.githubusercontent.com/docker/compose/1.17.1/contrib/completion /bash/docker-
compose > /etc/bash_completion.d/docker-cpmpose

输出以下内容:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    15  100    15    0     0     12      0  0:00:01  0:00:01 --:--:--    12
curl: (3) <url> malformed

下载镜像

sequenceiq/docker-spark 镜像已经安装了对spark完整依赖,先下载到本地

docker pull sequenceiq/spark:1.6.0

创建docker-compose.yml 文件,内容如下

# http://github.com/yeasy/docker-compose-files
# This compose file will start spark master node and the worker node.
# All nodes will become a cluster automatically.
# You can run: docker-compose scale worker=2
# After startup, try submit a pi calculation application.
#  /urs/local/spark/bin/spark-submit --master spark://master:7077 --class org.apache.spark.examples.SparkPi /usr/local/spark/lib/spark-examples-1.4.0-hadoop2.6.0.jar 1000

master:
  image: sequenceiq/spark:1.6.0
  hostname: master
  ports:
  - "4040:4040"
  - "8042:8042"
  - "7077:7077"
  - "8088:8088"
  - "8080:8080"
  restart: always
  #mem_limit: 1024m
  command: bash /usr/local/spark/sbin/start-master.sh && ping localhost > /dev/null

worker:
  image: sequenceiq/spark:1.6.0
  links:
  - master:master
  expose:
  - "8081"
  restart: always
  command: bash /usr/local/spark/sbin/start-slave.sh spark://master:7077 && ping localhost >/dev/null

文件解析:

master 服务

首先,master 服务映射了好几组端口到本地,分别功能为: 4040:Spark 运行任务时候提供 web 界面观测任务的具体执行状况,包括执行到哪个阶段、在哪个 executor 上执行; 8042:Hadoop 的节点管理界面; 7077:Spark 主节点的监听端口,用户可以提交应用到这个端口,worker 节点也可以通过这个端口连接到主节点构成集群; 8080:Spark 的监控界面,可以看到所有的 worker、应用整体信息; * 8088:Hadoop 集群的整体监控界面。

master 服务启动后,执行了 bash /usr/local/spark/sbin/start-master.sh 命令来配置自己为 master 节点,然后通过 ping 来避免容器退出。


worker 服务

类似 master 节点,启动后,执行了 /usr/local/spark/sbin/start-slave.sh spark://master:7077 命令来配置自己为 worker 节点,然后通过 ping 来避免容器退出。

注意,启动脚本后面需要提供 spark://master:7077 参数来指定 master 节点地址。

8081 端口提供的 web 界面,可以看到该 worker 节点上任务的具体执行情况。


关于compose模板文件的详细:http://www.jianshu.com/p/2217cfed29d7


启动集群

创建文件之后,然后在文件的当前目录下执行:

docker-compose up

然后会有下面的输出:

Creating sparkcompose_master_1...
Creating sparkcompose_slave_1...
Attaching to sparkcompose_master_1, sparkcompose_slave_1
master_1 | /
master_1 | Starting sshd: [  OK  ]
slave_1  | /
slave_1  | Starting sshd: [  OK  ]
master_1 | Starting namenodes on [master]
slave_1  | Starting namenodes on [5d0ea02da185]
master_1 | master: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-master.out
slave_1  | 5d0ea02da185: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-5d0ea02da185.out
master_1 | localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-master.out
slave_1  | localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-5d0ea02da185.out
master_1 | Starting secondary namenodes [0.0.0.0]
slave_1  | Starting secondary namenodes [0.0.0.0]
master_1 | 0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-master.out
master_1 | starting yarn daemons
master_1 | starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-master.out
master_1 | localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-master.out
master_1 | starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark-1.4.0-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-master.out
slave_1  | 0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-5d0ea02da185.out
slave_1  | starting yarn daemons
slave_1  | starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-5d0ea02da185.out
slave_1  | localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-5d0ea02da185.out
slave_1  | starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-1.4.0-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.worker.Worker-1-5d0ea02da185.out

之后打开另一个终端,docker-compose 服务起来后,我们还可以用 scale 命令来动态扩展 Spark 的 worker 节点数,例如

$ docker-compose scale worker=2
Creating and starting 2... done

测试

进入master节点中:

docker exec -it root_master_1 /bin/bash

进入spark shell:

bash-4.1# spark-shell

显示以下内容:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
Type in expressions to have them evaluated.
Type :help for more information.
scala> 

输入scala命令测试Spark能否工作:

scala> sc.parallelize(1 to 1000).count()
res0: Long = 1000                                                               

scala> 

执行应用

Spark 推荐用 spark-submit 命令来提交执行的命令,基本语法为

spark-submit \
–class your-class-name \
–master master_url \
your-jar-file
app_params

例如,我们可以使用 spark 自带样例中的计算 Pi 的应用。

在 master 节点上执行命令:

/usr/local/spark/bin/spark-submit --master spark://master:7077 --class org.apache.spark.examples.SparkPi /usr/local/spark/lib/spark-examples-1.4.0-hadoop2.6.0.jar 1000

这样一个使用docker镜像快速搭建spark集群就成功运行了,但是要是真的要好好学习spark的话,还是要自己一步步从每个组件搭建,环境搭建一定要知其所以然。

这是一个完整的搭建过程:https://www.cnblogs.com/jasonfreak/p/5391190.html

Logo

权威|前沿|技术|干货|国内首个API全生命周期开发者社区

更多推荐