Spark runs on Docker
最近看了不少Docker相关的文档,也做了不少相关的实验。
最近看了不少Docker相关的文档,也做了不少相关的实验。想着能不能在docker上运行spark,然后google了一下,发现
1 https://registry.hub.docker.com/u/amplab/spark-master/ ,https://github.com/amplab/docker-scripts 这里的两个应该是同一个,不过前者没给guide
2 spark官方源码里面提供了docker的子项目
本文采用spark提供的docker子项目来构建Spark集群
Docker的安装,这里不再做叙述。请参考官方文档。
下面我们看一看Docker子项目的目录结构
docker
---------build
---------readme.md
---------spark-test
----------------------build
----------------------readme.md
----------------------base
-------------------------------dockerfile
----------------------master
-------------------------------dockerfile
-------------------------------default_cmd
----------------------worker
-------------------------------dockerfile
-------------------------------default_cmd
根据目录结构可以看出这里是要构造3个image,分别是base, master, worker
docker/build, 这个脚本调用spark-test下面的build脚本,这里只做第一步检查,不用sudo的情况下药可以执行docker的命令
docker images > /dev/null || { echo Please install docker in non-sudo mode. ; exit; }
./spark-test/build
解决方法:
1. 如果还没有docker group就添加一个:
sudo groupadd docker
2.将用户加入该group内。然后退出并重新登录就生效啦。
sudo gpasswd -a ${USER} docker
3.重启docker
sudo service docker restart
大功告成!
docker/spark-test/build,这里的脚本开始调用下层的dockerfile,分别创建image
docker build -t spark-test-base spark-test/base/
docker build -t spark-test-master spark-test/master/
docker build -t spark-test-worker spark-test/worker/
docker/spark-test/readme.md
Spark Docker files usable for testing and development purposes.
These images are intended to be run like so:
docker run -v $SPARK_HOME:/opt/spark spark-test-master
docker run -v $SPARK_HOME:/opt/spark spark-test-worker spark://<master_ip>:7077
Using this configuration, the containers will have their Spark directories
mounted to your actual `SPARK_HOME`, allowing you to modify and recompile
your Spark source and have them immediately usable in the docker images
(without rebuilding them).
这里需要做一下解释:
在host机器上
安装Scala-2.10.x版本
安装Spark 1.1.0版本
sudo vi /etc/profile,增加$SPARK_HOME, $SCALA_HOME, $SPARK_BIN, $SCALA_BIN环境变量设置。source /etc/profile
另开一个shell,启动master,在master机器的屏幕上会打印出来master的IP地址,这个IP地址在启动worker的时候会用到。
同样,另开一个shell,启动worker。master和worker的屏幕上会有部分log打印出来,worker和master的shell窗口都是不可交互的界面,如果需要交互显示,可以修改下面的default.cmd,改成后台启动。或者通过ssh的方式登陆master和worker。
启动好之后,就可以启动host机器上的spark-shell来做基本的测试了。 $SPARK_HOME/bin/MASTER=spark://masterip:port ./spark-shell ,这样就启动了shell
这里复用了host上编译好的spark可执行代码,所以需要host上安装spark。
docker/spark-test/base/dockerfile
FROM ubuntu:precise
RUN echo "deb http://archive.ubuntu.com/ubuntu precise main universe" > /etc/apt/sources.list
RUN echo "deb http://cz.archive.ubuntu.com/ubuntu precise main" >> /etc/apt/sources.list
RUN echo "deb http://security.ubuntu.com/ubuntu precise-security main universe" >> /etc/apt/sources.list
# export proxy
ENV http_proxy http://www-proxy.xxxx.se:8080
ENV https_proxy http://www-proxy.xxxxx.se:8080
# Upgrade package index
RUN apt-get update
# install a few other useful packages plus Open Jdk 7
RUN apt-get install -y less openjdk-7-jre-headless net-tools vim-tiny sudo openssh-server
ENV SCALA_VERSION 2.10.4
ENV CDH_VERSION cdh4
ENV SCALA_HOME /opt/scala-$SCALA_VERSION
ENV SPARK_HOME /opt/spark
ENV PATH $SPARK_HOME:$SCALA_HOME/bin:$PATH
# Install Scala
ADD http://www.scala-lang.org/files/archive/scala-$SCALA_VERSION.tgz /
RUN (cd / && gunzip < scala-$SCALA_VERSION.tgz)|(cd /opt && tar -xvf -)
RUN rm /scala-$SCALA_VERSION.tgz
如果需要设置代理的话,按照上面的方式设置。
在安装的过程中发现,有些软件无法正常安装,于是增加了2个新的apt源,如上的最后两个源是我加上的。
docker/spark-test/master/dockerfile
FROM spark-test-base
ADD default_cmd /root/
CMD ["/root/default_cmd"]
IP=$(ip -o -4 addr list eth0 | perl -n -e 'if (m{inet\s([\d\.]+)\/\d+\s}xms) { print $1 }')
echo "CONTAINER_IP=$IP"
export SPARK_LOCAL_IP=$IP
export SPARK_PUBLIC_DNS=$IP
# Avoid the default Docker behavior of mapping our IP address to an unreachable host name
umount /etc/hosts
/opt/spark/bin/spark-class org.apache.spark.deploy.master.Master -i $IP
看到这里就可以发现,其实basebuild出来,master和worker基本上只是启动的过程中执行不同的命令而已,本质上image与base区别很小
docker/spark-test/worker/dockerfile
FROM spark-test-base
ENV SPARK_WORKER_PORT 8888
ADD default_cmd /root/
ENTRYPOINT ["/root/default_cmd"]
default.cmd
IP=$(ip -o -4 addr list eth0 | perl -n -e 'if (m{inet\s([\d\.]+)\/\d+\s}xms) { print $1 }')
echo "CONTAINER_IP=$IP"
export SPARK_LOCAL_IP=$IP
export SPARK_PUBLIC_DNS=$IP
# Avoid the default Docker behavior of mapping our IP address to an unreachable host name
umount /etc/hosts
/opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker $1
更多推荐
所有评论(0)