Docker, a lightweight Linux Container engine, becomes more and more popular accompanied with the continuous development of big data and cloud computing. Essentially this is due to convenience involved by Docker in applications building, shipping and running. In this article, we will introduce docker and steps to dockerize Apache HAWQ.


Docker Introduction

The high level architecture of Docker is shown in Figure 1. Linux Container is a kind of lightweight operating system level virtualization method for running multiple isolated containers on a single host. It leverages namespace to do process isolation and cgroup to do resource isolation. And layered FS is used to be combined into a single docker image, which may contain user data and apps. This process just like put patch one by one on the base image to form a new one. With docker image, we can start docker container. It is worth while to note that docker container is not VM. The most important difference is that each VM needs a guest OS on host machine, which result in more cost of starting time and resource consumption than docker container. Now we already talked about docker image and docker container. Dockerfile is another basic element of docker. It contains instructions on command line to assemble a docker image, on purpose of automatically build and version control.


Figure 1: Docker high-level architecture


Docker Installation

We do docker installation on CentOS Linux release 7.0.1406 (Core), the kernel version is 3.10.0-123.el7.x86_64. First we do yum -y update to upgrade kernel and softwares if possible. Later we issue curl -sSL https://get.docker.com/ | sh to install docker engine. Ideally we could start docker via service docker start. However it reports error. After device-mapper-libs and device-mapper-event-libs are yum installed, docker is started up successfully. By default docker daemon process binds to Unix socket which belongs to root, so we have to sudo docker command. To avoid this, simply issue usermod -aG docker $USER. Docker is now ready to go.


Docker Image Build

In order to build Apache HAWQ docker image, we make use of Dockerfile. Keywords below are widly used in Dockerfile:

  • FROM: to specify base image from which you are building

  • RUN: to execute shell command

  • ENV: to set environment variable in docker container

  • ADD/COPY: usually to import files from host

  • EXPOSE: to open service port in docker container

  • CMD/ENTRYPOINT: to be executed when docker container started

Though both ENTRYPOINT and CMD allow you to specify the startup command for an image, there are subtle differences between them:

  1. CMD is overridden by the argument after the image name when starting the container, while ENTRYPOINT can only be overridden by the flag —entrypoint.

  2. Combining ENTRYPOINT and CMD, CMD strings will be appended to be the args of ENTRYPOINT.

  3. When using ENTRYPOINT and CMD, it's important to always use the exec form like ENTRYPOINT ["/bin/ping”,”localhost”], not the shell form ENTRYPOINT /bin/ping localhost.


Now we have some basic knowledge for Dockerfile, let's continue to build image for Apache HAWQ. We choose centos:7 in DockerHub as the base image. First we yum install softwares which is version compatible to Apache HAWQ, like jdk1.7, krb5, libxml2, libcurl, snappy, etc. For other libraries which are not version compatible or not found in yum repo, we install them from specific source, like json-c 0.9, flex 2.5.35, libhdfs3, libyarn, etc. Apache HAWQ development environment is settle down once all these dependencies are successfully installed. This is enough for devel mode. For production mode, we still need to add entrypoint part including Apache HAWQ building and running loggic. To build the image, we issue command docker build hawq:devel <path to Dockerfile>. One pre-built Apache HAWQ docker image has been pushed to DockerHub, you can refer to https://hub.docker.com/r/mayjojo/hawq-devel/. 


Docker Image Run

Issue command docker images to check that we already built hawq:devel image in local, we still need hadoop image. We can find it in DockerHub by docker search hadoop, and then docker pull <image>. To start HAWQ container, we use command docker run -d --name=hawq hawq:devel tail -f /dev/null. Issue command docker ps to check that one container named hawq is running in daemon. To login to the env, issue command docker exec -it hawq /bin/bash. Now you can build your HAWQ code, run HAWQ and do everything what you want. If you happen to break the envrionment, just docker kill hawq and rerun a new one. To achieve data persistent or share data between containers, you can simply mount data volumn from host by docker run -v or create a data container docker create -v /data --name=data and run HAWQ/Hadoop container docker run --volumes-from data. The latter is more recommended.


Docker version is still in quick iteration today. More and more exciting features to apply in Apache HAWQ are waiting for us to explore...



更多精彩内容,请关注大数据社区公共帐号!

长按识别图片二维码



Logo

权威|前沿|技术|干货|国内首个API全生命周期开发者社区

更多推荐