k8s 中 hawq 无法启动 segment 问题排查
文章目录问题现象问题本质解决方案解决过程方案1:重新编译 hawq安装依赖依赖镜像编译 HAWQ方案2:根据现象找原因hawq init master错误1:错误2hawq initsegmentPG数据库初始化sysctldockerfilek8s中报错修复后的 docker file测试进入系统验证配置环境变量启动集群验证集群执行查询,验证正确性问题现象HAWQ 能创建表,但是无法进行写操作(
文章目录
问题现象
HAWQ 能创建表,但是无法进行写操作(插入、更新)
问题本质
无可用 segment
解决方案
echo "kernel.sem = 250 512000 100 2048" >> /etc/sysctl.conf
sysctl -p
解决过程
方案1:重新编译 hawq
安装依赖
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
# For CentOs 7 the link is https://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-9.noarch.rpm
rpm -ivh epel-release-latest-7.noarch.rpm
yum makecache
# On redhat7, make sure enabled rhel-7-server-extras-rpms and rhel-7-server-optional-rpms channel in /etc/yum.repos.d/redhat.repo
# Otherwise yum will prompt some packages(e.g. gperf) not be found
yum install -y man passwd sudo tar which git mlocate links make bzip2 net-tools \
autoconf automake libtool m4 gcc gcc-c++ gdb bison flex gperf maven indent \
libuuid-devel krb5-devel libgsasl-devel expat-devel libxml2-devel \
perl-ExtUtils-Embed pam-devel python-devel libcurl-devel snappy-devel \
thrift-devel libyaml-devel libevent-devel bzip2-devel openssl-devel \
openldap-devel protobuf-devel readline-devel net-snmp-devel apr-devel \
libesmtp-devel python-pip json-c-devel \
java-1.7.0-openjdk-devel lcov cmake3 \
openssh-clients openssh-server perl-JSON perl-Env
# need tomcat6 if enable-rps
# download from http://archive.apache.org/dist/tomcat/tomcat-6/v6.0.44/
ln -s /usr/bin/cmake3 /usr/bin/cmake
pip --retries=50 --timeout=300 install pycrypto
依赖镜像
registry.cn-chengdu.aliyuncs.com/sunwu/hadoop-with-hawqpackage
编译 HAWQ
编译let with the necessary dependencies installed and Hadoop is ready, the next step is to get the code and build HAWQ
# The Apache HAWQ source code can be obtained from the the following link:
# Apache Repo: https://git-wip-us.apache.org/repos/asf/hawq.git or
# GitHub Mirror: https://github.com/apache/hawq.
git clone https://git-wip-us.apache.org/repos/asf/hawq.git
;
# The code directory is hawq.
CODE_BASE=`pwd`/hawq
cd $CODE_BASE
# Run command to generate makefile.
./configure
# Or you could use --prefix=/hawq/install/path to change the Apache HAWQ install path,
# and you can also add some optional components using options (--with-python --with-perl)
# For El Capitan (Mac OS 10.11), you may need to do: export CPPFLAGS="-I/usr/local/include -L/usr/local/lib" if the configure cannot find some components
./configure --prefix=/hawq/install/path --with-python --with-perl
# If you need to Enable RPS for Ranger Integration
export CATALINA_HOME=/tomcat/install/path
./configure --prefix=/hawq/install/path --enable-rps
# You can also run the command with --help for more configuration.
./configure --help
# Run command to build and install
# To build concurrently , run make with -j option. For example, make -j8
# On Linux system without large memory, you will probably encounter errors like
# "Error occurred during initialization of VM" and/or "Could not reserve enough space for object heap"
# and/or "out of memory", try to set vm.overcommit_memory = 1 temporarily, and/or avoid "-j" build,
# and/or add more memory and then rebuild.
# On mac os, you will probably see this error: "'openssl/ssl.h' file not found".
# "brew link openssl --force" should be able to solve the issue.
make -j8
# Install HAWQ
make install
未编译成功,放弃,回到问题本质
方案2:根据现象找原因
使用 hawq init master , hawq init segment
排查。
hawq init master
错误1:
ERROR: failed to list directory hdfs://localhost:8020/hawq_default or it is not empty
原因:
hadoop 未启动或目录已存在
解决方案:
-
启动 hadoop,
start-dfs.sh
-
删除目录
hdfs dfs -rm -r /hawq_default
错误2
[ERROR]:-Data directory /home/bigdata/hawq-data-directory/masterdd is not empty on 58acfd266eba
原因:
文件目录已存在
解决:
rm -rf /home/bigdata/hawq-data-directory/masterdd
到此没问题
hawq init segment
错误1
[ERROR]:-Data directory /home/bigdata/hawq-data-directory/segmentdd is not empty on 58acfd266eba
解决:
rm -rf /home/bigdata/hawq-data-directory/segmentdd
错误2
[ERROR]:-Postgres initdb failed
原因:
猜测是 pg 的问题,转而研究 pg
PG
经过排查,估计是数据库未创建。
数据库初始化
执行命令 initdb
, 报错:
Failed system call was semget(48, 17, 03600)
转而研究 linux 系统的 sysctl
sysctl
增大系统的参数
sysctl -w kernel.sem="250 512000 100 2048"
使其生效
sysctl -p
验证
sysctl -a|grep sem
执行结果如下,代表成功
[root@48ce8e4e3f79 /]# sysctl -a|grep sem
kernel.sem = 250 512000 100 2048
kernel.sem_next_id = -1
dockerfile
由于 docker 中不能使用 sysctl 命令,故将其转换为如下 dockerfile 命令
# 设置参数
RUN echo "kernel.sem = 250 512000 100 2048" >> /etc/sysctl.conf
# 创建初始化脚本
RUN echo "sysctl -p " >> /entrypoint.sh
RUN echo "/usr/sbin/sshd -D " >> /entrypoint.sh
RUN chmod +x /entrypoint.sh
# 设置入口
ENTRYPOINT /entrypoint.sh
k8s中报错
无法修改容器中的系统参数,如下所示
解决:
让容器以特权模式运行,在 ReplicaSet 加上如下代码:
securityContext:
privileged: true
完整的 RC:
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: {{ .Values.name }}
labels:
app: {{ .Values.name }}
spec:
replicas: {{ .Values.replicaCount }}
selector:
matchLabels:
app: {{ .Values.name }}
template:
metadata:
labels:
app: {{ .Values.name }}
spec:
containers:
- name: {{ .Values.name }}
image: {{ .Values.image.hub }}/{{ .Values.image.namespace }}/{{ .Values.image.repository }}:{{ .Values.image.tag }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
securityContext:
privileged: true
修复后的 docker file
FROM akshays/hawq_hadoop:latest
RUN yum makecache fast && yum -y install openssh-server openssh-clients passwd
RUN sed -i "s/#PermitRootLogin no/PermitRootLogin yes/g" /etc/ssh/sshd_config
RUN sed -i "s/#Port 22/Port 22/g" /etc/ssh/sshd_config
RUN sed -i "s/PermitRootLogin without-password/PermitRootLogin yes/g" /etc/ssh/sshd_config
RUN echo root|passwd --stdin root
RUN mkdir -p /var/run/sshd
RUN /usr/bin/ssh-keygen -A
RUN echo "kernel.sem = 250 512000 100 2048" >> /etc/sysctl.conf
RUN echo "sysctl -p " >> /entrypoint.sh
RUN echo "/usr/sbin/sshd -D " >> /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT /entrypoint.sh
测试
进入系统
su bigdata
start-dfs.sh
执行 start-dfs.sh
会出现 Are you sure you want to continue connecting (yes/no)?
, 请选择 yes
。
验证
jps
如下表示成功
# 确认是否管用
[bigdata@5fd15f58ab9b /]$ jps
481 SecondaryNameNode
268 DataNode
620 Jps
159 NameNode
配置环境变量
source /home/bigdata/hawq/hawq-2.0.0/greenplum_path.sh
启动集群
hawq start cluster
如下表示成功:
验证集群
hawq state
如下表示成功
[bigdata@2116fb98c099 /]$ hawq state
--- 一大堆日志 ---
... Total segment valid (at master) = 1
... Total segment failures (at master) = 0
--- 一大堆日志 ---
执行查询,验证正确性
psql -d postgres
create table t ( i int );
insert into t values(5555555);
insert into t select generate_series(1,9);
select count(*) from t;
select * from t limit 10;
如下表示成功
更多推荐
所有评论(0)