一.前言

ClickHouse是Yandex在2016年6月15日开源的一个数据分析数据库,国内使用情况:

今日头条 内部用ClickHouse来做用户行为分析,内部一共几千个ClickHouse节点,单集群最大1200节点,总数据量几十PB,日增原始数据300TB左右,大多数查询相应时间在几秒钟。
腾讯内部用ClickHouse做游戏数据分析,并且为之建立了一整套监控运维体系。
携程内部从18年7月份开始接入试用,目前80%的业务都跑在ClickHouse上。每天数据增量十多亿,近百万次查询请求。
快手内部也在使用ClickHouse,存储总量大约10PB, 每天新增200TB, 90%查询小于3S。

ClickHouse处理速度快,但快也是有代价的,限制如下:
不支持事务;
不支持delete&update;
不支持高并发,即使一个查询,也会用服务器一半的CPU去执行,官方建议qps为100;
SQL满足日常使用80%以上的语法,join写法比较特殊;

ClickHouse官方文档参见:ClickHouse文档

二.ClickHouse安装

比较简单,网上文档很多,不再阐述。

三.数据迁移方法

看了下相关资料,批量数据迁移大概三种方法
1.csv导入
这种方法速度非常快,普通服务器导入千万数据60秒左右,如果SSD盘会更快; 但是从Oracle大表快速导出csv是个问题,Oracle存储过程在服务器端导出千万数据花了一个多小时(csv6G左右)。
2.第三方工具,如flume,datax等
flume暂不说,datax官网并没有提供ClickHouse支持,不过GitHub有热心同仁提供了源码,需要自己打包,见:Datax源码
3.自己写代码
这种费时费力不说,效率也不能保证,关键没那么多时间^_^。

四.DATAX实测

1.数据库环境准备
Oracle数据库

create table TDBA_TEST01
(
  TID          number primary key,
  TSN          varchar2(40),
  TNO          number,
  TAMT         number(15,2),
  CREATE_DATE  date not null,
  UPDATE_DATE  date,
  UPDATE_TIME  date
)
INSERT INTO TDBA_TEST01 VALUES(1,'SN001',1,1.01,TO_DATE('2020-02-01','YYYY-MM-DD'),TO_DATE('2020-02-02','YYYY-MM-DD'),TO_DATE('2020-02-02 01:01:01','YYYY-MM-DD HH24:MI:SS'));
INSERT INTO TDBA_TEST01 VALUES(2,'SN002',2,2.12,TO_DATE('2020-02-02','YYYY-MM-DD'),TO_DATE('2020-02-03','YYYY-MM-DD'),TO_DATE('2020-02-03 01:01:01','YYYY-MM-DD HH24:MI:SS'));
INSERT INTO TDBA_TEST01 VALUES(3,'SN003',3,3.23,TO_DATE('2020-02-03','YYYY-MM-DD'),TO_DATE('2020-02-04','YYYY-MM-DD'),TO_DATE('2020-02-04 01:01:01','YYYY-MM-DD HH24:MI:SS'));
INSERT INTO TDBA_TEST01 VALUES(4,NULL,4,4.34,TO_DATE('2020-02-04','YYYY-MM-DD'),TO_DATE('2020-02-05','YYYY-MM-DD'),TO_DATE('2020-02-05 01:01:01','YYYY-MM-DD HH24:MI:SS'));
INSERT INTO TDBA_TEST01 VALUES(5,'SN005',NULL,5.45,TO_DATE('2020-02-05','YYYY-MM-DD'),TO_DATE('2020-02-06','YYYY-MM-DD'),TO_DATE('2020-02-06 01:01:01','YYYY-MM-DD HH24:MI:SS'));
INSERT INTO TDBA_TEST01 VALUES(6,'SN006',6,NULL,TO_DATE('2020-02-06','YYYY-MM-DD'),TO_DATE('2020-02-07','YYYY-MM-DD'),TO_DATE('2020-02-07 01:01:01','YYYY-MM-DD HH24:MI:SS'));
INSERT INTO TDBA_TEST01 VALUES(7,'SN007',7,7.67,TO_DATE('2020-02-07','YYYY-MM-DD'),NULL,TO_DATE('2020-02-08 01:01:01','YYYY-MM-DD HH24:MI:SS'));
INSERT INTO TDBA_TEST01 VALUES(8,'SN008',8,8.78,TO_DATE('2020-02-08','YYYY-MM-DD'),TO_DATE('2020-02-09','YYYY-MM-DD'),NULL);
INSERT INTO TDBA_TEST01 VALUES(9,NULL,NULL,NULL,TO_DATE('2020-02-09','YYYY-MM-DD'),NULL,NULL);

ClickHouse数据库

create table TEST01
(
  TID          UInt32,
  TSN          String,
  TNO          UInt16,
  TAMT         Decimal(15,2),
  CREATE_DATE  Date,
  UPDATE_DATE  Date,
  UPDATE_TIME  DateTime
) ENGINE = MergeTree(CREATE_DATE, (TID), 8192)

2.源码打包,打包完上传Datax.tar.gz解压即可,源码:https://github.com/kuangye098/DataX
3.配置json文件,样例如下:
[root@ipshis bin]# more test.json

{
  "job": {
    "content": [
      {
        "reader": {                    
          "name": "oraclereader",                    
          "parameter": {                        
            "connection": [ {                                
              "jdbcUrl": ["jdbc:oracle:thin:@192.168.xxx.xxx:port:sid"],          
              "querySql": ["select * from TDBA_TEST01"]}
            ],                       
          "username": "OracleUser",
          "password": "OracleUserPwd"                        
          }                
        },
        "writer": {
          "name": "clickhousewriter",
          "parameter": {
            "username": "default",
            "password": "test#1234",
            "column":["*"],
            "connection": [
              {
                "jdbcUrl": "jdbc:clickhouse://127.0.0.1:8123/default",
                "table":["TEST01"]
              }
            ]
          }
        }
      }
    ],
    "setting": {
      "speed": {
        "channel":1 
      }
    }
  }
}

4.运行
[root@ipshis bin]# pwd
/datax/bin
[root@ipshis bin]# ./datax.py test.json

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright © 2010-2017, Alibaba Group. All Rights Reserved.

2020-03-02 10:17:54.326 [main] INFO VMInfo - VMInfo# operatingSystem class => com.sun.management.UnixOperatingSystem
2020-03-02 10:17:54.332 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.7 24.45-b08
jvmInfo: Linux amd64 2.6.32-431.el6.x86_64
cpu num: 40
totalPhysicalMemory: 125.93G
freePhysicalMemory: 13.56G
maxFileDescriptorCount: 4096
currentOpenFileDescriptorCount: 43
GC Names [PS MarkSweep, PS Scavenge]
MEMORY_NAME | allocation_size | init_size
PS Survivor Space | 42.50MB | 42.50MB
PS Old Gen | 683.00MB | 683.00MB
PS Eden Space | 256.50MB | 256.50MB
Code Cache | 48.00MB | 2.44MB
PS Perm Gen | 166.00MB | 21.00MB

2020-03-02 10:17:54.353 [main] INFO Engine -
{
“content”:[
{
“reader”:{
“name”:“oraclereader”,
“parameter”:{
“connection”:[
{
“jdbcUrl”:[
“jdbc:oracle:thin:@192.168.xxx.xxx:port:sid”
],
“querySql”:[
“select * from TDBA_TEST01”
]
}
],
“password”:“",
“username”:“ips2”
}
},
“writer”:{
“name”:“clickhousewriter”,
“parameter”:{
“column”:[
""
],
“connection”:[
{
“jdbcUrl”:“jdbc:clickhouse://127.0.0.1:8123/default”,
“table”:[
“TEST01”
]
}
],
“password”:"
”,
“username”:“default”
}
}
}
],
“setting”:{
“speed”:{
“channel”:1
}
}
}

2020-03-02 10:17:54.375 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2020-03-02 10:17:54.377 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2020-03-02 10:17:54.377 [main] INFO JobContainer - DataX jobContainer starts job.
2020-03-02 10:17:54.380 [main] INFO JobContainer - Set jobId = 0
2020-03-02 10:17:54.675 [job-0] INFO OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:oracle:thin:@192.168.xxx.xxx:port:sid.
2020-03-02 10:17:54.711 [job-0] INFO ClickHouseDriver - Driver registered
2020-03-02 10:17:54.920 [job-0] INFO OriginalConfPretreatmentUtil - table:[TEST01] all columns:[
TID,TSN,TNO,TAMT,CREATE_DATE,UPDATE_DATE,UPDATE_TIME
].
2020-03-02 10:17:54.921 [job-0] WARN OriginalConfPretreatmentUtil - 您的配置文件中的列配置信息存在风险. 因为您配置的写入数据库表的列为*,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改.
2020-03-02 10:17:54.922 [job-0] INFO OriginalConfPretreatmentUtil - Write data [
INSERT INTO %s (TID,TSN,TNO,TAMT,CREATE_DATE,UPDATE_DATE,UPDATE_TIME) VALUES(?,?,?,?,?,?,?)
], which jdbcUrl like:[jdbc:clickhouse://127.0.0.1:8123/default]
2020-03-02 10:17:54.922 [job-0] INFO JobContainer - jobContainer starts to do prepare …
2020-03-02 10:17:54.923 [job-0] INFO JobContainer - DataX Reader.Job [oraclereader] do prepare work .
2020-03-02 10:17:54.923 [job-0] INFO JobContainer - DataX Writer.Job [clickhousewriter] do prepare work .
2020-03-02 10:17:54.930 [job-0] INFO JobContainer - jobContainer starts to do split …
2020-03-02 10:17:54.930 [job-0] INFO JobContainer - Job set Channel-Number to 1 channels.
2020-03-02 10:17:54.933 [job-0] INFO JobContainer - DataX Reader.Job [oraclereader] splits to [1] tasks.
2020-03-02 10:17:54.933 [job-0] INFO JobContainer - DataX Writer.Job [clickhousewriter] splits to [1] tasks.
2020-03-02 10:17:54.951 [job-0] INFO JobContainer - jobContainer starts to do schedule …
2020-03-02 10:17:54.958 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2020-03-02 10:17:54.960 [job-0] INFO JobContainer - Running by standalone Mode.
2020-03-02 10:17:54.967 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2020-03-02 10:17:54.972 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2020-03-02 10:17:54.972 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2020-03-02 10:17:54.982 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2020-03-02 10:17:54.986 [0-0-0-reader] INFO CommonRdbmsReader T a s k − B e g i n t o r e a d r e c o r d b y S q l : [ s e l e c t ∗ f r o m T D B A T E S T 01 ] j d b c U r l : [ j d b c : o r a c l e : t h i n : @ 192.168. x x x . x x x : p o r t : s i d ] . 2020 − 03 − 0210 : 17 : 55.073 [ 0 − 0 − 0 − r e a d e r ] I N F O C o m m o n R d b m s R e a d e r Task - Begin to read record by Sql: [select * from TDBA_TEST01 ] jdbcUrl:[jdbc:oracle:thin:@192.168.xxx.xxx:port:sid]. 2020-03-02 10:17:55.073 [0-0-0-reader] INFO CommonRdbmsReader TaskBegintoreadrecordbySql:[selectfromTDBATEST01]jdbcUrl:[jdbc:oracle:thin:@192.168.xxx.xxx:port:sid].2020030210:17:55.073[000reader]INFOCommonRdbmsReaderTask - Finished read record by Sql: [select * from TDBA_TEST01
] jdbcUrl:[jdbc:oracle:thin:@192.168.xxx.xxx:port:sid].
2020-03-02 10:17:55.286 [0-0-0-writer] ERROR WriterRunner - Writer Runner Received Exceptions:
com.alibaba.datax.common.exception.DataXException: Code:[DBUtilErrorCode-05], Description:[往您配置的写入表中写入数据时失败.]. - java.lang.NullPointerException
at java.util.Calendar.setTime(Calendar.java:1106)
at java.text.SimpleDateFormat.format(SimpleDateFormat.java:955)
at java.text.SimpleDateFormat.format(SimpleDateFormat.java:948)
at java.text.DateFormat.format(DateFormat.java:336)
at ru.yandex.clickhouse.ClickHousePreparedStatementImpl.setDate(ClickHousePreparedStatementImpl.java:237)
at com.alibaba.datax.plugin.writer.clickhousewriter.ClickHouseWriter T a s k . f i l l P r e p a r e d S t a t e m e n t C o l u m n T y p e ( C l i c k H o u s e W r i t e r . j a v a : 324 ) a t c o m . a l i b a b a . d a t a x . p l u g i n . w r i t e r . c l i c k h o u s e w r i t e r . C l i c k H o u s e W r i t e r Task.fillPreparedStatementColumnType(ClickHouseWriter.java:324) at com.alibaba.datax.plugin.writer.clickhousewriter.ClickHouseWriter Task.fillPreparedStatementColumnType(ClickHouseWriter.java:324)atcom.alibaba.datax.plugin.writer.clickhousewriter.ClickHouseWriterTask.fillPreparedStatement(ClickHouseWriter.java:266)
at com.alibaba.datax.plugin.writer.clickhousewriter.ClickHouseWriter T a s k . d o B a t c h I n s e r t ( C l i c k H o u s e W r i t e r . j a v a : 248 ) a t c o m . a l i b a b a . d a t a x . p l u g i n . w r i t e r . c l i c k h o u s e w r i t e r . C l i c k H o u s e W r i t e r Task.doBatchInsert(ClickHouseWriter.java:248) at com.alibaba.datax.plugin.writer.clickhousewriter.ClickHouseWriter Task.doBatchInsert(ClickHouseWriter.java:248)atcom.alibaba.datax.plugin.writer.clickhousewriter.ClickHouseWriterTask.doBatchExecute(ClickHouseWriter.java:236)
at com.alibaba.datax.plugin.writer.clickhousewriter.ClickHouseWriter T a s k . s t a r t W r i t e W i t h C o n n e c t i o n ( C l i c k H o u s e W r i t e r . j a v a : 194 ) a t c o m . a l i b a b a . d a t a x . p l u g i n . w r i t e r . c l i c k h o u s e w r i t e r . C l i c k H o u s e W r i t e r Task.startWriteWithConnection(ClickHouseWriter.java:194) at com.alibaba.datax.plugin.writer.clickhousewriter.ClickHouseWriter Task.startWriteWithConnection(ClickHouseWriter.java:194)atcom.alibaba.datax.plugin.writer.clickhousewriter.ClickHouseWriterTask.startWrite(ClickHouseWriter.java:214)
at com.alibaba.datax.core.taskgroup.runner.WriterRunner.run(WriterRunner.java:56)
at java.lang.Thread.run(Thread.java:744)


5.问题
报错信息“…java.lang.NullPointerException…”,主要原因是ClickHouse不支持Null值(非String),解决方法很多种
1.修改Oracle相关信息,把所有Null栏位填入默认值,或者json配置中把Null列排除,这种方法在实际工作中不现实。
2.修改Datax源码,把所有Null置默认值,这个还是可行的,只要有点java基本知识,难度不大;
参照ClickHouse文档,Null的修改规则如下:
数字类型Null值默认置0
日期时间类型Null值默认置0000-00-00 00:00:00

6.修改源码,打包上传解压,再次运行

任务启动时刻 : 2020-03-02 11:00:47
任务结束时刻 : 2020-03-02 11:00:58
任务总计耗时 : 10s
任务平均流量 : 26B/s
记录写入速度 : 0rec/s
读出记录总数 : 9
读写失败总数 : 0

五.总结

CSV方式导入速度很给力,但有诸多限制,使用场景有限。
Datax更通用,期待官方能早日提供对ClickHouse的支持,它效率还是相当给力的,下面是千万级数据的实测迁移统计
任务启动时刻 : 2020-03-02 20:03:42
任务结束时刻 : 2020-03-02 20:17:02
任务总计耗时 : 800s
任务平均流量 : 3.88MB/s
记录写入速度 : 13511rec/s
读出记录总数 : 10808915
读写失败总数 : 0

本文附带的资源datax.tar.gz是修改过Null值的打包文件,文件比较小仅包含对Oracle,MySQL,SQLSERVER,ClickHouse的支持,可以直接使用。

Logo

旨在为数千万中国开发者提供一个无缝且高效的云端环境,以支持学习、使用和贡献开源项目。

更多推荐