Kafka Connect, Cassandra Sink: How to specify the partition and clustering keys?

BIGdd

11人浏览 · 2022-08-23 19:52:08

BIGdd · 2022-08-23 19:52:08 发布

Answer a question

I went through the Cassandra Sink doc but I don't see how to specify the partition and clustering keys.

The doc says this:

You can configure this connector to manage the schema on the Cassandra cluster. When altering an existing table the key is ignored. This is to avoid the potential issues around changing a primary key on an existing table. The key schema is used to generate a primary key for the table when it is created.

If it is a new table, the Connector will use the Key schema (from the KStream I suppose) to create the primary key. That might be Ok for the Partition Key, but not for the Clustering key.

So are we forced to create all the tables with the right keys before running the Streaming app, or is there a way to adjust things ?

Answers

Confluent's connector requires that all columns that are in the primary key should be in the key of the topic (as struct, if I remember correctly). This is one of the its limitations, as it may not be matching your output from application. In this case you'll need to transform topic to match this requirement.

Instead of Confluent's connector, I recommend to take DataStax's Kafka Connector that is carefully designed to effective load of data into Cassandra/DSE. It has following features (more information is in the following blog post):

Store data from one topic into one or multiple Cassandra tables (to support data denormalization);
Mapping of data in topic into Cassandra columns is defined by configuration file, so you can take any piece of key or value of the message, and map into column;
very effective by using unlogged batches where possible & lightweight;
support different security features of Cassandra/DSE;

Connector is free to use for DSE starting with DSE 4.8, and Cassandra starting with 2.1.

向你推荐>>>开发者社区

华为、百度、京东云现已入驻，来创建你的专属开发者社区吧！

更多推荐

关于 Jupyter 笔记本最糟糕的五件事

我曾经喜欢 Jupyter。我仍然认为它们是许多任务的绝佳工具,例如探索性数据分析和轻松轻松地向同事展示见解。然而,虽然它们有时非常适合数据科学,但有时却令人头疼。像任何软件工具一样,它们也有其缺点。以下是 Jupyter Notebooks 用于数据科学的五个最糟糕的事情: 1.练习良好的代码版本控制几乎是不可能的 Jupyter Notebooks 对于代码版本控制来说很糟糕。问题是它们存储为

大数据

2023 年流行的大数据和数据科学角色

数据科学和大数据提供了广泛的职业前景。涉及数据的角色的扩展伴随着数据科学的出现。它是当今最流行和最前沿的技术应用领域之一,这是有道理的。数据科学目前可能是最好的就业市场。与此同时,这一发展中的主题正在改变众多业务和技术。随着所有垂直领域的行业越来越受数据驱动,就业市场和必要的技能受到影响。随着我们学习新的数据接触点和评估方法,我们生活的社会、日常生活和国家经济越来越依赖数据。这是大数据和数据科学能

大数据

数据科学的主要组成部分和特点

数据科学是十年来增长最快、最具挑战性和高薪的工作之一。那么,究竟什么是数据科学?数据科学是一个跨学科领域,它结合了统计学、计算机科学和机器学习算法,以从结构化和非结构化数据中获得洞察力。据《经济时报》报道,尽管供应增长缓慢,但印度对通过数据科学课程认证的各行业数据科学专业人员的需求增长了 400% 以上。数据科学的组成部分 1\。数据探索这是最关键的一步,因为它花费的时间最多。数据探索消耗了大