Optimized points (window functions) of sparksql over hivesql

BIGdd

41人浏览 · 2022-08-12 17:06:58

BIGdd · 2022-08-12 17:06:58 发布

Sometimes, a select statement contains multiple window functions whose window definitions (OVER clauses) may be the same or different.

For the same windows, there is no need to partition and sort them again. We can merge them into a Window operator.

such as The realization principle of window function in spark and hive Case in:

 select     id,    sq,    cell_type,    rank,    row_number() over(partition by id  order by rank ) naturl_rank,    rank() over(partition by id order by rank) as r,    dense_rank() over(partition by  cell_type order by id) as dr   from window_test_table  group by id,sq,cell_type,rank;

The window of row_number() rank() can be completed in a partition and sorting. The performance of hive sql is consistent with that of spark sql.

But in another case:

 select    id,    rank,    row_number() over(partition by id  order by rank ) naturl_rank,    sum(rank) over(partition by id) as snum from window_test_table

Although the two windows are not exactly the same, sum(rank) does not care about the order within the partition, and can reuse the window of row Hou number().

As can be seen from the execution plan below, spark sql sum(rank) and row ou number() reuse the same window, while hive sql does not.

The execution plan of spark sql:

spark-sql>  explain select  id,rank,row_number() over(partition by id  order by rank ) naturl_rank,sum(rank) over(partition by id) as snum from window_test_table;      == Physical Plan ==*(3) Project [id#13, rank#16, naturl_rank#8, snum#9L]+- Window [row_number() windowspecdefinition(id#13, rank#16 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS naturl_rank#8], [id#13], [rank#16 ASC NULLS FIRST]   +- *(2) Sort [id#13 ASC NULLS FIRST, rank#16 ASC NULLS FIRST], false, 0      +- Window [sum(cast(rank#16 as bigint)) windowspecdefinition(id#13, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS snum#9L], [id#13]         +- *(1) Sort [id#13 ASC NULLS FIRST], false, 0            +- Exchange hashpartitioning(id#13, 200)               +- Scan hive tmp.window_test_table [id#13, rank#16], HiveTableRelation `tmp`.`window_test_table`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#13, sq#14, cell_type#15, rank#16]Time taken: 0.278 seconds, Fetched 1 row(s)

hive sql execution plan:

hive> explain select  id,rank,row_number() over(partition by id  order by rank ) naturl_rank,sum(rank) over(partition by id) as snum from window_test_table;OKSTAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-2 depends on stages: Stage-1  Stage-0 depends on stages: Stage-2STAGE PLANS:  Stage: Stage-1    Map Reduce      Map Operator Tree:          TableScan            alias: window_test_table            Statistics: Num rows: 13 Data size: 104 Basic stats: COMPLETE Column stats: NONE            Reduce Output Operator              key expressions: id (type: int), rank (type: int)              sort order: ++              Map-reduce partition columns: id (type: int)              Statistics: Num rows: 13 Data size: 104 Basic stats: COMPLETE Column stats: NONE      Reduce Operator Tree:        Select Operator          expressions: KEY.reducesinkkey0 (type: int), KEY.reducesinkkey1 (type: int)          outputColumnNames: _col0, _col3          Statistics: Num rows: 13 Data size: 104 Basic stats: COMPLETE Column stats: NONE          PTF Operator            Function definitions:                Input definition                  input alias: ptf_0                  output shape: _col0: int, _col3: int                  type: WINDOWING                Windowing table definition                  input alias: ptf_1                  name: windowingtablefunction                  order by: _col3 ASC NULLS FIRST                  partition by: _col0                  raw input shape:                  window functions:                      window function definition                        alias: row_number_window_0                        name: row_number                        window function: GenericUDAFRowNumberEvaluator                        window frame: PRECEDING(MAX)~FOLLOWING(MAX)                        isPivotResult: true            Statistics: Num rows: 13 Data size: 104 Basic stats: COMPLETE Column stats: NONE            Select Operator              expressions: _col0 (type: int), _col3 (type: int), row_number_window_0 (type: int)              outputColumnNames: _col0, _col3, row_number_window_0              Statistics: Num rows: 13 Data size: 104 Basic stats: COMPLETE Column stats: NONE              File Output Operator                compressed: false                table:                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat                    serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe  Stage: Stage-2    Map Reduce      Map Operator Tree:          TableScan            Reduce Output Operator              key expressions: _col0 (type: int)              sort order: +              Map-reduce partition columns: _col0 (type: int)              Statistics: Num rows: 13 Data size: 104 Basic stats: COMPLETE Column stats: NONE              value expressions: row_number_window_0 (type: int), _col3 (type: int)      Reduce Operator Tree:        Select Operator          expressions: VALUE._col0 (type: int), KEY.reducesinkkey0 (type: int), VALUE._col3 (type: int)          outputColumnNames: _col0, _col1, _col4          Statistics: Num rows: 13 Data size: 104 Basic stats: COMPLETE Column stats: NONE          PTF Operator            Function definitions:                Input definition                  input alias: ptf_0                  output shape: _col0: int, _col1: int, _col4: int                  type: WINDOWING                Windowing table definition                  input alias: ptf_1                  name: windowingtablefunction                  order by: _col1 ASC NULLS FIRST                  partition by: _col1                  raw input shape:                  window functions:                      window function definition                        alias: sum_window_1                        arguments: _col4                        name: sum                        window function: GenericUDAFSumLong                        window frame: PRECEDING(MAX)~FOLLOWING(MAX)            Statistics: Num rows: 13 Data size: 104 Basic stats: COMPLETE Column stats: NONE            Select Operator              expressions: _col1 (type: int), _col4 (type: int), _col0 (type: int), sum_window_1 (type: bigint)              outputColumnNames: _col0, _col1, _col2, _col3              Statistics: Num rows: 13 Data size: 104 Basic stats: COMPLETE Column stats: NONE              File Output Operator                compressed: false                Statistics: Num rows: 13 Data size: 104 Basic stats: COMPLETE Column stats: NONE                table:                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  Stage: Stage-0    Fetch Operator      limit: -1      Processor Tree:        ListSinkTime taken: 0.244 seconds, Fetched: 106 row(s)

向你推荐>>>开发者社区

华为、百度、京东云现已入驻，来创建你的专属开发者社区吧！

更多推荐

关于 Jupyter 笔记本最糟糕的五件事

我曾经喜欢 Jupyter。我仍然认为它们是许多任务的绝佳工具,例如探索性数据分析和轻松轻松地向同事展示见解。然而,虽然它们有时非常适合数据科学,但有时却令人头疼。像任何软件工具一样,它们也有其缺点。以下是 Jupyter Notebooks 用于数据科学的五个最糟糕的事情: 1.练习良好的代码版本控制几乎是不可能的 Jupyter Notebooks 对于代码版本控制来说很糟糕。问题是它们存储为

大数据

2023 年流行的大数据和数据科学角色

数据科学和大数据提供了广泛的职业前景。涉及数据的角色的扩展伴随着数据科学的出现。它是当今最流行和最前沿的技术应用领域之一,这是有道理的。数据科学目前可能是最好的就业市场。与此同时,这一发展中的主题正在改变众多业务和技术。随着所有垂直领域的行业越来越受数据驱动,就业市场和必要的技能受到影响。随着我们学习新的数据接触点和评估方法,我们生活的社会、日常生活和国家经济越来越依赖数据。这是大数据和数据科学能

大数据

数据科学的主要组成部分和特点

数据科学是十年来增长最快、最具挑战性和高薪的工作之一。那么,究竟什么是数据科学?数据科学是一个跨学科领域,它结合了统计学、计算机科学和机器学习算法,以从结构化和非结构化数据中获得洞察力。据《经济时报》报道,尽管供应增长缓慢,但印度对通过数据科学课程认证的各行业数据科学专业人员的需求增长了 400% 以上。数据科学的组成部分 1\。数据探索这是最关键的一步,因为它花费的时间最多。数据探索消耗了大