欄目導航

新聞資訊

新聞資訊

倫大淘寶技術 2024年07月08日 18:03 浙江

本文旨在收集整理ODPS開發(fā)中入門及進階級知識，盡可能涵蓋大多ODPS開發(fā)問題，成為一本mini百科全書，后續(xù)也會持續(xù)更新。希望通過筆者的梳理和理解，幫助剛接觸ODPS開發(fā)的同學快速上手。

本文為該系列第一篇：入門篇。

筆者不才，有任何錯誤紕漏，歡迎大家指正。

基礎功能介紹

?功能分類

一般來說，數(shù)據(jù)開發(fā)包括了以下幾個類型：

?MaxCompute功能

在此，我們重點介紹一下其中MaxCompute模塊（MaxCompute是適用于數(shù)據(jù)分析場景的企業(yè)級SaaS模式云數(shù)據(jù)倉庫）的功能：

基礎SQL

?DDL

具體語句1：

--創(chuàng)建新表。
 create [external] table [if not exists] <table_name>
 [primary key (<pk_col_name>, <pk_col_name2>),(<col_name> <data_type> 
                                               [not null] [default <default_value>] [comment <col_comment>], ...)]
 [comment <table_comment>]
 [partitioned by (<col_name> <data_type> [comment <col_comment>], ...)]


--用于創(chuàng)建聚簇表時設置表的Shuffle和Sort屬性。
 [clustered by | range clustered by (<col_name> [, <col_name>, ...]) 
  [sorted by (<col_name> [asc | desc] [, <col_name> [asc | desc] ...])] into <number_of_buckets> buckets] 
--僅限外部表。
 [stored by StorageHandler] 
 --僅限外部表。
 [with serdeproperties (options)] 
 --僅限外部表。
 [location <osslocation>] 
--指定表為Transactional1.0表，后續(xù)可以對該表執(zhí)行更新或刪除表數(shù)據(jù)操作，但是Transactional表有部分使用限制，請根據(jù)需求創(chuàng)建。
 [tblproperties("transactional"="true")]
--指定表為Transactional2.0表，后續(xù)可以做upsert，增量查詢，time-travel等操作
 [tblproperties ("transactional"="true" [, "write.bucket.num"="N", "acid.data.retain.hours"="hours"...])] [lifecycle <days>]
;


-------------------------------------------------------------------
--例子：
CREATE TABLE IF NOT EXISTS xxx.xxxx_xxxx_xxxx_hh
(
  xxxxx             STRING COMMENT '商品'
  ,xxxxx           STRING COMMENT '名字'
)
COMMENT 'xxx表'
PARTITIONED BY 
(
  ds                  STRING COMMENT 'yyyymmddhh'
)
LIFECYCLE 7
;

參數(shù)說明：

external：可選。表示創(chuàng)建的表為外部表。

if not exists：可選。如果不指定if not exists選項而存在同名表，會報錯。

table_name：必填。表名。

primary key（pk）：可選。表的主鍵。

col_name：可選，表的列名。

col_comment：可選。列的注釋內(nèi)容。

data_type：可選。列的數(shù)據(jù)類型。

not null：可選。禁止該列的值為NULL。default_value：可選。指定列的默認值。

table_comment：可選。表注釋內(nèi)容。

lifecycle：可選。表的生命周期。

partitioned by (<col_name> <data_type> [comment <col_comment>], ...：可選。指定分區(qū)表的分區(qū)字段。

具體語句2：修改表的所有人

alter table <table_name> changeowner to <new_owner>;


--------------------------------------------------------
--例子
--將表test1的所有人修改為ALIYUN$xxx@aliyun.com
alter table test1 changeowner to 'ALIYUN$xxx@aliyun.com';


--將表test1的所有人修改為名稱為ram_test的RAM用戶
alter table test1 changeowner to 'RAM$13xxxxxxxxxxx:ram_test';

參數(shù)說明：

table_name：必填。待修改Owner的表名。

new_owner：必填。修改后的Owner賬號。如果要修改Owner為RAM用戶，格式為：RAM$<UID>:<ram_name>，其中UID為阿里云賬號的賬號ID，ram_name為RAM用戶顯示名稱。

具體語句3：修改表的注釋

alter table <table_name> set comment '<new_comment>';
--------------------------------------------------------
--例子
alter table sale_detail set comment 'new coments for table sale_detail';

參數(shù)說明：

table_name：必填。待修改注釋的表的名稱。

new_comment：必填。修改后的注釋名稱。

具體語句4：修改表的修改時間

alter table <table_name> touch;
--------------------------------------------------------
--例子
alter table sale_detail touch;

參數(shù)說明：

table_name：必填。待修改表的修改時間的表名稱。

具體語句5：重命名表

alter table <table_name> rename to <new_table_name>;
--------------------------------------------------------
--例子
alter table sale_detail rename to sale_detail_rename;

參數(shù)說明：

table_name：必填。待修改名稱的表。

new_table_name：必填。修改后的表名稱。如果已存在與new_table_name同名的表，會返回報錯。

具體語句6：刪除表

drop table [if exists] <table_name>;
--------------------------------------------------------
--例子
drop table if exists sale_detail;

參數(shù)說明：

if exists：可選。如果不指定if exists且表不存在，則返回異常。如果指定if exists，無論表是否存在，均返回成功。

table_name：必填。待刪除的表名。

具體語句7：查看表或視圖信息

--查看表或視圖信息。
desc <table_name|view_name> [partition (<pt_spec>)]; 
--查看外部表、聚簇表或Transactional表信息。也可以查看內(nèi)部表的擴展信息。
desc extended <table_name>;
--------------------------------------------------------
--例子
desc test1;

參數(shù)說明：

table_name：必填。待查看表的名稱。

view_name：必填。待查看視圖的名稱。

pt_spec：可選。待查看分區(qū)表的指定分區(qū)。

extended：如果表為外部表、聚簇表或Transactional表，需要包含此參數(shù)。

具體語句8：查看分區(qū)信息

desc <table_name> partition (<pt_spec>);
--------------------------------------------------------
--例子
--查詢分區(qū)表sale_detail的分區(qū)信息。
desc sale_detail partition (xxxx_date='201310',region='beijing');

參數(shù)說明：

table_name：必填。待查看分區(qū)信息的分區(qū)表名稱。

pt_spec：必填。待查看的分區(qū)信息。

具體語句9：查看建表語句

show create table <table_name>;
--------------------------------------------------------
--例子
--查看表sale_detail的建表語句。
show create table sale_detail;

參數(shù)說明：

table_name：必填。待查看建表語句的表的名稱。

具體語句10：列出所有分區(qū)

show partitions <table_name>;
--------------------------------------------------------
--例子
--列出sale_detail中的所有分區(qū)。
show partitions sale_detail;

參數(shù)說明：

table_name：必填。待查看分區(qū)信息的分區(qū)表名稱。

具體語句11：清空列數(shù)據(jù)

ALTER TABLE <table_name> 
           [partition ( <pt_spec>[, <pt_spec>....] )] 
           CLEAR COLUMN column1[, column2, column3, ...]
                               [without touch];

參數(shù)說明：

table_name：將要執(zhí)行清空列數(shù)據(jù)的表名稱。

column1 , column2...：將要被清空數(shù)據(jù)的列名稱。

partition：指定分區(qū)。

pt_spec：分區(qū)描述。

without touch：表示不更新LastDataModifiedTime。

具體語句12：復制表

clone table <[<src_project_name>.]<src_table_name>> [partition(<pt_spec>), ...]
 to <[<dest_project_name>.]<dest_table_name>> [if exists [overwrite | ignore]] ;


----------------------------------------------------------------------------
--例子
 --復制表數(shù)據(jù)。
clone table xxxx_detail partition (xxxx_date='2013', region='china') to xxxx_detail_clone if exists overwrite;

參數(shù)說明：

src_project_name：可選。源表所屬MaxCompute項目名稱。

src_table_name：必填。源表名稱。

pt_spec：可選。源表的分區(qū)信息。

dest_project_name：可選。

dest_table_name：必填。目標表名稱。

?DML

具體語句1：插入或覆寫數(shù)據(jù)

--插入：直接向表或靜態(tài)分區(qū)中插入數(shù)據(jù)，可以在insert語句中直接指定分區(qū)值，將數(shù)據(jù)插入指定的分區(qū)。如果您需要插入少量測試數(shù)據(jù)，可以配合VALUES使用。
--覆寫：先清空表或靜態(tài)分區(qū)中的原有數(shù)據(jù)，再向表或靜態(tài)分區(qū)中插入數(shù)據(jù)。


insert {into|overwrite} table <table_name> [partition (<pt_spec>)] [(<col_name> [,<col_name> ...)]]
<select_statement>
from <from_statement>
[zorder by <zcol_name> [, <zcol_name> ...]];




----------------------------------------------------------------------------
--例子
--向源表追加數(shù)據(jù)。其中：insert into table table_name可以簡寫為insert into table_name，但insert overwrite table table_name不可以省略table關鍵字。
insert into xxxx_detail partition (xxxx_date='2013', region='china') values ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);


--執(zhí)行insert overwrite命令向表xxxx_detail_insert中覆寫數(shù)據(jù)，調(diào)整select子句中列的順序。
insert overwrite table xxxx_detail_insert partition (xxxx_date='2013', region='china')
    select xxxx_id, xxxx_name, xxxx_price from xxxx_detail;

參數(shù)說明：

table_name：必填。需要插入數(shù)據(jù)的目標表名稱。

pt_spec：可選。需要插入數(shù)據(jù)的分區(qū)信息。

col_name：可選。需要插入數(shù)據(jù)的目標表的列名稱。

select_statement：必填。select子句，從源表中查詢需要插入目標表的數(shù)據(jù)。

from_statement：必填。from子句，表示數(shù)據(jù)來源。

zorder by <zcol_name> [, <zcol_name> ...]：可選。向表或分區(qū)寫入數(shù)據(jù)時，支持根據(jù)指定的一列或多列，把排序列數(shù)據(jù)相近的行排列在一起，提升查詢時的過濾性能，在一定程度上降低存儲成本。

具體語句2：插入或覆寫動態(tài)分區(qū)數(shù)據(jù)

--在使用MaxCompute SQL處理數(shù)據(jù)時，分區(qū)列的值在select子句中提供，系統(tǒng)自動根據(jù)分區(qū)列的值將數(shù)據(jù)插入到相應分區(qū)。


insert {into|overwrite} table <table_name> partition (<ptcol_name>[, <ptcol_name> ...]) 
<select_statement> from <from_statement>;


----------------------------------------------------------------------------
--例子
--指定一級分區(qū)，將數(shù)據(jù)插入目標表。
insert overwrite table sale_detail_dypart partition (sale_date='2013', region)
select shop_name,customer_id,total_price,region from sale_detail;


--將源表sale_detail中的數(shù)據(jù)插入到目標表sale_detail_dypart。
insert overwrite table sale_detail_dypart partition (sale_date, region)
select shop_name,customer_id,total_price,sale_date,region from sale_detail;

參數(shù)說明：

table_name：必填。需要插入數(shù)據(jù)的目標表名。

ptcol_name：必填。目標表分區(qū)列的名稱。

select_statement：必填。select子句，從源表中查詢需要插入目標表的數(shù)據(jù)。

from_statement：必填。from子句，表示數(shù)據(jù)來源。例如，源表名稱。

具體語句3：更新或刪除數(shù)據(jù)

--刪除操作：用于刪除Transactional或Delta Table表中滿足指定條件的單行或多行數(shù)據(jù)。
delete from <table_name> [where <where_condition>];






--清空列數(shù)據(jù)：將不再使用的列數(shù)據(jù)從磁盤刪除并置NULL，從而達到降低存儲成本的目的。
ALTER TABLE <table_name> 
           [partition ( <pt_spec>[, <pt_spec>....] )] 
           CLEAR COLUMN column1[, column2, column3, ...]
                               [without touch];




--更新操作：用于將Transactional表或Delta Table表中行對應的單列或多列數(shù)據(jù)更新為新值。
--方式1
update <table_name> set <col1_name>=<value1> [, <col2_name>=<value2> ...] [WHERE <where_condition>];
--方式2
update <table_name> set (<col1_name> [, <col2_name> ...])=(<value1> [, <value2> ...])[WHERE <where_condition>];
--方式3
UPDATE <table_name>
       SET <col1_name>=<value1> [ , <col2_name>=<value2> , ... ]
        [ FROM <additional_tables> ]
        [ WHERE <where_condition> ]

參數(shù)說明：

table_name：必填。

where_condition：可選。WHERE子句，用于篩選滿足條件的數(shù)據(jù)。

partition：指定分區(qū)，若未指定，則表示操作所有分區(qū)。

pt_spec：分區(qū)描述。

without touch：表示不更新LastDataModifiedTime。

col1_name、col2_name：待修改行對應的列名稱。

value1、value2：至少更新一個列值。修改后的新值。

where_condition：可選。WHERE子句，用于篩選滿足條件的數(shù)據(jù)。

additional_tables：可選，from子句。

具體語句4：merge into

merge into <target_table> as <alias_name_t> using <source expression|table_name> as <alias_name_s>
--從on開始對源表和目標表的數(shù)據(jù)進行關聯(lián)判斷。
on <boolean expression1>
--when matched…then指定on的結(jié)果為True的行為。多個when matched…then之間的數(shù)據(jù)無交集。
when matched [and <boolean expression2>] then update set <set_clause_list>
when matched [and <boolean expression3>] then delete 
--when not matched…then指定on的結(jié)果為False的行為。
when not matched [and <boolean expression4>] then insert values <value_list>




----------------------------------------------------------------------------
--例子
--執(zhí)行merge into操作，對符合on條件的數(shù)據(jù)用源表的數(shù)據(jù)對目標表進行更新操作，對不符合on條件并且源表中滿足event_type為I的數(shù)據(jù)插入目標表。命令示例如下：
merge into acid_address_book_base1 as t using tmp_table1 as s 
on s.id=t.id and t.year='2020' and t.month='08' and t.day='20' and t.hour='16' 
when matched then update set t.first_name=s.first_name, t.last_name=s.last_name, t.phone=s.phone 
when not matched and (s._event_type_='I') then insert values(s.id, s.first_name, s.last_name,s.phone,'2020','08','20','16');

參數(shù)說明：

target_table：必填。目標表名稱，必須是實際存在的表。

alias_name_t：必填。目標表的別名。

source expression|table_name：必填。關聯(lián)的源表名稱、視圖或子查詢。

alias_name_s：必填。關聯(lián)的源表、視圖或子查詢的別名。

boolean expression1：必填。BOOLEAN類型判斷條件，判斷結(jié)果必須為True或False。

boolean expression2：可選。update、delete、insert操作相應的BOOLEAN類型判斷條件。

set_clause_list：當出現(xiàn)update操作時必填。

value_list：當出現(xiàn)insert操作時必填。

具體語句5：Values

--insert … values
insert into table <table_name>
[partition (<pt_spec>)][(<col1_name> ,<col2_name>,...)] 
values (<col1_value>,<col2_value>,...),(<col1_value>,<col2_value>,...),...


--values table
values (<col1_value>,<col2_value>,...),(<col1_value>,<col2_value>,...),<table_name> (<col1_name> ,<col2_name>,...)...

參數(shù)說明：

table_name：必填。待插入數(shù)據(jù)的表名稱。

pt_spec：可選。需要插入數(shù)據(jù)的目標分區(qū)信息。

col_name：可選。需要插入數(shù)據(jù)的目標列名稱。

col_value：可選。目標表中列對應的列值。

具體語句6：Load

--將Hologres、OSS、Amazon Redshift、BigQuery外部存儲的CSV格式或其他開源格式數(shù)據(jù)導入MaxCompute的表或表的分區(qū)。
{load overwrite|into} table <table_name> [partition (<pt_spec>)]
from location <external_location>
stored by <StorageHandler>
[with serdeproperties (<Options>)];


----------------------------------------------------------------------------
--例子
load overwrite table xxxx_data_csv_load
from
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-test/data_location/'
stored by 'com.aliyun.odps.CsvStorageHandler'
with serdeproperties (
  'odps.properties.rolearn'='acs:ram::xxxxx:role/aliyunodpsdefaultrole',   --AliyunODPSDefaultRole的ARN信息，可通過RAM角色管理頁面獲取。
  'odps.text.option.delimiter'=','
);

參數(shù)說明：

table_name：必填。需要插入數(shù)據(jù)的目標表名稱。

pt_spec：可選。需要插入數(shù)據(jù)的目標表分區(qū)信息。

external_location：必填。指定讀取外部存儲數(shù)據(jù)的OSS目錄。

StorageHandler：必填。指定內(nèi)置的StorageHandler名稱。

Options：可選。指定外部表相關參數(shù)。

具體語句7：Unload

--將MaxCompute的數(shù)據(jù)導出至OSS、Hologres外部存儲，OSS支持以CSV格式或其他開源格式存儲數(shù)據(jù)。
unload from {<select_statement>|<table_name> [partition (<pt_spec>)]} 
into 
location <external_location>
stored by <StorageHandler>
[with serdeproperties ('<property_name>'='<property_value>',...)];


----------------------------------------------------------------------------
--例子
--控制導出文件個數(shù)：設置單個Worker讀取MaxCompute表數(shù)據(jù)的大小，單位為MB。由于MaxCompute表有壓縮，導出到OSS的數(shù)據(jù)一般會膨脹4倍左右。
set odps.stage.mapper.split.size=256;
--導出數(shù)據(jù)。
unload from sale_detail partition (sale_date='2013',region='china')
into
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location'
stored by 'com.aliyun.odps.TsvStorageHandler'
with serdeproperties ('odps.properties.rolearn'='acs:ram::139699392458****:role/AliyunODPSDefaultRole', 'odps.text.option.gzip.output.enabled'='true');

參數(shù)說明：

select_statement：select查詢子句，

table_name、pt_spec：使用表名稱或表名稱加分區(qū)名稱的方式指定需要導出的數(shù)據(jù)。

external_location：必填。

StorageHandler：必填。指定內(nèi)置的StorageHandler名稱。

<property_name>'='<property_value>'：可選。property_name為屬性名稱，property_value為屬性值。

具體語句8：Explain

--分析查詢語句或表結(jié)構(gòu)來分析性能瓶頸
explain <dml query>;
----------------------------------------------------------------------------
--例子
explain 
select a.customer_id as ashop, sum(a.total_price) as ap,count(b.total_price) as bp 
from (select * from sale_detail_jt where sale_date='2013' and region='china') a 
inner join (select * from sale_detail where sale_date='2013' and region='china') b 
on a.customer_id=b.customer_id 
group by a.customer_id 
order by a.customer_id 
limit 10;

參數(shù)說明：

dml query：必填。select語句。

具體語句9：公用表表達式

--臨時命名結(jié)果集，用于簡化SQL，可以更好地提高SQL語句的可讀性與執(zhí)行效率
with 
     <cte_name> as
    (
      <cte_query>
    )
    [,<cte_name2>  as 
     (
       <cte_query2>
     )
     ,……]


----------------------------------------------------------------------------
--例子    
with 
  a as (select * from src where key is not null),
  b as (select  * from src2 where value > 0),
  c as (select * from src3 where value > 0),
  d as (select a.key, b.value from a join b on a.key=b.key),
  e as (select a.key,c.value from a left outer join c on a.key=c.key and c.key is not null)
insert overwrite table srcp partition (p='abc')
select * from d union all select * from e;

參數(shù)說明：

cte_name：必填。CTE的名稱，不能與當前with子句中的其他CTE的名稱相同。查詢中任何使用到cte_name標識符的地方，均指CTE。

cte_query：必填。一個select語句。select的結(jié)果集用于填充CTE。

?DQL

SELECT語句

1. SELECT語法

[with <cte>[, ...] ]
SELECT [all | distinct] <SELECT_expr>[, <except_expr>][, <replace_expr>] ...
       from <table_reference>
       [where <where_condition>]
       [group by {<col_list>|rollup(<col_list>)}]
       [having <having_condition>]
       [window <window_clause>]
       [order by <order_condition>]
       [distribute by <distribute_condition> [sort by <sort_condition>]|[ cluster by <cluster_condition>] ]
       [limit <number>]

下面將介紹SELECT命令格式及如何實現(xiàn)嵌套查詢、分組查詢、排序等操作。

2. SELECT語序

--語法順序
from <table_reference>
[where <where_condition>]
[group by <col_list>]
[having <having_condition>]
[window <window_name> AS (<window_definition>)]
[qualify <expression>]
select [all | distinct] <select_expr>, <select_expr>, ...
[order by <order_condition>]
[distribute by <distribute_condition> [sort by <sort_condition>] ]
[limit <number>]

場景1：from->where->group by->having->select->order by->limit

場景2：from->where->select->distribute by->sort by

3. WITH子句

with 
A as (SELECT 1 as C),
B as (SELECT * from A) 
SELECT * from B;

在同一WITH子句中的CTE必須具有唯一的名字。

在WITH子句中定義的CTE僅對在同一WITH子句中的其他CTE可以使用。

4. 列表達式

----------------------------------------------------------------------------
--例子 
--讀取表xxxx_detail的列shop_name
SELECT xxxx_name from xxxx_detail;
--查詢表xxxx_detail中region列數(shù)據(jù)，如果有重復值時僅顯示一條。
SELECT distinct region from xxxx_detail;
--選出xxxx_detail表中列名不為xxxx_name的所有列
SELECT `(xxxx_name)?+.+` from xxxx_detail;
--去重多列時，distinct的作用域是SELECT的列集合，不是單個列。
SELECT distinct region, xxxx_date from xxxx_detail;

用列名指定要讀取的列。

用星號（*）代表查詢所有的列。

可以使用正則表達式。

在選取的列名前可以使用distinct去掉重復字段，只返回去重后的值。

5. 排除列

--讀取xxxx_detail表的數(shù)據(jù)，并排除region列的數(shù)據(jù)。
----------------------------------------------------------------------------
--例子
SELECT * except(region) from xxxx_detail;

當希望讀取表內(nèi)大多數(shù)列的數(shù)據(jù)，同時要排除表中少數(shù)列的數(shù)據(jù)時。

表示讀取表數(shù)據(jù)時會排除指定列（col1、col2）的數(shù)據(jù)。

6. WHERE

--配合關系運算符，篩選滿足指定條件的數(shù)據(jù)。關系運算符包含：
  >、<、=、>=、<=、<>
  like、rlike
  in、not in
  between…and
----------------------------------------------------------------------------
--例子
SELECT * 
from xxxx_detail
where xxxx_date >='2008' and xxxx_date <='2014';
--等價于如下語句。
SELECT * 
from xxxx_detail 
where xxxx_date between '2008' and '2014';

where子句為過濾條件。如果表是分區(qū)表，可以實現(xiàn)列裁剪。

7. GROUP BY

----------------------------------------------------------------------------
--例子


--直接使用輸入表列名region作為group by的列，即以region值分組
SELECT region from xxxx_detail group by region;


--以region值分組，返回每一組的銷售額總量。
SELECT sum(xxxx_price) from xxxx_detail group by region;


--以region值分組，返回每一組的region值（組內(nèi)唯一）及銷售額總量。
SELECT region, sum (xxxx_price) from xxxx_detail group by region;

group by操作優(yōu)先級高于SELECT操作，因此group by的取值是SELECT輸入表的列名或由輸入表的列構(gòu)成的表達式。需要注意的是：

group by取值為正則表達式時，必須使用列的完整表達式。

SELECT語句中沒有使用聚合函數(shù)的列必須出現(xiàn)在GROUP BY中。

8. HAVING

----------------------------------------------------------------------------
--例子
--為直觀展示數(shù)據(jù)呈現(xiàn)效果，向sale_detail表中追加數(shù)據(jù)。
insert into sale_detail partition (sale_date='2014', region='shanghai') 
  values ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);
--使用having子句配合聚合函數(shù)實現(xiàn)過濾。
SELECT region,sum(total_price) from sale_detail 
group by region 
having sum(total_price)<305;

通常HAVING子句與聚合函數(shù)一起使用，實現(xiàn)過濾。

9. ORDER BY

----------------------------------------------------------------------------
--例子


--查詢表xxxx_detail的信息，并按照xxxx_price升序排列前2條。
SELECT * from xxxx_detail order by xxxx_price limit 2;


--將表xxx_detail按照xxxx_price升序排序后，輸出從第3行開始的3行數(shù)據(jù)。
SELECT xxxx_id,xxxx_price from xxxx_detail order by xxxx_price limit 3 offset 2;

默認對數(shù)據(jù)進行升序排序，如果降序排序，需要使用desc關鍵字。

order by默認要求帶limit數(shù)據(jù)行數(shù)限制，沒有l(wèi)imit會返回報錯。

10. DISTRIBUTE BY哈希分片

----------------------------------------------------------------------------
--例子


--查詢表xxxx_detail中的列region值并按照region值進行哈希分片。
SELECT region from xxxx_detail distribute by region;
--等價于如下語句。
SELECT region as r from xxxx_detail distribute by region;
SELECT region as r from xxxx_detail distribute by r;

distribute by控制Map（讀數(shù)據(jù)）的輸出在Reducer中是如何劃分的，如果不希望Reducer的內(nèi)容存在重疊，或需要對同一分組的數(shù)據(jù)一起處理，可以使用distribute by來保證同組數(shù)據(jù)分發(fā)到同一個Reducer中。

11. SORT BY局部排序

----------------------------------------------------------------------------
--例子


--查詢表xxxx_detail中的列region和xxxx_price的值并按照region值進行哈希分片，然后按照xxxx_price對哈希分片結(jié)果進行局部升序排序。
SELECT region,xxxx_price from xxxx_detail distribute by region sort by xxxx_price;


--查詢表xxxx_detail中的列region和xxxx_price的值并按照region值進行哈希分片，然后按照xxxx_price對哈希分片結(jié)果進行局部降序排序。
SELECT region,xxxx_price from xxxx_detail distribute by region sort by xxxx_price desc;


--如果sort by語句前沒有distribute by，sort by會對每個Reduce中的數(shù)據(jù)進行局部排序。
SELECT region,xxxx_price from xxxx_detail sort by xxxx_price desc;

sort by默認對數(shù)據(jù)進行升序排序，如果降序排序，需要使用desc關鍵字。

如果sort by語句前有distribute by，sort by會對distribute by的結(jié)果按照指定的列進行排序。

12. LIMIT限制輸出行數(shù)

SELECT * FROM xxxxx.xxxx_xxxx_xxxx
WHERE ds=20240520
LIMIT 100;

limit <number>中的number是常數(shù)，用于限制輸出行數(shù)，取值范圍為int32位取值范圍。

子查詢

1. 基礎子查詢

--格式1
select <select_expr> from (<select_statement>) [<sq_alias_name>];


--格式2
select (<select_statement>) from <table_name>;

普通查詢操作的對象是目標表，但是查詢的對象也可以是另一個select語句，這種查詢?yōu)樽硬樵儭Ｔ趂rom子句中，子查詢可以被當作一張表，與其他表或子查詢進行join操作。

2. IN SUBQUERY

--in subquery與left semi join用法類似
--格式一
select<select_expr1>from<table_name1>where<select_expr2>
  in(select<select_expr3>from<table_name2>);
--等效于leftsemijoin如下語句。
select<select_expr1>from<table_name1><alias_name1>leftsemijoin<table_name2><alias_name2>
  on<alias_name1>.<select_expr2>=<alias_name2>.<select_expr3>;


--格式二
select<select_expr1>from<table_name1>where<select_expr2>
  in(select<select_expr3>from<table_name2>where
     <table_name1>.<col_name>=<table_name2>.<col_name>);


----------------------------------------------------------------------------
--例子
set odps.sql.allow.fullscan=true;
select * from xxxx_detail where xxxx_price in (select xxxx_price from shop);


set odps.sql.allow.fullscan=true;
select * from xxxx_detail where xxxx_price 
  in (select xxxx_price from shop where xxxx_id=shop.xxxx_id);

select_expr1：必填。格式為col1_name, col2_name, 正則表達式,...，表示待查詢的普通列、分區(qū)列或正則表達式。