博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
教程:使用Data Lake Analytics + OSS分析CSV格式的TPC-H数据集
阅读量:6330 次
发布时间:2019-06-22

本文共 18251 字,大约阅读时间需要 60 分钟。

0. Data Lake Analytics(DLA)简介

关于Data Lake的概念,更多阅读可以参考:

以及AWS和Azure关于Data Lake的解读:

终于,阿里云现在也有了自己的数据湖分析产品:

可以点击申请使用(目前公测阶段还属于邀测模式,我们会尽快审批申请),体验本教程的TPC-H CSV数据格式的数据分析之旅。

产品文档:

1. 开通Data Lake Analytics与OSS服务

如果您已经开通,可以跳过该步骤。如果没有开通,可以参考:

进行产品开通服务申请。

2. 下载TPC-H测试数据集

可以从这下载TPC-H 100MB的数据集:

3. 上传数据文件到OSS

登录阿里云官网的OSS控制台:

规划您要使用的OSS bucket,创建或选择好后,点击“文件管理”,因为有8个数据文件,为每个数据文件创建对应的文件目录:

image.png | left

创建好8个目录如下:

image.png | left

点击进入目录,上传相应的数据文件,例如,customer目录,则上传customer.tbl文件。

image.png | left

上传好后,如下图。然后,依次把其他7个数据文件也上传到对应的目录下。

image.png | left

至此,8个数据文件都上传到了您的OSS bucket中:

oss://xxx/tpch_100m/customer/customer.tbloss://xxx/tpch_100m/lineitem/lineitem.tbloss://xxx/tpch_100m/nation/nation.tbloss://xxx/tpch_100m/orders/orders.tbloss://xxx/tpch_100m/part/part.tbloss://xxx/tpch_100m/partsupp/partsupp.tbloss://xxx/tpch_100m/region/region.tbloss://xxx/tpch_100m/supplier/supplier.tbl

4. 登录Data Lake Analytics控制台

点击“登录数据库”,输入开通服务时分配的用户名和密码,登录Data Lake Analytics控制台。

5. 创建Schema和Table

输入创建SCHEMA的语句,点击“同步执行”。

CREATE SCHEMA tpch_100m with DBPROPERTIES(  LOCATION = 'oss://test-bucket-julian-1/tpch_100m/',  catalog='oss');

(注意:目前在同一个阿里云region,Data Lake Analytics的schema名全局唯一,建议schema名尽量根据业务定义,已有重名schema,在创建时会提示报错,则请换一个schema名字。)

Schema创建好后,在“数据库”的下拉框中,选择刚刚创建的schema。然后在SQL文本框中输入建表语句,点击同步执行。

建表语句语法参考:

image.png | left

TPC-H对应的8个表的建表语句如下,分别贴入文档框中执行(LOCATION子句中的数据文件位置请根据您的实际OSS bucket目录相应修改)。(注意:目前控制台中还不支持多个SQL语句执行,请单条语句执行。)

CREATE EXTERNAL TABLE nation (    N_NATIONKEY INT,     N_NAME STRING,       N_ID STRING,    N_REGIONKEY INT,     N_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/nation';CREATE EXTERNAL TABLE lineitem (    L_ORDERKEY INT,     L_PARTKEY INT,     L_SUPPKEY INT,     L_LINENUMBER INT,     L_QUANTITY DOUBLE,     L_EXTENDEDPRICE DOUBLE,     L_DISCOUNT DOUBLE,     L_TAX DOUBLE,     L_RETURNFLAG STRING,     L_LINESTATUS STRING,     L_SHIPDATE DATE,     L_COMMITDATE DATE,     L_RECEIPTDATE DATE,     L_SHIPINSTRUCT STRING,     L_SHIPMODE STRING,     L_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/lineitem';CREATE EXTERNAL TABLE orders (    O_ORDERKEY INT,     O_CUSTKEY INT,     O_ORDERSTATUS STRING,     O_TOTALPRICE DOUBLE,     O_ORDERDATE DATE,     O_ORDERPRIORITY STRING,     O_CLERK STRING,     O_SHIPPRIORITY INT,     O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/orders';CREATE EXTERNAL TABLE supplier (    S_SUPPKEY INT,     S_NAME STRING,     S_ADDRESS STRING,     S_NATIONKEY INT,     S_PHONE STRING,     S_ACCTBAL DOUBLE,     S_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/supplier';CREATE EXTERNAL TABLE partsupp (    PS_PARTKEY INT,     PS_SUPPKEY INT,     PS_AVAILQTY INT,     PS_SUPPLYCOST DOUBLE,     PS_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/partsupp';CREATE EXTERNAL TABLE customer (    C_CUSTKEY INT,     C_NAME STRING,     C_ADDRESS STRING,     C_NATIONKEY INT,     C_PHONE STRING,     C_ACCTBAL DOUBLE,     C_MKTSEGMENT STRING,     C_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/customer';CREATE EXTERNAL TABLE part (    P_PARTKEY INT,     P_NAME STRING,     P_MFGR STRING,     P_BRAND STRING,     P_TYPE STRING,     P_SIZE INT,     P_CONTAINER STRING,     P_RETAILPRICE DOUBLE,     P_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/part';CREATE EXTERNAL TABLE region (    R_REGIONKEY INT,     R_NAME STRING,     R_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/region';

建表完毕后,刷新页面,在左边导航条中能看到schema下的8张表。

image.png | left

6. 执行TPC-H查询

TPC-H总共22条查询,如下:

Q1:

SELECT   l_returnflag,         l_linestatus,         Sum(l_quantity)                                       AS sum_qty,         Sum(l_extendedprice)                                  AS sum_base_price,         Sum(l_extendedprice * (1 - l_discount))               AS sum_disc_price,         Sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,         Avg(l_quantity)                                       AS avg_qty,         Avg(l_extendedprice)                                  AS avg_price,         Avg(l_discount)                                       AS avg_disc,         Count(*)                                              AS count_orderFROM     lineitemWHERE    l_shipdate <= date '1998-12-01' - INTERVAL '93' dayGROUP BY l_returnflag,         l_linestatusORDER BY l_returnflag,         l_linestatusLIMIT    1;

Q2:

SELECT   s_acctbal,         s_name,         n_name,         p_partkey,         p_mfgr,         s_address,         s_phone,         s_commentFROM     part,         supplier,         partsupp,         nation,         regionWHERE    p_partkey = ps_partkeyAND      s_suppkey = ps_suppkeyAND      p_size = 35AND      p_type LIKE '%NICKEL'AND      s_nationkey = n_nationkeyAND      n_regionkey = r_regionkeyAND      r_name = 'MIDDLE EAST'

Q3:

SELECT   l_orderkey,         Sum(l_extendedprice * (1 - l_discount)) AS revenue,         o_orderdate,         o_shippriorityFROM     customer,         orders,         lineitemWHERE    c_mktsegment = 'AUTOMOBILE'AND      c_custkey = o_custkeyAND      l_orderkey = o_orderkeyAND      o_orderdate < date '1995-03-31'AND      l_shipdate >  date '1995-03-31'GROUP BY l_orderkey,         o_orderdate,         o_shippriorityORDER BY revenue DESC,         o_orderdateLIMIT    10;

Q4:

SELECT   o_orderpriority,         Count(*) AS order_countFROM     orders,         lineitemWHERE    o_orderdate >= date '1997-10-01'AND      o_orderdate <  date '1997-10-01' + INTERVAL '3' monthAND      l_orderkey = o_orderkeyAND      l_commitdate < l_receiptdateGROUP BY o_orderpriorityORDER BY o_orderpriorityLIMIT    1;

Q5:

SELECT   n_name,         Sum(l_extendedprice * (1 - l_discount)) AS revenueFROM     customer,         orders,         lineitem,         supplier,         nation,         regionWHERE    c_custkey = o_custkeyAND      l_orderkey = o_orderkeyAND      l_suppkey = s_suppkeyAND      c_nationkey = s_nationkeyAND      s_nationkey = n_nationkeyAND      n_regionkey = r_regionkeyAND      r_name = 'ASIA'AND      o_orderdate >= date '1995-01-01'AND      o_orderdate <  date '1995-01-01' + INTERVAL '1' yearGROUP BY n_nameORDER BY revenue DESCLIMIT    1;

Q6:

SELECT sum(l_extendedprice * l_discount) AS revenueFROM lineitemWHERE l_shipdate >= date '1995-01-01'AND l_shipdate < date '1995-01-01' + interval '1' yearAND l_discount between 0.04 - 0.01 AND 0.04 + 0.01AND l_quantity < 24LIMIT 1;

Q7:

SELECT   supp_nation,         cust_nation,         l_year,         Sum(volume) AS revenueFROM     (                SELECT n1.n_name                          AS supp_nation,                       n2.n_name                          AS cust_nation,                       Extract(year FROM l_shipdate)      AS l_year,                       l_extendedprice * (1 - l_discount) AS volume                FROM   supplier,                       lineitem,                       orders,                       customer,                       nation n1,                       nation n2                WHERE  s_suppkey = l_suppkey                AND    o_orderkey = l_orderkey                AND    c_custkey = o_custkey                AND    s_nationkey = n1.n_nationkey                AND    c_nationkey = n2.n_nationkey                AND    ( (                                     n1.n_name = 'GERMANY'                              AND    n2.n_name = 'INDIA')                       OR     (                                     n1.n_name = 'INDIA'                              AND    n2.n_name = 'GERMANY') )                AND    l_shipdate BETWEEN date '1995-01-01' AND    date '1996-12-31' ) AS shippingGROUP BY supp_nation,         cust_nation,         l_yearORDER BY supp_nation,         cust_nation,         l_yearLIMIT    1;

Q8:

SELECT   o_year,         Sum(         CASE                  WHEN nation = 'INDIA' THEN volume                  ELSE 0         end) / Sum(volume) AS mkt_shareFROM     (                SELECT Extract(year FROM o_orderdate)     AS o_year,                       l_extendedprice * (1 - l_discount) AS volume,                       n2.n_name                          AS nation                FROM   part,                       supplier,                       lineitem,                       orders,                       customer,                       nation n1,                       nation n2,                       region                WHERE  p_partkey = l_partkey                AND    s_suppkey = l_suppkey                AND    l_orderkey = o_orderkey                AND    o_custkey = c_custkey                AND    c_nationkey = n1.n_nationkey                AND    n1.n_regionkey = r_regionkey                AND    r_name = 'ASIA'                AND    s_nationkey = n2.n_nationkey                AND    o_orderdate BETWEEN date '1995-01-01' AND    date '1996-12-31'                AND    p_type = 'STANDARD ANODIZED STEEL' ) AS all_nationsGROUP BY o_yearORDER BY o_yearLIMIT    1;

Q9:

SELECT   nation,         o_year,         Sum(amount) AS sum_profitFROM     (                SELECT n_name                                                          AS nation,                       Extract(year FROM o_orderdate) AS o_year,                       l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount                FROM   part,                       supplier,                       lineitem,                       partsupp,                       orders,                       nation                WHERE  s_suppkey = l_suppkey                AND    ps_suppkey = l_suppkey                AND    ps_partkey = l_partkey                AND    p_partkey = l_partkey                AND    o_orderkey = l_orderkey                AND    s_nationkey = n_nationkey                AND    p_name LIKE '%aquamarine%' ) AS profitGROUP BY nation,         o_yearORDER BY nation,         o_year DESCLIMIT    1;

Q10:

SELECT   c_custkey,         c_name,         Sum(l_extendedprice * (1 - l_discount)) AS revenue,         c_acctbal,         n_name,         c_address,         c_phone,         c_commentFROM     customer,         orders,         lineitem,         nationWHERE    c_custkey = o_custkeyAND      l_orderkey = o_orderkeyAND      o_orderdate >= date '1994-08-01'AND      o_orderdate <  date '1994-08-01' + INTERVAL '3' monthAND      l_returnflag = 'R'AND      c_nationkey = n_nationkeyGROUP BY c_custkey,         c_name,         c_acctbal,         c_phone,         n_name,         c_address,         c_commentORDER BY revenue DESCLIMIT    20;

Q11:

SELECT   ps_partkey,         Sum(ps_supplycost * ps_availqty) AS valueFROM     partsupp,         supplier,         nationWHERE    ps_suppkey = s_suppkeyAND      s_nationkey = n_nationkeyAND      n_name = 'PERU'GROUP BY ps_partkeyHAVING   Sum(ps_supplycost * ps_availqty) >(SELECT Sum(ps_supplycost * ps_availqty) * 0.0001000000 as sum_value  FROM partsupp,       supplier,       nation  WHERE  ps_suppkey = s_suppkey  AND    s_nationkey = n_nationkey  AND    n_name = 'PERU')ORDER BY value DESCLIMIT    1;

Q12:

SELECT l_shipmode, sum(case when o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH' then 1    else 0end) AS high_line_count, sum(case when o_orderpriority <> '1-URGENT' and o_orderpriority <> '2-HIGH' then 1else 0end) AS low_line_countFROM orders,     lineitemWHERE o_orderkey = l_orderkeyAND l_shipmode in ('MAIL', 'TRUCK')AND l_commitdate < l_receiptdateAND l_shipdate < l_commitdateAND l_receiptdate >= date '1996-01-01'AND l_receiptdate < date '1996-01-01' + interval '1' yearGROUP BY l_shipmodeORDER BY l_shipmodeLIMIT 1;

Q13:

SELECT c_count, count(*) AS custdistFROM (    SELECT c_custkey, count(o_orderkey) AS c_count    FROM customer,         orders    WHERE c_custkey = o_custkey    AND o_comment NOT LIKE '%pending%accounts%'    GROUP BY c_custkey ) AS c_ordersGROUP BY c_countORDER BY custdist DESC, c_count DESCLIMIT 1;

Q14:

SELECT 100.00 * sum(case when p_type like 'PROMO%' then l_extendedprice * (1 - l_discount)    else 0end) / sum(l_extendedprice * (1 - l_discount)) AS promo_revenueFROM lineitem,     partWHERE l_partkey = p_partkeyAND l_shipdate >= date '1996-01-01'AND l_shipdate < date '1996-01-01' + interval '1' monthLIMIT 1;

Q15:

WITH revenue0 AS(SELECT l_suppkey AS supplier_no, sum(l_extendedprice * (1 - l_discount)) AS total_revenueFROM lineitemWHERE l_shipdate >= date '1993-01-01'AND l_shipdate < date '1993-01-01' + interval '3' monthGROUP BY l_suppkey)SELECT s_suppkey, s_name, s_address, s_phone, total_revenueFROM supplier, revenue0WHERE s_suppkey = supplier_noAND total_revenue IN (    SELECT max(total_revenue)    FROM revenue0 )ORDER BY s_suppkey;

Q16:

SELECT p_brand, p_type, p_size, count(distinct ps_suppkey) AS supplier_cntFROM partsupp,     partWHERE p_partkey = ps_partkeyAND p_brand <> 'Brand#23'AND p_type NOT LIKE 'PROMO BURNISHED%'AND p_size IN (1, 13, 10, 28, 21, 35, 31, 11)AND ps_suppkey NOT IN (    SELECT s_suppkey    FROM supplier    WHERE s_comment LIKE '%Customer%Complaints%' )GROUP BY p_brand, p_type, p_sizeORDER BY supplier_cnt DESC, p_brand, p_type, p_sizeLIMIT 1;

Q17:

SELECT    sum(l_extendedprice) / 7.0 AS avg_yearlyFROM    lineitem,    partWHERE p_partkey = l_partkey    AND p_brand = 'Brand#44'    AND p_container = 'WRAP PKG'    AND l_quantity < (        SELECT            0.2 * avg(l_quantity)        FROM            lineitem, part        WHERE            l_partkey = p_partkey    );

Q18:

SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity)FROM customer,     orders,     lineitemWHERE o_orderkey IN (      SELECT l_orderkey      FROM lineitem      GROUP BY l_orderkey      HAVING sum(l_quantity) > 315 )AND c_custkey = o_custkeyAND o_orderkey = l_orderkeyGROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalpriceORDER BY o_totalprice DESC, o_orderdateLIMIT 100;

Q19:

SELECT sum(l_extendedprice* (1 - l_discount)) AS revenueFROM lineitem,     partWHERE ( p_partkey = l_partkey and p_brand = 'Brand#12'        and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')        and l_quantity >= 6 and l_quantity <= 6 + 10        and p_size between 1 and 5        and l_shipmode in ('AIR', 'AIR REG')        and l_shipinstruct = 'DELIVER IN PERSON' )    or ( p_partkey = l_partkey and p_brand = 'Brand#13'        and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')        and l_quantity >= 10 and l_quantity <= 10 + 10        and p_size between 1 and 10        and l_shipmode in ('AIR', 'AIR REG')        and l_shipinstruct = 'DELIVER IN PERSON' )    or ( p_partkey = l_partkey and p_brand = 'Brand#24'        and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')        and l_quantity >= 21 and l_quantity <= 21 + 10        and p_size between 1 and 15        and l_shipmode in ('AIR', 'AIR REG')        and l_shipinstruct = 'DELIVER IN PERSON' )LIMIT 1;

Q20:

with temp_table as( select 0.5 * sum(l_quantity) as col1 from lineitem,      partsupp where l_partkey = ps_partkey and l_suppkey = ps_suppkey and l_shipdate >= date '1993-01-01' and l_shipdate < date '1993-01-01' + interval '1' year)select s_name, s_addressfrom supplier,     nationwhere s_suppkey in (    select ps_suppkey    from partsupp,         temp_table    where ps_partkey in (        select p_partkey        from part        where p_name like 'dark%' )        and ps_availqty > temp_table.col1 )    and s_nationkey = n_nationkey and n_name = 'JORDAN'order by s_namelimit 1;

Q21:

select    s_name,    count(*) as numwaitfrom    supplier,    lineitem l1,    orders,    nationwhere    s_suppkey = l1.l_suppkey    and o_orderkey = l1.l_orderkey    and o_orderstatus = 'F'    and l1.l_receiptdate > l1.l_commitdate    and exists (        select            *        from            lineitem l2        where            l2.l_orderkey = l1.l_orderkey            and l2.l_suppkey <> l1.l_suppkey    )    and not exists (        select            *        from            lineitem l3        where            l3.l_orderkey = l1.l_orderkey            and l3.l_suppkey <> l1.l_suppkey            and l3.l_receiptdate > l3.l_commitdate    )    and s_nationkey = n_nationkey    and n_name = 'SAUDI ARABIA'group by    s_nameorder by    numwait desc,    s_namelimit 100;

Q22:

with temp_table_1 as(  select avg(c_acctbal) as avg_value  from customer  where c_acctbal > 0.00 and substring(c_phone from 1 for 2)  in ('33', '29', '37', '35', '25', '27', '43')),temp_table_2 as(  select count(*) as count1  from orders, customer  where o_custkey = c_custkey)select cntrycode, count(*) as numcust, sum(c_acctbal) as totacctbalfrom (    select substring(c_phone from 1 for 2) as cntrycode, c_acctbal    from customer, temp_table_1, temp_table_2    where substring(c_phone        from 1        for 2) in ('33', '29', '37', '35', '25', '27', '43')        and c_acctbal > temp_table_1.avg_value        and temp_table_2.count1 = 0) as custsalegroup by cntrycodeorder by cntrycodelimit 1;

7. 异步执行查询

Data Lake Analytics支持“同步执行”模式和“异步执行”模式。“同步执行”模式下,控制台界面等待执行结果返回;“异步执行”模式下,立刻返回查询任务的ID。

image.png | left

点击“执行状态”,可以看到该异步查询任务的执行状态,主要分为:“RUNNING”,“SUCCESS”,“FAILURE”。

image.png | left

点击“刷新”,当STATUS变为“SUCCESS”时,表示查询成功,同时可查看查询耗时“ELAPSE_TIME”和查询扫描的数据字节数“SCANNED_DATA_BYTES”。

image.png | left

8. 查看查询历史

点击“执行历史”,可以看到您执行的查询的历史详细信息,包括:

1)查询语句;
2)查询耗时与执行具体时间;
3)查询结果返回行数;
4)查询状态;
5)查询扫描的字节数;
6)结果集回写到的目标OSS文件(Data Lake Analytics会将查询结果集保存用户的bucket中)。

image.png | left

查询结果文件自动上传到用户同region的OSS bucket中,其中包括结果数据文件和结果集元数据描述文件。

{QueryLocation}/{query_name}|Unsaved}/{yyyy}/{mm}/{dd}/{query_id}/xxx.csv{QueryLocation}/{query_name}|Unsaved}/{yyyy}/{mm}/{dd}/{query_id}/xxx.csv.metadata

其中QueryLocation为:

aliyun-oa-query-results-
-

image.png | left

9. 后续

至此,本教程一步一步教您如何利用Data Lake Analytics云产品分析您OSS上的CSV格式的数据文件。除了CSV文件外,Data Lake Analytics还支持Parquet、ORC、json、RCFile、AVRO等多种格式文件的数据分析能力。特别是Parquet、ORC,相比CSV文件,有极大的性能和成本优势(同样内容的数据集,拥有更小的存储空间、更快的查询性能,这也意味着更低的分析成本)。

后续陆续会有更多教程和文章,手把手教您轻松使用Data Lake Analytics进行数据湖上数据分析和探索,开启您的云上低成本、即存即用的数据分析和探索之旅。
G

转载地址:http://usboa.baihongyu.com/

你可能感兴趣的文章
前端项目课程3 jquery1.8.3到1.11.1有了哪些新改变
查看>>
UOJ#179. 线性规划(线性规划)
查看>>
整合spring cloud云架构 - SSO单点登录之OAuth2.0登录认证(1)
查看>>
Isolation Forest原理总结
查看>>
windows的服务中的登录身份本地系统账户、本地服务账户和网络服务账户修改
查看>>
JAVA中循环删除list中元素的方法总结
查看>>
redis 安装
查看>>
SQL some any all
查看>>
电子书下载:Programming Windows Identity Foundation
查看>>
有理想的程序员必须知道的15件事
查看>>
用于测试的字符串
查看>>
财付通和支付宝资料收集
查看>>
PHPCMS V9数据库表结构分析
查看>>
『原创』+『参考』基于PPC的图像对比程序——使用直方图度量
查看>>
理解 IEnumerable 与 IEnumerator
查看>>
NHibernate 2.0 Beta 1 Released和一些工具
查看>>
【每天一个Linux命令】12. Linux中which命令的用法
查看>>
软件接口数据一致性机制
查看>>
微服务架构介绍和RPC框架对比
查看>>
Debian下使用OpenLDAP 管理端
查看>>