问题定位及性能优化

日志管理

OpenTenBase 日志格式说明

执行正确的语句产生的日志

2017-10-11 16:23:55.178 CST,"pgxz","postgres",11499,"127.0.0.1:2329",59ddd50c.2ceb,1,"idle",2017-10-11 16:23:40 CST,3/26053,0,LOG,00000,"statement: select * from pg_class limit 1;",,,,,,,,,"psql"

执行时间                                                            | 2017-10-11 16:23:55.178
用户名                                                              | pgxz
数据库                                                              | postgres
进程id                                                              | 11499
客户端id                                                            | 127.0.0.1:2329
会话 ID                                                             | 59ddd50c.2ceb
每个会话的行号                                                        | 1
命令标签                                                             | idle
登录时间                                                             | 2017-10-11 16:23:40
虚拟事务 ID                                                          | 3/26053
普通事务 ID                                                          | 0
级别                                                                | LOG
SQLSTATE 代码                                                       | 00000
执行信息                                                             | statement: select * from pg_class limit 1;
详情                                                                | 
提示                                                                | 
导致错误的内部查询                                                    | 
错误位置所在的字符计数                                                 | 
错误上下文                                                           | 
导致错误的用户查询（如果有且被log_min_error_statement启用）              | 
错误位置所在的字符计数                                                 | 
在 PostgreSQL 源代码中错误的位置(如果log_error_verbosity被设置为verbose) | 
应用名                                                               | psql

错误日志解释

2017-10-11 16:24:10.233 CST,"pgxz","postgres",11499,"127.0.0.1:2329",59ddd50c.2ceb,2,"idle",2017-10-11 16:23:40 CST,3/26054,0,LOG,00000,"statement: select * from pgxc_nodes limit 1;",,,,,,,,,"psql"
2017-10-11 16:24:10.233 CST,"pgxz","postgres",11499,"127.0.0.1:2329",59ddd50c.2ceb,3,"SELECT",2017-10-11 16:23:40 CST,3/26054,0,ERROR,42P01,"relation ""pgxc_nodes"" does not exist",,,,,,,,,"psql"

向 OpenTenBase 执行错误的语句会产生两条日志，一条是执行语句，一条提示出错的原因

对日志进行分析

创建日志表

CREATE table opentenbase_log
(
  log_time timestamp without time zone,
  user_name text,
  database_name text,
  process_id integer,
  connection_from text,
  session_id text,
  session_line_num bigint,
  command_tag text,
  session_start_time timestamp without time zone,
  virtual_transaction_id text,
  transaction_id bigint,
  error_severity text,
  sql_state_code text,
  message text,
  detail text,
  hint text,
  internal_query text,
  internal_query_pos integer,
  context text,
  query text,
  query_pos integer,
  location text,
  application_name text
);

导入日志数据

OpenTenBase 日志文件默认存储在“数据目录/pg_log”目录下面

postgres=# COPY opentenbase_log FROM '/data/pgxz/data/pgxz/dn001/pg_log/postgresql-Tuesday-16.csv' WITH csv; 
COPY 10790

统计日志数据

--按照session连接及操作时间排序

postgres=# select * from opentenbase_log order by process_id,log_time;

--查询错误日志

SELECT * FROM  opentenbase_log WHERE error_severity='ERROR' limit 1;

--统计session操作数统计

postgres=#  select count(1),process_id,user_name,database_name from opentenbase_log group by process_id,user_name,database_name order by count(1) desc limit 10;
 count | process_id | user_name | database_name 
-------+------------+-----------+---------------
  2770 |      48067 | pgxz      | postgres
    10 |      22143 | pgxz      | postgres
    10 |      28778 | pgxz      | postgres
     9 |      28367 | pgxz      | postgres
     9 |      44280 | pgxz      | postgres
     8 |      32442 | pgxz      | postgres
     7 |      17911 | pgxz      | postgres
     7 |      21865 | pgxz      | postgres
     7 |      26159 | pgxz      | postgres
     7 |      45471 | pgxz      | postgres
(10 rows)

--用户操作统计

postgres=#  select count(1),user_name from opentenbase_log group by user_name order by count(1) desc limit 10;                                                           
 count | user_name 
-------+-----------
 10790 | pgxz

--数据库访问次数统计

postgres=#  select count(1),database_name from opentenbase_log group by database_name order by count(1) desc limit 10;             
 count | database_name 
-------+---------------
 10790 | postgres
(1 row)

--错误信息统计

postgres=# select count(1),user_name,database_name from opentenbase_log where error_severity='ERROR' group by user_name,database_name order by count(1) desc limit 10;           
 count | user_name | database_name 
-------+-----------+---------------
  1390 | pgxz      | postgres
(1 row)

配置只收集慢的sql语句

#用户访问日志格式
log_destination = 'csvlog'

#启用用户访问日志收集器
logging_collector = on

#配置sql语句执行超过多少毫秒数时，语句将被记录,值为-1时禁用，0时记录所有语句
#下面配置只收集运行超过1秒的语句
log_min_duration_statement = 1000

#默认不记录任何日志
log_statement = 'none'

收集到的日志文件内容如下所示

2017-10-15 10:25:54.106 CST,"postgres","postgres",43799,"127.0.0.1:17899",59e2c65b.ab17,4,"SELECT",2017-10-15 10:22:19 CST,2/0,0,LOG,00000,"duration: 1338.366 ms  statement: select * from t where id=20000 or id=2000000;",,,,,,,,,"psql"

系统记录运行的语句及运行时间

如何查询数据是否倾斜

连接上不同的dn节点，查询表的容量大小，如果大小偏差较大就可以判断存在数据倾斜
连接dn001:

[pgxz@VM_0_29_centos pgxz]$ psql -p 15432 -h 172.16.0.47
psql (PostgreSQL 10 (opentenbase 2.01))
Type "help" for help.

postgres=# select pg_size_pretty(pg_table_size('opentenbase_1'));
 pg_size_pretty 
----------------
 2408 kB
(1 row)

postgres=# select pg_size_pretty(pg_table_size('opentenbase_2'));
 pg_size_pretty 
----------------
 896 kB
(1 row)

postgres=# \q

连接dn002:

[pgxz@VM_0_29_centos pgxz]$ psql -p 15431 -h 172.16.0.47 
psql (PostgreSQL 10 (opentenbase 2.01))
Type "help" for help.

postgres=# select pg_size_pretty(pg_table_size('opentenbase_1'));
 pg_size_pretty 
----------------
 2408 kB
(1 row)

postgres=#  select pg_size_pretty(pg_table_size('opentenbase_2'));
 pg_size_pretty 
----------------
 464 kB
(1 row)

上面数据表“opentenbase_2”容量相差为一半，基本可以判定存在数据倾斜。

如何优化有问题的Sql语句

查看是否为分布键查询

postgres=# explain select * from opentenbase_1 where f1=1;        
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Remote Fast Query Execution  (cost=0.00..0.00 rows=0 width=0)
   Node/s: dn001, dn002
   ->  Gather  (cost=1000.00..7827.20 rows=1 width=14)
         Workers Planned: 2
         ->  Parallel Seq Scan on opentenbase_1  (cost=0.00..6827.10 rows=1 width=14)
               Filter: (f1 = 1)
(6 rows)

postgres=# explain select * from opentenbase_1 where f2=1;
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Remote Fast Query Execution  (cost=0.00..0.00 rows=0 width=0)
   Node/s: dn001
   ->  Gather  (cost=1000.00..7827.20 rows=1 width=14)
         Workers Planned: 2
         ->  Parallel Seq Scan on opentenbase_1  (cost=0.00..6827.10 rows=1 width=14)
               Filter: (f2 = 1)
(6 rows)

上面第一个查询为非分布键查询，需要发往所有节点，这样最慢的节点决定了整个业务的速度，需要保持所有节点的响应性能一致，业务设计查询时尽可能带上分布键

查看是否使用上索引

postgres=# create index opentenbase_2_f2_idx on opentenbase_2(f2); 
CREATE INDEX
postgres=# explain select * from opentenbase_2 where f2=1;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Remote Fast Query Execution  (cost=0.00..0.00 rows=0 width=0)
   Node/s: dn001, dn002
   ->  Index Scan using opentenbase_2_f2_idx on opentenbase_2  (cost=0.42..4.44 rows=1 width=14)
         Index Cond: (f2 = 1)
(4 rows)

postgres=# explain select * from opentenbase_2 where f3='1';
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Remote Fast Query Execution  (cost=0.00..0.00 rows=0 width=0)
   Node/s: dn001, dn002
   ->  Gather  (cost=1000.00..7827.20 rows=1 width=14)
         Workers Planned: 2
         ->  Parallel Seq Scan on opentenbase_2  (cost=0.00..6827.10 rows=1 width=14)
               Filter: (f3 = '1'::text)
(6 rows)

第一个查询使用了索引，第二个没有使用索引，通常情况下，使用索引可以加速查询速度，但要记住索引也会增加更新的开销

查看是否为分布key join

postgres=# explain  select opentenbase_1.* from opentenbase_1,opentenbase_2 where opentenbase_1.f1=opentenbase_2.f1 ;           
                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Remote Subquery Scan on all (dn001,dn002)  (cost=29.80..186.32 rows=3872 width=40)
   ->  Hash Join  (cost=29.80..186.32 rows=3872 width=40)
         Hash Cond: (opentenbase_1.f1 = opentenbase_2.f1)
         ->  Remote Subquery Scan on all (dn001,dn002)  (cost=100.00..158.40 rows=880 width=40)
               Distribute results by S: f1
               ->  Seq Scan on opentenbase_1  (cost=0.00..18.80 rows=880 width=40)
         ->  Hash  (cost=18.80..18.80 rows=880 width=4)
               ->  Seq Scan on opentenbase_2  (cost=0.00..18.80 rows=880 width=4)
(8 rows)

postgres=# explain  select opentenbase_1.* from opentenbase_1,opentenbase_2 where opentenbase_1.f2=opentenbase_2.f1 ;   
                                   QUERY PLAN                                    
---------------------------------------------------------------------------------
 Remote Fast Query Execution  (cost=0.00..0.00 rows=0 width=0)
   Node/s: dn001, dn002
   ->  Hash Join  (cost=18904.69..46257.08 rows=500564 width=14)
         Hash Cond: (opentenbase_1.f2 = opentenbase_2.f1)
         ->  Seq Scan on opentenbase_1  (cost=0.00..9225.64 rows=500564 width=14)
         ->  Hash  (cost=9225.64..9225.64 rows=500564 width=4)
               ->  Seq Scan on opentenbase_2  (cost=0.00..9225.64 rows=500564 width=4)
(7 rows)

第一查询需要数据重分布，而第二个是不需要，分布键join查询性能会更高。

查看join发生的节点

postgres=#  explain  select opentenbase_1.* from opentenbase_1,opentenbase_2 where opentenbase_1.f1=opentenbase_2.f1 ;    
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Hash Join  (cost=29.80..186.32 rows=3872 width=40)
   Hash Cond: (opentenbase_1.f1 = opentenbase_2.f1)
   ->  Remote Subquery Scan on all (dn001,dn002)  (cost=100.00..158.40 rows=880 width=40)
         ->  Seq Scan on opentenbase_1  (cost=0.00..18.80 rows=880 width=40)
   ->  Hash  (cost=126.72..126.72 rows=880 width=4)
         ->  Remote Subquery Scan on all (dn001,dn002)  (cost=100.00..126.72 rows=880 width=4)
               ->  Seq Scan on opentenbase_2  (cost=0.00..18.80 rows=880 width=4)
(7 rows)

postgres=# set prefer_olap to on;
SET
postgres=# explain  select opentenbase_1.* from opentenbase_1,opentenbase_2 where opentenbase_1.f1=opentenbase_2.f1 ;  
                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Remote Subquery Scan on all (dn001,dn002)  (cost=29.80..186.32 rows=3872 width=40)
   ->  Hash Join  (cost=29.80..186.32 rows=3872 width=40)
         Hash Cond: (opentenbase_1.f1 = opentenbase_2.f1)
         ->  Remote Subquery Scan on all (dn001,dn002)  (cost=100.00..158.40 rows=880 width=40)
               Distribute results by S: f1
               ->  Seq Scan on opentenbase_1  (cost=0.00..18.80 rows=880 width=40)
         ->  Hash  (cost=18.80..18.80 rows=880 width=4)
               ->  Seq Scan on opentenbase_2  (cost=0.00..18.80 rows=880 width=4)
(8 rows)

上面join在cn节点执行，下面的在dn上重分布后再join，业务上设计一般oltp类业务在cn上进行少数据量join性能会更好

查看并行的worker数

postgres=# explain select count(1) from opentenbase_1;
                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=118.81..118.83 rows=1 width=8)
   ->  Remote Subquery Scan on all (dn001,dn002)  (cost=118.80..118.81 rows=1 width=0)
         ->  Partial Aggregate  (cost=18.80..18.81 rows=1 width=8)
               ->  Seq Scan on opentenbase_1  (cost=0.00..18.80 rows=880 width=0)
(4 rows)

postgres=# analyze opentenbase_1;
ANALYZE
postgres=# explain select count(1) from opentenbase_1;
                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Parallel Finalize Aggregate  (cost=14728.45..14728.46 rows=1 width=8)
   ->  Parallel Remote Subquery Scan on all (dn001,dn002)  (cost=14728.33..14728.45 rows=1 width=0)
         ->  Gather  (cost=14628.33..14628.44 rows=1 width=8)
               Workers Planned: 2
               ->  Partial Aggregate  (cost=13628.33..13628.34 rows=1 width=8)
                     ->  Parallel Seq Scan on opentenbase_1  (cost=0.00..12586.67 rows=416667 width=0)
(6 rows)

上面第一个查询没走并行，analyze后走并行才是正确的，建议大数据量更新再执行analyze。

检查各个节点的执行计划是否一致

./opentenbase_run_sql_dn_master.sh "explain select * from opentenbase_2 where f2=1"  
dn006 --- psql -h 172.16.0.13 -p 11227 -d postgres -U opentenbase -c "explain select * from opentenbase_2 where f2=1"
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Bitmap Heap Scan on opentenbase_2  (cost=2.18..7.70 rows=4 width=40)
   Recheck Cond: (f2 = 1)
   ->  Bitmap Index Scan on opentenbase_2_f2_idx  (cost=0.00..2.18 rows=4 width=0)
         Index Cond: (f2 = 1)
(4 rows)

dn002 --- psql -h 172.16.0.42 -p 11012 -d postgres -U opentenbase -c "explain select * from opentenbase_2 where f2=1"
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 Index Scan using opentenbase_2_f2_idx on opentenbase_2  (cost=0.42..4.44 rows=1 width=14)
   Index Cond: (f2 = 1)
(2 rows)

这两个dn的执行计划不一致，最大可能可以是数据倾斜或者是执行计划给禁用导致的。

如果有可能的话，DBA可以配置在系统空闲时执行全库analyze和vacuum。

优化实例

count(distinct xx)优化

postgres=# CREATE TABLE t1(f1 serial not null unique,f2 text,f3 text,f4 text,f5 text,f6 text,f7 text,f8 text,f9 text,f10 text,f11 text,f12 text) distribute by shard(f1);
NOTICE:  Replica identity is needed for shard table, please add to this table through "alter table" command.
CREATE TABLE
Time: 89.938 ms

postgres=# insert into t1 select t,md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text) from generate_series(1,1000000) as t;
INSERT 0 1000000
Time: 14849.045 ms (00:14.849)

postgres=# analyze t1;
ANALYZE
Time: 1340.387 ms (00:01.340)

postgres=# explain (verbose)  select count(distinct f2) from t1;  
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=103320.00..103320.01 rows=1 width=8)
   Output: count(DISTINCT f2)
   ->  Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=100.00..100820.00 rows=1000000 width=33)
         Output: f2
         ->  Seq Scan on public.t1  (cost=0.00..62720.00 rows=1000000 width=33)
               Output: f2
(6 rows)

Time: 0.748 ms
postgres=# select count(distinct f2) from t1;  
  count  
---------
 1000000
(1 row)

Time: 6274.684 ms (00:06.275)

postgres=# select count(distinct f2) from t1 where f1 <10;       
 count 
-------
     9
(1 row)

Time: 19.261 ms

上面发现count(distinct f2)是发生在cn节点，对于TP类业务，需要操作的数据量少的情况下，性能开销是没有问题的，而且往往比下推执行的性能开销还要小。但如果一次要操作的数据量比较大的ap类业务，则网络传输就会成功瓶颈，下面看看改写后的执行计划

postgres=# explain (verbose)  select count(1) from (select f2 from t1 group by f2) as t ; 
                                                                           QUERY PLAN                                                                            
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=355600.70..355600.71 rows=1 width=8)
   Output: count(1)
   ->  Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=355600.69..355600.70 rows=1 width=0)
         Output: PARTIAL count(1)
         ->  Partial Aggregate  (cost=355500.69..355500.70 rows=1 width=8)
               Output: PARTIAL count(1)
               ->  Group  (cost=340500.69..345500.69 rows=1000000 width=33)
                     Output: t1.f2
                     Group Key: t1.f2
                     ->  Sort  (cost=340500.69..343000.69 rows=1000000 width=0)
                           Output: t1.f2
                           Sort Key: t1.f2
                           ->  Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=216192.84..226192.84 rows=1000000 width=0)
                                 Output: t1.f2
                                 Distribute results by S: f2
                                 ->  Group  (cost=216092.84..221092.84 rows=1000000 width=33)
                                       Output: t1.f2
                                       Group Key: t1.f2
                                       ->  Sort  (cost=216092.84..218592.84 rows=1000000 width=33)
                                             Output: t1.f2
                                             Sort Key: t1.f2
                                             ->  Seq Scan on public.t1  (cost=0.00..62720.00 rows=1000000 width=33)
                                                   Output: t1.f2
(23 rows)

改写后，并行推到dn去执行，现在看看执行的效果

postgres=# select count(1) from (select f2 from t1 group by f2) as t ; 
  count  
---------
 1000000
(1 row)

Time: 1328.431 ms (00:01.328)
postgres=# select count(1) from (select f2 from t1 where f1<10 group by f2) as t ; 
 count 
-------
     9
(1 row)

Time: 24.991 ms
postgres=#

我们可以看到对于大量数据计算的AP类业务，性能提高了5倍

增大work_mem减少io访问

postgres=# CREATE TABLE t1(f1 serial not null unique,f2 text,f3 text,f4 text,f5 text,f6 text,f7 text,f8 text,f9 text,f10 text,f11 text,f12 text) distribute by shard(f1);
NOTICE:  Replica identity is needed for shard table, please add to this table through "alter table" command.
CREATE TABLE
Time: 70.545 ms

postgres=# CREATE TABLE t2(f1 serial not null unique,f2 text,f3 text,f4 text,f5 text,f6 text,f7 text,f8 text,f9 text,f10 text,f11 text,f12 text) distribute by shard(f1); 
NOTICE:  Replica identity is needed for shard table, please add to this table through "alter table" command.
CREATE TABLE
Time: 61.913 ms

postgres=# insert into t1 select t,md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text) from generate_series(1,1000) as t;   
INSERT 0 1000
Time: 48.866 ms

postgres=# insert into t2 select t,md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text) from generate_series(1,50000) as t;     
INSERT 0 50000
Time: 792.858 ms

postgres=# analyze t1;
ANALYZE
Time: 175.946 ms

postgres=# analyze t2;
ANALYZE
Time: 318.802 ms
postgres=# 

postgres=# explain  select * from t1 where f2 not in (select f2 from t2);                
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=0.00..2076712.50 rows=500 width=367)
   ->  Seq Scan on t1  (cost=0.00..2076712.50 rows=500 width=367)
         Filter: (NOT (SubPlan 1))
         SubPlan 1
           ->  Materialize  (cost=0.00..4028.00 rows=50000 width=33)
                 ->  Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=0.00..3240.00 rows=50000 width=33)
                       ->  Seq Scan on t2  (cost=0.00..3240.00 rows=50000 width=33)
(7 rows)

Time: 0.916 ms
postgres=# select * from t1 where f2 not in (select f2 from t2);                
 f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 
----+----+----+----+----+----+----+----+----+-----+-----+-----
(0 rows)

Time: 4226.825 ms (00:04.227)

postgres=# set work_mem to '8MB';
SET
Time: 0.289 ms
postgres=# explain  select * from t1 where f2 not in (select f2 from t2);                
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=3365.00..3577.50 rows=500 width=367)
   ->  Seq Scan on t1  (cost=3365.00..3577.50 rows=500 width=367)
         Filter: (NOT (hashed SubPlan 1))
         SubPlan 1
           ->  Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=0.00..3240.00 rows=50000 width=33)
                 ->  Seq Scan on t2  (cost=0.00..3240.00 rows=50000 width=33)
(6 rows)

Time: 0.890 ms
postgres=# select * from t1 where f2 not in (select f2 from t2);           
 f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 
----+----+----+----+----+----+----+----+----+-----+-----+-----
(0 rows)

Time: 105.249 ms

增大work_mem后，性能提高了40倍，因为work_mem足够放下filter的数据，不需要再做 Materialize物化，filter由原来的subplan变成了hash subplan，直接在内存hash表中filter，性能就上去了

注意，work_mem默认不宜过大，建议在某个具体的查询语句中再根据需要进行调整即可。

not in改写为anti join

上面通过增大计算内存达到提高性能，但内存不可能无限扩大，下面通过改写语句也可以达到提高查询的性能。

postgres=#  explain select * from t1 left outer join t2 on t1.f2 = t2.f2 where t2.f2 is null; 
                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=6405.00..9260.75 rows=1 width=734)
   ->  Hash Anti Join  (cost=6405.00..9260.75 rows=1 width=734)
         Hash Cond: (t1.f2 = t2.f2)
         ->  Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=100.00..682.00 rows=1000 width=367)
               Distribute results by S: f2
               ->  Seq Scan on t1  (cost=0.00..210.00 rows=1000 width=367)
         ->  Hash  (cost=21940.00..21940.00 rows=50000 width=367)
               ->  Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=100.00..21940.00 rows=50000 width=367)
                     Distribute results by S: f2
                     ->  Seq Scan on t2  (cost=0.00..3240.00 rows=50000 width=367)
(10 rows)

Time: 1.047 ms
postgres=# select * from t1 left outer join t2 on t1.f2 = t2.f2 where t2.f2 is null; 
 f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 
----+----+----+----+----+----+----+----+----+-----+-----+-----+----+----+----+----+----+----+----+----+----+-----+-----+-----
(0 rows)

Time: 107.233 ms
postgres=#

也可以修改not exists

postgres=# explain select * from t1 where not exists( select 1 from t2 where t1.f2=t2.f2);
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=3865.00..4078.75 rows=1 width=367)
   ->  Hash Anti Join  (cost=3865.00..4078.75 rows=1 width=367)
         Hash Cond: (t1.f2 = t2.f2)
         ->  Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=100.00..682.00 rows=1000 width=367)
               Distribute results by S: f2
               ->  Seq Scan on t1  (cost=0.00..210.00 rows=1000 width=367)
         ->  Hash  (cost=5240.00..5240.00 rows=50000 width=33)
               ->  Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=100.00..5240.00 rows=50000 width=33)
                     Distribute results by S: f2
                     ->  Seq Scan on t2  (cost=0.00..3240.00 rows=50000 width=33)
(10 rows)

Time: 0.974 ms
postgres=# select * from t1 where not exists( select 1 from t2 where t1.f2=t2.f2);        
 f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 
----+----+----+----+----+----+----+----+----+-----+-----+-----
(0 rows)

Time: 42.944 ms

分布key jon+limit优化

--数据准备  
postgres=# CREATE TABLE t1(f1 serial not null unique,f2 text,f3 text,f4 text,f5 text,f6 text,f7 text,f8 text,f9 text,f10 text,f11 text,f12 text) distribute by shard(f1); 
NOTICE:  Replica identity is needed for shard table, please add to this table through "alter table" command.
CREATE TABLE
postgres=# CREATE TABLE t2(f1 serial not null unique,f2 text,f3 text,f4 text,f5 text,f6 text,f7 text,f8 text,f9 text,f10 text,f11 text,f12 text) distribute by shard(f1); 
NOTICE:  Replica identity is needed for shard table, please add to this table through "alter table" command.
CREATE TABLE
postgres=# insert into t1 select t,md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text) from generate_series(1,1000000) as t;
INSERT 0 1000000
postgres=# insert into t2 select t,md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text) from generate_series(1,1000000) as t; 
INSERT 0 1000000
postgres=# analyze t1;
ANALYZE
postgres=# analyze t2;
ANALYZE
postgres=# 

postgres=# \timing 
Timing is on.
postgres=# explain  select t1.* from t1,t2 where t1.f1=t2.f1 limit 10;
                                                                  QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.25..1.65 rows=10 width=367)
   ->  Merge Join  (cost=0.25..140446.26 rows=1000000 width=367)
         Merge Cond: (t1.f1 = t2.f1)
         ->  Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=100.12..434823.13 rows=1000000 width=367)
               ->  Index Scan using t1_f1_key on t1  (cost=0.12..62723.13 rows=1000000 width=367)
         ->  Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=100.12..71823.13 rows=1000000 width=4)
               ->  Index Only Scan using t2_f1_key on t2  (cost=0.12..62723.13 rows=1000000 width=4)
(7 rows)

Time: 1.372 ms

postgres=# explain analyze  select t1.* from t1,t2 where t1.f1=t2.f1 limit 10;       
                                                                                         QUERY PLAN                                                                                         
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.25..1.65 rows=10 width=367) (actual time=2675.437..2948.199 rows=10 loops=1)
   ->  Merge Join  (cost=0.25..140446.26 rows=1000000 width=367) (actual time=2675.431..2675.508 rows=10 loops=1)
         Merge Cond: (t1.f1 = t2.f1)
         ->  Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=100.12..434823.13 rows=1000000 width=367) (actual time=1.661..1.704 rows=10 loops=1)
         ->  Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=100.12..71823.13 rows=1000000 width=4) (actual time=2673.761..2673.783 rows=10 loops=1)
 Planning time: 0.358 ms
 Execution time: 2973.948 ms
(7 rows)

Time: 2976.008 ms (00:02.976)
postgres=#

看执行计划是在cn上面执行，merge join需要把要join的数据拉回cn再排序，然后再join，这里主切的开销在于网络，优化的话方法就是让语句其推下去计算

postgres=# set prefer_olap to on;
SET
Time: 0.291 ms
postgres=#  explain  select t1.* from t1,t2 where t1.f1=t2.f1 limit 10;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=100.25..101.70 rows=10 width=367)
   ->  Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=100.25..101.70 rows=10 width=367)
         ->  Limit  (cost=0.25..1.65 rows=10 width=367)
               ->  Merge Join  (cost=0.25..140446.26 rows=1000000 width=367)
                     Merge Cond: (t1.f1 = t2.f1)
                     ->  Index Scan using t1_f1_key on t1  (cost=0.12..62723.13 rows=1000000 width=367)
                     ->  Index Only Scan using t2_f1_key on t2  (cost=0.12..62723.13 rows=1000000 width=4)
(7 rows)

Time: 1.061 ms

postgres=# explain analyze  select t1.* from t1,t2 where t1.f1=t2.f1 limit 10;   
                                                                                QUERY PLAN                                                                                 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=100.25..101.70 rows=10 width=367) (actual time=1.527..3.899 rows=10 loops=1)
   ->  Remote Subquery Scan on all (dn01,dn02,dn03,dn04,dn05,dn06,dn07,dn08,dn09,dn10)  (cost=100.25..101.70 rows=10 width=367) (actual time=1.525..1.529 rows=10 loops=1)
 Planning time: 0.360 ms
 Execution time: 18.193 ms
(4 rows)

Time: 19.921 ms

相差150倍的性能，一般情况下，如果需要拉大量的数据回cn计算，则下推执行的效率会更好

非分布key join使用hash join性能一般最好

为了提高tp类业务查询的性能，我们经常需要对一些字段建立索引，使用有索引字段join时系统往往也会使用Merge Cond和nestloop

--准备数据
mydb=# CREATE TABLE t1(f1 serial not null,f2 text,f3 text,f4 text,f5 text,f6 text,f7 text,f8 text,f9 text,f10 text,f11 text,f12 text) distribute by shard(f1); 
NOTICE:  Replica identity is needed for shard table, please add to this table through "alter table" command.
CREATE TABLE
Time: 481.042 ms

mydb=# create index t1_f1_idx on t1(f2); 
CREATE INDEX
Time: 85.521 ms

mydb=# CREATE TABLE t2(f1 serial not null,f2 text,f3 text,f4 text,f5 text,f6 text,f7 text,f8 text,f9 text,f10 text,f11 text,f12 text) distribute by shard(f1); 
NOTICE:  Replica identity is needed for shard table, please add to this table through "alter table" command.
CREATE TABLE
Time: 75.973 ms

mydb=# create index t2_f1_idx on t2(f2);  
CREATE INDEX
Time: 29.890 ms

mydb=# insert into t1 select t,md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text) from generate_series(1,1000000) as t;
INSERT 0 1000000
Time: 16450.623 ms (00:16.451)

mydb=# insert into t2 select t,md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text),md5(t::text) from generate_series(1,1000000) as t; 
INSERT 0 1000000
Time: 17218.738 ms (00:17.219)

mydb=# analyze t1;
ANALYZE

Time: 2219.341 ms (00:02.219)
mydb=# analyze t2;
ANALYZE

Time: 1649.506 ms (00:01.650)
mydb=#

--merge join

mydb=# explain select t1.* from t1,t2 where t1.f2=t2.f2 limit 10;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=100.25..102.78 rows=10 width=367)
   ->  Remote Subquery Scan on all (dn001,dn002)  (cost=100.25..102.78 rows=10 width=367)
         ->  Limit  (cost=0.25..2.73 rows=10 width=367)
               ->  Merge Join  (cost=0.25..248056.80 rows=1000000 width=367)
                     Merge Cond: (t1.f2 = t2.f2)
                     ->  Remote Subquery Scan on all (dn001,dn002)  (cost=100.12..487380.85 rows=1000000 width=367)
                           Distribute results by S: f2
                           ->  Index Scan using t1_f1_idx on t1  (cost=0.12..115280.85 rows=1000000 width=367)
                     ->  Materialize  (cost=100.12..155875.95 rows=1000000 width=33)
                           ->  Remote Subquery Scan on all (dn001,dn002)  (cost=100.12..153375.95 rows=1000000 width=33)
                                 Distribute results by S: f2
                                 ->  Index Only Scan using t2_f1_idx on t2  (cost=0.12..115275.95 rows=1000000 width=33)
(12 rows)

Time: 4.183 ms

mydb=# explain analyze select t1.* from t1,t2 where t1.f2=t2.f2 limit 10;
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=100.25..102.78 rows=10 width=367) (actual time=6555.346..6556.296 rows=10 loops=1)
   ->  Remote Subquery Scan on all (dn001,dn002)  (cost=100.25..102.78 rows=10 width=367) (actual time=6555.343..6555.349 rows=10 loops=1)
 Planning time: 0.473 ms
 Execution time: 6569.828 ms
(4 rows)

Time: 6614.439 ms (00:06.614)

--nested loop  

mydb=# set enable_mergejoin to off;
SET
Time: 0.422 ms
mydb=# explain select t1.* from t1,t2 where t1.f2=t2.f2 limit 10;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Limit  (cost=100.12..103.57 rows=10 width=367)
   ->  Remote Subquery Scan on all (dn001,dn002)  (cost=100.12..103.57 rows=10 width=367)
         ->  Limit  (cost=0.12..3.52 rows=10 width=367)
               ->  Nested Loop  (cost=0.12..339232.00 rows=1000000 width=367)
                     ->  Remote Subquery Scan on all (dn001,dn002)  (cost=100.00..434740.00 rows=1000000 width=367)
                           Distribute results by S: f2
                           ->  Seq Scan on t1  (cost=0.00..62640.00 rows=1000000 width=367)
                     ->  Materialize  (cost=100.12..100.31 rows=1 width=33)
                           ->  Remote Subquery Scan on all (dn001,dn002)  (cost=100.12..100.30 rows=1 width=33)
                                 Distribute results by S: f2
                                 ->  Index Only Scan using t2_f1_idx on t2  (cost=0.12..0.27 rows=1 width=33)
                                       Index Cond: (f2 = t1.f2)
(12 rows)

Time: 1.033 ms

mydb=# explain analyze select t1.* from t1,t2 where t1.f2=t2.f2 limit 10;
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=100.12..103.57 rows=10 width=367) (actual time=5608.326..5609.571 rows=10 loops=1)
   ->  Remote Subquery Scan on all (dn001,dn002)  (cost=100.12..103.57 rows=10 width=367) (actual time=5608.323..5608.349 rows=10 loops=1)
 Planning time: 0.347 ms
 Execution time: 5669.901 ms
(4 rows)

Time: 5672.584 ms (00:05.673)

--hash join

mydb=# set enable_nestloop to off;
SET
Time: 0.436 ms
mydb=#  explain select t1.* from t1,t2 where t1.f2=t2.f2 limit 10;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=85983.00..85984.94 rows=10 width=367)
   ->  Remote Subquery Scan on all (dn001,dn002)  (cost=85983.00..85984.94 rows=10 width=367)
         ->  Limit  (cost=85883.00..85884.89 rows=10 width=367)
               ->  Hash Join  (cost=85883.00..274580.00 rows=1000000 width=367)
                     Hash Cond: (t1.f2 = t2.f2)
                     ->  Remote Subquery Scan on all (dn001,dn002)  (cost=100.00..434740.00 rows=1000000 width=367)
                           Distribute results by S: f2
                           ->  Seq Scan on t1  (cost=0.00..62640.00 rows=1000000 width=367)
                     ->  Hash  (cost=100740.00..100740.00 rows=1000000 width=33)
                           ->  Remote Subquery Scan on all (dn001,dn002)  (cost=100.00..100740.00 rows=1000000 width=33)
                                 Distribute results by S: f2
                                 ->  Seq Scan on t2  (cost=0.00..62640.00 rows=1000000 width=33)
(12 rows)

Time: 1.141 ms

mydb=#  explain analyze select t1.* from t1,t2 where t1.f2=t2.f2 limit 10;
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=85983.00..85984.94 rows=10 width=367) (actual time=1083.691..1085.962 rows=10 loops=1)
   ->  Remote Subquery Scan on all (dn001,dn002)  (cost=85983.00..85984.94 rows=10 width=367) (actual time=1083.688..1083.699 rows=10 loops=1)
 Planning time: 0.530 ms
 Execution time: 1108.830 ms
(4 rows)

Time: 1117.713 ms (00:01.118)

exists的优化

exists在数据量比较大情况下，一般使用的是Semi Join ，在work_mem足够大的情况下走的是hash join，性能会更好

postgres=# show work_mem;
 work_mem 
----------
 4MB
(1 row)

Time: 0.298 ms
postgres=# explain  select count(1) from t1 where exists(select 1 from t2 where t2.t1_f1=t1.f1);
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=242218.32..242218.33 rows=1 width=8)
   ->  Remote Subquery Scan on all (dn001,dn002)  (cost=242218.30..242218.32 rows=1 width=0)
         ->  Partial Aggregate  (cost=242118.30..242118.31 rows=1 width=8)
               ->  Hash Semi Join  (cost=110248.00..242118.30 rows=505421 width=0)
                     Hash Cond: (t1.f1 = t2.t1_f1)
                     ->  Seq Scan on t1  (cost=0.00..17420.00 rows=1000000 width=4)
                     ->  Hash  (cost=79340.00..79340.00 rows=3000000 width=4)
                           ->  Remote Subquery Scan on all (dn001,dn002)  (cost=100.00..79340.00 rows=3000000 width=4)
                                 Distribute results by S: t1_f1
                                 ->  Seq Scan on t2  (cost=0.00..52240.00 rows=3000000 width=4)
(10 rows)

Time: 1.091 ms
postgres=# select count(1) from t1 where exists(select 1 from t2 where t2.t1_f1=t1.f1);         
 count  
--------
 500000
(1 row)

Time: 3779.401 ms (00:03.779)

postgres=# set work_mem to '128MB';
SET
Time: 0.368 ms
postgres=# explain  select count(1) from t1 where exists(select 1 from t2 where t2.t1_f1=t1.f1);
                                                    QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=101763.76..101763.77 rows=1 width=8)
   ->  Remote Subquery Scan on all (dn001,dn002)  (cost=101763.75..101763.76 rows=1 width=0)
         ->  Partial Aggregate  (cost=101663.75..101663.76 rows=1 width=8)
               ->  Hash Join  (cost=89660.00..101663.75 rows=505421 width=0)
                     Hash Cond: (t2.t1_f1 = t1.f1)
                     ->  Remote Subquery Scan on all (dn001,dn002)  (cost=59840.00..69443.00 rows=505421 width=4)
                           Distribute results by S: t1_f1
                           ->  HashAggregate  (cost=59740.00..64794.21 rows=505421 width=4)
                                 Group Key: t2.t1_f1
                                 ->  Seq Scan on t2  (cost=0.00..52240.00 rows=3000000 width=4)
                     ->  Hash  (cost=17420.00..17420.00 rows=1000000 width=4)
                           ->  Seq Scan on t1  (cost=0.00..17420.00 rows=1000000 width=4)
(12 rows)

Time: 4.739 ms
postgres=# select count(1) from t1 where exists(select 1 from t2 where t2.t1_f1=t1.f1);         
 count  
--------
 500000
(1 row)

Time: 1942.037 ms (00:01.942)
postgres=#

大约有一倍性能的提升