Netezza – Data Analytics – My Point of View

Netezza Zone Maps in a normalized Data Model

Introduction

Netezza is a data warehousing appliance that uses an Asymmetric Massive Parallel Processing Architecture. The Netezza architecture is driven by two fundamental principles- process close to the data source and do not move data sets unless absolutely necessary. Among the two principles, the latter is implemented primarily by commodity hardware Field Programmable Gate Arrays (FPGAs). FPGA plays the pivotal role in filtering out data as soon as possible, removing I/O bottlenecks, freeing up valuable downstream components such as memory and processor. Zone Maps are internal data structures of Netezza that enables FPGAs to filter out data. In simplistic terms, a Zone Map is a persistent table maintained in Netezza that contains information about tables created by users. Zone Maps store the maximum and minimum values of columns corresponding to pages that store data corresponding to a table. Unlike traditional data structures such as index that helps determine where exactly you should look for data, zone maps help determine where to not look for data. I find this concept very fascinating and analogous with a strategy to answer a popular question that all of us have faced at some point of time – ‘What do you want to be in this world when you grow up?’ I believe the easiest way to start answering this question is to answer the question ‘What do you not want to be in this world’?

This article addresses two questions that I was curious about

How can I know if Zone Maps are used in my query?
Do I need a completely de-normalized data model for effectively using Zone Maps?

Utilization of Zone Maps in a normalized data model

In order to answer these questions, I considered a simple data model consisting of a fact table (FACT_AR_LEDGER) and a dimension table (INVOICE). The primary key of the dimension table (INVOICE_KEY) is present in the fact table. BILLING_CUSTOMER_KEY and INVOICE_STATUS are columns present in the dimension table. These columns have not been propagated to the fact table.

Though the effective utilization of Zone Maps is dependent on the optimal ordering of data, data was not ordered in any fashion for the test cases run below:

*Query 1- Retrieve 1000 rows from fact table with where clause on column present in fact table* select INVOICE_STATUS from FACT_AR_LEDGER a
join INVOICE b
on a.INVOICE_KEY=b.INVOICE_KEY
where a.INVOICE_KEY=151854
limit 1000 *Query 2- Retrieve 1000 rows from fact table with where clause on column not present in fact table* select INVOICE_STATUS from FACT_AR_LEDGER a join INVOICE b on a.INVOICE_KEY=b.INVOICE_KEY where b.BILLING_CUSTOMER_KEY=6364340 limit 1000; Note: In Query 2, Billing Customer (Billing_Customer_Key=6364340) is only linked to one Invoice (INVOICE_KEY=151854). Hence, the target record counts for both queries are same. Both queries have also been limited to the first 1000 records.

A portion of the Explain Plan for Query 1 corresponding to the sequential scan of the fact table is shown below. Node 2.
[SPU Sequential Scan table “FACT_AR_LEDGER” as “A” {}]
— Estimated Rows = 446058, Width = 59, Cost = 0.0 .. 41673.0, Conf = 90.0 [BT: MaxPages=32557 TotalPages=714725] (JIT-Stats)
Restrictions:
(A.INVOICE_KEY = 151854)
Projections:
1:A.INVOICE_STATUS
2:A.INVOICE_KEY
Cardinality:
A.INVOICE_KEY 1 (JIT)

As per the Explain plan, you will notice the optimizer utilizes Zone Map information in JIT stats which is reflected in the MaxPages count (32557) which is significantly lower than the TotalPages (714725).

A portion of the Explain Plan for Query 2 corresponding to the sequential scan of the fact table is shown below

Node 2.

[SPU Sequential Scan table “FACT_AR_LEDGER” as “A” {}]

— Estimated Rows = 766644778, Width = 59, Cost = 0.0 .. 174918.1, Conf = 90.0 (FACT)

Projections:

1:A.INVOICE_STATUS

2:A.INVOICE_KEY

Cardinality:

A.INVOICE_KEY 170.3K (JIT)

As per the Explain Plan of Query 2, it was observed that there was no restriction to the pages to be scanned which may suggest that Zone Maps are not utilized for this query. Our analysis aligns well with basic logic that when a fact table is scanned, only Zone Maps corresponding to columns present in the table will be utilized. In addition, the cost comparison for sequential scan for FACT_AR_LEDGER is lower for Query 1 in comparison with Query 2 due to the use of Zone Maps in Explain Plan of Query 1.

Before concluding that Zone Maps were not utilized in Query 2, let us take a moment to compare the Execution Plan Files of both these queries.

Plan File – Query 1

— Object “/nz/data.1.0/plans/3388915/s3388915_1.o” from cache “/nz/data.1.0/cache/287/3o”

— Object “/nz/data.1.0/plans/3388915/h3388915_501.o” from cache “/nz/data.1.0/cache/143/0o”

— Snippet 3388915_1 exec 0.006 @ 0% res 0.007 channels 1 hmem 1 mem 16 exp h/s 0%/0%

— Detail 3388915_1 table 33757/33766.09/33776 zmread 2 zmhits 2 #22 hcpu 0.002 cpu 0.003/0.020 dread 0.003 mem 3

/nz/kit.7.2.0.4-P1/sys/cc/bin/i686-mcpnps-linux-gnu-g++ -march=prescott -mfpmath=sse -include inclall_1.h -fPIC -o /nz/data.1.0/plans/3388915/s3388915_2.o -shared -O2 -fno-strict-aliasing -fexceptions -fsigned-char -Wno-invalid-offsetof -DNZDEBUG=0 -DGENCODE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -DFOR_SPU -I/nz/kit.7.2.0.4-P1/sys/include -I/nz/data.1.0/plans.include/genpch /nz/data.1.0/plans/3388915/s3388915_2.cpp

— Object “/nz/data.1.0/plans/3388915/s3388915_2.o” compiled dur 0.188 complexity 3733

/nz/kit.7.2.0.4-P1/sys/cc/bin/i686-rhel5-linux-gnu-g++ -march=prescott -mfpmath=sse -include inclall_4.h -O -fno-strict-aliasing -fPIC -shared -Wa,–32 -o /nz/data.1.0/plans/3388915/h3388915_502.o -DNZDEBUG=0 -DGENCODE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -I/nz/kit.7.2.0.4-P1/sys/include -I/nz/data.1.0/plans.include/genpch /nz/data.1.0/plans/3388915/h3388915_502.cpp

— Object “/nz/data.1.0/plans/3388915/h3388915_502.o” compiled dur 0.155 complexity 1798

— Snippet 3388915_2 exec 0.958 @ 100% res 1.588 channels 1 hmem 295 mem 17 estrows 2 exp h/s 6%/33% priority

— Detail 3388915_2 table 33757/33766.09/33776 scan 32402/32487.50/32557 hits 6/6.95/8 zmread 3 zmhits 3 #22 hcpu 0.350 cpu 0.018/0.040 dread 1.220/1.333 fpga 0.593/0.885 mem 6 temp 1

—————-

As shown in the plan file, there is a reduction in the scan counts (32402/32487.50/32557) in comparison with the table counts (33757/33766.09/33776) which supports our preliminary analysis that zone maps were used during the execution of this query.

Plan File – Query 2

— Object “/nz/data.1.0/plans/3580088/s3580088_1.o” from cache “/nz/data.1.0/cache/287/3o”

— Object “/nz/data.1.0/plans/3580088/h3580088_501.o” from cache “/nz/data.1.0/cache/143/0o”

— Snippet 3580088_1 exec 0.003 @ 0% res 0.001 channels 1 hmem 1 mem 16 exp h/s 0%/0%

— Detail 3580088_1 table 33757/33766.09/33776 #22 hcpu 0.001 mem 3

— Object “/nz/data.1.0/plans/3580088/s3580088_2.o” from cache “/nz/data.1.0/cache/137/6bo”

— Object “/nz/data.1.0/plans/3580088/h3580088_502.o” from cache “/nz/data.1.0/cache/52/4ao”

— Snippet 3580088_2 exec 0.950 @ 100% res 1.271 channels 1 hmem 295 mem 17 estrows 1000 exp h/s 6%/33% priority

— Detail 3580088_2 table 33757/33766.09/33776 scan 32402/32487.50/32557 zmread 1 zmhits 1 #22 hcpu 0.002 cpu 0.057/0.080 dread 1.212/1.323 fpga 0.668/0.992 mem 6 temp 1

It was noticed as per the plan file, there was a reduction in the scan counts (32402/32487.50/32557) in comparison with the table counts (33757/33766.09/33776) which negates our preliminary analysis that Zone Maps were not used during execution of Query 2. Netezza was smart enough to identify INVOICE_KEY corresponding to the filter (BILLING_CUSTOMER_KEY=6364340) on the dimension table (INVOICE) and utilize this information via Zone Maps on the fact table (FACT_AR_LEDGER).

Summary

Explain Plan information is not enough to conclude whether Zone Maps are used in the execution of a query. The statistics obtained from the corresponding Plan Files of queries give us a better understanding about the use of Zone Maps in a query.
Utilization of Zone Maps does not require columns applied as restrictions to be present in fact tables. In other words, Netezza does not require all keys to be propagated as Foreign Keys to fact tables for zone maps to be utilized.
Netezza is a cost based optimizer. The cost for sequential scan of a fact table when the query is restricted based on a column present in the fact table is significantly lower in comparison with a query restriction based on a dimension column not present in the fact table. Cost of queries determines resources required for query execution, thereby playing an important role in Workload Management. Hence, if there are dimension keys that do not change over a period of time and are candidate columns for restrictions/filters, add these keys as columns in fact tables.