Backdating Scenarios in Type 2 Dimensions

Recently, I had an interesting conversation with my project’s data architect regarding possible back-dated changes for a primary dimension — Employee — in the data warehouse. In our existing data model, Employee was maintained as a Type 2 slowly changing dimension.

Six months into deployment, it was confirmed by business management that employee changes could be back-dated. This conversation reminded me of a project that I was part of three years earlier where such back-dated scenarios happened frequently.

Many of us are familiar with design sessions where business leaders confirm the chances of a certain scenario happening is rare. The architect then designs a solution without considering this supposedly rare scenario at all. After one month of deployment, it is noticed that this scenario occurs and breaks the existing design. Back-dated changes are an example of such “rare scenarios.”

In addition to the challenge we faced in designing a solution that tackles back dated changes, such scenarios bring in additional complexities such as possible insertion of records within historical date ranges, surrogate keys of dimensions getting out of sequence, the need to update historical fact data with new surrogate keys, and altering of summarized data in fact tables/aggregate fact tables.

After the code was deployed to the production environment the team realized that this “rare scenario” happened more than anticipated. For the first few occurrences of this scenario, manual updates were performed to clean the data. Soon it was decided to re-design the architecture to handle back-dated scenarios.

With rapid technological innovations in the area of business intelligence, business management expects solutions to all their issues as a return for their investment. Often overlooked are the complexities that scenarios like backdating presents. I am aware that Oracle BI handles such back-dated changes as part of their solution, but not all systems do. As the BI field becomes more mature, vendor systems will accommodate occurrences of such back-dated scenarios as a normal situation in any data warehousing design. Everyone’s goals is to have a data warehousing solution that is committed to keeping the data as clean as possible.

In the remainder of this article, I will present a logical approach that you can use to tackle back-dated scenarios.

A conventional type 2 slowly changing dimension is modeled with an educated assumption that all changes coming from the source occur going forward. The following diagram depicts the conventional design of a type 2 dimension. To illustrate a scenario, I have given an example below.

Employee XYZ resides in California, and was hired on 1/1/2010.

Exhibit 1
On 5/1/2013, we were notified from the source system that this employee has moved to Texas. The source table src_Employee looks as follows:

Exhibit 2

In a conventional type 2 slowly changing dimension — Dim_Employee, the following records are created.

Exhibit 3

On 7/1/2013, we are informed by the user that this employee had relocated to Texas on 3/1/2013. In a scenario in which ‘location’ is the only column to trigger historical employee records, the start and end dates in ‘DIM_Employee’ could be updated for each record.

Exhibit 4

However, in a real life scenario, there will be multiple columns that trigger historical records for a dimension. In the previous example, let us consider Position to be another critical attribute for Employee.

Employee was hired on 1/1/2010 as a Software Engineer. He resides in California.

Exhibit 5

On 4/1/2013, he gets promoted to the position of a Senior Software Engineer.

Exhibit 6

On 5/1/2013, we get a trigger from the source system that this employee has relocated to Texas.

Exhibit 7

In the conventional type 2 slowly changing dimension, the following records are created for the employee.

Exhibit 8

On 7/1/2013, we are informed by the user that this employee had relocated to Texas on 3/1/2013. In this scenario, we cannot go ahead and update the effective dates of each record.

Assumption: There is a source table ‘src_Employee_location’ that maintains start and end date of each employee and his corresponding location.

Exhibit 9

 Design a typical Type 2 dimension for Employee. The image of Dim_Employee after this step will be as seen in exhibit 3.

 Identify all records that have changed src_Employee_location (based on Update_Date).  Join the changes in src_Employee_location to DIM_Employee based on the following condition Emp_No=Emp_No and Start Date between eff_start_date and eff_end_date or End Date between eff_start_date and eff_end_date

After this join operation, the output looks as follows:

Exhibit 10
Scenario 1

Start date (source) < eff_st_date (target) and end date(source) > eff_end_date (target)

Description- The date range of the target record is within the date range of source record.

Resolution- Update existing record with source location.

Scenario 2

Start date (source) between eff_st_date (target) and eff_end_date (target) and End Date(source) between eff_st_date(target) and eff_end_date(target)

Description- The date range of the source record within the date range of target record.

Resolution- Insert new record that retrieves location from source record and remaining data from target record, record start date= start date (source), record end date = end date (source)

Scenario 3

End Date (source) >= eff_end_date(target) and Start Date (source) > eff_start_date (target)

Description- Overlap of source and target records in which source start date in between target start and end dates

Resolution- Insert new record that retrieves location from source record and remaining data from target record, record start date= start date(source), record end date = eff_end_date (target)

Scenario 4

End Date (source) < eff_end_date(target) and Start Date (source) <= eff_start_date (target)

Description- Overlap of source and target records in which source end date in between target start and end dates

Resolution- Update existing record that retrieves all data from target record, record start date remains same, record end date = end date ( source)

The data in Dim_Employee after the above pseudo code has been applied looks as follows:

Exhbit 11

Advertisements

Enabling execution of Cartesian Joins in MicroStrategy

Enabling execution of Cartesian Joins in MicroStrategy

Recent years have seen the emergence of self-service and data visualization vendors such as Tableau, Power BI and Qlik Sense that has forced enterprise BI players such as MicroStrategy to improvise their tools to compete with these vendors. I have been a solutions architect in the field of BI been using MicroStrategy for more than 10 years. From my perspective as an architect, I appreciate the power of the underlying SQL engine in MicroStrategy and do not foresee self-service vendors replacing MicroStrategy as an enterprise BI tool in the coming years. What thrills me most as a Solutions Architect in Microstrategy is implementing workaround solutions to meet complex reporting requirements. In other words, I am building additional logic to change the default behavior of the MicroStrategy SQL engine. One example that comes to mind is the effective use of Cartesian joins of dimension tables in certain advanced reporting scenarios. As per conventional best practices, it is not recommended to use Cartesian joins in MicroStrategy.  However, there are certain business cases where the execution of Cartesian joins is required. In this blog, I have illustrated the use of Cartesian joins of dimension tables using two scenarios.

Scenario 1- Accurate analysis across date dimension without the need of fictitious/dummy data

In this scenario, I used a public data set of construction spending data across educational and religious sectors for the year 2016.  I massaged this data set further by deleting construction spending data for the religious sector for the month June 2016.  Shown below is a simple bar chart visualization displaying this scenario.

Construction Spending - Educational

Construction Spending - Religious.PNG

As seen in this graph, since there is no data in religious sector for May 2016, the bars are more spread apart in comparison with the educational sector data. Moreover, reference line calculations such mean, median ignore the fact that there was no construction spending for the religious sector in the month May 2016, thereby misleading the audience. This issue becomes evident if the bar chart is converted in to a line graph.

Construction Spending - Religious line graph 1

Using this graph, we may wrongly conclude that construction spending was consistently high during the middle of the year and slows down during the start and end of year.

This issue can be resolved by entering dummy data in the fact table to ensure all months appear across any other dimension during time series analysis. Nevertheless, not all organizations provide flexibility to manually enter fictitious data in the data warehouse to satisfy certain reporting scenarios. In such cases we may have to implement workaround solutions within the reporting layer to handle these complex issues.

In order to resolve this issue using MicroStrategy, I enabled the VLDB property- ‘Cartesian Join Warning’ to ‘Execute’. I also created an additional dummy metric that serves the same purpose as introduction of dummy data within the data warehouse.  The sql statement for this metric is shown below.

CREATE TEMPORARY TABLE ZZMD01 as

select       a11.month_key  month_key,

a12.sector_key  cat_key,

max(a11.month_key)  WJXBFS1

from         d_month  a11

cross join d_sector  a12

where       a11.month_key between 201601 and 201612

group by  a11.month_key,

a12.sector_key

 

This sql pass is basically a Cartesian join between the dimensions- Month and Sector with an additional filter restricting data to 2016. This sql pass thereby ensures there is a row for all months of 2016 for all sectors irrespective of the actual construction spending data.

The revised bar chart and line chart with the introduction of this metric is shown below.

Construction Spending - Religious 1

 

Construction Spending - Religiious 1- line graphAs seen in the graphs, the revised average line considers the month of May 2016, dropping the average from 311 to 285.

Scenario 2- Month End Reporting

For this scenario, let us consider a hypothetical example of ‘Sales Commissions data’.  A sales man ‘TOM’ belongs to the ‘CENTRAL’ region. On Jan 15, 2017, he moved from ‘CENTRAL’ region to ‘SOUTH EAST’ region. The dimension data corresponding to ‘Tom’ is shown below.

Sales Rep Key Sales Rep ID Sales Rep Region Effective Start Date Effective End Date
100 1 TOM CENTRAL 1/1/2016 1/15/2017
200 1 TOM SOUTH EAST 1/16/2017 12/31/9999

 

As seen above, Type 2 strategy has been used to maintain history for the ‘Sales Rep’ dimension.

The Commissions fact data for this sales rep is shown below.

Sales Rep Key Month Commission Amt
100 Jan-17 2000
200 Jan-17 1000

 

A simple report to show distribution of commissions across regions has been shown below.

Let us make this scenario a bit more complicated. What if the requirement for the month-end report is to allocate all commissions for each month to the region where the sales rep was assigned at the end of the month? To satisfy this requirement, I had to implement a workaround solution in which execution of Cartesian Joins was enabled. In this solution, the month-end region had to be created as a metric.  An alias table for the Type 2 dimension was also created. The sql is shown below.

CREATE TEMPORARY TABLE ZZMD01 as

select       a11.sls_rep_id  sls_rep_id,

a12.month_key  month_key,

max(a11.sls_rep_nm)  WJXBFS1,

max(a11.sls_rep_region)  WJXBFS2

from         d_salesrep                 a11

cross join d_month a12

where       (((a11.sls_rep_key)

in              (select      r12.sls_rep_key

from         d_month  r11

cross join d_salesrep                 r12

where       r11.month_end_date between r12.effective_start_dttm and r12.effective_end_dttm

group by  r12.sls_rep_key))

and a12.month_key = 201701)

group by  sls_rep_id,

a12.month_key

 

This sql pass is basically a Cartesian join between the dimensions- Month and Sales Rep identifying all sales rep versions that are active for the date- ‘201701’.

The revised report results is shown below.

As seen in both the scenarios, execution of Cartesian joins can prove to be very useful to meet complex reporting scenarios. What was used in both these scenarios was essentially a non-equi join with additional where clauses.  It is recommended though that enabling execution of Cartesian joins be enabled only for certain complex reports. Enabling this setting at an overall project level can result in incorrect results and performance degradation especially when the Cartesian joins are implemented without any filters.

Self-Service BI – A Reality Check

Self-Service BI- A Reality Check

Business Intelligence as a field has gained rapid maturity over a period of time. We are living in an era when we constantly hear buzzwords like Big Data, Prescriptive Analytics and Data Science. In the midst of these catchy phrases, the one phrase that still stands out for me is Self-Service BI. Self-Service BI is defined as an approach that enables business users to access and work with corporate data even if they do not have a background in Business Intelligence. This approach is primarily intended to reduce dependency of business users on IT for creating their reports. In the conventional reporting paradigm, a business user provided requirements for reports to IT. IT would then create these reports which business users would then use for analysis. The average turnaround time for this process takes at least 2-3 weeks. With Self-Service, the previously frustrated user can answer the same business question within a day. Though Self-Service BI has been relevant in this field over the past two decades, it has gained more prominence in this era with the availability of powerful visualization tools such as Tableau and MicroStrategy empowering business users to explore, visualize and analyze data by themselves.

Self-Service caught my attention during my recent project with a reputed software firm. This project has transformed my thought process as a BI Consultant. Prior to this project, my understanding of Self-Service BI was limited to a narrow scope in which users would use independent data sources/silos for deriving analytics. This project’s primary requirement was to create an Enterprise level Ad-hoc reporting environment for users to create their own reports and visualizations. During the course of this project, I realized the value of enabling self-service in increasing user adoption.  Earlier, I valued BI to be a niche skillset and assumed that business users will always to be dependent on me. Now, I seek ways to improve to make the reporting environment more user friendly and question myself – Can users answer these business questions themselves? Though there is no one-size-fit-all solution for Self-Service, here are few steps that I would recommend to improve self-service in an organization.

  • Formulate our own requirements: It is naïve to expect business users to provide detailed Self-Service reporting requirements. We must be smart enough to formulate our own requirements and seek clarification from business. This is very significant in an enterprise reporting environment where a simple business question can have multiple interpretations. Let us consider a very simple example. Let us consider an Employee ‘Tom’ working for a Product Company ‘ABC’. During the year 2015, he sold Products worth 30000 dollars. This employee has worked across two departments during 2015. When a user creates a report consisting of Employee, Department and Sales for Year 2015, does he want to see one row for ‘Tom’ with total Sales of 30000 dollars and his current department or does he want to see multiple rows for Tom showing the split of Sales across departments? Addressing such nuances by seeking input from users during the initial stages of a project generates interest among the user community and drives them to providing additional feedback for other interesting scenarios.
  • Wear the hat of the Business User: As BI Consultants, we should resist the temptation to use the more SQL-friendly interfaces and interact with the same reporting interfaces used by user groups to answer business questions. This helps us understand better how users interact with reporting interfaces and what the limitations/flaws of these interfaces are.
  • User Friendly Navigation: As all of us know, the user traffic for a website depends on how easy it is to navigate within the website. Similarly, organizing the metadata of an Enterprise Reporting Solution into a user-friendly folder structure is critical towards the success of Self-Service BI. The naming convention of objects should be intuitive enough for users to navigate. It is also important to align metadata more with business terms that users are familiar with. Users should be able to navigate to objects with minimal supervision. The reporting environment also needs to have adequate search capabilities for users to navigate to objects they are looking for.
  • Microsoft Excel is the tool to beat: Microsoft Excel still happens to be a business user’s favorite tool. No matter which advanced reporting technology is being used in the organization, majority of the users still end up downloading the data into excel and doing further analysis. We should challenge ourselves to prevent users from downloading data into excel for further analysis.
  • Performance: Performance is very critical for enabling self-service in an organization. The data model may be perfect and optimized for Self-Service, but if reports take long time to execute, they are going to revert back to their standalone data silos for data analysis. It is very important for the BI team to optimize the system for better performance.
  • 80-20 rule: Let us accept the fact that Self-Service makes an ordinary business user feel very powerful. He may start asking several interesting questions about data. Any BI solution will have limitations and not answer all interesting questions from users without IT intervention. As BI Consultants, we will need to come to terms with this reality and ensure the reporting environment will be able to meet 80 % of all possible business requirements. As for the remaining 20 % complex BI requirements, it is imperative to provide pre-defined canned solutions.
  • BI Enablement Team: It is important for an organization to have a BI Enablement Team to increase user adoption. BI Enablement Team helps users familiarize more with reporting technologies by conducting user training sessions and workshops. From my personal experience, the critical phase for any project is the Post Go-Live Phase. During this phase, the enthusiastic business users will be exploring their new playground and have lot of questions. Not having a dedicated team to work with the users during this phase is a recipe for failure. In addition, this team can send out communications and notifications to business users about new functionalities/dashboards available in the reporting environment.
  • Collaboration & IT Governance: The purpose of Self-Service BI is to improve the efficiency and productivity of business users by empowering them with a platform to answer questions by themselves. However, an organization’s overall productivity and efficiency gets impacted if different user groups spend time in answering the same business questions. Collaboration is essential to encourage multiple user groups to work together when they are trying to answer the same questions. Collaboration within an organization can be improved by IT Governance. IT needs to constantly monitor usage in a reporting environment, identifying candidate reports created by user groups that can be deployed to a shared environment accessible to multiple user groups. Prior to the deployment of these reports, IT can also conduct testing to assess the quality of these reports. On the other hand, business users should be trained to search for objects within a shared environment before they start creating them.

As part of IT Governance, it is also important to identify long running Ad-hoc reporting requests and identify performance improvements to these requests.  Though performance is crucial for enabling Self-Service BI, it is recommended to assign lower priority to Ad-hoc Reports in comparison with standard, canned reports.

Many a time, BI Consultants who design enterprise reporting solutions may have the right intentions to enable self-service; however, this takes a backseat in the midst of constant pressure and stringent deadlines they face in completing deliverables. Hence, enabling Self-Service in an organization is a gradual process which requires constant monitoring by IT to identify scope for improvements. Nevertheless, Self-Service BI is going to be a futile attempt without dedicated participation from the user community showing interest in interacting with the reporting environment. Business Users should resist their temptation to use Microsoft Excel and legacy data sources; rather be more open to change and train themselves in using enterprise reporting solutions. As the saying goes, it takes two hands to clap. IT and Business need to work together to overcome the ‘Lack of Trust’ barrier and make Self-Service BI a success in an organization.

 

Netezza Zone Maps in a normalized data model

Netezza Zone Maps in a normalized Data Model

Introduction

Netezza is a data warehousing appliance that uses an Asymmetric Massive Parallel Processing Architecture.  The Netezza architecture is driven by two fundamental principles- process close to the data source and do not move data sets unless absolutely necessary. Among the two principles, the latter is implemented primarily by commodity hardware Field Programmable Gate Arrays (FPGAs).  FPGA plays the pivotal role in filtering out data as soon as possible, removing I/O bottlenecks, freeing up valuable downstream components such as memory and processor. Zone Maps are internal data structures of Netezza that enables FPGAs to filter out data. In simplistic terms, a Zone Map is a persistent table maintained in Netezza that contains information about tables created by users. Zone Maps store the maximum and minimum values of columns corresponding to pages that store data corresponding to a table. Unlike traditional data structures such as index that helps determine where exactly you should look for data, zone maps help determine where to not look for data. I find this concept very fascinating and analogous with a strategy to answer a popular question that all of us have faced at some point of time – ‘What do you want to be in this world when you grow up?’ I believe the easiest way to start answering this question is to answer the question ‘What do you not want to be in this world’?

This article addresses two questions that I was curious about

  • How can I know if Zone Maps are used in my query?
  • Do I need a completely de-normalized data model for effectively using Zone Maps?

Utilization of Zone Maps in a normalized data model

In order to answer these questions, I considered a simple data model consisting of a fact table (FACT_AR_LEDGER) and a dimension table (INVOICE). The primary key of the dimension table (INVOICE_KEY) is present in the fact table. BILLING_CUSTOMER_KEY and INVOICE_STATUS are columns present in the dimension table. These columns have not been propagated to the fact table.

Though the effective utilization of Zone Maps is dependent on the optimal ordering of data, data was not ordered in any fashion for the test cases run below:

 Query 1- Retrieve 1000 rows from fact table with where clause on column present in fact table

select INVOICE_STATUS from

FACT_AR_LEDGER a

join INVOICE b
on a.INVOICE_KEY=b.INVOICE_KEY
where a.INVOICE_KEY=151854
limit 1000

 

Query 2- Retrieve 1000 rows from fact table with where clause on column not present in fact table

select INVOICE_STATUS from

FACT_AR_LEDGER a

join INVOICE b

on a.INVOICE_KEY=b.INVOICE_KEY

where b.BILLING_CUSTOMER_KEY=6364340

limit 1000;

 

Note: In Query 2, Billing Customer (Billing_Customer_Key=6364340) is only linked to one Invoice (INVOICE_KEY=151854). Hence, the target record counts for both queries are same. Both queries have also been limited to the first 1000 records.

A portion of the Explain Plan for Query 1 corresponding to the sequential scan of the fact table is shown below.

 

Node 2.

[SPU Sequential Scan table “FACT_AR_LEDGER” as “A” {}]
— Estimated Rows = 446058, Width = 59, Cost = 0.0 .. 41673.0, Conf = 90.0 [BT: MaxPages=32557 TotalPages=714725] (JIT-Stats)
Restrictions:
(A.INVOICE_KEY = 151854)
Projections:
1:A.INVOICE_STATUS
2:A.INVOICE_KEY
Cardinality:
A.INVOICE_KEY 1 (JIT)

 

As per the Explain plan, you will notice the optimizer utilizes Zone Map information in JIT stats which is reflected in the MaxPages count (32557) which is significantly lower than the TotalPages (714725).

 

A portion of the Explain Plan for Query 2 corresponding to the sequential scan of the fact table is shown below

Node 2.
[SPU Sequential Scan table “FACT_AR_LEDGER” as “A” {}]
— Estimated Rows = 766644778, Width = 59, Cost = 0.0 .. 174918.1, Conf = 90.0 (FACT)
Projections:
1:A.INVOICE_STATUS
2:A.INVOICE_KEY
Cardinality:
A.INVOICE_KEY 170.3K (JIT)

 

As per the Explain Plan of Query 2, it was observed that there was no restriction to the pages to be scanned which may suggest that Zone Maps are not utilized for this query. Our analysis aligns well with basic logic that when a fact table is scanned, only Zone Maps corresponding to columns present in the table will be utilized. In addition, the cost comparison for sequential scan for FACT_AR_LEDGER is lower for Query 1 in comparison with Query 2 due to the use of Zone Maps in Explain Plan of Query 1.

Before concluding that Zone Maps were not utilized in Query 2, let us take a moment to compare the Execution Plan Files of both these queries.

 

Plan File – Query 1

— Object “/nz/data.1.0/plans/3388915/s3388915_1.o” from cache “/nz/data.1.0/cache/287/3o”

— Object “/nz/data.1.0/plans/3388915/h3388915_501.o” from cache “/nz/data.1.0/cache/143/0o”

— Snippet 3388915_1 exec 0.006 @ 0% res 0.007 channels 1 hmem 1 mem 16 exp h/s 0%/0%

— Detail  3388915_1 table 33757/33766.09/33776 zmread 2 zmhits 2 #22 hcpu 0.002 cpu 0.003/0.020 dread 0.003 mem 3

/nz/kit.7.2.0.4-P1/sys/cc/bin/i686-mcpnps-linux-gnu-g++ -march=prescott -mfpmath=sse -include inclall_1.h -fPIC -o /nz/data.1.0/plans/3388915/s3388915_2.o -shared -O2 -fno-strict-aliasing -fexceptions -fsigned-char -Wno-invalid-offsetof -DNZDEBUG=0 -DGENCODE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -DFOR_SPU -I/nz/kit.7.2.0.4-P1/sys/include -I/nz/data.1.0/plans.include/genpch /nz/data.1.0/plans/3388915/s3388915_2.cpp

— Object “/nz/data.1.0/plans/3388915/s3388915_2.o” compiled dur 0.188 complexity 3733

/nz/kit.7.2.0.4-P1/sys/cc/bin/i686-rhel5-linux-gnu-g++ -march=prescott -mfpmath=sse -include inclall_4.h -O -fno-strict-aliasing -fPIC -shared -Wa,–32 -o /nz/data.1.0/plans/3388915/h3388915_502.o -DNZDEBUG=0 -DGENCODE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -I/nz/kit.7.2.0.4-P1/sys/include -I/nz/data.1.0/plans.include/genpch /nz/data.1.0/plans/3388915/h3388915_502.cpp

— Object “/nz/data.1.0/plans/3388915/h3388915_502.o” compiled dur 0.155 complexity 1798

— Snippet 3388915_2 exec 0.958 @ 100% res 1.588 channels 1 hmem 295 mem 17 estrows 2 exp h/s 6%/33% priority

— Detail  3388915_2 table 33757/33766.09/33776 scan 32402/32487.50/32557 hits 6/6.95/8 zmread 3 zmhits 3 #22 hcpu 0.350 cpu 0.018/0.040 dread 1.220/1.333 fpga 0.593/0.885 mem 6 temp 1

—————-

As shown in the plan file, there is a reduction in the scan counts (32402/32487.50/32557) in comparison with the table counts (33757/33766.09/33776) which supports our preliminary analysis that zone maps were used during the execution of this query.

Plan File – Query 2

— Object “/nz/data.1.0/plans/3580088/s3580088_1.o” from cache “/nz/data.1.0/cache/287/3o”

— Object “/nz/data.1.0/plans/3580088/h3580088_501.o” from cache “/nz/data.1.0/cache/143/0o”

— Snippet 3580088_1 exec 0.003 @ 0% res 0.001 channels 1 hmem 1 mem 16 exp h/s 0%/0%

— Detail  3580088_1 table 33757/33766.09/33776 #22 hcpu 0.001 mem 3

— Object “/nz/data.1.0/plans/3580088/s3580088_2.o” from cache “/nz/data.1.0/cache/137/6bo”

— Object “/nz/data.1.0/plans/3580088/h3580088_502.o” from cache “/nz/data.1.0/cache/52/4ao”

— Snippet 3580088_2 exec 0.950 @ 100% res 1.271 channels 1 hmem 295 mem 17 estrows 1000 exp h/s 6%/33% priority

— Detail  3580088_2 table 33757/33766.09/33776 scan 32402/32487.50/32557 zmread 1 zmhits 1 #22 hcpu 0.002 cpu 0.057/0.080 dread 1.212/1.323 fpga 0.668/0.992 mem 6 temp 1

 

It was noticed as per the plan file, there was a reduction in the scan counts (32402/32487.50/32557) in comparison with the table counts (33757/33766.09/33776) which negates our preliminary analysis that Zone Maps were not used during execution of Query 2. Netezza was smart enough to identify INVOICE_KEY corresponding to the filter (BILLING_CUSTOMER_KEY=6364340) on the dimension table (INVOICE) and utilize this information via Zone Maps on the fact table (FACT_AR_LEDGER).

Summary

  • Explain Plan information is not enough to conclude whether Zone Maps are used in the execution of a query. The statistics obtained from the corresponding Plan Files of queries give us a better understanding about the use of Zone Maps in a query.
  • Utilization of Zone Maps does not require columns applied as restrictions to be present in fact tables. In other words, Netezza does not require all keys to be propagated as Foreign Keys to fact tables for zone maps to be utilized.
  • Netezza is a cost based optimizer. The cost for sequential scan of a fact table when the query is restricted based on a column present in the fact table is significantly lower in comparison with a query restriction based on a dimension column not present in the fact table. Cost of queries determines resources required for query execution, thereby playing an important role in Workload Management. Hence, if there are dimension keys that do not change over a period of time and are candidate columns for restrictions/filters, add these keys as columns in fact tables.