May 22, 2011

Types of Dimensions

Conformed dimension - Dimensions are conformed when they are either exactly the same (including keys) or one is a perfect subset of the other. Conformed dimensions are either identical or strict mathematical subsets of the most granular, detailed dimension

Junk dimension - A junk dimension is a convenient grouping of typically low-cardinality flags and indicators. By creating an abstract dimension, these flags and indicators are removed from the fact table while placing them into a useful dimensional framework.
 A Junk Dimension is a dimension table consisting of attributes that do not belong in the fact table or in any of the existing dimension tables. The nature of these attributes is usually text or various flags, e.g. non-generic comments or just simple yes/no or true/false indicators. These kinds of attributes are typically remaining when all the obvious dimensions in the business process have been identified and thus the designer is faced with the challenge of where to put these attributes that do not belong in the other dimensions.

Degenerate dimension - A dimension key, such as a transaction number, invoice number, ticket number, or bill-of-lading number, that has no attributes and hence does not join to an actual dimension table. Degenerate dimensions are very common when the grain of a fact table represents a single transaction item or line item because the degenerate dimension represents the unique identifier of the parent. Degenerate dimensions often play an integral role in the fact table's primary

Role-playing dimensions - Dimensions are often recycled for multiple applications within the same database. For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or "Date of Hire". This is often referred to as a "role-playing dimension".

Types of Fact tables

There are basically three fundamental measurement events, which characterizes all fact tables.

Transactional - A transactional table is the most basic and fundamental. The grain associated with a transactional fact table is usually specified as "one row per line in a transaction", e.g., every line on a receipt. Typically a transactional fact table holds data of the most detailed level, causing it to have a great number of dimensions associated with it.

Periodic snapshots - The periodic snapshot, as the name implies, takes a "picture of the moment", where the moment could be any defined period of time, e.g. a performance summary of a salesman over the previous month. A periodic snapshot table is dependent on the transactional table, as it needs the detailed data held in the transactional fact table in order to deliver the chosen performance output.

Accumulating snapshots - This type of fact table is used to show the activity of a process that has a well defined beginning and end, e.g., the processing of an order. An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated. An accumulating snapshot table often has multiple date columns, each representing a milestone in the process. Therefore, it's important to have an entry in the associated date dimension that represents an unknown date, as many of the milestone dates are unknown at the time of the creation of the row.

Data Integrity Problems

·        Same person, different spellings
Agarwal, Agrawal, Aggarwal etc...
·        Multiple ways to denote company name
Persistent Systems, PSPL, Persistent Pvt. LTD.
·        Use of different names
mumbai, bombay
·        Different account numbers generated by different applications for the same customer
·        Required fields left blank
·        Invalid product codes collected at point of sale
manual entry leads to mistakes
“in case of a problem use 9999999”

Why Separate Data Warehouse?

Performance
·        Operational database designed & tuned for known transaction workloads.
·        Complex OLAP queries would degrade performance of transaction.
·        Special data organization, access & implementation methods needed for multidimensional views & queries.
Function
·        Missing data:  Decision support requires historical data, which Operational database do not typically maintain.
·        Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources:  Operational database, external sources.
·        Data quality:  Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.

Reasons for creating a Data mart

        To give users access to the data they need to analyze most often
        To provide data in a form that matches the collective view of the data by a group of users in a department or business function
        To improve end-user response time due to the reduction in the volume of data to be accessed
        To provide appropriately structured data as ditated by the requirements of end-user access tools
        Normally use less data so tasks such as data cleansing, loading, transformation, and integration are far easier, and hence implementing and setting up a data mart is simpler than establishing a corporate data warehouse
        The cost of implementing data marts is normally less than that required to establish a data warehouse
        The potential users of a data mart are more clearly defined and can be more easily targeted to obtain support for a data mart project rather than a corporate data warehouse project

OLTP vs Data warehouseing


OLTP systems 
Data warehousing systems 
Hold current data
Stores detailed data
Data is dynamic
Repetitive processing
High level of transaction throughput
Predictable pattern of usage
Transaction-driven
Application-orented
Supports day-to-day decisions
Serves large number of clerical/operation users
Holds historical data
Stores detailed and highly summarized data
Data is largely static
Ad hoc, unstructured, and heuristic processing
Medium to how level of transaction throughput
Unpredictable pattern of usage
Analysis driven
Subject-oriented
supports strategic decisions
Serves relatively how number of managerial users

What is a Universe

What is a Universe
q  A universe is a semantic layer that maps the data structure found in databases (tables, columns, etc) in business terms.
q  A universe can represent any specific business area, process, or department. For example, a universe can relate to a department in a company such as marketing or accounting.
Universe Layout...

Class
q  A class is a logical grouping of objects within a universe. In general, the name of a class reflects a business concept that conveys the category or type of objects.
q  A class can be further divided into subclasses.
q  As a designer, you are free to define hierarchies of classes and subclasses into a model that best reflects the business concepts of your company.
Objects
q  An object is the most refined component in a universe. It maps to data or a derivation of data in the database.
q  Types of objects -
Ø  Dimension Objects -
        Focus of analysis in a query.
        Maps to one or more columns in a database that are key to a query.
Ø  Measure Objects -
        Aggregate functions that map to statistics in a database.
        Map to “fact attributes” of a star schema.
Ø  Detail Objects -
        Descriptive data about the dimensions.
        Always attached to a dimension.
Universe development process
q  User Requirements
Ø  Detailed analysis of the information needs of the users.
q  Data Analysis
Ø  Detailed data analysis of the database breaking it down into functional areas that users can relate to.
q  Design
Ø  Information from first 2 phases must be amalgamated to create the conceptual design of the Universe, its objects and other components.
q  Development and distribution
Ø  Create using the Designer module.
Ø  Test using the User module.
Ø  Distribute to the users via the repository.
q  Maintenance
Ø  Designer is responsible for updating and maintaining Universe and keeping it up to date with any changes in the User Requirements.  

Universe Creation:

Normalization


Normalization

The following are the characteristic features of Normalized tables.
·        No data should be duplicated in different rows unnecessarily.
·        The intersection of every row and every column should contain some entry.
·        If a row is added to a table, then other tables in the database must be unaffected.
·        If a row is deleted from a table, then important information should not be lost.
·        Any row of a table can be updated independent of other rows in the table.


First Normal Form (1 NF)

·        A table is said to be in First normal form provided the intersection of every row and column contain only atomic values.


Issues with 1 NF

·        Insertion:A subset of a tuple can’t be inserted.
·        Deletion: information may be lost in the process.
·        Updation:May have to be repeated for the same piece of information.


Second Normal Form (2 NF)

·        A Table is said to be in second normal form provided it is in first normal form and every non-key column is fully functionally dependent on candidate key.
·        In order to reduce a table to 2 NF create a separate table containing the column that is partially dependent on the candidate key and that subset of candidate key on which it depends.


Problem with 2 NF

·        Since a table in 2 NF may have transitive dependencies, it may contain two or more non-key columns that are functionally dependent. This will result in unnecessary duplication of data.


Third Normal Form (3 NF)

·        A Table is said to be in third normal form provided it is in second normal form and every non-key column is non-transitively dependent on candidate key.
·        In order to reduce a table to 3 NF create a separate table containing the columns that are functionally dependent.


Limitations of Normalization

·        Full normalization may not be always desirable.
·        Use your Experience.

May 14, 2011

What is Datawarehouse ?

        Data warehouse is subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process.
        a data warehouse is data management and data analysis
        goal: is to integrate enterprise wide corporate data into a single reository from which users can easily run queries
Subject-orientedàDWH is organized around the major subjects of the enterprise..rather than the major application areas.. This is reflected in the need to store decision-support data rather than application-oriented data
IntegratedàBecause the source data come together from different enterprise-wide applications systems. The source data is often inconsistent using..The integrated data source must be made consistent to present a unified view of the data to the users
Time-variantàthe source data in the DWH is only accurate and valid at some point in time or over some time interval. The time-variance of the data warehouse is also shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots
Non-volatileàdata is not update in real time but is refresh from operational system on a regular basis.

Some Overlapping Terminology in Data warehouse


 

May 7, 2011

How do I eliminate the duplicate rows ?

SQL> delete from table_name where rowid not in (select max(rowid) from table group by duplicate_values_field_name);

Example.

Table: Emp

Empno Ename
101     Ram
102     Tiger
103     Scott
104     Bikram
105     Scott

SQL:delete ename from emp a where rowid < ( select max(rowid) from emp b where a.ename = b.ename);

The output like,

Empno Ename
101     Ram
102     Tiger
103     Scott
104     Bikram

Note: Please read the Co-related sub query Concepts.