DW&BI Technology: 2011

May 22, 2011

Types of Dimensions

Conformed dimension - Dimensions are conformed when they are either exactly the same (including keys) or one is a perfect subset of the other. Conformed dimensions are either identical or strict mathematical subsets of the most granular, detailed dimension

Junk dimension - A junk dimension is a convenient grouping of typically low-cardinality flags and indicators. By creating an abstract dimension, these flags and indicators are removed from the fact table while placing them into a useful dimensional framework.

A Junk Dimension is a dimension table consisting of attributes that do not belong in the fact table or in any of the existing dimension tables. The nature of these attributes is usually text or various flags, e.g. non-generic comments or just simple yes/no or true/false indicators. These kinds of attributes are typically remaining when all the obvious dimensions in the business process have been identified and thus the designer is faced with the challenge of where to put these attributes that do not belong in the other dimensions.

Degenerate dimension - A dimension key, such as a transaction number, invoice number, ticket number, or bill-of-lading number, that has no attributes and hence does not join to an actual dimension table. Degenerate dimensions are very common when the grain of a fact table represents a single transaction item or line item because the degenerate dimension represents the unique identifier of the parent. Degenerate dimensions often play an integral role in the fact table's primary

Role-playing dimensions - Dimensions are often recycled for multiple applications within the same database. For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or "Date of Hire". This is often referred to as a "role-playing dimension".

Types of Fact tables

There are basically three fundamental measurement events, which characterizes all fact tables.

Transactional - A transactional table is the most basic and fundamental. The grain associated with a transactional fact table is usually specified as "one row per line in a transaction", e.g., every line on a receipt. Typically a transactional fact table holds data of the most detailed level, causing it to have a great number of dimensions associated with it.

Periodic snapshots - The periodic snapshot, as the name implies, takes a "picture of the moment", where the moment could be any defined period of time, e.g. a performance summary of a salesman over the previous month. A periodic snapshot table is dependent on the transactional table, as it needs the detailed data held in the transactional fact table in order to deliver the chosen performance output.

Accumulating snapshots - This type of fact table is used to show the activity of a process that has a well defined beginning and end, e.g., the processing of an order. An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated. An accumulating snapshot table often has multiple date columns, each representing a milestone in the process. Therefore, it's important to have an entry in the associated date dimension that represents an unknown date, as many of the milestone dates are unknown at the time of the creation of the row.

Data Integrity Problems

· Same person, different spellings

Agarwal, Agrawal, Aggarwal etc...

· Multiple ways to denote company name

Persistent Systems, PSPL, Persistent Pvt. LTD.

· Use of different names

mumbai, bombay

· Different account numbers generated by different applications for the same customer

· Required fields left blank

· Invalid product codes collected at point of sale

manual entry leads to mistakes

“in case of a problem use 9999999”

Why Separate Data Warehouse?

Performance

· Operational database designed & tuned for known transaction workloads.

· Complex OLAP queries would degrade performance of transaction.

· Special data organization, access & implementation methods needed for multidimensional views & queries.

Function

· Missing data: Decision support requires historical data, which Operational database do not typically maintain.

· Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: Operational database, external sources.

· Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.

Reasons for creating a Data mart

• To give users access to the data they need to analyze most often

• To provide data in a form that matches the collective view of the data by a group of users in a department or business function

• To improve end-user response time due to the reduction in the volume of data to be accessed

• To provide appropriately structured data as ditated by the requirements of end-user access tools

• Normally use less data so tasks such as data cleansing, loading, transformation, and integration are far easier, and hence implementing and setting up a data mart is simpler than establishing a corporate data warehouse

• The cost of implementing data marts is normally less than that required to establish a data warehouse

• The potential users of a data mart are more clearly defined and can be more easily targeted to obtain support for a data mart project rather than a corporate data warehouse project

OLTP vs Data warehouseing

OLTP systems	Data warehousing systems
Hold current data Stores detailed data Data is dynamic Repetitive processing High level of transaction throughput Predictable pattern of usage Transaction-driven Application-orented Supports day-to-day decisions Serves large number of clerical/operation users	Holds historical data Stores detailed and highly summarized data Data is largely static Ad hoc, unstructured, and heuristic processing Medium to how level of transaction throughput Unpredictable pattern of usage Analysis driven Subject-oriented supports strategic decisions Serves relatively how number of managerial users

What is a Universe

What is a Universe

q A universe is a semantic layer that maps the data structure found in databases (tables, columns, etc) in business terms.

q A universe can represent any specific business area, process, or department. For example, a universe can relate to a department in a company such as marketing or accounting.

Universe Layout...

Class

q A class is a logical grouping of objects within a universe. In general, the name of a class reflects a business concept that conveys the category or type of objects.

q A class can be further divided into subclasses.

q As a designer, you are free to define hierarchies of classes and subclasses into a model that best reflects the business concepts of your company.

Objects

q An object is the most refined component in a universe. It maps to data or a derivation of data in the database.

q Types of objects -

Ø Dimension Objects -

• Focus of analysis in a query.

• Maps to one or more columns in a database that are key to a query.

Ø Measure Objects -

• Aggregate functions that map to statistics in a database.

• Map to “fact attributes” of a star schema.

Ø Detail Objects -

• Descriptive data about the dimensions.

• Always attached to a dimension.

Universe development process

q User Requirements

Ø Detailed analysis of the information needs of the users.

q Data Analysis

Ø Detailed data analysis of the database breaking it down into functional areas that users can relate to.

q Design

Ø Information from first 2 phases must be amalgamated to create the conceptual design of the Universe, its objects and other components.

q Development and distribution

Ø Create using the Designer module.

Ø Test using the User module.

Ø Distribute to the users via the repository.

q Maintenance

Ø Designer is responsible for updating and maintaining Universe and keeping it up to date with any changes in the User Requirements.

Universe Creation:

Normalization

Normalization

The following are the characteristic features of Normalized tables.

· No data should be duplicated in different rows unnecessarily.

· The intersection of every row and every column should contain some entry.

· If a row is added to a table, then other tables in the database must be unaffected.

· If a row is deleted from a table, then important information should not be lost.

· Any row of a table can be updated independent of other rows in the table.

First Normal Form (1 NF)

· A table is said to be in First normal form provided the intersection of every row and column contain only atomic values.

Issues with 1 NF

· Insertion:A subset of a tuple can’t be inserted.

· Deletion: information may be lost in the process.

· Updation:May have to be repeated for the same piece of information.

Second Normal Form (2 NF)

· A Table is said to be in second normal form provided it is in first normal form and every non-key column is fully functionally dependent on candidate key.

· In order to reduce a table to 2 NF create a separate table containing the column that is partially dependent on the candidate key and that subset of candidate key on which it depends.

Problem with 2 NF

· Since a table in 2 NF may have transitive dependencies, it may contain two or more non-key columns that are functionally dependent. This will result in unnecessary duplication of data.

Third Normal Form (3 NF)

· A Table is said to be in third normal form provided it is in second normal form and every non-key column is non-transitively dependent on candidate key.

· In order to reduce a table to 3 NF create a separate table containing the columns that are functionally dependent.

Limitations of Normalization

· Full normalization may not be always desirable.

· Use your Experience.

May 14, 2011

What is Datawarehouse ?

• Data warehouse is subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process.

• a data warehouse is data management and data analysis

• goal: is to integrate enterprise wide corporate data into a single reository from which users can easily run queries

Subject-orientedàDWH is organized around the major subjects of the enterprise..rather than the major application areas.. This is reflected in the need to store decision-support data rather than application-oriented data

IntegratedàBecause the source data come together from different enterprise-wide applications systems. The source data is often inconsistent using..The integrated data source must be made consistent to present a unified view of the data to the users

Time-variantàthe source data in the DWH is only accurate and valid at some point in time or over some time interval. The time-variance of the data warehouse is also shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots

Non-volatileàdata is not update in real time but is refresh from operational system on a regular basis.

Some Overlapping Terminology in Data warehouse

May 7, 2011

How do I eliminate the duplicate rows ?

SQL> delete from table_name where rowid not in (select max(rowid) from table group by duplicate_values_field_name);

Example.

Table: Emp

Empno Ename
101 Ram
102 Tiger
103 Scott
104     Bikram
105     Scott

SQL:delete ename from emp a where rowid < ( select max(rowid) from emp b where a.ename = b.ename);

The output like,

Empno Ename
101 Ram
102 Tiger
103 Scott
104     Bikram

Note: Please read the Co-related sub query Concepts.

January 16, 2011

RecycleBin in Oracle10g

One of the many new features that oracle10g introduced is the recyclebin. when enabled , thisfeture works a littele bit like the familiar windows recyclebin or mac Trash. Droped tables go into the recyclebin and can be restored from the recyclebin.

    There are two recyclebin views: USER_RECYCLEBIN and DBA_RECYCLEBIN. For convenience, the synonyms RECYCLEBIN points to USER_RECYCLEBIN. The recyclebin is enabled by default in 10g, but you can turn it on or off the RECYCLEBIN initialization parameter, at the system or session level

When the recyclebin is enabled, any tables that you drop do not actually get deleted. Instead when you drop a table, Oracle just renames the table and all its associated objects(Indexes, triggers, LOB segments, etc) to a system-generated name that begins with BIN$

It's important to know that after you've dropped a table, it has only been renamed; the table segments are still sitting there in your tablespace, unchanged, taking up space. This space still count againt your user tablespace quotas, as well as filling up the tablespace. It will not be reclaimed untill you get the table out of recyclebin. You can remove an object from the recyclebin by restoring it, or by purging its from the recyclebin.

Ex: SQL:\> DROP TABLE TEST;
      SQL:\> SELECT * FROM RECYCLEBIN;
      SQL:\> FLASHBACK TABLE TEST TO BEFORE DROP;

     SQL:\> PURGE RECYCLEBIN

2PC(two phase distributed commit)

Q: What is 2PC(two phase distributed Commit)

Ans: Peroming a distributed transaction is similar to a local transaction.

    Update local_table set x=10;
    Update remote_table@another_database set y=20;
    commit;

    Oracle will either commit in both databases or nither, it uses atwo-pahse distributed commit(2PC) protocol to do this. Two-Phase commit is a transaction protocol designed for the complications that rise with distributed resoure manager. with a two-phase commit protocol, the distributed transaction manager employs acordinator to manage the individual resource managers.

The commit process proceeds as follows:

Phase 1:
        Each participating resource manager coordinates local operations and forces all log records out:
        If sucessful, responds "ok"
        If unsucessful, either allow a time-out or respond "oops"

Phase 2:
       If all participants respond "ok"
             Coordinator instructs participatnig resource manages to "COMMIT"
             Participants complete  operation writing the log record for the commit
      Otherwise:
              Coordinator instruts participating resource manager to "ROLLBACK"
              Participants complete their respective local undos

Labels

May 22, 2011

May 14, 2011

May 7, 2011

January 16, 2011