SSIS

Business Intelligence

Data, Business Intelligence and analytics

Organization are drowning in data

but lack in information / Knowledge
our ability to collect and store data seems to have surpassed our ability to make sense of it
Important trends:

Storage capacity continues to rise rapidly
cost of storage continues to drop

What is Business Intelligence?

The application and technologies transforming business data into Action.

Fact based decision making typically involves a subset of the following skill or tools

data querying / sql
database design/data warehousing/ Data mining
Decision support system/ simulations
Data visualization/ Dashboard

Business intelligence (BI) is a technology driven process for analyzing data and presenting actionable information to help executives manager and other corporate end user make informed business decision

The analytics given insight on most successful sourcing channels and provide metric to increase effectiveness, ultimately lowering the cost on the spend

THE BI GOAL

BI is about using data to help enterprise user make better business decision

Right information+ Right time + right format

| |

strategic, tactical, operational

===>Better Company

Data / Information and knowledge

Data is a collection of raw value elements or fact used for calculating reasoning or measuring

Information is the result of collecting and organizing data in a way that establishes relationship between data items, which thereby provide context and meaning

Knowledge is the concept of understanding information based on recognized patterns in a way that provides insight to information

Famous "Beer and Diapers" BI examples

consider the following convenience store data (register sales receipts)

Goal : Convert data to knowledge

- Hard to do in practice

- required commitment/effort.

What is OLAP?

OLAP(Online analytical processing) is a process of performing multidirectional analysis at high speed on large volumes of data from a data warehouse, data mart, or some other unified, centralized data store.

Online Analytical processing server is based on the multidimensional data model . It allows managers an analyst to get an insight of the information through fast, consistent and interactive access to information.

Online transaction processing or OLTP refers to data processing methods and software focused on transaction oriented data and applications

The main difference between OLAP and OLTP is in the name itself.

OLAP(Online Analytical Processing)	OLTP(Online Transaction processing)
OLAP is analytical in nature	OLTP is transactional in nature
OLAP are designed for multidimentional analysis of data in data warehouse which contains both transactional and historical data. OLAP server is typcally the middle analytical tier of a data warehousing solution	OLTP is designed to support transaction oriented and accruately as possible.
Uses of OLAP are data mining and other business intelligence application , complex analytical calculation, and predictive scenarios as well as business reorting function like financial analysis, budgeting, and forecase planning	Common used for OLTP include ATM, ecommerce software , credit card payment processing , online bookings, reservation system and record keeping tools

Data warehousing -Schemas

Schema is a logical description of the entire database. It includes the name and description of records of all record types including all associate

Types of Data Warehouse schema:

Star schema
Snowflake Schema
Starflake Schema

Star schema in data warehouse schema in which the center of the star can have one fact table and number of associated dimension tables. It is known as start schema as its structure resembles a star. The star schema data model is the simplest type of data warehouse schema

Snowflake schema is a data warehouse schema in which there is logical arrangement of tables in a multidimensional database such that the ER diagram resembles a snow flake shape. A snowflake schema is an extension of star schema and its adds additional dimensions

Star flake is combination of star schema and snowflake schema .It is snowflake schema where some of the dimension tables have been unnormalized. It allows more efficient access to the data by various tools and the ability to adapt to changes in user requirements. It is also extensible as it supports adding new dimensions.

What are Facts and Facts tables?

Fact table contain the data corresponding to a particular business process.

Each row represent a single event associated with a process and contains the measurement data associated with that event.

It contain the numeric, quantitative data (aka measures).

Typically one fact table per distinct business business process

Exception: "Consolidated" fact (aka "merged" facts) such as actual vs forecast which require the same granularity and are frequently analyzed together

Dimensions

Dimension describe the objects involved in a business intelligence effort. While fact correspond to events , dimensions correspond to people, items or other
Dimension tables: provide the descriptive context -attributes with the who, what, when ,why or how. They should always include friendly name and description

ETL

ETL is a process which is used for data extraction from the source (database, XML file, text files etc.) Transforms the data and then loads the data into the data warehouse.

ETL can be termed as Extract Transform Load. ETL extracts the data from a different source(it can be an oracle database, xml file, text file, xml etc.). Then transforms the data (by applying aggregate function, keys, join etc.)

Finally data are loaded into data warehouse for analytics.

ETL has three main processes:

Extraction
Transform
Load

Extraction

Extraction generally starts with the extraction of data from various heterogeneous source that might be some flat file(csv,txt etc), Excel spreadsheet, API based system or may be various other databases they might be in some operation system (OLTP).

Flat files or spread sheet can be made available through the FTP or S3 bucket with their control total file.

Control total file is a file that contains the header information of data with the counts and other basic information that can be used to check the consistency of data after extracting from source to destination.

there might be different strategies on data Extraction

Full extraction (Refresh)
Partial Extraction (Append)

if the extraction is Full, first the last load data will taken as backup and new set of data will be loaded in the staging and delta will be calculated based on new data load and existing data

if the extraction is partial , it will be added to the existing data and filter out the duplicate data as well as update the existing data if it is updated with the new record

After data is extracted from the source and loaded in staging server data load verification is done using the control total file that comes along with data files

The last step of extraction will be data profiling

Transformation

Transformation step consist of data integration, data cleansing , mapping and data transformation.

Data that are extracted from source system are raw and that cannot be used as it should be standardized as needed, there might be data transformation done as per the business requirement or definition can be added and data integration will be done.

Transformation doesn't mean addition or deleting the data but it is transformation from one form to another without loosing its meaning.

Some transformation task

Data Integration: In this step data from different source are integrated
Selection: It is the process of selecting the only those column which actually needed or required by output system
Matching: In this step we match the raw data with the lookup or master data that needed to be transformed.
Data cleansing: This process standardizes various data in the companies based on the specific rule and adjusts them to be used easily for new services and applications. It is the task which must be especially done when data is migrated from the mainframe and system is changed from legacy to new one

Major transformation types

Data Standardizing: As data are fetched from different sources hence they may not be in the same format thus data standardizing is done
Calculated and derived data: in this step different new column ay be derived from source columns as per the business requirement
Splitting and merging: as per the business requirement we have to split some column n different column or may be merge into single column depending upon business requirement
Conversion of unit of measurement: In this step source data are converted in uniform unit of measurement
Duplication: In this step duplicate data will be removed that we can get from source system

Data loading is the process of loading the data in warehouse or database that are prepared after applying the transformation logic

Need of ETL

ETL is a tool that extracts, transform and load raw data into the user data
ETL helps firms to examine their business data to make critical business decisions
It provides a technique of transferring the data from multiple sources to a data warehouse
Transactional database do not answer complicated business question, but ETL can be able to answer this question
ETL is a preset process for accessing and refining data source into a piece of useful data
when the data source changes the data warehouse will be updated
properly designed and validated ETL software is essential for successful data warehouse management
It helps to improve productivity because it is simplified and can be used without the need for technical skills

ETL tools

Informatica PowerCenter
Oracle Data integrator(ODI)
Pentaho Data integration(PDI()
SQL server Integration Services(SSIS)
Cognos Data Manager
QlikView Expressor
SAP Data services
Talend pen studio & Integration Suite
IBM information server(DataStage)

ETL pipelining Tools

Amazon Glue
Microsoft flow
air flow

Database concept

ACID

Atomicity

database follow the all or nothing rule i.e. the database consider all transaction operation as one whole unit or atom.
when database processes a transaction, it is either fully completed or not executed at all

Consistency

Ensure that only valid data following all rules and constraints is written in the database
When transaction result is invalid data, the database reverts to its previous state, which abides by all customary rule and constraints

Isolation

Ensure that transaction are securely and independently processes at same time without interference but it doesn't ensure the order of transactions
eg user A withdraws $100 and B withdraw $250 from the user Z's account which has a balance of $10000
since both A and B withdraw from Z's account one of the user is require to wait until transactionn is completed
if B is requred to wait then B must wat until A's transaction is completed and Z's account balance changes to $900. Now B can withdraw $250 from this $900 balance

Durability

user B may withdraw $100 only after A transaction is completed and is updated in the database. If system fails before A's transaction is logged in the database, A cannot withdraw any money and Z's account returns to its previous consistent state

Important Terminology and Concept

Null
Entity Integrity
Relational Integrity
Referential Integrity
General Constraints

Redundancy

Dependencies between attributes within a relation cause redundancy

eg all addresses in the same town have the same zip code

ssn	name	town	zip
1234	Joe	Huntingdon	1665
2345	Mary	Huntingdon	1665
3456	Tom	Huntigdon	1665
5948	Harry	Alexandra	16603

there's clearly redundant information stored here

Consistency and integrity are harder to maintain even in this simple example.

ensuring the fact that the zip code always refers the same city and the city is spelled consistently

Note we don't have a zip code to city fact stored unless there is a person from that zip code

Anomalies

An anomaly is an inconsistent, incomplete or contradictory state of database

Insertion anomaly- user s unable to insert a new record of data when it should be possible to do so because not all information is available
Deletion anomaly- when a record is deleted, other information that is tied to t is also deleted
Update anomaly- a record is updated, but other appearances of the same items are not updated

Studentid	Activity	Cost
1	swimming	$23
2	swimming	$23
3	dancing	$23
4	Fencng	$20

table can be splitted

studentid	Activity
21	dancing
21	swimming
34	dancing
55	fencing

studentid	Activity
Dancing	$23
Swimming	$50
Fencing	$20

splitting the relations

Keys in RDBMS

Super key
Candidate key
composite key
Primary key
Foreign key
Unique key

Functional Dependency

Function dependency is a relationship that exist when one attribute uniquely determines another attribute

if R is a relation with attribute X and Y , a function dependency between the attributes is represented as

X->Y

which specifies Y s functionally dependent on X. Here X is a determinant set and Y is a dependent attribute.

Each value of X is associated precisely with one Y value

Functional dependency is a database servers as a constraint between two sets of attributes

Defining functional dependency is an important part of relational database design and contributes to aspect normalization

for table with following field

empnum

empemail

empfname

epllname

if empnum is k then FD:

empnum->empemail

empnum->empfname

empnum-> emplname

must exist

FUNCTIONAL DEPENDENCY

empnum -> empemail

attribute on LHS is known as determinant

empnum is determinant of empemail

Transitive dependency

Consider attributes A,B, and C where

A->B and B->C

functional dependencies are transitive which means that we all have the function dependency A=>C

we say that C is transitively dependent on A and B

Partial dependency exist when an attribute B is functionally dependent on Attribute A and A is component of a multipart candidate key

Normalization

Normalization is the process that improves a database design by generating relation that are of higher normal forms

The objective of normalization is to create relation where every dependency is on the key, the whole key and nothing but the key

First Normal Form

basic rule

Contains only atomic values
there are no repeating group

A database table is said to be in 1NF if it contains no repeating field / column.

The process of converting the UNF table into 1NF is as follows

Separate the repeating field into new database table along with the key from unnormalized database table
The priary key of new database table may be a composite key

Empnum | empphone | Empdegree

123 233-9876

333 233-1231 BA,BSc,PhD

679 244-1231 BSc,MSc

we can normalize above table by splitting above table into 2 table

employee

empNum | Empphone

123 233-9876

333 233-1231

679 233-1231

enmloyeeDegree

empnum | empdegree

333 BA

333 BSc

333 PhD

679 BSc

678 MSc

Second normal form

basic rule

it is first normal form
all non key attribute are fully functional dependent on the primary key

A database table s said to be in 2 NF if it is 1NF and contains only those fields./column that are functionally dependent

process

remove the partial dependencies
A type of functional dependency where a field is only functionally dependent on the part of primary key
if field B depend on field A and vice versa, also for a given value of B we have only one possible value of A and vice versa. Then we put the field V into new database table where B will be primary key and also marked as foreign key in parent table

Third Normal Form

basic rule

it is in second normal form
there is no transitive functional dependency

A database table is said to be in 3NF if it is 2nd NF and all non keys field should be dependent on primary key or we can also say table to be in 3NF if it is 2NF and no field of table is transitively functionally dependent on the primary key

Process

Remove the transitive dependencies
make separate table for transitive dependent Field

https://fdocuments.in/document/the-process-of-normalisation.html

T-SQL

Transact SQL is microsoft's and Sybase's proprietary extension to the conventional SQL.

This enhancement brought procedural programming language functionalities such as local variable, control of flow construct( if... else; begin... end; continue command; break command; looping etc)

Additional functions for various needs as well as a provision of creating our own function or procedure etc.

hence we can regard T-SQL as a language

way of working is bit different from Oracle

but almost every thing is same

stored procedure also return result directly but not in case of PL

Logical Query Processing

FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE/ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
OFFSET/FETCH

SQL anatomy

select <select list> define which columns to return

form <table source> define table to query

where <search condition> filter returns data using a predicate

group by <group by list> arranges rows by group

having Filter group by a prediate

SQL Anatomy

order by <order by list> sorts the result

select branchID, count(staff_id)

from workers

where branchType =

aggregate function

count- returns number of value in column

sum-returns the sum total of values of a column

avg-returns the mean average value in column

min-returns the lowest value in column

max-returns the highest value in column

min and max can also perform on date

Group by

the group by clause dvides the row returned from the select statement into groups

for each group you can apply and aggregate function

syntax

select col1, aggregate_fnction(col2)

from tbl_name

Group by col1;

if group by is used without aggregate function it will act like DISTINCT clause

HAVING

we often use having clause in conjunction with group by clause to filter group rows that do not satisfy a specified condition

syntax

select

col1,

aggregate_function(col2)

from tbl_name

group by

col1

having

conditions

Joins

join are used to retrieve data from multiple tables

JOIN performed whenever two or more tables are joined in SQL statement

4 types of join

Inner join(simple join)
outer join

left
right
full

self join
cross join

inner join

common type of join

return all rows from multiple table where the join condition meet

select col

from tbl inner join tbl1

pon tbl.col=tbl2.col;

left outer join

return all rows from the left hand table specified in the ON condition and only those row from the other table where the join fields are equal

select col

from tbl1

left join tbl2

on tbl1.col=tbl2.col;

Right outer join

returns all rows from right hand table specified in the ON condition and only those rows from the outer table where join field are equal

select col

from tbl1

right join tbl2

tbl1.col=tbl2.col

full outer Joins

returns all rows from left and right table with null in place where the join condition is not meet

right outer join

return all rows from the right hand table specified in the ON condtion and only those rows from the outer table where the joined fields are equal

select col

from tbl1

right join tbl2

on tbl1.col=tbl2.col

full outer join

returns all the rows from left an right table with null in place where the join condition is not meet

select col

from tbl

full join tbl2

on tbl1.col=tbl2.col

Self Join

is a query in which table is joined to itself.

self join are useful for comparing values in a column of rows within the same table\

select col

form A a

Inner join B b on join_predicate;

cross join

cross join clause allows you to product the Cartesian product of rows in 2 or more table

select * from t1 cross join t2;

select * from t1,t2;

select * from t1 inner join t2 on true;

Union

Union concatenates the result of 2 queries into a single result set.

The result can be controlled to includes duplicate rows

UNION ALL - includes duplicates

UNION - excludes duplicates

a union operation is different from join

union concatenates result set from 2 queries

union doesn't create individual rows from columns gathered from 2 table

join compares columns from 2 table to create result rows composed from 2 tables

select firstname,lastname from table

union

select firstname,lastname from table2

intersection/ except

Intersect returns distinct rows that are output by both left and right input queries operator

expect returns distinct rows from the left input query that aren't output by the right input query

Pattern matching

the sql server LIKE is a logical operator that determines if a character string matches a specific pattern.

A pattern may include regular characters and wildcard characters.The LIKE operator is used in the where clause of the select, update and delete statement to filter row based on pattern matching

Pattern

the pattern is a sequence of character to search for in the column or expression. It can include the following valid wildcard characters:

the percent wildcard(%): any string of zero or more characters

the underscore(_) wildcard: any single character

the [list of character] wildcard: any single character within the specified range

the[^]: any single character not within a list or a range

column | expression LIKE pattern [ESCAPE escape_character]

select first_name,last_name form customer

where first_name like'jen%"

select cid,fn,ln from sales.customers

where ln like '_u%'

order by fn

select cid,fn,ln from tbl

where ln like '[a-c]%' order by fn

select [description] from Production.ProductDescription

where [description] like '[^a-t]%'

COALESCE/NULLIF

coalesce

it evaluates the argument in order and returns the current value of the first expression that initally doesn't evaluate null

example

select coalesce(NULL,NULL,'SOMEVALUE','OTHERVALE');

the above query returns

Jidesh Baidya

Search This Blog

SSIS

Comments

Post a Comment