SSIS

 Business Intelligence


Data, Business Intelligence and analytics


Organization are drowning in data

  • but lack in information / Knowledge
  • our ability to collect and store data seems to have surpassed our ability to make sense of it
  • Important trends:
    • Storage capacity continues to rise rapidly
    • cost of storage continues to drop



What is Business Intelligence?

The application and technologies transforming business data into Action.

Fact based decision making typically involves a subset of the following skill or tools



  •  data querying / sql
  • database design/data warehousing/ Data mining
  • Decision support system/ simulations
  • Data visualization/ Dashboard
Business intelligence (BI) is a technology driven process for analyzing data and presenting actionable information to help executives manager and other corporate end user make informed business decision

The analytics given insight on most successful sourcing channels and provide metric to increase effectiveness, ultimately lowering the cost on the spend

THE BI GOAL

BI is about using data to help enterprise user make better business decision


Right information+ Right time + right format

                        | |
                        | |
                        V

strategic, tactical, operational


===>Better  Company


Data / Information and knowledge

Data is a collection of raw value elements or fact used for calculating  reasoning or measuring

Information is the result of collecting and organizing data in a way that establishes relationship between data items, which thereby provide context and meaning

Knowledge is the concept of understanding information based on recognized patterns in a way that provides insight to information


Famous "Beer and Diapers" BI examples

consider the following convenience store data (register sales receipts)



Goal : Convert data to knowledge

- Hard to do in practice 
- required commitment/effort.



What is OLAP?

OLAP(Online analytical processing) is a process of performing multidirectional analysis at high speed on large volumes of data from a data warehouse, data mart, or some other unified, centralized data store.

Online Analytical processing server is based on the multidimensional data model . It allows managers an analyst to get an insight of the information through fast, consistent and interactive access to information.


Online transaction processing or OLTP refers to data processing methods and software focused on transaction oriented data and applications

The main difference between OLAP and OLTP is in the name itself.


OLAP(Online Analytical Processing) OLTP(Online Transaction processing)
OLAP is analytical in nature OLTP is transactional in nature
OLAP are designed for multidimentional analysis of data in data warehouse which contains both transactional and historical data. OLAP server is typcally the middle analytical tier of a data warehousing solution OLTP is designed to support transaction oriented and accruately as possible. 
Uses of OLAP are data mining and other business intelligence application , complex analytical calculation, and predictive scenarios as well as business reorting function like financial analysis, budgeting, and forecase planning Common used for OLTP include ATM, ecommerce software , credit card payment processing , online bookings, reservation system and record keeping tools


Data warehousing -Schemas

Schema is a logical description of the entire database. It includes the name and description of records of all record types including all associate

Types of Data Warehouse schema:
  • Star schema
  • Snowflake Schema
  • Starflake Schema
Star schema in data warehouse schema in which the center of the star can have one fact table and number of associated dimension tables. It is known as start schema as its structure resembles a star. The star schema data model is the simplest type of data warehouse schema


Snowflake schema is a data warehouse schema in which there is logical arrangement of tables in a multidimensional database such that the ER diagram resembles a snow flake shape. A snowflake schema is an extension of star schema and its adds additional dimensions

Star flake is combination of star schema and snowflake schema .It is snowflake schema where some of the dimension tables have been unnormalized. It allows more efficient access to the data by various tools and the ability to adapt to changes in user requirements. It is also extensible as it supports adding new dimensions.

What are Facts and Facts tables?

Fact table contain the data corresponding to a particular business process.
Each row represent a single event associated with a process and contains the measurement data associated with that event.
It contain the numeric, quantitative data (aka measures).

Typically one fact table per distinct business business process
Exception: "Consolidated" fact (aka "merged" facts) such as actual vs forecast which require the same granularity and are frequently analyzed together


Dimensions
  • Dimension describe the objects involved in a business intelligence effort. While fact correspond to events , dimensions correspond to people, items or other
  • Dimension tables: provide the descriptive context -attributes with the who, what, when ,why or how. They should always include friendly name and description


ETL

ETL is a process which is used for data extraction from the source (database, XML file, text files etc.) Transforms the data and then loads the data into the data warehouse.

ETL can be termed as Extract Transform Load. ETL extracts the data from a different source(it can be an oracle database, xml file, text file, xml etc.). Then transforms the data (by applying aggregate function, keys, join etc.)
Finally data are loaded into data warehouse for analytics.


ETL has three main processes:
  • Extraction
  • Transform
  • Load






Extraction
Extraction generally starts with the extraction of data from various heterogeneous source that might be some flat file(csv,txt etc), Excel spreadsheet, API based system or may be various other databases they might be in some operation system (OLTP).

Flat files or spread sheet can be made available through the FTP or S3 bucket with their control total file.
Control total file is a file that contains the header information of data with the counts and other basic information that can be used to check the consistency of data after extracting from source to destination.



there might be different strategies on data Extraction
  • Full extraction (Refresh)
  • Partial Extraction (Append)
if the extraction is Full, first the last load data will taken as backup and new set of data will be loaded in the staging and delta will be calculated based on new data load and existing data

if the extraction is partial , it will be added to the existing data and filter out the duplicate data as well as update the existing data if it is updated with the new record


After data is extracted from the source and loaded in staging server data load verification is done using the control total file that comes along with data files

The last step of extraction will be data profiling 





Transformation

Transformation step consist of data integration, data cleansing , mapping and data transformation.
Data that are extracted from source system are raw and that cannot be used as it should be standardized as needed, there might be data transformation done as per the business requirement or definition  can be added and data integration will be done.

Transformation doesn't mean addition or deleting the data but it is transformation from one form to another without loosing its meaning.

Some transformation task
  1. Data Integration: In this step data from different source are integrated
  2. Selection: It is the process of selecting the only those column which actually needed or required by output system
  3. Matching: In this step we match the raw data with the lookup or master data that needed to be transformed.
  4. Data cleansing: This process standardizes various data in the companies based on the specific rule and adjusts them to be used easily for new services and applications. It is the task which must be especially done when data is migrated from the mainframe and system is changed from legacy to new one
Major transformation types
  • Data Standardizing: As data are fetched from different sources hence they may not be in the same format thus data standardizing is done
  • Calculated and derived data: in this step different new column ay be derived from source columns as per the business requirement
  • Splitting and merging: as per the business requirement we have to split some column n different column or may be merge into single column depending upon business requirement
  • Conversion of unit of measurement: In this step source data are converted in uniform unit of measurement
  • Duplication: In this step duplicate data will be removed that we can get from source system



Loading
Data loading is the process of loading the data in warehouse or database that are prepared after applying the transformation logic




Need of ETL
  • ETL is a tool that extracts, transform and load raw data into the user data
  • ETL helps firms to examine their business data to make critical business decisions
  • It provides a technique of transferring the data from multiple sources to a data warehouse
  • Transactional database do not answer complicated business question, but ETL can be able to answer this question
  • ETL is a preset process for accessing and refining data source into a piece of useful data
  • when the data source changes the data warehouse will be updated
  • properly designed and validated ETL software is essential for successful data warehouse management
  • It helps to improve productivity because it is simplified and can be used without the need for technical skills
ETL tools
  • Informatica PowerCenter
  • Oracle Data integrator(ODI)
  • Pentaho Data integration(PDI()
  • SQL server Integration Services(SSIS)
  • Cognos Data Manager
  • QlikView Expressor
  • SAP Data services
  • Talend pen studio & Integration Suite
  • IBM information server(DataStage)

ETL pipelining Tools
  • Amazon Glue
  • Microsoft flow
  • air flow
Database concept
ACID

Atomicity
  • database follow the all or nothing rule i.e. the database consider all transaction operation as one whole unit or atom.
  • when database processes a transaction, it is either fully completed or not executed at all

Consistency
  • Ensure that only valid data following all rules and constraints is written in the database
  • When transaction result is invalid data, the database reverts to its previous state, which abides by all customary rule and constraints

Isolation
  • Ensure that transaction are securely and independently processes at same time without interference but it doesn't ensure the order of transactions 
  • eg user A withdraws $100 and B withdraw $250 from the user Z's account which has a balance of $10000
  • since both A and B withdraw from Z's account  one of the user is require to wait until transactionn is completed
  • if B is requred to wait then B must wat until A's transaction is completed and Z's account balance changes to $900. Now B can withdraw $250 from this $900 balance

Durability
  • user B may withdraw $100 only after A transaction is completed and is updated in the database. If system fails before A's transaction is logged in the database, A cannot withdraw any money and Z's account returns to its previous consistent state

Important Terminology and Concept
  • Null
  • Entity Integrity
  • Relational Integrity
  • Referential Integrity
  • General Constraints
Redundancy
Dependencies between attributes within a relation cause redundancy
eg all addresses in the same town have the same zip code


ssn name town zip
1234 Joe Huntingdon    1665   
2345 Mary Huntingdon    1665   
3456 TomHuntigdon 1665   
5948 Harry Alexandra 16603
there's clearly redundant information stored here

Consistency and integrity are harder to maintain even in this simple example.
ensuring the fact that the zip code always refers the same city and the city is spelled consistently

Note we don't have a zip code to city fact stored unless there is a person from that zip code


Anomalies

An anomaly is an inconsistent, incomplete or contradictory state of database

  • Insertion anomaly- user s unable to insert a new record of data when it should be possible to do so because not all information is available
  • Deletion anomaly- when a record is deleted, other information that is tied to t is also deleted
  • Update anomaly- a record is updated, but other appearances of the same items are not updated
Studentid Activity Cost
1 swimming $23
2 swimming $23
3 dancing $23
4 Fencng $20
table can be splitted
studentid Activity
21 dancing
21 swimming
34 dancing
55 fencing
studentid Activity
Dancing $23
Swimming $50
Fencing $20


splitting the relations


Keys in RDBMS
  • Super key
  • Candidate key
  • composite key
  • Primary key
  • Foreign key
  • Unique key

Functional Dependency

Function dependency is a relationship that exist when one attribute uniquely determines another attribute

if R is a relation with attribute X and Y , a function dependency between the attributes is represented as

X->Y

which specifies Y s functionally dependent on X. Here X is a determinant set and Y is a dependent attribute.

Each value of X is associated precisely with one Y value

Functional dependency is a database servers as a constraint between two sets of attributes

Defining functional dependency is an important part of relational database design and contributes to aspect normalization



for table with following field
empnum
empemail
empfname
epllname

if empnum is k then FD:
empnum->empemail
empnum->empfname
empnum-> emplname

must exist







 








FUNCTIONAL DEPENDENCY

empnum -> empemail

attribute on LHS is known as determinant
empnum is determinant of empemail



Transitive dependency

Consider attributes A,B, and C where

A->B and B->C

functional dependencies are transitive which means that we all have the function dependency A=>C


we say that C is transitively dependent on A and B

Partial dependency exist when an attribute B is functionally dependent on Attribute A and A is component of a multipart candidate key



Normalization
  • Normalization is the process that improves a database design by generating relation that are of higher normal forms
 
The objective of normalization is to create relation where every dependency is on the key, the whole key and nothing but the key


First Normal Form

basic rule
  • Contains only atomic values
  • there are no repeating group
A database table is said to be in 1NF if it contains no repeating field / column.

The process of converting the UNF table into 1NF is as follows
  • Separate the repeating field into new database table along with the key from unnormalized database table
  • The priary key of new database table may be a composite key

Empnum | empphone | Empdegree
123            233-9876
333            233-1231    BA,BSc,PhD
679            244-1231    BSc,MSc

we can normalize above table by splitting above table into 2 table

employee
empNum    |    Empphone
123                233-9876
333                233-1231
679                233-1231


enmloyeeDegree

empnum    |    empdegree
333                BA
333                BSc
333                PhD
679                BSc
678                MSc


Second normal form

basic rule
  • it is first normal form
  • all non key attribute are fully functional dependent on the primary key

A database table s said to be in 2 NF if it is 1NF and contains only those fields./column that are functionally dependent


process
  • remove the partial dependencies
  • A type of functional dependency where a field is only functionally dependent on the part of primary key
  • if field B depend on field A and vice versa, also for a given value of B we have only one possible value of A and vice versa. Then we put the field V into new database table where B will be primary key and also marked as foreign key in parent table 

Third Normal Form
basic rule
  • it is in second normal form
  • there is no transitive functional dependency
A database table is said to be in 3NF if it is 2nd NF and all non keys field should be dependent on primary key or we can also say table to be in 3NF if it is 2NF and no field of table is transitively functionally dependent on the primary key


Process
  • Remove the transitive dependencies
  • make separate table for transitive dependent Field



https://fdocuments.in/document/the-process-of-normalisation.html



T-SQL

Transact SQL is microsoft's and Sybase's proprietary extension to the conventional SQL.
This enhancement brought procedural programming language functionalities such as local variable, control of flow construct( if... else; begin... end; continue command; break command; looping etc)

Additional functions for various needs as well as a provision of creating our own function or procedure etc.
hence we can regard T-SQL as a language

way of working is bit different from Oracle
but almost every thing is same


stored procedure also return result directly but not in case of PL

Logical Query Processing

  1. FROM
  2. ON
  3. JOIN
  4. WHERE
  5. GROUP BY
  6. WITH CUBE/ROLLUP
  7. HAVING
  8. SELECT
  9. DISTINCT
  10. ORDER BY
  11. TOP
  12. OFFSET/FETCH


SQL anatomy

select <select list> define which columns to return
form <table source> define table to query
where <search condition> filter returns data using a predicate
group by <group by list> arranges rows by group
having Filter group by a prediate


SQL Anatomy

order by <order by list> sorts the result
select branchID, count(staff_id)
from workers
where branchType =


aggregate function

count- returns number of value in column
sum-returns the sum total of values of a column
avg-returns the mean average value in column
min-returns the lowest value in column
max-returns the highest value in column

min and max can also perform on date

Group by
the group by clause dvides the row returned from the select statement into groups

for each group you can apply and aggregate function


syntax

select col1, aggregate_fnction(col2)
from tbl_name
Group by col1;

if group by is used without aggregate function it will act like DISTINCT clause


HAVING

we often use having clause in conjunction with group by clause to filter group rows that do not satisfy a specified condition
  
syntax

select
col1,
aggregate_function(col2)
from tbl_name
group by
col1
having
conditions


Joins

join are used to retrieve data from multiple tables
JOIN performed whenever two or more tables are joined in SQL statement

4 types of join
  1. Inner join(simple join)
  2. outer join
    1. left
    2. right
    3. full
  3. self join
  4. cross join

inner join
common type of join
return all rows from multiple table where the join condition meet


select col
from tbl inner join tbl1
pon tbl.col=tbl2.col;





left outer join
return all rows from the left hand table specified in the ON condition and only those row from the other table where the join fields are equal


select col
from tbl1
left join tbl2
on tbl1.col=tbl2.col;




Right outer join
returns all rows from right hand table specified in the ON condition and only those rows from the outer table where join field are equal


select col
from tbl1
right join tbl2
on 
tbl1.col=tbl2.col

full outer Joins
returns all rows from left and right table with null in place where the join condition is not meet


right outer join

return all rows from the right hand table specified in the ON condtion and only those rows from the outer table where the joined fields are equal

select col
from tbl1
right join tbl2
on tbl1.col=tbl2.col

full outer join
returns all the rows from left an right table with null in place where the join condition is not meet

select col
from tbl
full join tbl2
on tbl1.col=tbl2.col


Self Join
is a query in which table is joined to itself.
self join are useful for comparing values in a column of rows within the same table\

select col
form A a
Inner join B b on join_predicate;


cross join 

cross join clause allows you to product the Cartesian product of rows in 2 or more table

select * from t1 cross join t2;

select  * from t1,t2;

select * from t1 inner join t2 on true;

Union
Union concatenates the result of 2 queries into a single result set.
The result can be controlled  to includes duplicate rows 

UNION ALL - includes duplicates
UNION - excludes duplicates

a union operation is different from join
union concatenates result set from 2 queries 
union doesn't create individual  rows from columns gathered from 2 table

join compares columns from 2 table to create result rows composed from 2 tables


eg
select firstname,lastname from table
union
select firstname,lastname from table2


intersection/ except

Intersect returns distinct rows that are output by both left and right input queries operator


expect returns distinct rows from the left input query that aren't output by the right input query


Pattern matching
the sql server LIKE is a logical operator that determines if a character string matches a specific pattern. 
A pattern may include regular characters and wildcard characters.The LIKE operator is used in the where clause of the select, update and delete statement to filter row based on pattern matching

Pattern
the pattern is a sequence of character to search for in the column or expression. It can include the following valid wildcard characters:
the percent wildcard(%): any string of zero or more characters
the underscore(_) wildcard: any single character
the [list of character] wildcard: any single character within the specified range
the[^]: any single character not within a list or a range

LIKE
column | expression LIKE pattern [ESCAPE escape_character]

eg
select first_name,last_name form customer
where first_name like'jen%"


select cid,fn,ln from sales.customers
where ln like '_u%'
order by fn

select cid,fn,ln from tbl
where ln like '[a-c]%' order by fn


select [description] from Production.ProductDescription
where [description] like '[^a-t]%'

  
COALESCE/NULLIF

coalesce
it evaluates the argument in order and returns the current value of the first expression that initally doesn't evaluate null

example
select coalesce(NULL,NULL,'SOMEVALUE','OTHERVALE');

the above query returns







Comments