Business Intelligence
Data, Business Intelligence and analytics
Organization are drowning in data
- but lack in information / Knowledge
- our ability to collect and store data seems to have surpassed our ability to make sense of it
- Important trends:
- Storage capacity continues to rise rapidly
- cost of storage continues to drop
What is Business Intelligence?
The application and technologies transforming business data into Action.
Fact based decision making typically involves a subset of the following skill or tools
- data querying / sql
- database design/data warehousing/ Data mining
- Decision support system/ simulations
- Data visualization/ Dashboard
Business intelligence (BI) is a technology driven process for analyzing data and presenting actionable information to help executives manager and other corporate end user make informed business decision
The analytics given insight on most successful sourcing channels and provide metric to increase effectiveness, ultimately lowering the cost on the spend
THE BI GOAL
BI is about using data to help enterprise user make better business decision
Right information+ Right time + right format
| |
| |
V
strategic, tactical, operational
===>Better Company
Data / Information and knowledge
Data is a collection of raw value elements or fact used for calculating reasoning or measuring
Information is the result of collecting and organizing data in a way that establishes relationship between data items, which thereby provide context and meaning
Knowledge is the concept of understanding information based on recognized patterns in a way that provides insight to information
Famous "Beer and Diapers" BI examples
consider the following convenience store data (register sales receipts)
Goal : Convert data to knowledge
- Hard to do in practice
- required commitment/effort.
What is OLAP?
OLAP(Online analytical processing) is a process of performing multidirectional analysis at high speed on large volumes of data from a data warehouse, data mart, or some other unified, centralized data store.
Online Analytical processing server is based on the multidimensional data model . It allows managers an analyst to get an insight of the information through fast, consistent and interactive access to information.
Online transaction processing or OLTP refers to data processing methods and software focused on transaction oriented data and applications
The main difference between OLAP and OLTP is in the name itself.
OLAP(Online Analytical Processing) | OLTP(Online Transaction processing) |
---|---|
OLAP is analytical in nature | OLTP is transactional in nature |
OLAP are designed for multidimentional analysis of data in data warehouse which contains both transactional and historical data. OLAP server is typcally the middle analytical tier of a data warehousing solution | OLTP is designed to support transaction oriented and accruately as possible. |
Uses of OLAP are data mining and other business intelligence application , complex analytical calculation, and predictive scenarios as well as business reorting function like financial analysis, budgeting, and forecase planning | Common used for OLTP include ATM, ecommerce software , credit card payment processing , online bookings, reservation system and record keeping tools |
Data warehousing -Schemas
Schema is a logical description of the entire database. It includes the name and description of records of all record types including all associate
Types of Data Warehouse schema:
- Star schema
- Snowflake Schema
- Starflake Schema
Star schema in data warehouse schema in which the center of the star can have one fact table and number of associated dimension tables. It is known as start schema as its structure resembles a star. The star schema data model is the simplest type of data warehouse schema
Snowflake schema is a data warehouse schema in which there is logical arrangement of tables in a multidimensional database such that the ER diagram resembles a snow flake shape. A snowflake schema is an extension of star schema and its adds additional dimensions
Star flake is combination of star schema and snowflake schema .It is snowflake schema where some of the dimension tables have been unnormalized. It allows more efficient access to the data by various tools and the ability to adapt to changes in user requirements. It is also extensible as it supports adding new dimensions.
What are Facts and Facts tables?
Fact table contain the data corresponding to a particular business process.
Each row represent a single event associated with a process and contains the measurement data associated with that event.
It contain the numeric, quantitative data (aka measures).
Typically one fact table per distinct business business process
Exception: "Consolidated" fact (aka "merged" facts) such as actual vs forecast which require the same granularity and are frequently analyzed together
Dimensions
- Dimension describe the objects involved in a business intelligence effort. While fact correspond to events , dimensions correspond to people, items or other
- Dimension tables: provide the descriptive context -attributes with the who, what, when ,why or how. They should always include friendly name and description
ETL
ETL is a process which is used for data extraction from the source (database, XML file, text files etc.) Transforms the data and then loads the data into the data warehouse.
ETL can be termed as Extract Transform Load. ETL extracts the data from a different source(it can be an oracle database, xml file, text file, xml etc.). Then transforms the data (by applying aggregate function, keys, join etc.)
Finally data are loaded into data warehouse for analytics.
ETL has three main processes:
- Extraction
- Transform
- Load
Extraction
Extraction generally starts with the extraction of data from various heterogeneous source that might be some flat file(csv,txt etc), Excel spreadsheet, API based system or may be various other databases they might be in some operation system (OLTP).
Flat files or spread sheet can be made available through the FTP or S3 bucket with their control total file.
Control total file is a file that contains the header information of data with the counts and other basic information that can be used to check the consistency of data after extracting from source to destination.
there might be different strategies on data Extraction
- Full extraction (Refresh)
- Partial Extraction (Append)
if the extraction is Full, first the last load data will taken as backup and new set of data will be loaded in the staging and delta will be calculated based on new data load and existing data
if the extraction is partial , it will be added to the existing data and filter out the duplicate data as well as update the existing data if it is updated with the new record
After data is extracted from the source and loaded in staging server data load verification is done using the control total file that comes along with data files
The last step of extraction will be data profiling
Transformation
Transformation step consist of data integration, data cleansing , mapping and data transformation.
Data that are extracted from source system are raw and that cannot be used as it should be standardized as needed, there might be data transformation done as per the business requirement or definition can be added and data integration will be done.
Transformation doesn't mean addition or deleting the data but it is transformation from one form to another without loosing its meaning.
Some transformation task
- Data Integration: In this step data from different source are integrated
- Selection: It is the process of selecting the only those column which actually needed or required by output system
- Matching: In this step we match the raw data with the lookup or master data that needed to be transformed.
- Data cleansing: This process standardizes various data in the companies based on the specific rule and adjusts them to be used easily for new services and applications. It is the task which must be especially done when data is migrated from the mainframe and system is changed from legacy to new one
Major transformation types
- Data Standardizing: As data are fetched from different sources hence they may not be in the same format thus data standardizing is done
- Calculated and derived data: in this step different new column ay be derived from source columns as per the business requirement
- Splitting and merging: as per the business requirement we have to split some column n different column or may be merge into single column depending upon business requirement
- Conversion of unit of measurement: In this step source data are converted in uniform unit of measurement
- Duplication: In this step duplicate data will be removed that we can get from source system
Loading
Data loading is the process of loading the data in warehouse or database that are prepared after applying the transformation logic
Need of ETL
- ETL is a tool that extracts, transform and load raw data into the user data
- ETL helps firms to examine their business data to make critical business decisions
- It provides a technique of transferring the data from multiple sources to a data warehouse
- Transactional database do not answer complicated business question, but ETL can be able to answer this question
- ETL is a preset process for accessing and refining data source into a piece of useful data
- when the data source changes the data warehouse will be updated
- properly designed and validated ETL software is essential for successful data warehouse management
- It helps to improve productivity because it is simplified and can be used without the need for technical skills
ETL tools
- Informatica PowerCenter
- Oracle Data integrator(ODI)
- Pentaho Data integration(PDI()
- SQL server Integration Services(SSIS)
- Cognos Data Manager
- QlikView Expressor
- SAP Data services
- Talend pen studio & Integration Suite
- IBM information server(DataStage)
ETL pipelining Tools
- Amazon Glue
- Microsoft flow
- air flow
Database concept
ACID
Atomicity
- database follow the all or nothing rule i.e. the database consider all transaction operation as one whole unit or atom.
- when database processes a transaction, it is either fully completed or not executed at all
Consistency
- Ensure that only valid data following all rules and constraints is written in the database
- When transaction result is invalid data, the database reverts to its previous state, which abides by all customary rule and constraints
Isolation
- Ensure that transaction are securely and independently processes at same time without interference but it doesn't ensure the order of transactions
- eg user A withdraws $100 and B withdraw $250 from the user Z's account which has a balance of $10000
- since both A and B withdraw from Z's account one of the user is require to wait until transactionn is completed
- if B is requred to wait then B must wat until A's transaction is completed and Z's account balance changes to $900. Now B can withdraw $250 from this $900 balance
Durability
- user B may withdraw $100 only after A transaction is completed and is updated in the database. If system fails before A's transaction is logged in the database, A cannot withdraw any money and Z's account returns to its previous consistent state
Important Terminology and Concept
- Null
- Entity Integrity
- Relational Integrity
- Referential Integrity
- General Constraints
Redundancy
Dependencies between attributes within a relation cause redundancy
eg all addresses in the same town have the same zip code
ssn | name | town | zip |
---|---|---|---|
1234 | Joe | Huntingdon | 1665 |
2345 | Mary | Huntingdon | 1665 |
3456 | Tom | Huntigdon | 1665 |
5948 | Harry | Alexandra | 16603 |
Consistency and integrity are harder to maintain even in this simple example.
ensuring the fact that the zip code always refers the same city and the city is spelled consistently
Note we don't have a zip code to city fact stored unless there is a person from that zip code
Anomalies
An anomaly is an inconsistent, incomplete or contradictory state of database
- Insertion anomaly- user s unable to insert a new record of data when it should be possible to do so because not all information is available
- Deletion anomaly- when a record is deleted, other information that is tied to t is also deleted
- Update anomaly- a record is updated, but other appearances of the same items are not updated
Studentid | Activity | Cost |
---|---|---|
1 | swimming | $23 |
2 | swimming | $23 |
3 | dancing | $23 |
4 | Fencng | $20 |
studentid | Activity |
---|---|
21 | dancing |
21 | swimming |
34 | dancing |
55 | fencing |
studentid | Activity |
---|---|
Dancing | $23 |
Swimming | $50 |
Fencing | $20 |
splitting the relations
Keys in RDBMS
- Super key
- Candidate key
- composite key
- Primary key
- Foreign key
- Unique key
Functional Dependency
Function dependency is a relationship that exist when one attribute uniquely determines another attribute
if R is a relation with attribute X and Y , a function dependency between the attributes is represented as
X->Y
which specifies Y s functionally dependent on X. Here X is a determinant set and Y is a dependent attribute.
Each value of X is associated precisely with one Y value
Functional dependency is a database servers as a constraint between two sets of attributes
Defining functional dependency is an important part of relational database design and contributes to aspect normalization
for table with following field
empnum
empemail
empfname
epllname
if empnum is k then FD:
empnum->empemail
empnum->empfname
empnum-> emplname
must exist
FUNCTIONAL DEPENDENCY
empnum -> empemail
attribute on LHS is known as determinant
empnum is determinant of empemail
Transitive dependency
Consider attributes A,B, and C where
A->B and B->C
functional dependencies are transitive which means that we all have the function dependency A=>C
we say that C is transitively dependent on A and B
Partial dependency exist when an attribute B is functionally dependent on Attribute A and A is component of a multipart candidate key
Normalization
- Normalization is the process that improves a database design by generating relation that are of higher normal forms
The objective of normalization is to create relation where every dependency is on the key, the whole key and nothing but the key
First Normal Form
basic rule
- Contains only atomic values
- there are no repeating group
A database table is said to be in 1NF if it contains no repeating field / column.
The process of converting the UNF table into 1NF is as follows
- Separate the repeating field into new database table along with the key from unnormalized database table
- The priary key of new database table may be a composite key
Empnum | empphone | Empdegree
123 233-9876
333 233-1231 BA,BSc,PhD
679 244-1231 BSc,MSc
we can normalize above table by splitting above table into 2 table
employee
empNum | Empphone
123 233-9876
333 233-1231
679 233-1231
enmloyeeDegree
empnum | empdegree
333 BA
333 BSc
333 PhD
679 BSc
678 MSc
Second normal form
basic rule
- it is first normal form
- all non key attribute are fully functional dependent on the primary key
process
- remove the partial dependencies
- A type of functional dependency where a field is only functionally dependent on the part of primary key
- if field B depend on field A and vice versa, also for a given value of B we have only one possible value of A and vice versa. Then we put the field V into new database table where B will be primary key and also marked as foreign key in parent table
Third Normal Form
basic rule
- it is in second normal form
- there is no transitive functional dependency
A database table is said to be in 3NF if it is 2nd NF and all non keys field should be dependent on primary key or we can also say table to be in 3NF if it is 2NF and no field of table is transitively functionally dependent on the primary key
Process
- Remove the transitive dependencies
- make separate table for transitive dependent Field
T-SQL
Transact SQL is microsoft's and Sybase's proprietary extension to the conventional SQL.
This enhancement brought procedural programming language functionalities such as local variable, control of flow construct( if... else; begin... end; continue command; break command; looping etc)
Additional functions for various needs as well as a provision of creating our own function or procedure etc.
hence we can regard T-SQL as a language
way of working is bit different from Oracle
but almost every thing is same
stored procedure also return result directly but not in case of PL
Logical Query Processing
- FROM
- ON
- JOIN
- WHERE
- GROUP BY
- WITH CUBE/ROLLUP
- HAVING
- SELECT
- DISTINCT
- ORDER BY
- TOP
- OFFSET/FETCH
SQL anatomy
select <select list> define which columns to return
form <table source> define table to query
where <search condition> filter returns data using a predicate
group by <group by list> arranges rows by group
having Filter group by a prediate
SQL Anatomy
order by <order by list> sorts the result
select branchID, count(staff_id)
from workers
where branchType =
aggregate function
count- returns number of value in column
sum-returns the sum total of values of a column
avg-returns the mean average value in column
min-returns the lowest value in column
max-returns the highest value in column
min and max can also perform on date
Group by
the group by clause dvides the row returned from the select statement into groups
for each group you can apply and aggregate function
syntax
select col1, aggregate_fnction(col2)
from tbl_name
Group by col1;
if group by is used without aggregate function it will act like DISTINCT clause
HAVING
we often use having clause in conjunction with group by clause to filter group rows that do not satisfy a specified condition
syntax
select
col1,
aggregate_function(col2)
from tbl_name
group by
col1
having
conditions
Joins
join are used to retrieve data from multiple tables
JOIN performed whenever two or more tables are joined in SQL statement
4 types of join
- Inner join(simple join)
- outer join
- left
- right
- full
- self join
- cross join
inner join
common type of join
return all rows from multiple table where the join condition meet
select col
from tbl inner join tbl1
pon tbl.col=tbl2.col;
left outer join
return all rows from the left hand table specified in the ON condition and only those row from the other table where the join fields are equal
select col
from tbl1
left join tbl2
on tbl1.col=tbl2.col;
Right outer join
returns all rows from right hand table specified in the ON condition and only those rows from the outer table where join field are equal
select col
from tbl1
right join tbl2
on
tbl1.col=tbl2.col
full outer Joins
returns all rows from left and right table with null in place where the join condition is not meet
right outer join
return all rows from the right hand table specified in the ON condtion and only those rows from the outer table where the joined fields are equal
select col
from tbl1
right join tbl2
on tbl1.col=tbl2.col
full outer join
returns all the rows from left an right table with null in place where the join condition is not meet
select col
from tbl
full join tbl2
on tbl1.col=tbl2.col
Self Join
is a query in which table is joined to itself.
self join are useful for comparing values in a column of rows within the same table\
select col
form A a
Inner join B b on join_predicate;
cross join
cross join clause allows you to product the Cartesian product of rows in 2 or more table
select * from t1 cross join t2;
select * from t1,t2;
select * from t1 inner join t2 on true;
Union
Union concatenates the result of 2 queries into a single result set.
The result can be controlled to includes duplicate rows
UNION ALL - includes duplicates
UNION - excludes duplicates
a union operation is different from join
union concatenates result set from 2 queries
union doesn't create individual rows from columns gathered from 2 table
join compares columns from 2 table to create result rows composed from 2 tables
eg
select firstname,lastname from table
union
select firstname,lastname from table2
intersection/ except
Intersect returns distinct rows that are output by both left and right input queries operator
expect returns distinct rows from the left input query that aren't output by the right input query
Pattern matching
the sql server LIKE is a logical operator that determines if a character string matches a specific pattern.
A pattern may include regular characters and wildcard characters.The LIKE operator is used in the where clause of the select, update and delete statement to filter row based on pattern matching
Pattern
the pattern is a sequence of character to search for in the column or expression. It can include the following valid wildcard characters:
the percent wildcard(%): any string of zero or more characters
the underscore(_) wildcard: any single character
the [list of character] wildcard: any single character within the specified range
the[^]: any single character not within a list or a range
LIKE
column | expression LIKE pattern [ESCAPE escape_character]
eg
select first_name,last_name form customer
where first_name like'jen%"
select cid,fn,ln from sales.customers
where ln like '_u%'
order by fn
select cid,fn,ln from tbl
where ln like '[a-c]%' order by fn
select [description] from Production.ProductDescription
where [description] like '[^a-t]%'
COALESCE/NULLIF
coalesce
it evaluates the argument in order and returns the current value of the first expression that initally doesn't evaluate null
example
select coalesce(NULL,NULL,'SOMEVALUE','OTHERVALE');
the above query returns
Comments
Post a Comment