Thursday, March 31, 2016

Presentation and Visualization Methods

Data visualization

Data visualization is the method of consolidating data into one collective, illustrative graphic. Traditionally, data visualization has been used for quantitative work (info graphics are a popular example) but ways to represent qualitative work have shown to be equally as powerful.

Below is a beautiful explanation on what is data visualization.


Visualization methods are considered to be very important for the users because it provides mental models of the information. Visualization techniques make huge and complex information intelligible. Information visualization is a visual user interface that provides insight of information to the user. The basic purpose of visualization is to create interactive visual representations of the information that exploit human’s perceptual and cognitive capabilities of problem solving. The goal of visualization is that the user can easily understand and interpret huge and complex set of information.
With availability of enough visualization techniques it can be very confusing to know what and when should be appropriate technique to use in order to convey maximum possible understanding. The basic purpose of visual representation is to efficiently interpret what is insight, as easy as possible. Different available visualization techniques are used for different situation which convey different level of understanding. This document is guide for the young researchers who wants to start work in visualization.

Below is a beautiful explanation on the value of data visualization.



Types Of Data Visualization

Basic Charts

The most recognizable and utilized form of data visualization is the basic chart. Line, bar, area and pie charts represent the most common types of this form.

Status Indicators

Status indicators are also a commonly used visualization to indicate the business condition of a particular measure or unit of data. These indicators can take on many forms, including gauges, traffic lights or symbols. Status indicators become even more effective when they incorporate contextual metrics, such as targets and thresholds, because they can provide quick feedback as to whether a specific measure is good or bad, high or low, below or above target. In addition to basic charts that visualize a set or sets of data, status indicators are also a commonly used visualization to indicate the business condition of a particular measure or unit of data. These indicators can take on many forms, including gauges, traffic lights or symbols. Status indicators become even more effective when they incorporate contextual metrics, such as targets and thresholds, because they can provide quick feedback as to whether a specific measure is good or bad, high or low, below or above target. 

Advanced Data Visualizations

More advanced examples of data visualization include scatter graphs, bubble charts, spark line charts, geographical maps, tree maps, Pareto charts, and many others. These more sophisticated visualizations are designed to display data in ways tailored to a specific function or industry.


The three business domains which I am considering, to demonstrate the utilization of date visualization are:


Healthcare Industry

Healthcare data tends to reside in multiple places. From different source systems, like EMRs or HR software, to different departments, like radiology or pharmacy. The data comes from all over the organization. Aggregating this data into a single, central system, such as an enterprise data warehouse (EDW), makes this data accessible and actionable. Healthcare data also occurs in different formats (e.g., text, numeric, paper, digital, pictures, videos, multimedia, etc.). Radiology uses images, old medical records exist in paper format, and today’s EMRs can hold hundreds of rows of textual and numerical data. Sometimes the same data exists in different systems and in different formats. Such is the case with claims data versus clinical data.
Any individual who has been a patient in hospital, will likely concur that the experience has opportunity to improve. Instrumentation and the best possible utilization of information and learning can have a genuine effect with regards to enhancing patient care.


Recommendation:

Utilizing a dashboard that joins wide assortment of diagrams, meters and display graphs, healthcare administrators can make informed short-term tactical decisions while gaining insight into how their decisions will affect various outcomes, staff groups, and finances.

Recommended Visualizations:

Pie Charts: Top insurance payers can be quickly analyzed using pie charts.
Bar Charts: The department utilization, including individual utilization levels of Doctors and Nurses can be compared using Bar charts.
Gauges: Patient wait time and lag by date and hour can be analyzed using gauges.


Electronic Commerce (E-commerce) Industry

Data visualization has equal importance in e-commerce as well. An online retail store generally collects data about its customers and where are they coming from i.e. platforms and websites. It is a good idea for the businesses to analyze its current customer base and the current competitors and compare their own business with others. All of this is made possible through a data visualization tools. Data visualization and analytics tools can help online business owners to make better business decisions and strategies to succeed and stay alive in the industry.


Recommendation:

Dashboards and e-commerce analytics give visibility for various division to see data that is significant to them. Distributors can use these tools to improve decision-making because they paint a big picture of the data. Seeing this information as a geo-location map, a bar graph and a line graph will be easier and more meaningful.

Recommended Visualizations:

Line Chart: To display trends over a period of time and also provide an easy way to compare online retailers in a particular year.
Bar Chart: To depict the region wise top performing stores in a state
Waterfall Chart: Helps in understanding the cumulative effect of sequentially introduced positive or negative values
Geo-location Map: For a visually appealing overview of sales by region.


Finance Industry

For the financial industry, I am mostly concentrating on the visualization in the banking industry. The finance industry deals with large amount of data every day and the processes involved are very complex. The financial services industry includes a wide variety of businesses such as credit bureaus, credit card companies, brokerage firms, and mortgage providers. Each has its own way of presenting information to its different users.

Recommendation:

A dashboard that displays financial metrics and sales metrics such as Margin by Month, Sales Distribution, Monthly Support Expenses, Monthly Revenue, etc.
Column charts, just like bar graphs, serve dashboard readers by helping them visualize categorical data and comparing it side by side. The main purpose of both the column and line chart remains the same, even when they are combined. Columns are best used to represent categorical data, while lines displays the distribution of data over time (trend).


Recommended Visualizations:

Gauges: To visually depict the range of expenses
Maps, Area charts: To visually depict the sales distribution across locations
Line charts: To analyze the Margin, Revenue and Expenses.


Conclusion

It is much easier to understand the data when it is presented in a visual format rather than in a table with columns and rows. But it is all the more important to choose the right visualization for any kind of analysis. To ensure this, it is always advised to understand your data first. Then ask several questions what are the variables depicting? What do the business users want to analyze? What is this analysis required? How will this analysis help them make informed decisions? What are the different ways this data can be visualized? What is the best way to present this data? I am sure after you have answers to all the above questions, you can make wonders in creating visualizations.



References:
http://bridgeable.com/the-importance-of-data-visualization/
http://www.ijcaonline.org/archives/volume34/number1/4061-5722
http://www.dashboardinsight.com/Article.aspx?id=4148
https://www.healthcatalyst.com/a-new-way-to-look-at-healthcare-data-models
http://ibmresearchnews.blogspot.com/2013/08/cultivating-healthier-hospitals.html
http://www.cmswire.com/cms/customer-experience/birst-to-aggressively-market-cloudbased-business-intelligence-offering-with-us-38m-funding-022188.php
http://www.dashboardinsight.com/dashboards/strategic/dundas-data-visualization-sonatica-dashboard.aspx

Thursday, March 3, 2016

Big Unstructured Data v/s Structured Relational Data

Structured Data


Structured data refers to the data stored in an organized manner. Structured data provides a uniform format of data stored primarily in tabular format using relational databases or excel spreadsheets. The uniformity in the data provides simplicity in understanding and querying of data. SQL full form for Structured Query Language is a database programming and analyzing language. It helps in structuring of data as well as filtering the data as per user requirements. It offers a variety of operation to be performed on structured databases namely search, insert, delete, modify and create.

Some examples of structured data:

Machine Generated

  • Sensory Data - GPS data, manufacturing sensors, medical devices
  • Point-of-Sale Data - Credit card information, location of sale, product information
  • Call Detail Records - Time of call, caller and recipient information
  • Web Server Logs - Page requests, other server activity
Human Generated

  • Input Data - Any data inputted into a computer: age, zip code, gender, etc.

Unstructured Data





Unstructured data refers to the data in an unorganized manner. 90% of the data present today is found in unstructured form. Typical examples of unstructured data can be webpages, emails, blogs and documents which contain a wealth of data in a scattered form. Unstructured data is incapable of fitting into relational databases and hence require various data mining techniques in order to convert them into useful datasets.
Social media plays a heavy role in unstructured data. In addition to social media there are many other common forms of unstructured data:

  • Word Doc’s, PDF’s and Other Text Files - Books, letters, other written documents, audio and video transcripts
  • Audio Files - Customer service recordings, voicemails, 911 phone calls
  • Presentations - PowerPoints, SlideShares
  • Videos - Police dash cam, personal video, YouTube uploads
  • Images - Pictures, illustrations, memes
  • Messaging - Instant messages, text messages



Below is a beautiful explanation of difference between structured and unstructured data.



Volume of Data


Incessant growth in amount of data generated in today’s world gave rise to the coin of the word “Big Data”. Big Data refers to the large amount of unstructured data which cannot be processed using conventional database systems. It is said to be too fast, too volatile and too difficult to hold in.
As mentioned already, 90% of the data sources are unstructured. In today’s digital world organizations while carrying out their business operations are generating a large amount of unstructured data. This unstructured data stored in Emails, Documents, Webpages and exterior sources pose a challenge in front of the organizations. The challenge lies with interpreting these unstructured sources into meaning insights. These insights are major source of assistance for carrying our strategic as well as decision making activities. Below graph projects the increase in amount of unstructured data around the world while the structured data sources has been consistent since the early 20th Century.




In order to tackle Big Data, Organizations have increased resources to be allocated to process and clean big data and bring out insights. Few of the useful technologies utilized today to convert unstructured data into structured data are Hadoop, MongoDb, RapidMiner and other Business Intelligence technologies.



Type of Data


Spatial Data – Spatial Data has several dimensions and provides information about location or position with respect to any given entity. Examples can be coordinates of a place, images taken from various locations.

Historical Data -  Data which goes through ETL(Extract, Transform and Load) process in order to build data warehouses. These data warehouses  stores static transformed data. This data is primarily used for analysis purposes.

Redundant Data - Duplicate data which resides in same data sources. This type of data leads to data inconsistency and integrity conflicts.

Operational Data - Real time data generated in transactional systems. This type of data provides lowest level  of granularity but is difficult to be worked upon before being transferred to analytical systems.

Created Data - Data that are purposely created by businesses primarily for market research. This data consists of focus groups, customer surveys, etc.

Transactional Data - Data generated after transaction process is completed.

Metadata -  Metadata is data that describes the data and schema objects, and is used by applications to fetch and compute the data correctly.





Data Warehouse's role in analyzing unstructured data

  • Introduces simplicity and performance to statistical analysis
  • Aggregates data from multiple sources
  • Offers features to build predictive models which help in analyzing trends and patterns
  • Adaptable to changing data demands and analytic requirement escalations
  • Separation of data as per user's view by creating data marts
  • Despite the substantial advantages of using Data warehouses, there do exist a lot of challenges and limitation of implementing data warehouses.



Limitations of Data Warehouses


  • Data stored in Data Warehouses are static due to which they do not provide capability to analyse real time data sources
  • Data Security becomes a matter of concern as the data aggregated from multiple sources pose a challenge of being available to unauthorized user groups
  • Complexity exists with integration of data from different disparate sources.  Different format and architecture of different data sources also add on to the complexity
  • Cost Overhead with implementation of Data warehouses is huge.   It requires the effort of numerous IT and business professionals to achieve the target objective from building a data warehouse. Data warehouses are known to have disappointing ROIs
  • No Drill - Down Capability - Data Warehouses are incapable of reaching the lowest level of granularity required by some users leading to unclear picture of data while making business decisions
  • Data Ownership Concerns - Data warehouses are often, but not always, Software as a Service implementations, or cloud services applications. Your data security in this environment is only as good as your cloud vendor. Even if implemented locally, there are concerns about data access throughout the company. 
  • Adding new data sources takes long initial implementation time and cost associated are high
  • Limited flexibility of use and types of users - requires multiple separate data marts for multiple uses and types of users


Future of Data Warehouses in the age of Big Unstructured Data


Technologies like Hadoop, Informatica Powercenter, Terdata are emerging as the biggest players in Data Warehouse Technology.  The future is bright. Data Warehouse have never been more value than they are today. Decision makers are banking on data to make crucial decisions.
With the incessant growth in data generated through social media, free texts , sensor and meters .

Web 2.0 significantly grew business-related data generated from e-commerce, web logs, search marketing, and other sources. These sources remained business-generated and business-owned. Enterprises expanded ETL operations to compensate for the new data sources.

Due to the continuous growth of unstructured data, organizations are looking to integrate with robust platforms like Hadoop to handle unstructured data and perform analysis on it. Hadoop excels at processing big data sets with its  MapReduce model and distributed file system (HDFS). These features make Hadoop a great addition to “standard” data warehouses.

Data warehouses update their data on a periodic basis. However, in future the emphasis will be more on getting real time analysis on real time data. These Data Warehouses will enable real-time decisions in dynamic environment.
More and more firms will be moving on faster cloud based databases. Cloud Computing is the trend in the market right now. Although, it is well-known for operational applications at present, cloud is not used profoundly in the data warehouse platforms as yet. Clouds have the ability to provide dynamic allocation, which becomes helpful when data volume of a particular warehouse varies fairly unpredictably. Also, through cloud, the applications can significantly scale up based on the requirements.


Cloud Based Data Warehousing Solutions will become a basic requirement in future.
According to Jon Bock, Vice President of Snowflake, " Cloud-based solutions will be critical to helping organizations expand access to data and analytics as well as increase their agility with their data. Taking advantage of the flexibility and cost model of the cloud, these solutions will offer performance on demand and native understanding of diverse data to support a wide range of analytics, without the management overhead and cost of traditional on-premises offerings."


References:

http://www.ibm.com/analytics/us/en/technology/data-warehousing/
http://www.whamtech.com/adv_disadv_dw.htm
http://www.datasciencecentral.com/profiles/blogs/structured-vs-unstructured-data-the-rise-of-data-anarchy
http://datastorageasean.com/daily-news/sas-contextual-analysis-accelerates-putting-structure-unstructured-data
https://www.betterbuys.com/bi/future-of-data-warehousing/
http://smallbusiness.chron.com/disadvantages-data-warehouse-73584.html

Thursday, February 18, 2016

Dimensional Modelling for Insurance Industry


Insurance industry is an important and growing sector for business intelligence market. Insurance companies generate several complicated transactions that must be analyzed in many different ways. Insurance companies like GEICO help customers manage their risks by charging a fixed premium amount and providing a sum of money in case of unexpected events such as a car accident or medical emergency.

Insurance Industry base their business models around assuming and diversifying risk. Most insurance companies make revenues in two ways:  charging premium amounts in exchange of insurance coverage and subsequently investing those premium into markets to earn interest.


Within Insurance Industry, I have chosen GEICO and will be applying concepts of dimensional modelling concepts to evaluate the performance metrics for evaluating an insurance company.  GEICO which stands for Government Employees Insurance Company is an automobile insurance company based out of Chevy Chase, Maryland. It is the second largest insurance company in United States of America. It is wholly owned subsidiary of Berkshire Hathaway. Today, there are more than 22 million automobiles insured by GEICO which are owned by more than 13 million policy holders. GEICO follows a direct-to-customer sales model.

From the perspective of GEICO’s CEO there are certain KPIS(Key Performance Indicators). Few of them are listed below

  • Average Insurance Policy Size - Average insurance policy size of policies closed within measurement period. This KPI is most used for: Operational Excellence
  • Loss Ratio % - The ratio of claims to premiums. It may be calculated in several different ways, using paid premiums or earned premiums, and using paid claims with or without changes in claim reserves and with or without changes in active life reserves.
  • Claims Solvency %  - Insurance companies’ ability to pay the claims of policyholders.
  • % of Overdue Claims - Percentage of overdue claims.
  • Percentage of Sales Growth - Measures the amount of policy renewals and new policy sales over a set period of time
  • Net Income Ratio - Measures how effective organization is at generating profit on each of dollar of earned premium.


Based on the above KPIs, I have picked Claims Solvency % as the one on which we will proceed with our case study. Claims Solvency % refers to an insurance company’s capacity to compensate its customers in case of unexpected calamities i.e. in GEICO case can be car collision.

Performance Metrics that are important to evaluate GEICO’s performance can be

  • Total Revenues – Total amount collected from customers in exchange of insuring their     automobiles
  • Total no of Customers – Total Number of Customers insured by GEICO
  • Total Settlements – Sum of amount paid to customers in lieu of insurance bought by customers in case of car collisions
  • Total Insured Amount – Sum of amount which GEICO is liable to pay in case all customers are need to be paid


Dimensional Model can play a crucial role in helping GEICO’s higher management getting answers to specific questions. Such as how many customers were insured on a monthly, Yearly or daily basis. The level of granularity to be queried can be handled by using Dimensional Models. Similarly, Dimensional Models can help them query and evaluate the performance of specific insurance products or a group of similar insurance products clubbed together.

Given the surplus amount of data collected during the operations of a large insurance company like GEICO, Dimension Model can help in ensuring simplicity and optimized query platform for business users like the CEO. Dimension Model denormalizes the highly complex transactional database tables and clubs them into dimensions, hence ensuring simpler and speedy query of data.
Dimensional Modelling maximizes flexibility and scalability depending on changing user activities. Given the notion, GEICO’s CEO can track real time changes occurring to the customer data and base their evaluation on that.

Transaction-level star join schema  would provide an extremely powerful way for GEICO to analyze insurance claims. The number of claimants, the timing of claims, the timing of payments made, etc can be easily derived from this view of the data.

Below is a sample dimensional model which GEICO can adopt in order to evaluate their performance.

Thursday, February 4, 2016

Business Intelligence & Analysis Products Scan & Evaluation

What is Business Intelligence?

Business intelligence (BI) is described as "the set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis purposes".  BI, is an umbrella term that refers to a variety of software applications used to analyze an organization’s raw data.
Simply put, BI allows easy interpretation of large volume of data and take intelligent business decision based on the interpretation.

There are several Business Intelligence and Analytics tools and platform, available in the market, which makes taking an intelligent decision much easier.
However, one needs to comprehensively evaluate a BI tool, taking various criteria into account, before choosing it for an organization and invest in it. Criterias such as data visualization capability, scalability and so forth must be considered before choosing the best fit for the organization.
This blog discusses the top five most popular Business Intelligence & Analysis products used today and evaluate these products based on certain criteria and rate them based on weight of each criteria.

Criteria used for comparing BI tools
  • Data visualization
    One of the most important feature in BI tool is data visualization. This criterion is the capability of BI tool to represent data, to their audiences, in the most interactive way possible. The tool should use vibrant visual objects, images, charts and interactive dashboard having informative visualization. The visual representation of data should be such that, for even a non-technical users, it is easier to understand
  • Scalability
    This criterion is the ability of the BI tool to handle growing amount of data, growing amount of users and its potential to be enlarged. BI tool should be scalable enough to add additional functionalities based on the business requirement. Business settings are susceptible to change, a good BI tool should perform well when even when business settings are changed rapidly
  • Cost effectiveness 
    Cost is another important factor which needs to be considered while choosing a BI tool. This criterion is based upon return on investment or value for money, in other words, number of features provided by the BI tool for the price that is charged.
    The cost of a BI tool includes the following factors:
    • License costs
    • Cost of BI developer
    • Training costs
    • Support/Maintenance fees
  • Data Integration
    This criterion is the capability of the BI tool to handle multiple data sets, originating from multiple data sources. These sources include XML data, MS Excel data, MySQL or Oracle Database, Siebel, SAP database, flat files, etc. The variety of sources from where data is being collected is ever increasing, it is vital for a BI tool to be regularly updated, to support multiple data sources and format.
  • Customer Experience/ Ease of Use
    This is a measure of how easy it is for the users to get used to the product and produce some meaningful output. The users should require minimal training to use the BI tool.This criteria also includes help and support documentation provided by the vendor



Business Intelligence Tools


1. Tableau



Strengths:
  • Supports data integration from various sources 
  • Wide range of dashboard capabilities
  • Drag and drop easy to use interface
  • Row level data security
  • Cost is comparatively less than other tools
  • High quality customer support to quickly resolve any issues
Weakness:
  • Lacks a robust security system
  • Lacks support of APIs
  • Less support for custom modification and 3rd party plugins
  • Limited functionality for data mining

Product Analysis:

Data visualization
All features are dynamic, interactive and highly customizable. One of the best tool available

Scalability
Issues with large volume of data

Cost effectiveness
Tableau is free for students and commercial version cost around $2000. Training is provided free of cost

Data Integration
Can handle variety of data sets easily



2. MicroStrategy


Strengths:
  • Ability to generate enterprise reporting, dashboards and notifications
  • Ad-Hoc reporting is supported
  • Mobile application to allow user to create reports on mobile phones
  • Enterprise grade cyber security to provide object level security
  • Multiple data-sources are supported 

Weakness:

  • Visualization effects are not that good as compared to other tools
  • Not easy for business and non-technical users
  • Product cost is higher compared to other tools offering the same feature set
  • Structured data warehouse is needed for integration

Product Analysis:

Data visualization
Reports lack visual appeal, also takes longer time to generate

Scalability
Highly scalable, can handle data up-to 17 TB easily

Cost effectiveness
The cost of the tool is reasonable and is around $3000

Data Integration
Can integrate with excel, hadoop and other big data sets. Excellent data integration



3. QLIKView



Strength:

  • Quickest implementation time and exceptional performance in vital BI projects
  • Ability to easily handle complex ETL
  • Excellent tool for data discovery, can dive deep into data for analysis
  • Excellent visualization features and easy to use user interface


Weakness:

  • Not very scalable
  • Few tasks require steep learning curve, not suited to business users
  • Server deployment could be expensive
  • Not good for real time reporting

Product Analysis:

Data visualization
One of the best BI tools for data visualization. Closely follows Tableau

Scalability
QLIK is highly scalable, supports large data-sets and thousands of users

Cost effectiveness
QLIK costs around $4000 and $1000 per additional user

Data Integration
QLIK can work with multiple data-sets with ease, better than Tableau in this criteria



4. Oracle BI


Strength:

  • Supports Big Data
  • Good detailed user documentation
  • User friendly interface
  • Excellent customer support and training
  • Ability to analyze large data-sets in short amount of time 

Weakness:

  • Not easily customizable, requires significant amount of time for custom requirement
  • Lack support for multiple data-sources

Product Analysis:

Data visualization
Oracle is good in handling large sets of data, it was primarily developed to process large scale system of record requirements, it is still not good in data visualization

Scalability
Oracle has a scalable architecture, it is a enterprise level tool and it handles large amount of data.

Cost effectiveness
A major drawback of Oracle BI is its high cost

Data Integration
Oracle BI is primarily uses SQL databases and is limited in processing multiple databases


5. SAS BI


Strength:

  • Easy and fast reporting
  • Supports advanced techniques like prediction modelling and data mining 
  • Data integration is excellent
  • Huge market share
  • Support of other programming languages like R


Weakness:

  • Cost is higher compared to other BI tools
  • User interface is not so user friendly
  • No hardware architecture support

Product Analysis:

Data visualization
SAS has good amount of features for data visualization

Scalability
SAS is highly scalable, it can operate on multiple operating platforms like Linux/Unix, Windows, etc.

Cost effectiveness
Cost of using SAS is too high, servers cost around $8000 and $1700 is charged for any additional user

Data Integration
SAS can handle mupltiple data sources like MySQL, Hadoop, flat siles, etc.



Weight score analysis of BI tools


CriteriaWeightTableauSASQLIKMicroStrategyOracle
Data visualization30.00%1081066
Scalability15.00%8108810
Cost effectiveness20.00%88777
Data Integration20.00%88886
Customer Experience15.00%1081098
Points100.00%8.98.38.77.357.1
Rank13245


After carefully analyzing each BI tools and comparing them on the basis of Data visualization, Scalability, Cost effectiveness, Data Integration and Customer Experience. We see that Tableau is clearly a better tool than the other four BI tools. However, as we can see all the five tools are close enough to compete against each other and each of them has their own unique set of features, the choice of selecting a BI tools will eventually depend on specific set of features that the user is looking for


References:
https://en.wikipedia.org/wiki/Business_intelligence
http://www.qlik.com/
http://www.microstrategy.com/us/
http://www.tableau.com
http://www.sas.com/en_us/home.html
http://www.oracle.com/us/solutions/business-analytics/business-intelligence/foundation-suite/overview/index.html
http://www.cio.com/article/2439504/business-intelligence/business-intelligence-definition-and-solutions.html
http://www.gartner.com/technology/reprints.do?id=1-2ADAAYM&ct=150223&st=sb