Thursday, March 31, 2016

Presentation and Visualization Methods

Data visualization

Data visualization is the method of consolidating data into one collective, illustrative graphic. Traditionally, data visualization has been used for quantitative work (info graphics are a popular example) but ways to represent qualitative work have shown to be equally as powerful.

Below is a beautiful explanation on what is data visualization.


Visualization methods are considered to be very important for the users because it provides mental models of the information. Visualization techniques make huge and complex information intelligible. Information visualization is a visual user interface that provides insight of information to the user. The basic purpose of visualization is to create interactive visual representations of the information that exploit human’s perceptual and cognitive capabilities of problem solving. The goal of visualization is that the user can easily understand and interpret huge and complex set of information.
With availability of enough visualization techniques it can be very confusing to know what and when should be appropriate technique to use in order to convey maximum possible understanding. The basic purpose of visual representation is to efficiently interpret what is insight, as easy as possible. Different available visualization techniques are used for different situation which convey different level of understanding. This document is guide for the young researchers who wants to start work in visualization.

Below is a beautiful explanation on the value of data visualization.



Types Of Data Visualization

Basic Charts

The most recognizable and utilized form of data visualization is the basic chart. Line, bar, area and pie charts represent the most common types of this form.

Status Indicators

Status indicators are also a commonly used visualization to indicate the business condition of a particular measure or unit of data. These indicators can take on many forms, including gauges, traffic lights or symbols. Status indicators become even more effective when they incorporate contextual metrics, such as targets and thresholds, because they can provide quick feedback as to whether a specific measure is good or bad, high or low, below or above target. In addition to basic charts that visualize a set or sets of data, status indicators are also a commonly used visualization to indicate the business condition of a particular measure or unit of data. These indicators can take on many forms, including gauges, traffic lights or symbols. Status indicators become even more effective when they incorporate contextual metrics, such as targets and thresholds, because they can provide quick feedback as to whether a specific measure is good or bad, high or low, below or above target. 

Advanced Data Visualizations

More advanced examples of data visualization include scatter graphs, bubble charts, spark line charts, geographical maps, tree maps, Pareto charts, and many others. These more sophisticated visualizations are designed to display data in ways tailored to a specific function or industry.


The three business domains which I am considering, to demonstrate the utilization of date visualization are:


Healthcare Industry

Healthcare data tends to reside in multiple places. From different source systems, like EMRs or HR software, to different departments, like radiology or pharmacy. The data comes from all over the organization. Aggregating this data into a single, central system, such as an enterprise data warehouse (EDW), makes this data accessible and actionable. Healthcare data also occurs in different formats (e.g., text, numeric, paper, digital, pictures, videos, multimedia, etc.). Radiology uses images, old medical records exist in paper format, and today’s EMRs can hold hundreds of rows of textual and numerical data. Sometimes the same data exists in different systems and in different formats. Such is the case with claims data versus clinical data.
Any individual who has been a patient in hospital, will likely concur that the experience has opportunity to improve. Instrumentation and the best possible utilization of information and learning can have a genuine effect with regards to enhancing patient care.


Recommendation:

Utilizing a dashboard that joins wide assortment of diagrams, meters and display graphs, healthcare administrators can make informed short-term tactical decisions while gaining insight into how their decisions will affect various outcomes, staff groups, and finances.

Recommended Visualizations:

Pie Charts: Top insurance payers can be quickly analyzed using pie charts.
Bar Charts: The department utilization, including individual utilization levels of Doctors and Nurses can be compared using Bar charts.
Gauges: Patient wait time and lag by date and hour can be analyzed using gauges.


Electronic Commerce (E-commerce) Industry

Data visualization has equal importance in e-commerce as well. An online retail store generally collects data about its customers and where are they coming from i.e. platforms and websites. It is a good idea for the businesses to analyze its current customer base and the current competitors and compare their own business with others. All of this is made possible through a data visualization tools. Data visualization and analytics tools can help online business owners to make better business decisions and strategies to succeed and stay alive in the industry.


Recommendation:

Dashboards and e-commerce analytics give visibility for various division to see data that is significant to them. Distributors can use these tools to improve decision-making because they paint a big picture of the data. Seeing this information as a geo-location map, a bar graph and a line graph will be easier and more meaningful.

Recommended Visualizations:

Line Chart: To display trends over a period of time and also provide an easy way to compare online retailers in a particular year.
Bar Chart: To depict the region wise top performing stores in a state
Waterfall Chart: Helps in understanding the cumulative effect of sequentially introduced positive or negative values
Geo-location Map: For a visually appealing overview of sales by region.


Finance Industry

For the financial industry, I am mostly concentrating on the visualization in the banking industry. The finance industry deals with large amount of data every day and the processes involved are very complex. The financial services industry includes a wide variety of businesses such as credit bureaus, credit card companies, brokerage firms, and mortgage providers. Each has its own way of presenting information to its different users.

Recommendation:

A dashboard that displays financial metrics and sales metrics such as Margin by Month, Sales Distribution, Monthly Support Expenses, Monthly Revenue, etc.
Column charts, just like bar graphs, serve dashboard readers by helping them visualize categorical data and comparing it side by side. The main purpose of both the column and line chart remains the same, even when they are combined. Columns are best used to represent categorical data, while lines displays the distribution of data over time (trend).


Recommended Visualizations:

Gauges: To visually depict the range of expenses
Maps, Area charts: To visually depict the sales distribution across locations
Line charts: To analyze the Margin, Revenue and Expenses.


Conclusion

It is much easier to understand the data when it is presented in a visual format rather than in a table with columns and rows. But it is all the more important to choose the right visualization for any kind of analysis. To ensure this, it is always advised to understand your data first. Then ask several questions what are the variables depicting? What do the business users want to analyze? What is this analysis required? How will this analysis help them make informed decisions? What are the different ways this data can be visualized? What is the best way to present this data? I am sure after you have answers to all the above questions, you can make wonders in creating visualizations.



References:
http://bridgeable.com/the-importance-of-data-visualization/
http://www.ijcaonline.org/archives/volume34/number1/4061-5722
http://www.dashboardinsight.com/Article.aspx?id=4148
https://www.healthcatalyst.com/a-new-way-to-look-at-healthcare-data-models
http://ibmresearchnews.blogspot.com/2013/08/cultivating-healthier-hospitals.html
http://www.cmswire.com/cms/customer-experience/birst-to-aggressively-market-cloudbased-business-intelligence-offering-with-us-38m-funding-022188.php
http://www.dashboardinsight.com/dashboards/strategic/dundas-data-visualization-sonatica-dashboard.aspx

Thursday, March 3, 2016

Big Unstructured Data v/s Structured Relational Data

Structured Data


Structured data refers to the data stored in an organized manner. Structured data provides a uniform format of data stored primarily in tabular format using relational databases or excel spreadsheets. The uniformity in the data provides simplicity in understanding and querying of data. SQL full form for Structured Query Language is a database programming and analyzing language. It helps in structuring of data as well as filtering the data as per user requirements. It offers a variety of operation to be performed on structured databases namely search, insert, delete, modify and create.

Some examples of structured data:

Machine Generated

  • Sensory Data - GPS data, manufacturing sensors, medical devices
  • Point-of-Sale Data - Credit card information, location of sale, product information
  • Call Detail Records - Time of call, caller and recipient information
  • Web Server Logs - Page requests, other server activity
Human Generated

  • Input Data - Any data inputted into a computer: age, zip code, gender, etc.

Unstructured Data





Unstructured data refers to the data in an unorganized manner. 90% of the data present today is found in unstructured form. Typical examples of unstructured data can be webpages, emails, blogs and documents which contain a wealth of data in a scattered form. Unstructured data is incapable of fitting into relational databases and hence require various data mining techniques in order to convert them into useful datasets.
Social media plays a heavy role in unstructured data. In addition to social media there are many other common forms of unstructured data:

  • Word Doc’s, PDF’s and Other Text Files - Books, letters, other written documents, audio and video transcripts
  • Audio Files - Customer service recordings, voicemails, 911 phone calls
  • Presentations - PowerPoints, SlideShares
  • Videos - Police dash cam, personal video, YouTube uploads
  • Images - Pictures, illustrations, memes
  • Messaging - Instant messages, text messages



Below is a beautiful explanation of difference between structured and unstructured data.



Volume of Data


Incessant growth in amount of data generated in today’s world gave rise to the coin of the word “Big Data”. Big Data refers to the large amount of unstructured data which cannot be processed using conventional database systems. It is said to be too fast, too volatile and too difficult to hold in.
As mentioned already, 90% of the data sources are unstructured. In today’s digital world organizations while carrying out their business operations are generating a large amount of unstructured data. This unstructured data stored in Emails, Documents, Webpages and exterior sources pose a challenge in front of the organizations. The challenge lies with interpreting these unstructured sources into meaning insights. These insights are major source of assistance for carrying our strategic as well as decision making activities. Below graph projects the increase in amount of unstructured data around the world while the structured data sources has been consistent since the early 20th Century.




In order to tackle Big Data, Organizations have increased resources to be allocated to process and clean big data and bring out insights. Few of the useful technologies utilized today to convert unstructured data into structured data are Hadoop, MongoDb, RapidMiner and other Business Intelligence technologies.



Type of Data


Spatial Data – Spatial Data has several dimensions and provides information about location or position with respect to any given entity. Examples can be coordinates of a place, images taken from various locations.

Historical Data -  Data which goes through ETL(Extract, Transform and Load) process in order to build data warehouses. These data warehouses  stores static transformed data. This data is primarily used for analysis purposes.

Redundant Data - Duplicate data which resides in same data sources. This type of data leads to data inconsistency and integrity conflicts.

Operational Data - Real time data generated in transactional systems. This type of data provides lowest level  of granularity but is difficult to be worked upon before being transferred to analytical systems.

Created Data - Data that are purposely created by businesses primarily for market research. This data consists of focus groups, customer surveys, etc.

Transactional Data - Data generated after transaction process is completed.

Metadata -  Metadata is data that describes the data and schema objects, and is used by applications to fetch and compute the data correctly.





Data Warehouse's role in analyzing unstructured data

  • Introduces simplicity and performance to statistical analysis
  • Aggregates data from multiple sources
  • Offers features to build predictive models which help in analyzing trends and patterns
  • Adaptable to changing data demands and analytic requirement escalations
  • Separation of data as per user's view by creating data marts
  • Despite the substantial advantages of using Data warehouses, there do exist a lot of challenges and limitation of implementing data warehouses.



Limitations of Data Warehouses


  • Data stored in Data Warehouses are static due to which they do not provide capability to analyse real time data sources
  • Data Security becomes a matter of concern as the data aggregated from multiple sources pose a challenge of being available to unauthorized user groups
  • Complexity exists with integration of data from different disparate sources.  Different format and architecture of different data sources also add on to the complexity
  • Cost Overhead with implementation of Data warehouses is huge.   It requires the effort of numerous IT and business professionals to achieve the target objective from building a data warehouse. Data warehouses are known to have disappointing ROIs
  • No Drill - Down Capability - Data Warehouses are incapable of reaching the lowest level of granularity required by some users leading to unclear picture of data while making business decisions
  • Data Ownership Concerns - Data warehouses are often, but not always, Software as a Service implementations, or cloud services applications. Your data security in this environment is only as good as your cloud vendor. Even if implemented locally, there are concerns about data access throughout the company. 
  • Adding new data sources takes long initial implementation time and cost associated are high
  • Limited flexibility of use and types of users - requires multiple separate data marts for multiple uses and types of users


Future of Data Warehouses in the age of Big Unstructured Data


Technologies like Hadoop, Informatica Powercenter, Terdata are emerging as the biggest players in Data Warehouse Technology.  The future is bright. Data Warehouse have never been more value than they are today. Decision makers are banking on data to make crucial decisions.
With the incessant growth in data generated through social media, free texts , sensor and meters .

Web 2.0 significantly grew business-related data generated from e-commerce, web logs, search marketing, and other sources. These sources remained business-generated and business-owned. Enterprises expanded ETL operations to compensate for the new data sources.

Due to the continuous growth of unstructured data, organizations are looking to integrate with robust platforms like Hadoop to handle unstructured data and perform analysis on it. Hadoop excels at processing big data sets with its  MapReduce model and distributed file system (HDFS). These features make Hadoop a great addition to “standard” data warehouses.

Data warehouses update their data on a periodic basis. However, in future the emphasis will be more on getting real time analysis on real time data. These Data Warehouses will enable real-time decisions in dynamic environment.
More and more firms will be moving on faster cloud based databases. Cloud Computing is the trend in the market right now. Although, it is well-known for operational applications at present, cloud is not used profoundly in the data warehouse platforms as yet. Clouds have the ability to provide dynamic allocation, which becomes helpful when data volume of a particular warehouse varies fairly unpredictably. Also, through cloud, the applications can significantly scale up based on the requirements.


Cloud Based Data Warehousing Solutions will become a basic requirement in future.
According to Jon Bock, Vice President of Snowflake, " Cloud-based solutions will be critical to helping organizations expand access to data and analytics as well as increase their agility with their data. Taking advantage of the flexibility and cost model of the cloud, these solutions will offer performance on demand and native understanding of diverse data to support a wide range of analytics, without the management overhead and cost of traditional on-premises offerings."


References:

http://www.ibm.com/analytics/us/en/technology/data-warehousing/
http://www.whamtech.com/adv_disadv_dw.htm
http://www.datasciencecentral.com/profiles/blogs/structured-vs-unstructured-data-the-rise-of-data-anarchy
http://datastorageasean.com/daily-news/sas-contextual-analysis-accelerates-putting-structure-unstructured-data
https://www.betterbuys.com/bi/future-of-data-warehousing/
http://smallbusiness.chron.com/disadvantages-data-warehouse-73584.html