Thursday, March 3, 2016

Big Unstructured Data v/s Structured Relational Data

Structured Data

Structured data refers to the data stored in an organized manner. Structured data provides a uniform format of data stored primarily in tabular format using relational databases or excel spreadsheets. The uniformity in the data provides simplicity in understanding and querying of data. SQL full form for Structured Query Language is a database programming and analyzing language. It helps in structuring of data as well as filtering the data as per user requirements. It offers a variety of operation to be performed on structured databases namely search, insert, delete, modify and create.

Some examples of structured data:

Machine Generated

  • Sensory Data - GPS data, manufacturing sensors, medical devices
  • Point-of-Sale Data - Credit card information, location of sale, product information
  • Call Detail Records - Time of call, caller and recipient information
  • Web Server Logs - Page requests, other server activity
Human Generated

  • Input Data - Any data inputted into a computer: age, zip code, gender, etc.

Unstructured Data

Unstructured data refers to the data in an unorganized manner. 90% of the data present today is found in unstructured form. Typical examples of unstructured data can be webpages, emails, blogs and documents which contain a wealth of data in a scattered form. Unstructured data is incapable of fitting into relational databases and hence require various data mining techniques in order to convert them into useful datasets.
Social media plays a heavy role in unstructured data. In addition to social media there are many other common forms of unstructured data:

  • Word Doc’s, PDF’s and Other Text Files - Books, letters, other written documents, audio and video transcripts
  • Audio Files - Customer service recordings, voicemails, 911 phone calls
  • Presentations - PowerPoints, SlideShares
  • Videos - Police dash cam, personal video, YouTube uploads
  • Images - Pictures, illustrations, memes
  • Messaging - Instant messages, text messages

Below is a beautiful explanation of difference between structured and unstructured data.

Volume of Data

Incessant growth in amount of data generated in today’s world gave rise to the coin of the word “Big Data”. Big Data refers to the large amount of unstructured data which cannot be processed using conventional database systems. It is said to be too fast, too volatile and too difficult to hold in.
As mentioned already, 90% of the data sources are unstructured. In today’s digital world organizations while carrying out their business operations are generating a large amount of unstructured data. This unstructured data stored in Emails, Documents, Webpages and exterior sources pose a challenge in front of the organizations. The challenge lies with interpreting these unstructured sources into meaning insights. These insights are major source of assistance for carrying our strategic as well as decision making activities. Below graph projects the increase in amount of unstructured data around the world while the structured data sources has been consistent since the early 20th Century.

In order to tackle Big Data, Organizations have increased resources to be allocated to process and clean big data and bring out insights. Few of the useful technologies utilized today to convert unstructured data into structured data are Hadoop, MongoDb, RapidMiner and other Business Intelligence technologies.

Type of Data

Spatial Data – Spatial Data has several dimensions and provides information about location or position with respect to any given entity. Examples can be coordinates of a place, images taken from various locations.

Historical Data -  Data which goes through ETL(Extract, Transform and Load) process in order to build data warehouses. These data warehouses  stores static transformed data. This data is primarily used for analysis purposes.

Redundant Data - Duplicate data which resides in same data sources. This type of data leads to data inconsistency and integrity conflicts.

Operational Data - Real time data generated in transactional systems. This type of data provides lowest level  of granularity but is difficult to be worked upon before being transferred to analytical systems.

Created Data - Data that are purposely created by businesses primarily for market research. This data consists of focus groups, customer surveys, etc.

Transactional Data - Data generated after transaction process is completed.

Metadata -  Metadata is data that describes the data and schema objects, and is used by applications to fetch and compute the data correctly.

Data Warehouse's role in analyzing unstructured data

  • Introduces simplicity and performance to statistical analysis
  • Aggregates data from multiple sources
  • Offers features to build predictive models which help in analyzing trends and patterns
  • Adaptable to changing data demands and analytic requirement escalations
  • Separation of data as per user's view by creating data marts
  • Despite the substantial advantages of using Data warehouses, there do exist a lot of challenges and limitation of implementing data warehouses.

Limitations of Data Warehouses

  • Data stored in Data Warehouses are static due to which they do not provide capability to analyse real time data sources
  • Data Security becomes a matter of concern as the data aggregated from multiple sources pose a challenge of being available to unauthorized user groups
  • Complexity exists with integration of data from different disparate sources.  Different format and architecture of different data sources also add on to the complexity
  • Cost Overhead with implementation of Data warehouses is huge.   It requires the effort of numerous IT and business professionals to achieve the target objective from building a data warehouse. Data warehouses are known to have disappointing ROIs
  • No Drill - Down Capability - Data Warehouses are incapable of reaching the lowest level of granularity required by some users leading to unclear picture of data while making business decisions
  • Data Ownership Concerns - Data warehouses are often, but not always, Software as a Service implementations, or cloud services applications. Your data security in this environment is only as good as your cloud vendor. Even if implemented locally, there are concerns about data access throughout the company. 
  • Adding new data sources takes long initial implementation time and cost associated are high
  • Limited flexibility of use and types of users - requires multiple separate data marts for multiple uses and types of users

Future of Data Warehouses in the age of Big Unstructured Data

Technologies like Hadoop, Informatica Powercenter, Terdata are emerging as the biggest players in Data Warehouse Technology.  The future is bright. Data Warehouse have never been more value than they are today. Decision makers are banking on data to make crucial decisions.
With the incessant growth in data generated through social media, free texts , sensor and meters .

Web 2.0 significantly grew business-related data generated from e-commerce, web logs, search marketing, and other sources. These sources remained business-generated and business-owned. Enterprises expanded ETL operations to compensate for the new data sources.

Due to the continuous growth of unstructured data, organizations are looking to integrate with robust platforms like Hadoop to handle unstructured data and perform analysis on it. Hadoop excels at processing big data sets with its  MapReduce model and distributed file system (HDFS). These features make Hadoop a great addition to “standard” data warehouses.

Data warehouses update their data on a periodic basis. However, in future the emphasis will be more on getting real time analysis on real time data. These Data Warehouses will enable real-time decisions in dynamic environment.
More and more firms will be moving on faster cloud based databases. Cloud Computing is the trend in the market right now. Although, it is well-known for operational applications at present, cloud is not used profoundly in the data warehouse platforms as yet. Clouds have the ability to provide dynamic allocation, which becomes helpful when data volume of a particular warehouse varies fairly unpredictably. Also, through cloud, the applications can significantly scale up based on the requirements.

Cloud Based Data Warehousing Solutions will become a basic requirement in future.
According to Jon Bock, Vice President of Snowflake, " Cloud-based solutions will be critical to helping organizations expand access to data and analytics as well as increase their agility with their data. Taking advantage of the flexibility and cost model of the cloud, these solutions will offer performance on demand and native understanding of diverse data to support a wide range of analytics, without the management overhead and cost of traditional on-premises offerings."


1 comment: