Implementing Data Preparation in Distributed Multimedia System

IS DATA PREPARATION THE
NEXT
BIG DATA DISRUPTION?
The 22nd International Conference on Distributed Multimedia Systems
DMS 2016
Grand Hotel Salerno, Salerno, Italy
November 25 - 26, 2016

• SCENARIO
• BIG DATA IN THE DATA DRIVEN ENTERPRISE
• WHAT DATA PREPARATION SHOULD COVER
• CREATING READY DATA USING FRACTALS
• CASE STUDY
Agenda
Source Forrester 2016

1. DOES THE BUSINESS ANALYST UNDERSTAND THE DATA SCIENTIST?
2. WHY DATA DRIVEN COMPANIES ARE HIRING DATA JOURNALISTS?
3. WHY DARK DATA EXTERNAL TO DATA LAKES CONTINUE TO GROW?
4. WHY IT IS REQUIRED SO LONG TIME FOR MAKING DATA?
5. DATA PLAY AND NARRATIVES?
HOW LONG TIME AVAILABLE TO EXPLOIT DATA PROCESSING OUTPUT?
77%
Data Processing
23%
Data Analysis
Source Bloor2016

90% IS DARK
12% AVAILABLE FOR BUSINESS INSIGHTS
88% IS JUST STORED
80% RECORDINGs, PDFs AND TEXTs
source IDC 2016
+4300% ANNUAL DATA GENERATION

Data preparation is an iterative process for exploring and transforming raw data into forms
suitable for data science, data discovery, and analytics.
Self-service data preparation tools (SSDP) are user-oriented tools that enable data preparation
capabilities such as data cataloging - inventorying, data discovery, data exploration, data
transformation, data structuring, surfacing of sensitive attributes and anomaly detection.
These tools are aimed at reducing the time and complexity of preparing data and improving
analyst productivity.
Pre process
Prepare
Discover
Exploit
Raw Technically correct
Ready Data
Patterns
Formatted
Multimedia
domain
Missing
Multimedia

Depending on how you count them, there are
anywhere from 20 to 50 providers of self-service
data preparation tools. However, they’re not all
equal, and users should carefully examine the
offering to measure they’re getting what they
expect.
Many BI and Advanced Analytics vendors (Tableau, Qlik, Sas etc.)
have jumped onto SSDP, even if their capabilities aren’t separate
from their core offerings and shows limitations in term of
Performances, Neutrality, Custom processing.
The key reason why self-service data prep will survive as its own
category entity is the growing realization that data preparation
needs to be kept separate from analysis and Discovery.
The volumes and the number of data sources will not be
decreasing, and neither will the number of BI tools.
To that end, it’s likely that self-service data prep will remain a
product category unto itself for the foreseeable future.
Source Bloor2016

Where we are
BIG DATA IN THE DATA DRIVEN ENTERPRISE

WE ALL ARE AWARE
I.T. DIVISION
IS GOING TO BUILD
PLANETS OF DATA

WHICH ARE WORLDS MADE OF
DATA BASEs, DATA LAKEs,
DATA WAREHOUSEs,
STRUCTUREs, AND SCHEMAs

IT SEEMS THAT
THESE WORLDS ARE CALLED
“BIGDATA”

BUT, WE’RE AFRAID TO CREATE THEM,
LORDS ARE TAKING LONGER THAN 7 DAYS
AND, UNFORTUNATELY, WORSE…
IT SEEMS THAT
HUMANS HAVEN’T
ACCESS TO THOSE
WORLDS

Bottom line:
Is the data preparation the bridge between
planets of data and the user?
BigData is not Just technology, responsibility
should be allocated on the basis of the
following critical factors:
1. Raw data will be transfered to the preparation unit
(push), or
2. the preparation unit has to read data from the data
lake (pull)?
3. the data lake has been designed to stage or to store
raw data?
4. what about the variability of the context and data?
PULL
IT
Data lake purpose
PUSH
STORESTAGE
Data Communication mode
END USER
IT
END USER END USER
Low
variability
High
variability

Backgrounds
WHAT DATA PREPARATION SHOULD COVER

raw data r cold,
analytics hot

reality
1993 understanding comics
How to Connect
analytics and
details?

A database is
required to
contextualize
languages and
realities

Bottom Line:
Usage of data should be faster, cost less with minimum data
movement requirements
• materialize reality and language in a
consistent database
• couple language and reality using
keyback features
• Bind external algorithm using Open
(Standard?) User Exits
• foster holistic views of data through
Grid Data Unification

blending
Context, languages and facts
CREATING READY DATA USING FRACTAL ADC

rowId Nname Ncity
1 1 1
2 2 2
3 3 3
4 2 2
Key Value NValue
Name Aldo 1
Name Sara 2
Name Anna 3
City Miami 1
… … …
DateBirth UDateB Age
11/1/90 1/11/90 26
12/2/89 2/12/89 26
1.1.68 1/1/68 48
31-1-61 1/31/61 56
Ncity city state
1 Miami Fl
2 NYC NY
3 Rome Italy
Map DictionaryLuggage
hierarchy
Data complex Storage group
name city DateBirth
Aldo Miami 11/1/90
Sara NYC 12/2/89
Anna Rome 1.1.68
Sara NYC 31-1-61
Data source
Fractal conversion
Transform
DateBirth
Add Geo
classification
ADC is a fractal like algorithm that converts an input raw data and related data processing into a set of
chained binary blocks, formulas and long pointers.
We show that ADC represents an important set of computations… The advantages of ADC are that:
it is described by a small number of parameters and has a priori known sizes of the views , the views can be generated
independently, the overhead of combining the generated views is predictable, the data set can be partitioned into a
number of independently generated subsets, the elements of the data set are pseudo random
These properties make ADC a strong candidate for a data intensive grid benchmark < M. Frumkin NASA NAS Division >

Using the fractal engine,
performances are extreme

MATERIAL TESTING
• Complex Json, Oracle, csv, wmv data
• Manual data processing executed using
Mathlab
• Hours of Scientist work to detect outlier
• Impossibility to replicate tests with same
results
• Scarce know how capitalization
• Blend of data happens at Narrative
writing time

Terabyte level staging
Rigid batch processing
No history
Digital reality Language
Fractal
Data base

Bottom Line:
Everyday we hear from entrepreneurs doing their best to turn their big ideas in a consistent and
successful online business. Here IT is the enabler but, unfortunately, sometimes the T part has a negative
influence on the development of the core idea.
The ideal tool kit is made for who wish to exploit the I part of the IT, so that entrepreneurs having great
ideas, can craft their business themselves. And they should!

Implementing Data Preparation in Distributed Multimedia System

Recommandé

Recommandé

Contenu connexe

Similaire à Implementing Data Preparation in Distributed Multimedia System

Similaire à Implementing Data Preparation in Distributed Multimedia System (20)

Dernier

Dernier (20)

Implementing Data Preparation in Distributed Multimedia System