After Oracle acquired Endeca, we all had to figure out what to do next. This case study describes building a learning-driven strategy capability to guide an adventurous product development group focused on the new domains of big data analytics and machine intelligence. I’ll share the outcomes of our efforts to launch new products chartered directly around customer experience value; outline the methods, tools, and perspectives that powered product discovery and strategic planning; share a framework and patterns for identifying and understanding emerging domains; and review the application of this toolkit to new situations.
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
UX STRAT 2018 | Flying Blind On a Rocket Cycle: Pioneering Experience Centered Product Strategy For Emerging Spaces
1. FLYING BLIND ON A ROCKET CYCLE
PIONEERING EXPERIENCE-CENTERED PRODUCT
STRATEGY FOR EMERGING SPACES
2. JOE LAMANTIA
Currently: VP Design & Development @ Bottomline Technologies
Previous 20 years: end-to-end customer experience, all stages of product and
service development, and digital / business transformation, focusing on
emerging business and technology.
Archetype(s): Sometime Entrepreneur / Proto-academic / Arm-chair Pro Cyclist
https://www.linkedin.com/in/digitaljoelamantia/
@mojoe
JoeLamantia.com [joelamantia.net]
3. !3
Businesses around the world depend on Bottomline Technologies
(NASDAQ: EPAY) solutions to help them make complex business
payments simple, smart and secure, including some of the world’s largest
banks, and private and publicly traded companies.
4. This case study describes building a learning-
driven strategy capability to guide an
adventurous product development group
focused on the new domains of big data
analytics and machine intelligence.
I’ll share the outcomes of our efforts to launch
new products chartered directly around
customer experience value; outline the
methods, tools, and perspectives that powered
product discovery and strategic planning; share
a framework and patterns for identifying and
understanding emerging domains; and review
the application of this toolkit to new situations.
12. BUSINESS STRATEGY IS ABOUT
IDENTIFYING YOUR BUSINESS
OBJECTIVES AND DECIDING WHERE TO
INVEST TO BEST ACHIEVE THOSE
OBJECTIVES.
Marty Cagan
http://svpg.com/business-strategy-vs-product-strategy/
13. THE PRODUCT STRATEGY SPEAKS TO
HOW YOU HOPE TO DELIVER ON THE
BUSINESS STRATEGY.
Marty Cagan
http://svpg.com/business-strategy-vs-product-strategy/
18. OPPORTUNITY ASSESSMENT
“I ASK PRODUCT MANAGERS TO ANSWER TEN FUNDAMENTAL QUESTIONS”
1. Exactly what problem will this solve? (value proposition)
2. For whom do we solve that problem? (target market)
3. How big is the opportunity? (market size)
4. What alternatives are out there? (competitive landscape)
5. Why are we best suited to pursue this? (our differentiator)
6. Why now? (market window)
7. How will we get this product to market? (go-to-market strategy)
8. How will we measure success/make money from this product? (metrics/revenue strategy)
9. What factors are critical to success? (solution requirements)
10.Given the above, what’s the recommendation? (go or no-go)
http://svpg.com/assessing-product-opportunities/
Assessing
Product
Opportunities
by Marty Cagan | Dec 13, 2006
19. PRODUCT DISCOVERY
MODERN PRODUCT DISCOVERY
• Introduction [:26]
• Modern Product Discovery [:54]
• The Evolution of Modern Product Discovery [4:15]
• The Agile Manifesto [7:06]
• The Rise of User Experience Design [8:47]
• The Lean Startup: Eric Ries [9:49]
• The Jobs-To-Be-Done Framework: Clayton Christensen and Anthony Ulwick [10:42]
• OKRs and Design Sprints [12:12]
• The Goal of Modern Product Discovery [14:27]
• Putting Discovery Practices Into Context: The Opportunity Solution Tree [21:32]
• The Future of Product Discovery [29:42]
https://www.producttalk.org/2017/02/evolution-product-discovery/
The Evolution of
Modern Product
Discovery
February 8, 2017 by Teresa Torres 9
Comments
23. PRODUCT STRATEGY CHARTS A DESIRED
SET OF COURSES THROUGH THE SPACE
OF POSSIBLE PRODUCTS FOR A DOMAIN
Joe Lamantia
http://svpg.com/business-strategy-vs-product-strategy/
32. DEEP STRUCTURE
ENTERPRISE / B2B
• Business process
• Activity
• Social structure: Organizational model
• Boundaries
• Regulation
• IT / Systems architecture
• Lifecycle
• Flows: capital, information, people
• Frame: shareholder value, social enterprise
CONSUMER / B2C
• Value scheme: wealth, love,
knowledge, safety
• Demographics
• Boundaries
• Mores
• Culture
• Social structure: community / group
• Frame: active lifestyle, sustainability
34. Information Visibility through Endeca Discovery Applications
MDEX Engine
Rapidly
changing
data and
content
Large volumes of
highly attributed records
Structured and
unstructured
information
Discovery Applications
Intuitive user experience guides
untrained users to discover relationships
in data
Specialized Database
High performance database purpose
built for data-driven search, navigation,
and analytics
Flexible Data Integration
Consolidate structured and unstructured
data to bridge whitespace between
enterprise systems
50. EXPLORING HYPOTHESES ABOUT VALUE:
“AUTOMATION OF RECONCILIATION ACTIVITIES
WILL ENABLE
ACCOUNTS PAYABLE GROUPS IN MID-MARKET
COMPANIES
TO
HANDLE 30% MORE TRANSACTIONS.”
51. PRODUCT DEVELOPMENT IMPACT
INNOVATION OPPORTUNITIES
PRODUCT HYPOTHESES FOR VALIDATION
PRODUCT CONCEPTS FOR PROTOTYPING
PLANNING GUIDANCE (ROADMAP > EPIC > QA)
DELIVERY GUIDANCE: FEATURES AND FUNCTIONS
56. Data Scientist
Square - San Francisco Bay Area
Job Description
Square is hiring a Data Scientist on our Risk team. The Risk team at Square is responsible for enabling growth while mitigating financial loss associated with transactions. We work
closely with our Product and Growth teams to craft a fantastic experience for our buyers and sellers.
Desired Skills & Experience
As a Data Scientist on our Risk team, you will use machine learning and data mining techniques to assess and mitigate the risk of every entity and event in our network. You will
sift through a growing stream of payments, settlements, and customer activities to identify suspicious behavior with high precision and recall. You will explore and understand our
customer base deeply, become an expert in Risk, and contribute to a world-class underwriting system that helps Square provide delightful service to both buyers and sellers.
To accomplish this, you are comfortable writing production code in Java and conducting exploratory data analysis in R and Python. You can take statistical and engineering ideas
from prototype to production. You excel in a small team setting and you apply expert knowledge in engineering and statistics.
Responsibilities
1. Investigate, prototype and productionize features and machine learning models to identify good and bad behavior.
2. Design, build, and maintain robust production machine learning systems.
3. Create visualizations that enable rapid detection of suspicious activity in our user base.
4. Become a domain expert in Risk.
5. Participate in the engineering life-cycle.
6. Work closely with analysts and engineers.
Requirements
1. Ability to find a needle in the haystack. With data.
2. Extensive programming experience in Java and Python or R.
3. Knowledge of one or more of the following: classification techniques in machine learning, data mining, applied statistics, data visualization.
4. Concise verbal and written articulation of complex ideas.
Even Better
1. Contagious passion for Square’s mission.
2. Data mining or machine learning competition experience.
Company Description
Square is a revolutionary service that enables anyone to accept credit cards anywhere. Square offers an easy to use, free credit card reader that plugs into a phone or iPad. It's
simple to sign up. There is no extra equipment, complicated contracts, monthly fees or merchant account required.
Co-founded by Jim McKelvey and Jack Dorsey in 2009, the company is headquartered in San Francisco.
59. WHAT SORT OF PERSON?
▸ They seem different than analysts:
▸ problem set
▸ relationship to discovery tools
▸ skills and professional profile
▸ discovery / analytical methods
▸ perspective
▸ workflow and collaboration
▸ Are they? How?
60. AREAS OF INVESTIGATION
▸ Workflow
▸ Environment
▸ Organizational model
▸ Pain points
▸ Tools
▸ Data landscape
▸ Analytical practices
▸ Project structure
▸ Unmet needs
62. DISCUSSION GUIDE
Can you please walk me through a recent or current project?
a. How was the project initiated?
b. How defined was the business problem in the beginning? Did the problem change?
c. Where/who did you obtain data sets from? How did you make the decision?
d.Describe the data you used: How did the data sets look like? How big were they? Were they structured or unstructured?
e. What tools or techniques did you use to do the analyses? Did they map to the specific steps you mentioned just now?
f. How did you decide these were the tools/techniques to use? To what extent were these decisions made by yourself and to what extent were
they standardized by your group/team?
g. How did you present the results of your analyses? What tools did you use? What do you like and dislike about your current tool set?
h. Which stage of this project was the most challenging? To what extent did the tools satisfy what you intended to do? What features were lacking?
i. How much collaboration was there during each stage of the project?
i. Background and role of collaborators
ii. Collaboration modes
iii. Types of information shared
Thinking about the projects you have worked on, is there a common approach you take to address these problems?
How did you decide on this approach/tools?
63. NEEDS
What are the most common and useful statistical techniques you use during discovery and analysis efforts?
“(1) The most commonly used statistical techniques used to date (in our strategic planning work) are: dimensionality
reduction (partition clustering, multiple correspondence analysis), factor analysis, partition clustering (k-means, k-medoids,
fuzzy clustering), cluster validation techniques (silhouette, dunn’s index, connectivity), multivariate outlier detection, linear
regression, and logistic regression.”
What statistical capabilities or functions would be very useful if provided within Endeca discovery applications, and where
would they be useful?
(2) Techniques that would assist with identifying outliers or invalid data. Much of this work seems to be done by hand. I
believe that we are also getting to the point where we could start using linear regression and splines (for showing trends).”
64.
65. NEEDS
For example, would system-generated descriptive statistical visualizations be useful for whole data sets - or for smaller user-
selected groups of attributes?
“With regards to your last question on visualization, we have put in significant effort to use visualization in our Endeca
installation. We have built visualizations such as tree maps, flow diagrams, sun burst diagrams, scatter plots showing clusters,
and hierarchical edge bundling diagrams to explore our data sets.
Would it be useful for the application to analyze and suggest possible distribution models it sees in the data; for the values of
individual attributes, and/or for larger sets of data?
Our data tends to be qualitative rather than quantitative so this drives much of our visualizations.
So yes, interactive descriptive statistical visualization would be helpful – on the complete data set and individual attributes.”
66. Discovery/Information Needs
Support longer term strategic planning:
•How can we decrease the time-to-install service for new customers
•How can we decrease the time it takes to restore service after a storm causes wide-
spread outages
•How can we decrease operational cost for each department/line of business
•How many call center representatives do I need in my call center
•How much offsite technician headcount do we need based on historical/seasonal
trends balanced against current customer install base and ongoing sales/marketing
efforts?
Evaluate Success:
•How effective was a particular marketing campaign
•How effective is a new training program for call center representatives
•How effective is a self-install approach
Understanding variables that impact KPIs. KPIs include:
•Call center volume
•% successful resolution by support staff
•Time-to-install
•Sales volume
•Sales revenue
Understanding & Explaining Variance using Retrospective Analyses
•Why does Connecticut have a shorter time-to-install than Rhode Island
•Why did 2 identical marketing campaigns in 2 different markets have vastly
different impact on sales
•Is the variance significant, or does it represent random deviation?
Ad-hoc Reporting
•How many calls to the call center needed to be escalated to tier 2 support last
month
•How many new customers complained that a technician was later/didn't show up for
the install appointment
Analyst Profile: Scott – Operations Analyst
Summary
Education
BA Information Systems (Connecticut State College)
MBA Org Leadership (Johnson & Wales)
Scott is a mid-level analyst with a background in Business
Information Systems, and MBA in Organizational
Leadership. He works in a 6-person team at Cox-New England
(Telecommunications). His current role involves conducting data mining
analysis to support operations research and organizational decision
making/strategic planning.
Scott's work supports both sides of the profit equation: operations
research/analysis to support internal cost-cutting and process innovation,
and formative/summative evaluation to help drive effective sales/
marketing efforts to increase revenue. His group is also given target cost
savings goals that they need to help individual departments achieve to
fulfill a cost reduction organizational mandate. His group accomplishes
this by discovering inefficiencies in process through data mining,
predictive modeling and retrospective data analysis.
Cox has highly attributed enterprise data on customers, marketing
campaigns, pricing variants and special offers, demographics, geography
of the area, building and home types, school schedules, weather events,
etc. that describe customer usage patterns, consumption of media
bandwidth, etc. Each of their products (data, cable, phone, wireless) has
different usage profiles that vary along many of the dimensions and
variables listed above. His group is focused on residential customers;
business customers are handled by a separate unit.
67.
68. ‘FIVE THINGS ANALYSTS DO WITH DATA’
▸ Clustering
▸ Dimension Reduction
▸ Anomaly Detection
▸ Characterization
▸ Testing probability model & validation
Source: Frontiers in Massive Data Analysis
http://www.nap.edu/openbook.php?record_id=18374
}
}
Structure of data
Profile of data
} Validity of data
73. Sense Maker Segment
Sense makers need to create and/or employ insights to accomplish
their business goals and satisfy their responsibilities.
These insights emerge from independent and collaborative discovery
efforts that involve direct interaction with discovery applications, and
participation in discovery environments.
Insight Consumer
Analyst
Casual Analyst
Data Scientist
Analytics Manager
Problem Solver
74. Creates data-driven insights, offerings, and resources to transform the organization
Work Experience 10 Years
Education Ph.D. Statistics, MS Bio-Informatics
Job Title Senior Data Scientist
Company LInkedIn
Summarize & Communicate
Review findings with colleagues;
summarize ,visualize, and
communicate key findings to
Insight Consumers/decision makers
Prototype & Experiment
with data driven
feature:
How can we prototype/
evaluate this w/out
disrupting the site?
Gather Data &
Analyze Results
Use descriptive,
inferential, and
predictive statistics
to evaluate results
Analyze & Identify causal/
predictive factors:
Who are the best
candidates to contact for a
job based on recruiter
needs and profile content?
Dana Data Scientist
• Defining and capturing useful measures of
online attention
• Getting all the data analytic tools to work
together properly
• No current workflow support or tools for data
wrangling, analysis, experimentation,, and
prototyping
• Effective tools to help experiment with and
evaluate value /utility of features and
activities for users
• Ability to rapidly prototype data-driven
features w/out risk of online service
disruptions
• Open source data manipulation, mining &
analysis tools including R, Pig, Hadoop, Python,
etc.
• Statistical packages such as SAS, SPSS, etc.
• Custom analytical tools built using open source
components and languages
• Leverage data to support the org mission
• Enhance products & services with data-driven
insights and features
• Use data to identify new opportunities and
prototype/drive new customer offerings
• Create useful data sets/streams, measures, &
resources (e.g., data models, algorithms, etc.
Key Goals
Tools
Pain Points
Wish List
Sample Workflow
Dana is a Senior Data Scientist who has worked at LinkedIn for 5 years.
Dana’s education includes a Ph.D. in Statistics and an MS in Bio
Informatics. Dana’s previous work includes positions in academic research
groups as a doctoral candidate and post-doc, as well as software
engineering roles in the Internet & technology industries.
•Dana works with several other data scientists and her Analytics Manager
on a centralized team
•Dana and her colleagues aim to create data driven insights, features,
resources, and offerings that deliver strategic value to LinkedIn
•Dana works with Analysts on other teams to define and create discovery
tools, data sets, and methods for use by their groups at LinkedIn.
•Dana & team are visible & well established within LinkedIn, and have a
voice in product strategy and operational context; they have a high
degree of autonomy in defining data science projects
•Dana works with Insight Consumers to suggest and determine potential
new data driven offerings to prototype and evaluate.
• How can we leverage data to increase online engagement with LinkedIn?
•How should we measure engagement & what factors drive it?
•What aspects of a personal profile are most likely to encourage /
discourage new connections between people?
•How can we increase people’s activity and contributions to topical
discussion groups?
• What factors drive the effectiveness of our marketing campaigns?
•Why did one of our marketing campaigns work exceptionally well?
• How can leverage data to help recruiters identify and communicate
effectively with qualified and potentially available candidates?
Typical Discovery Scenarios & Problems
Background
Work Context
• Mines, analyzes, & experiments with data to
identify patterns, trends, outliers, causal
factors, predictive models, & opportunities
• Defines and explains newly devised
measurements, predictive models, &
insights
• Compares effectiveness of operations at
achieving company goals for engagement,
growth, data quality
• Produces & explores new data sets
• Collaborates with other data scientists to
capture new data streams
• Prototypes new data driven site features/
offerings
• Runs data based experiments to test/
evaluate models, hypotheses & prototypes
• Communicates & explains analyses to
colleagues & Insight Consumers
I’ll do whatever it takes – wrangle,
extract, manipulate, analyze,
experiment, prototype – to use
data to drive value & innovate
“
”
Activities
75. Perspectives
Analytical
The analytical perspective is the center of definition for all
analytical roles. Contrast with engineers, who "make stuff".
Analytical roles figure things out for some purpose: whether a
model to inform a product prototype or provide insight.
Empirical
The empirical perspective is distinct from the analytical
perspective, and marks 'true' data scientists. This revolves
around framing and testing hypotheses formally and informally,
often requires validation and interrogation of experimental
methods and results by others, expects significant degree of
transparency at (all) stages of the analytical effort.
76. Empirical Method
Experiments
Hypotheses
Results
Questions or
beliefs
Predictions
Conclusions
Insights
Domain
Production
Models
Data Sets
Exploratory ValidationInvestigative TrainingModel Building
Analytical
Methods
Insight
Consumer
Data
Scientist
Articulates
Directs
& applies
Creates & refines
Effected by
Lead to
Tested by
Use / require
Motivate
Creates & refines
Generate
Achieves
Informed by & shares
Inform
Understands
Defines & evolves
Inform
Data
Engineer
Implements
Determines
Applied to validates
Data Sources
Used to define
Applied to
Development
Corpus
External
Sources
Production
Corpus
Mirrors
Applied to
Models
Reference Initial Interim New
Drawn from
Analytical Tool
Algorithm Script Test
Implemented as
Implements
Inform
What is the question?
How will we answer the question?
What data will we use?
What analytical method will we use?
What tools will we use?
What are the results?
What do the results mean?
What did we learn / discover?
Who should we inform?
What is the next question?
Manages
Data ProductsManages
EMPIRICAL DISCOVERY
“a hybrid, purposeful,
applied, augmented, iterative
and serendipitous method for
realizing novel insights for
business, through analysis of
large and diverse data sets.”
Data Science and Empirical Discovery: A New Discipline Pioneering a New
Analytical Method
https://blogs.oracle.com/serendipity/entry/data_science_and_empirical_discovery
78. Analysis Workflow & Activities
• Empirical analysis of subsets of data
–Understand topology of data, boundaries (sets / subsets, complete corpus,
totality of data)
• Outlier identification and profiling
–How significant are outliers to overall topology
»Comparative exclusion and profiling of resulting data subsets to understand their role,
discover principal components
• Find and analyze patterns, areas of interestingness / deserving attention
• Find and analyze central actors / factors (in existing model that produced
source data, in topology of working data, in patterns, etc.)
–ID and understand their impact on local and global data topology and primary metrics if in several ways
/ more than one axis / at the same time
• Discover and analyze relationships amongst central actors
–Understand cycles, trends, changes (dynamic characteristics) for core actors,
topology, patterns and structure
–Understand causal factors
• Codify / create new model reflecting insights & outcomes from experiments
79. Data Science Workflow
• Frame problem / goal of effort
• Identify and extract data to be used in effort from whole corpus / totality of available
data
–Exploratory identification and selection of working data for use in experiments
• Define experiment(s): hypothesis / null hypothesis, methods, success criteria
–Derive insight(s)
–Wrangle, process, visualize, interpret
• Codify / create new model reflecting insights outcomes from experiments
• Validate new model(s)
• Provision training data
• Train new model
• Validation and outcome of training model
• Hand-off for implementation on production systems / as production code
80. THE ESSENCE
▸Empirical perspective
▸Business imperatives drive activities
▸Analytical approach
▸Recipe is always the same
▸Engineering always present
▸Data challenges are paramount
▸consume 60% - 80% of time and effort
▸Data volumes range huge to moderate (PB > MB)
▸Domain often drives analysis
▸Data scientists already have self-service
▸Some new problems, many the same
▸Use ‘advanced’ analytics, not conventional BA
▸Innovate by applying known analyses to new data
▸Current workflow fragmented across tools and data stores
▸Success can be a model, product, insight, infrastructure, tool
81. Model of Analytical Workflow
Articulates common analytical activities
“realistic” - represents wrangling, some iterative dynamics
bounded - does not represent business perspective
Originated by Ben Lorica - O’Reilly
*consistent with our research*
85. THE ESSENCE
▸Empirical perspective
▸Business imperatives drive activities
▸Analytical approach
▸Recipe is always the same
▸Engineering always present
▸Data challenges are paramount
▸consume 60% - 80% of time and effort
▸Data volumes range huge to moderate (PB > MB)
▸Domain often drives analysis
▸Data scientists already have self-service
▸Some new problems, many the same
▸Use ‘advanced’ analytics, not conventional BA
▸Innovate by applying known analyses to new data
▸Current workflow fragmented across tools and data stores
▸Success can be a model, product, insight, infrastructure, tool
87. John is tasked with analyzing 30 years of crime data collected by three different authorities. Accordingly, the data arrive in three different formats: one source is a relational database, another is a comma-separated values (CSV)
file, and the third file contains data copied from various tables within a portable document format (PDF) report. Knowing the structure required for his visualization tool, John first reviews the different data sets to identify potential
problems (step 1 in Figure 1).
The relational database allows him to specify a query and generate a file in an acceptable format. For the comma delimited data, the column headings associated with the data were unclear. Using spreadsheet software he adds a
row of header information at the top to fit the format required by the visualization tool. While updating the header, John notices that the location of a given crime is encoded in one column (as ‘City, State’) in the CSV file and
encoded in two columns (one ‘City’ column and one ‘State’ column) in the relational database.
He decides to split the column in the CSV file into two separate columns. John then opens the text file in the spreadsheet but the spreadsheet does not parse the data as desired. After manually moving data fields to appropriate
columns and some other manipulation (step 2), John finally has consistent columns and now combines the three files into one, but then notices that some columns have inconsistently formatted cells.
The ‘Date’ column is formatted as ‘dd/mm/yy’ in some cells and as ‘mm/dd/yyyy’ in others. John returns to the original files, transforms all the dates to the same format, and recombines the files. John loads the merged data file in a
visualization tool (step 3). The tool immediately gives the error message ‘Empty cells in column 3’; it cannot cope with missing data. John returns to the spreadsheet to fill in missing values using a few spreadsheet formulas (back
to step 2). He edits the data by hand; sometimes he transforms the data (e.g. one state reports data only every other year so he uses an average for the missing years). At other times there is nothing he can do after diagnosing a
new problem (i.e. return to step 1). For example, he finds out that survey question 24 did not exist before 2000, and the most recent year of data from Ohio has not been delivered yet, so he tries to pick the best possible value (e.g.
1) to indicate missing values. John detects other, more nuanced, problems; for example, some cells have a blank space instead of being empty. It took hours to notice that difference. John tries to follow a systematic approach
when evaluating the data, but it is difficult to keep track of what he has inspected and how he has modified the data, especially because he discovers different issues across different files. Even after all of this work, he is not sure if
he has examined all of the variables or overlooked any outliers. After a while, the data file seems good enough and he decides to move on.
It took a few days so it is with a great sense of accomplishment that John finally loads the data for the second time into the visualization tool he wants to use (step 3 again). He constructs several views of the data, including a
geospatial representation of the crimes and a scatterplot of age against crime. As soon as he sees the visualized data he realizes that, unfortunately, data quality issues still persist. Extreme outliers appear in the visualization.
Some outliers seem to be valid data (e.g. data from the District of Columbia are very different from data from every other state).
Others seem suspicious (criminals may vary in age from teenagers to older adults, but apparently babies are also committing crimes in certain states). John iteratively removes those outliers he believes to be dirty data (e.g.
criminals under 7 and over 120 years old). Times eries visualizations indicate that, in 1995, some causes of death disappear abruptly while new ones appear.Two days later, an email exchange with colleagues reveals that the
classification of causes of death was changed that year. John writes a transformation script to merge the data so he can analyze distinct terms referring to the same (or at least similar) cause of death.
Although the ‘real’ analysis is just about to start (step 4), John has made dozens of transformations, repeated the process several times, made important discoveries relating to the quality of the data, and made
many decisions impacting the quality of the final ‘clean’ data. He also used visualization repeatedly while walking through the process, but still does not have results to show to his boss. Finally, he is able to work
with the usable data, and useful insights come to the surface, but updated data sets arrive (step 5). Without proper documentation (step 6) of his transformations, John might be forced to repeat many of the
tedious tasks.
“Research directions in data wrangling: Visualizations and transformations for usable and credible data”
“a process of iterative data exploration and transformation that enables analysis.”
WRANGLING SCENARIO
88. Although the ‘real’ analysis is just about to start (step 4), John has made
dozens of transformations, repeated the process several times, made
important discoveries relating to the quality of the data, and made many
decisions impacting the quality of the final ‘clean’ data.
He also used visualization repeatedly while walking through the process, but
still does not have results to show to his boss.
Finally, he is able to work with the usable data, and useful insights come to the
surface, but updated data sets arrive (step 5).
Without proper documentation (step 6) of his transformations, John might be
forced to repeat many of the tedious tasks.
“Research directions in data wrangling: Visualizations and transformations for usable and credible data”
“a process of iterative data exploration and transformation that enables analysis.”
WRANGLING SCENARIO
89. One or more initial data sets may be used and new versions may
come later. The wrangling and analysis phases overlap.
While wrangling tools tend to be separated from the visual
analysis tools, the ideal system would provide integrated tools
(light yellow). The purple line illustrates a typical iterative process
with multiple back and forth steps.
Much wrangling may need to take place before the data can be loaded
within visualization and analysis tools, which typically
immediately reveals new problems with the data.
Wrangling might take place at all the stages of analysis as users
sort out interesting insights from dirty data, or new data become
available or needed.
At the bottom we illustrate how the data evolves from raw data to
usable data that leads to new insights.
“a process of iterative data exploration and transformation that enables analysis.”
WRANGLING IN THE ANALYTICAL WORKFLOW
91. Discovery in the Analytical Workflow
• Commonly recognizable cycle and focus for discovery activities (subset)
• Explicitly iterative, ad-hoc, dynamic
• Goal = incremental / directional advance in understanding
• Core modes of engagement with data = Explore, Analyze
• Modeling phase does not involve exploration
Discovery
96. The Language of Discovery:
A concrete descriptive language for
human discovery activity in diverse
contexts.
A simple and consistent vocabulary that
is independent of domain, role,
information type, etc.
The Language of Discovery:
A concrete descriptive language for
human discovery activity in diverse
contexts.
A simple and consistent vocabulary that
is independent of domain, role,
information type, etc.
103. Locate
To find a specific (possibly known) thing
e.g. I need to find a new part with particular technical attributes and then source it from the most qualified supplier - Engineering
Verify
‘To confirm or substantiate that an item or set of items meets some specific criterion’
e.g. How can I determine if I am looking at the latest information for a part or supplier? - Supply Chain Specialist
Monitor
‘To maintain awareness of the status of an item or data set for purposes of management or
control’
e.g. I need to monitor at risk/failing customers/dealers so I can prompt my Account Reps to fix the problems - Sales Manager
104. Compare
To examine two or more things to identify similarities & differences
e.g. I need to compare our module set teardowns with competitive teardown information to see if we’re staying competitive for cost, quality and functionality - Engineering
Comprehend
To generate insight by understanding the nature or meaning of something
e.g. I need to analyze and understand consumer-customer-market trends to inform brand strategy & communications plan – Director, Brand Image
Explore
To proactively investigate or examine something for the purpose of knowledge discovery
e.g. I need to understand the cost drivers for this commodity so I can negotiate better terms with my suppliers and forecast business risk based on market indices -
Procurement
105. Analyze
To critically examine the detail of something to identify patterns & relationships
e.g. I need to know the cost drivers for a part such as materials that impact cost. Is the relationship a correlation or step function for a part cost driver? - Engineering
Evaluate
To use judgement to determine the significance or value of something with respect to a specific benchmark
or model
e.g. I need to determine my current state in my prints so I can evaluate if I have price variation to negotiate a better price - Procurement
Synthesize
To generate or communicate insight by integrating diverse inputs to create a novel artifact or composite
view
e.g. I need to prepare a weekly report for my boss (sales mgr) of how things are going - Account Rep
111. Discovery Modes and Activity
Explore
Wrangle
Analyze
Augment
Sensemaking
Transformation
data quality computed / enriched data
New data triggers
new cycles
Cumulative Change
Direction & Momentum
Begin Conclude
Goal: Make data useful for
analysis
Goal: Understand the nature and
usefulness of data for analysis.
Goal: Accumulate insight through
iterative analysis
Goal: Achieve insights by
analyzing data.
112. Working with data
to effect outcomes
Explore
Wrangle
Analyze
Augment
Sensemaking
Transformation
data quality computed / enriched data
New data triggers
new cycles
Cumulative Change
Direction & Momentum
Begin Conclude
Advancing insight
Can’t do this…
…Without these
capabilities
Apparent Mode and Activity Affinities
113. Explore
Wrangle
Analyze
Augment
Sensemaking
Transformation
source data source & enriched data
New data triggers
new cycles
Cumulative incremental progress
Focus of attention: Organization
of the data and quality issues
Focus of attention: Actual &
potential insights
Real wrangling Real analysis
Actual Discovery Modes and Activity Affinities
114. CAPABILITIES FOR VISUAL DISCOVERY & ANALYSIS TOOLS
▸ Explore data corpus
▸via effectively characterized catalog
▸ Explore individual data sets
▸effective preview / sample / subset
▸ Analyze data
▸within ad-hoc data sets, across ad-hoc data sets
▸ Wrangle data
▸within ad-hoc data sets, across ad-hoc data sets
▸ Verify outcomes: insights, models, data products
▸ Synthesize outcomes
▸ distinct types = insights, model, data product (project)
▸ Publish outcomes
▸ distinct types = insight, data product, model (project)
▸ Integrate specialized / external analytical tools {augment}
▸ analysis tools (R, Python), reference models, validation tools
▸ Integrate external workflow tools {enhancing}
▸ e.g. figshare, model management, projects
▸ Support analytical workflow {enhancing}
128. Tools on the Market Now
Explore
Wrangle
Analyze
Augment
Sensemaking
Transformation
data quality computed / enriched data
Cumulative Change
Direction & Momentum
Begin Conclude
Paxata, Trifacta
Beyond Core?
OSS / hand rolled
EID 3.x
Wave 1 wrangling
tools now in market
No good exploration
tool in market
129. Tools on the Market Now
Explore
Wrangle
Analyze
Augment
Sensemaking
Transformation
data quality computed / enriched data
Cumulative Change
Direction & Momentum
Begin Conclude
Alteryx
Datameer
Modest exploration
capabilities
130. Tools on the Market Now
Explore
Wrangle
Analyze
Augment
Sensemaking
Transformation
data quality computed / enriched data
Cumulative Change
Direction & Momentum
Begin Conclude
Alteryx
Modest exploration
capabilities
Qlik
131. Tools on the Market Now
Explore
Wrangle
Analyze
Augment
Sensemaking
Transformation
data quality computed / enriched data
Cumulative Change
Direction & Momentum
Begin Conclude
Tableau, Platfora
Wave 1 visual
analysis tools now in
market
Modest wrangling
capabilities
135. VISUAL DISCOVERY AND ANALYSIS TOOLS: WAVE 1
Definition: traditional discovery & analysis possible on hadoop stores
Value prop = easy access to hadoop stores for analysts w/out data engineer
In / coming to market now: platfora, datameer, clearstory, sisense, etc.
Segment is viable (people understand the need & have the problem)
Tool maturity will increase incrementally, and in customary ways
alignment to workflow particulars
nuanced and compelling UX
broader footprint of supporting capabilities: provenance, publishing, collaboration
integration with ecosystem of related tools for activity
This class of tools competes with & may replace / displace existing non-hadoop native tools that are still rising with the general analytics wave: qlik, tableau,
microstrategy
Firms making new investments (for new stacks) will try / buy this new generation
Firms extending existing investments less likely to buy new
Long view = tools in this segment could ‘eat’ BI marketshare by adding reporting and other structured analytical capabilities that capture customers
who do not have large BI stacks now, begin investing here, and subsequently need BI capability
154. DEEP STRUCTURE <> ANALYTICAL WORKFLOW
CHANGE VECTORS <> BIG DATA TECHNOLOGIES
EARLY SIGNALS <> RISE OF DATA SCIENCE
INFLECTION POINTS <> DATA SCIENCE MOMENT
EMERGING SPACES <> EMPIRICAL DISCOVERY
HOLISTIC EXPERIENCES <> VISUAL DISCOVERY TOOL
156. VISUAL DISCOVERY & ANALYSIS TOOLS: WAVE 2
Definition: Augmented discovery & analysis across full business data corpus
Value prop = deeper insights from more diverse data, faster insights,
effected via a mixed toolkit of (semi)automated analytical techniques (clustering, machine learning, regression / correlation, etc.) enhances and directs analyst
attention
Vectors of augmentation: data types, degree of automation
data = text / lingual, location / spatial, native graph, native stream
automation = which specific activities are augmented, to what degree)
Wave 2 is at the ‘pioneer’ stage: specifics of capability, value, implementation unknown
Limiting factors:
Domain specificity: value of general discovery analytics drops once domain boundaries are reached - need to align specifically to domain view of world
Expect verticalization of all analytics
Low / no tolerance for black boxes - deeper insights require transparency
Analytical literacy: level increasing, but orgs can’t benefit from advanced analytical techniques if not understood & trusted
198. WORKING THE ECOSYSTEM
• Oracle = an ecosystem
• ML = commoditizing
• Someone will ‘generate the electricity’ = provide
ML capability within the Oracle ecosystem
• Everyone’s going to need it…
222. The Language of Discovery
Category: Primary Research, Design Systems
Outcomes: Building on already-published original
applied research into information retrieval and
usage, the language of discovery posits a domain-
independent framework describing the activity
primitives of discovery in terms of ‘modes’.
Succeeding professional and industry publications
outline the application of this descriptive vocabulary
in settings including product design and
development, product strategy, and information
management.
Reference:
• Russell-Rose, T., Lamantia, J. and Burrell, M. 2011. A Taxonomy of
Enterprise Search and Discovery. Proceedings of EuroHCIR 2011,
London, UK. http://ceur-ws.org/Vol-763/paper4.pdf
• Russell-Rose, T., Lamantia, J. and Burrell, M. 2011. A Taxonomy of
Enterprise Search and Discovery. Proceedings of HCIR 2011, California,
USA. https://docs.google.com/a/kent.edu/viewer?
a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxoY2lyd29ya3Nob3B8
Z3g6NzdmYjc3OWY2ZjQ2Zjg4MQ
• Russell-Rose, T. and Makri, S. 2012 A Model of Consumer Search
Behavior. Proceedings of EuroHCIR 2012, Nijmegen, NL.
• Designing the Search Experience: http://www.amazon.com/Designing-
Search-Experience-Information-Architecture/dp/0123969816
• Presentation - Strata: http://conferences.oreilly.com/strata/
stratany2012/public/schedule/detail/25411
• Presentation - UX Lisbon conference: http://www.joelamantia.com/
user-experience-ux/slides-for-uxlx-talk-the-language-of-discovery-a-
grammar-for-designing-big-data-interactions
223. Domain & Market Study: Data Science
Outcomes: Comprehensive portrait of all major facets of a new
analytical discipline, including its practices, roles,
methodology, tools and technologies, workflows,
organizational models, skillsets, alignment with business, areas
of innovation, and relation to the landscape of business
analytics.
Research outcomes and synthesized insights guided product
design, management, and strategy efforts including;
opportunity identification and profiling, landscape /
competitive modeling, technology lifecycle and evolution
models, product discovery, concept creation and evaluation,
prototyping.
Notable aspects: Consistently delivered insights twelve or
more months ahead of leading industry analysts pursuing
similar agendas.
Artifacts & Synthesis
• Data Science Highlights: http://www.joelamantia.com/user-
research/data-science-highlights-an-investigation-of-the-discipline
• Empirical Discovery Concept and Workflow Model: https://
blogs.oracle.com/serendipity/entry/
empirical_discovery_concept_and_workflow
• Empirical Discovery: A New Discipline https://blogs.oracle.com/
serendipity/entry/data_science_and_empirical_discovery
• Defining Discovery: Core Concepts: https://blogs.oracle.com/
serendipity/entry/defining_discovery_core_concepts
• Discovery and the Age of Insight http://www.joelamantia.com/
language-of-discovery/discovery-and-the-age-of-insight
• Big Data Is Not Enough http://www.joelamantia.com/user-
experience-ux/big-data-is-not-the-insight-slides-from-enterprise-
search-europe
226. DEEP STRUCTURES
ENTERPRISE / B2B
• Business process
• Activity
• Social structure: Organizational model
• Boundaries
• Regulation
• IT / Systems architecture
• Lifecycle
• Flows: capital, information, people
• Frame: shareholder value, social enterprise
CONSUMER / B2C
• Value scheme: wealth, love,
knowledge, safety
• Demographics
• Boundaries
• Mores
• Culture
• Social structure: community / group
• Frame: active lifestyle, sustainability