SlideShare une entreprise Scribd logo
1  sur  57
Data Ethics in Data Science Education
(plus: Science Data, Responsibly)
Bill Howe
University of Washington
Plan
• context: eScience Institute (1 min)
• context: Data Science MOOC (3 min)
• Vignette on Teaching Data Ethics (5 min)
• Science Data, Responsibly (6 min)
– Automated Curation
– Viziometrics
9/25/2016 Data, Responsibly @ Dagstuhl 2
• People
• Research Staff (~4 100% Data Scientists, ~4 50% Research Scientists)
• Postdocs (~12 at steady state)
• Faculty (~9 Exec Committee, ~20 Steering Committee, ~100 Affiliates)
• Adminstrative Staff (Program Managers, Finance, Admin)
• Programs
– Short and long-term research, education programs ugrad/masters/Phd,
software, research consulting
– Leadership on all things data science around campus
• Funding
• $700k / yr permanent appropriation from the state of WA
• $32.8M for 5 years jointly with NYU and UC Berkeley from the Gordon and
Betty Moore Foundation and the Alfred P Sloan Foundation to build a “Data
Science Environment”
• $9M for 5 years from the Washington Research Foundation
• $500k / yr from the Provost for half-lines for recruiting in relevant fields
9/25/2016 Bill Howe, UW 4
Data Science Education
9/25/2016 Bill Howe, UW 5
Students Non-Students
CS/Informatics Non-Major
professionals researchers
undergrads grads undergrads grads
(2011) Data Science Certificate
(2013) Data Science MOOC
(2013) NSF IGERT Big Data PhD
(2013) New CS Courses
(2016) Data Science Masters
(2015) Data Sci. for Social Good
Data Ethics being incorporated in all programs
Session 2
Summer 2014
121,215 students
Session 1
Spring 2013
119,504 students
Introduction to Data Science MOOC on Coursera
Participation numbers
• “Registered:” 119,517 totally irrelevant
• Clicked play in first 2 weeks: 78,589
• Turned in 1st homework: 10,663
• Completed all assignments: ~9000 typical for a MOOC
• “Passed:” 7022
• Forum threads: 4661
• Forum posts: 22,900
Fairly consistent with Coursera data across “hard” courses
Define success however you want
– Many love it in parts, start late, don’t turn in homework, etc.
– Learning rather than watching television
Syllabus
• Data Science Landscape (~1 week)
• Data Manipulation at Scale
– Relational Databases (~1 week)
– MapReduce (~1 week)
– NoSQL (~1 week)
• Analytics
– Statistics Topics (~1 week)
– Machine Learning Topics (~2 weeks)
• Visualization (~1 week)
• Graph Analytics (~1 week)
2015: MOOC Recast as a 4-course “Specialization”
Data Manipulation at Scale
Databases, Systems, Algorithms
Practical Predictive Analytics
Stats (resampling methods, multiple hypothesis testing, more)
ML (rules/trees/forests, ensembles/boosting/bagging, SVMs, GD, eval…)
Communicating Data Science
Visualization, ethics and privacy
Capstone
VIGNETTE ON TEACHING
DATA ETHICS
9/25/2016 Bill Howe, UW 10
Alcohol Study, Barrow Alaska, 1979
Native leaders and city officials,
worried about drinking and associated
violence in their community invited a
group of sociology researchers to
assess the problem and work with
them to devise solutions.
Methods
• 10% representative sample (N=88)
of everyone over the age of 15 using
a 1972 demographic survey
• Interviewed on attitudes and values
about use of alcohol
• Obtained psychological histories
including drinking behavior
• Given the Michigan Alcoholism
Screening Test (Seltzer, 1971)
• Asked to draw a picture of a person
– Used to determine cultural identity
Results announced unilaterally and publicly
At the conclusion of the study researchers formulated a report entitled “The
Inupiat, Economics and Alcohol on the Alaskan North Slope” which was released
simultaneously at a press release and to the Barrow community. The press
release was picked up by the New York Times, who ran a front page story
entitled Alcohol Plagues Eskimos
The results of the Barrow Alcohol Study in Alaska were revealed in the context of a
press conference that was held far from the Native village, and without the
presence, much less the knowledge or consent, of any community member who
might have been able to present any context concerning the socioeconomic
conditions of the village. Study results suggested that nearly all adults in the
community were alcoholics. In addition to the shame felt by community members,
the town’s Standard and Poor bond rating suffered as a result, which in turn
decreased the tribe’s ability to secure funding for much needed projects.
Backlash
Methodological Problems
“The authors once again met with the Barrow Technical Advisory
Group, who stated their concern that only Natives were studied,
and that outsiders in town had not been included.”
“The estimates of the frequency of intoxication based on
association with the probability of being detained were termed
"ludicrous, both logically and statistically.””
Edward F. Foulks, M.D., Misalliances In The Barrow Alcohol Study
Ethical Problems
• Participants were not in control of their data nor
the context in which they were presented.
• Easy to demonstrate specific, significant harms:
– Social: Stigmatization
– Financial: Bond rating lowered
• Important: Nothing to do with individual privacy
– No PII revealed at any point, to anyone
– No violations of best practices in data handling
– But even those who did not participate in the study
incurred harm
Two Topics
• Social Component: Codes of Conduct
• Technical Component: Managing Sensitive Data
Ethical principles vs. ethical rules
• In the Barrow example, ethical rules were
generally followed
• But ethical principles were violated: The
researchers appear to have placed their own
interests ahead of those of the research
subjects, the client, and society
Principles: Codes of Conduct
• American Statistical Association
– http://www.amstat.org/committees/ethics/
• Certified Analytics Professional
– https://www.certifiedanalytics.org/ethics.php
• Data Science Association
– http://www.datascienceassn.org/code-of-
conduct.html
SCIENCE DATA, RESPONSIBLY
9/25/2016 Bill Howe, UW 20
Science is a complete mess
• Reproducibility
– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible
– Only about half of psychology 100 studies had effect sizes that approximated
the original result (Science, 2015)
– Ioannidis 2005: Why most public research findings are false
– Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups
9/25/2016 Bill Howe, UW 21
Science, 2015
9/25/2016 Data, Responsibly @ Dagstuhl 23
Retractions are increasing…..
Science is a complete mess
• Reproducibility
– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible
– Only about half of psychology 100 studies had effect sizes that approximated
the original result (Science, 2015)
– Ioannidis 2005: Why most public research findings are false
– Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups
• Fraud
– Diederik Stapel: 38 articles with fictitious data
– Bharat Aggarwal: a huge number of images with evidence of manipulation
9/25/2016 Bill Howe, UW 24
Bharat Aggarwal
alleged data manipulation
Science is a complete mess
• Reproducibility
– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible
– Only about half of psychology 100 studies had effect sizes that approximated
the original result (Science, 2015)
– Ioannidis 2005: Why most public research findings are false
– Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups
• Fraud
– Diederik Stapel: 38 articles with fictitious data
– Bharat Aggarwal: a huge number of images with evidence of manipulation
• Public Trust
– Churn: Chocolate, egg yolks, red meat, red wine, etc.
– Climate change, vaccines
9/25/2016 Bill Howe, UW 27
Vision: Validate scientific claims automatically
– Check for manipulation (manipulated images, Benford’s Law)
– Extract claims from papers
– Check claims against the authors’ data
– Check claims against related data sets
– Automatic meta-analysis across the literature + public datasets
• First steps
– Automatic curation: Validate and attach metadata to public datasets
– Longitudinal analysis of the visual literature
9/25/2016 Data, Responsibly @ Dagstuhl 32
“DEEP” CURATION
Science Data, Responsibly
Microarray experiments
9/25/2016 Bill Howe, UW 41
Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the
bottleneck to data sharing
Maxim
Gretchkin
Hoifung
Poon
color = labels supplied
as metadata
clusters = 1st two PCA
dimensions on the
gene expression data
itself
Can we use the expression data
directly to curate algorithmically?
Maxim
Gretchkin
Hoifung
Poon
The expression data
and the text labels
appear to disagree
Maxim
Gretchkin
Hoifung
Poon
Better Tissue
Type Labels
Domain knowledge
(Ontology)
Expression data
Free-text Metadata
2 Deep Networks
text
expr
SVM
Deep Curation Maxim
Gretchkin
Hoifung
Poon
Distant supervision and co-learning between text-
based classified and expression-based classifier: Both
models improve by training on each others’ results.
Free-text classifier
Expression classifier
Deep Curation:
Our stuff wins, with no training data
Maxim
Gretchkin
Hoifung
Poon
state of the art
our reimplementation
of the state of the art
our dueling
pianos NN
amount of training data used
VIZIOMETRICS:
COMPREHENDING VISUAL INFORMATION
IN THE SCIENTIFIC LITERATURE
Human-Data Interaction
9/25/2016 Bill Howe, UW 46
Step 1: Dismantling Composite
Figures Poshen Lee
ICPRAM 2015
Do high-impact papers have fewer
equations, as indicated by Fawcett and
Higginson? (Yes)
Poshen LeeJevin West
high impact papers low impact papers
Do high-impact papers have more
diagrams? (Yes)
Poshen LeeJevin West
TEACHING
DATA ETHICS IN DATA SCIENCE
Session 2
Summer 2014
121,215 students
Session 1
Spring 2013
119,504 students
Participation numbers
• “Registered”: 119,517 totally irrelevant
• Clicked play in first 2 weeks: 78,589
• Turned in 1st homework: 10,663
• Completed all assignments: ~9000 typical for a MOOC
• “Passed”: 7022
• Forum threads: 4661
• Forum posts: 22,900
Fairly consistent with Coursera data across “hard” courses
Define success however you want
– Many love it in parts, start late, don’t turn in homework, etc.
– Learning rather than watching television
Lectures
• Data Science Context and Case Studies (~1 week)
• Data Management at Scale
– Relational Databases (~1 week)
– MapReduce (~1 week)
– NoSQL (~1 week)
• Topics in Analytics
– Permutation Methods, Bayesian Methods (~1 week)
– Machine Learning Algorithms and Evaluation (~1 week)
• Visualization (~1 week)
• Graph Analytics (~1 week)
• Guest Lectures
9/25/2016 Bill Howe, UW 56
Who took the course?
9/25/2016 Bill Howe, UW 57
Who took the course?
9/25/2016 Bill Howe, UW 58
Who took the course?
What programming language do you typically use?
9/25/2016 Bill Howe, UW 59
9/25/2016 Bill Howe, UW 60
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
Attrition, video lectures
Number of students watching videos by segment, ordered by time
9/25/2016 Bill Howe, UW 62
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Twitter1
Twitter2
Twitter3
Twitter4
Twitter5
Twitter6
Database1
Database2
Database3
Database4
Database5
Database6
Database7
Database8
Database9
MapReduce1
MapReduce2
MapReduce3
MapReduce4
MapReduce5
MapReduce6
Kaggle
Tableau
Attrition, assignments
Number of students completing assignments by part
9/25/2016 Bill Howe, UW 64
Who took the course?
In a directory with 1000 text files, you are asked to
create a list of files that contain the word Drosophila
9/25/2016 Bill Howe, UW 65
Who took the course?
What if you were given a billion documents spread across many
computers and asked to count the occurrences of a given phrase?
“I left the company I co-founded in 2005 to do data
analytics with Wibidata, with whom I was introduced
as a result of their guest lecture in your course.

Contenu connexe

Tendances

The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
 
Machines are people too
Machines are people tooMachines are people too
Machines are people tooPaul Groth
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
 
A Blind Date With (Big) Data: Student Data in (Higher) Education
A Blind Date With (Big) Data: Student Data in (Higher) EducationA Blind Date With (Big) Data: Student Data in (Higher) Education
A Blind Date With (Big) Data: Student Data in (Higher) EducationUniversity of South Africa (Unisa)
 
What Can Happen when Genome Sciences Meets Data Sciences?
What Can Happen when Genome Sciences Meets Data Sciences?What Can Happen when Genome Sciences Meets Data Sciences?
What Can Happen when Genome Sciences Meets Data Sciences?Philip Bourne
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) CommonsJames Hendler
 
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroData ScienceTech Institute
 
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebStefan Dietze
 
Reproducibility from an infomatics perspective
Reproducibility from an infomatics perspectiveReproducibility from an infomatics perspective
Reproducibility from an infomatics perspectiveMicah Altman
 
Tragedy of the Data Commons (ODSC-East, 2021)
Tragedy of the Data Commons (ODSC-East, 2021)Tragedy of the Data Commons (ODSC-East, 2021)
Tragedy of the Data Commons (ODSC-East, 2021)James Hendler
 
Knowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/InteroperabilityKnowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/InteroperabilityJames Hendler
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of DataPaul Groth
 
Informatics Transform : Re-engineering Libraries for the Data Decade
Informatics Transform : Re-engineering Libraries for the Data DecadeInformatics Transform : Re-engineering Libraries for the Data Decade
Informatics Transform : Re-engineering Libraries for the Data DecadeLiz Lyon
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)James Hendler
 
The UVA School of Data Science
The UVA School of Data ScienceThe UVA School of Data Science
The UVA School of Data SciencePhilip Bourne
 
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebBeyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebStefan Dietze
 

Tendances (20)

The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 
Machines are people too
Machines are people tooMachines are people too
Machines are people too
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
A Blind Date With (Big) Data: Student Data in (Higher) Education
A Blind Date With (Big) Data: Student Data in (Higher) EducationA Blind Date With (Big) Data: Student Data in (Higher) Education
A Blind Date With (Big) Data: Student Data in (Higher) Education
 
What Can Happen when Genome Sciences Meets Data Sciences?
What Can Happen when Genome Sciences Meets Data Sciences?What Can Happen when Genome Sciences Meets Data Sciences?
What Can Happen when Genome Sciences Meets Data Sciences?
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) Commons
 
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
 
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the Web
 
Reproducibility from an infomatics perspective
Reproducibility from an infomatics perspectiveReproducibility from an infomatics perspective
Reproducibility from an infomatics perspective
 
Tragedy of the Data Commons (ODSC-East, 2021)
Tragedy of the Data Commons (ODSC-East, 2021)Tragedy of the Data Commons (ODSC-East, 2021)
Tragedy of the Data Commons (ODSC-East, 2021)
 
Knowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/InteroperabilityKnowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/Interoperability
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of Data
 
Informatics Transform : Re-engineering Libraries for the Data Decade
Informatics Transform : Re-engineering Libraries for the Data DecadeInformatics Transform : Re-engineering Libraries for the Data Decade
Informatics Transform : Re-engineering Libraries for the Data Decade
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
 
The UVA School of Data Science
The UVA School of Data ScienceThe UVA School of Data Science
The UVA School of Data Science
 
Broad Data
Broad DataBroad Data
Broad Data
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
 
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebBeyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
 

Similaire à Science Data, Responsibly

Fsci 2018 monday30_july_am6
Fsci 2018 monday30_july_am6Fsci 2018 monday30_july_am6
Fsci 2018 monday30_july_am6ARDC
 
Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014Susanna-Assunta Sansone
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureUniversity of Washington
 
Evolving and emerging scholarly communication services in libraries: public a...
Evolving and emerging scholarly communication services in libraries: public a...Evolving and emerging scholarly communication services in libraries: public a...
Evolving and emerging scholarly communication services in libraries: public a...Claire Stewart
 
2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...
2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...
2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...datacite
 
Day 1 - Quisumbing and Davis - Moving Beyond the Qual-Quant Divide
Day 1 - Quisumbing and Davis - Moving Beyond the Qual-Quant DivideDay 1 - Quisumbing and Davis - Moving Beyond the Qual-Quant Divide
Day 1 - Quisumbing and Davis - Moving Beyond the Qual-Quant DivideAg4HealthNutrition
 
Bioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big DataBioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big DataPhilip Bourne
 
Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona
Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona
Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona Elsevier
 
AAPOR - comparing found data from social media and made data from surveys
AAPOR - comparing found data from social media and made data from surveysAAPOR - comparing found data from social media and made data from surveys
AAPOR - comparing found data from social media and made data from surveysCliff Lampe
 
Realizing the Potential of Research Data by Carole L. Palmer
Realizing the Potential of Research Data by Carole L. Palmer Realizing the Potential of Research Data by Carole L. Palmer
Realizing the Potential of Research Data by Carole L. Palmer carolelynnpalmer
 
Sdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) finalSdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) finalkimlyman
 
Teaching Data Science to Undergraduate Students
Teaching Data Science to Undergraduate StudentsTeaching Data Science to Undergraduate Students
Teaching Data Science to Undergraduate StudentsNicole Vasilevsky
 
Sdal overview sallie keller
Sdal overview  sallie kellerSdal overview  sallie keller
Sdal overview sallie kellerkimlyman
 
Data Management and Broader Impacts: a holistic approach
Data Management and Broader Impacts: a holistic approachData Management and Broader Impacts: a holistic approach
Data Management and Broader Impacts: a holistic approachMegan O'Donnell
 
Michael Pocock: Citizen Science Project Design
Michael Pocock: Citizen Science Project DesignMichael Pocock: Citizen Science Project Design
Michael Pocock: Citizen Science Project DesignAlice Sheppard
 
Social Graphs for Better Drug Development
Social Graphs for Better Drug DevelopmentSocial Graphs for Better Drug Development
Social Graphs for Better Drug DevelopmentVaticle
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Managementaaroncollie
 

Similaire à Science Data, Responsibly (20)

Fsci 2018 monday30_july_am6
Fsci 2018 monday30_july_am6Fsci 2018 monday30_july_am6
Fsci 2018 monday30_july_am6
 
Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
Evolving and emerging scholarly communication services in libraries: public a...
Evolving and emerging scholarly communication services in libraries: public a...Evolving and emerging scholarly communication services in libraries: public a...
Evolving and emerging scholarly communication services in libraries: public a...
 
2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...
2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...
2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...
 
Day 1 - Quisumbing and Davis - Moving Beyond the Qual-Quant Divide
Day 1 - Quisumbing and Davis - Moving Beyond the Qual-Quant DivideDay 1 - Quisumbing and Davis - Moving Beyond the Qual-Quant Divide
Day 1 - Quisumbing and Davis - Moving Beyond the Qual-Quant Divide
 
Bioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big DataBioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big Data
 
Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona
Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona
Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona
 
AAPOR - comparing found data from social media and made data from surveys
AAPOR - comparing found data from social media and made data from surveysAAPOR - comparing found data from social media and made data from surveys
AAPOR - comparing found data from social media and made data from surveys
 
Realizing the Potential of Research Data by Carole L. Palmer
Realizing the Potential of Research Data by Carole L. Palmer Realizing the Potential of Research Data by Carole L. Palmer
Realizing the Potential of Research Data by Carole L. Palmer
 
Data Science and Urban Science @ UW
Data Science and Urban Science @ UWData Science and Urban Science @ UW
Data Science and Urban Science @ UW
 
Sdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) finalSdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) final
 
Teaching Data Science to Undergraduate Students
Teaching Data Science to Undergraduate StudentsTeaching Data Science to Undergraduate Students
Teaching Data Science to Undergraduate Students
 
Sdal overview sallie keller
Sdal overview  sallie kellerSdal overview  sallie keller
Sdal overview sallie keller
 
Open Science Incentives/Veerle van den Eynden
Open Science Incentives/Veerle van den EyndenOpen Science Incentives/Veerle van den Eynden
Open Science Incentives/Veerle van den Eynden
 
Data Management and Broader Impacts: a holistic approach
Data Management and Broader Impacts: a holistic approachData Management and Broader Impacts: a holistic approach
Data Management and Broader Impacts: a holistic approach
 
Michael Pocock: Citizen Science Project Design
Michael Pocock: Citizen Science Project DesignMichael Pocock: Citizen Science Project Design
Michael Pocock: Citizen Science Project Design
 
Social Graphs for Better Drug Development
Social Graphs for Better Drug DevelopmentSocial Graphs for Better Drug Development
Social Graphs for Better Drug Development
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 

Plus de University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe University of Washington
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareUniversity of Washington
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchUniversity of Washington
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersUniversity of Washington
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce University of Washington
 
Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceUniversity of Washington
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceUniversity of Washington
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisUniversity of Washington
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceUniversity of Washington
 

Plus de University of Washington (19)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible Research
 
End-to-End eScience
End-to-End eScienceEnd-to-End eScience
End-to-End eScience
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce
 
Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory Science
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScience
 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and Analysis
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 

Dernier

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 

Dernier (20)

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 

Science Data, Responsibly

  • 1. Data Ethics in Data Science Education (plus: Science Data, Responsibly) Bill Howe University of Washington
  • 2. Plan • context: eScience Institute (1 min) • context: Data Science MOOC (3 min) • Vignette on Teaching Data Ethics (5 min) • Science Data, Responsibly (6 min) – Automated Curation – Viziometrics 9/25/2016 Data, Responsibly @ Dagstuhl 2
  • 3. • People • Research Staff (~4 100% Data Scientists, ~4 50% Research Scientists) • Postdocs (~12 at steady state) • Faculty (~9 Exec Committee, ~20 Steering Committee, ~100 Affiliates) • Adminstrative Staff (Program Managers, Finance, Admin) • Programs – Short and long-term research, education programs ugrad/masters/Phd, software, research consulting – Leadership on all things data science around campus • Funding • $700k / yr permanent appropriation from the state of WA • $32.8M for 5 years jointly with NYU and UC Berkeley from the Gordon and Betty Moore Foundation and the Alfred P Sloan Foundation to build a “Data Science Environment” • $9M for 5 years from the Washington Research Foundation • $500k / yr from the Provost for half-lines for recruiting in relevant fields
  • 5. Data Science Education 9/25/2016 Bill Howe, UW 5 Students Non-Students CS/Informatics Non-Major professionals researchers undergrads grads undergrads grads (2011) Data Science Certificate (2013) Data Science MOOC (2013) NSF IGERT Big Data PhD (2013) New CS Courses (2016) Data Science Masters (2015) Data Sci. for Social Good Data Ethics being incorporated in all programs
  • 6. Session 2 Summer 2014 121,215 students Session 1 Spring 2013 119,504 students Introduction to Data Science MOOC on Coursera
  • 7. Participation numbers • “Registered:” 119,517 totally irrelevant • Clicked play in first 2 weeks: 78,589 • Turned in 1st homework: 10,663 • Completed all assignments: ~9000 typical for a MOOC • “Passed:” 7022 • Forum threads: 4661 • Forum posts: 22,900 Fairly consistent with Coursera data across “hard” courses Define success however you want – Many love it in parts, start late, don’t turn in homework, etc. – Learning rather than watching television
  • 8. Syllabus • Data Science Landscape (~1 week) • Data Manipulation at Scale – Relational Databases (~1 week) – MapReduce (~1 week) – NoSQL (~1 week) • Analytics – Statistics Topics (~1 week) – Machine Learning Topics (~2 weeks) • Visualization (~1 week) • Graph Analytics (~1 week)
  • 9. 2015: MOOC Recast as a 4-course “Specialization” Data Manipulation at Scale Databases, Systems, Algorithms Practical Predictive Analytics Stats (resampling methods, multiple hypothesis testing, more) ML (rules/trees/forests, ensembles/boosting/bagging, SVMs, GD, eval…) Communicating Data Science Visualization, ethics and privacy Capstone
  • 10. VIGNETTE ON TEACHING DATA ETHICS 9/25/2016 Bill Howe, UW 10
  • 11. Alcohol Study, Barrow Alaska, 1979 Native leaders and city officials, worried about drinking and associated violence in their community invited a group of sociology researchers to assess the problem and work with them to devise solutions.
  • 12. Methods • 10% representative sample (N=88) of everyone over the age of 15 using a 1972 demographic survey • Interviewed on attitudes and values about use of alcohol • Obtained psychological histories including drinking behavior • Given the Michigan Alcoholism Screening Test (Seltzer, 1971) • Asked to draw a picture of a person – Used to determine cultural identity
  • 13. Results announced unilaterally and publicly At the conclusion of the study researchers formulated a report entitled “The Inupiat, Economics and Alcohol on the Alaskan North Slope” which was released simultaneously at a press release and to the Barrow community. The press release was picked up by the New York Times, who ran a front page story entitled Alcohol Plagues Eskimos
  • 14. The results of the Barrow Alcohol Study in Alaska were revealed in the context of a press conference that was held far from the Native village, and without the presence, much less the knowledge or consent, of any community member who might have been able to present any context concerning the socioeconomic conditions of the village. Study results suggested that nearly all adults in the community were alcoholics. In addition to the shame felt by community members, the town’s Standard and Poor bond rating suffered as a result, which in turn decreased the tribe’s ability to secure funding for much needed projects. Backlash
  • 15. Methodological Problems “The authors once again met with the Barrow Technical Advisory Group, who stated their concern that only Natives were studied, and that outsiders in town had not been included.” “The estimates of the frequency of intoxication based on association with the probability of being detained were termed "ludicrous, both logically and statistically.”” Edward F. Foulks, M.D., Misalliances In The Barrow Alcohol Study
  • 16. Ethical Problems • Participants were not in control of their data nor the context in which they were presented. • Easy to demonstrate specific, significant harms: – Social: Stigmatization – Financial: Bond rating lowered • Important: Nothing to do with individual privacy – No PII revealed at any point, to anyone – No violations of best practices in data handling – But even those who did not participate in the study incurred harm
  • 17. Two Topics • Social Component: Codes of Conduct • Technical Component: Managing Sensitive Data
  • 18. Ethical principles vs. ethical rules • In the Barrow example, ethical rules were generally followed • But ethical principles were violated: The researchers appear to have placed their own interests ahead of those of the research subjects, the client, and society
  • 19. Principles: Codes of Conduct • American Statistical Association – http://www.amstat.org/committees/ethics/ • Certified Analytics Professional – https://www.certifiedanalytics.org/ethics.php • Data Science Association – http://www.datascienceassn.org/code-of- conduct.html
  • 21. Science is a complete mess • Reproducibility – Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible – Only about half of psychology 100 studies had effect sizes that approximated the original result (Science, 2015) – Ioannidis 2005: Why most public research findings are false – Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups 9/25/2016 Bill Howe, UW 21
  • 23. 9/25/2016 Data, Responsibly @ Dagstuhl 23 Retractions are increasing…..
  • 24. Science is a complete mess • Reproducibility – Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible – Only about half of psychology 100 studies had effect sizes that approximated the original result (Science, 2015) – Ioannidis 2005: Why most public research findings are false – Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups • Fraud – Diederik Stapel: 38 articles with fictitious data – Bharat Aggarwal: a huge number of images with evidence of manipulation 9/25/2016 Bill Howe, UW 24
  • 26. Science is a complete mess • Reproducibility – Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible – Only about half of psychology 100 studies had effect sizes that approximated the original result (Science, 2015) – Ioannidis 2005: Why most public research findings are false – Reinhart & Rogoff: global economic policy based on spreadsheet fuck ups • Fraud – Diederik Stapel: 38 articles with fictitious data – Bharat Aggarwal: a huge number of images with evidence of manipulation • Public Trust – Churn: Chocolate, egg yolks, red meat, red wine, etc. – Climate change, vaccines 9/25/2016 Bill Howe, UW 27
  • 27.
  • 28.
  • 29. Vision: Validate scientific claims automatically – Check for manipulation (manipulated images, Benford’s Law) – Extract claims from papers – Check claims against the authors’ data – Check claims against related data sets – Automatic meta-analysis across the literature + public datasets • First steps – Automatic curation: Validate and attach metadata to public datasets – Longitudinal analysis of the visual literature 9/25/2016 Data, Responsibly @ Dagstuhl 32
  • 32. 9/25/2016 Bill Howe, UW 41 Microarray samples submitted to the Gene Expression Omnibus Curation is fast becoming the bottleneck to data sharing Maxim Gretchkin Hoifung Poon
  • 33. color = labels supplied as metadata clusters = 1st two PCA dimensions on the gene expression data itself Can we use the expression data directly to curate algorithmically? Maxim Gretchkin Hoifung Poon The expression data and the text labels appear to disagree
  • 34. Maxim Gretchkin Hoifung Poon Better Tissue Type Labels Domain knowledge (Ontology) Expression data Free-text Metadata 2 Deep Networks text expr SVM
  • 35. Deep Curation Maxim Gretchkin Hoifung Poon Distant supervision and co-learning between text- based classified and expression-based classifier: Both models improve by training on each others’ results. Free-text classifier Expression classifier
  • 36. Deep Curation: Our stuff wins, with no training data Maxim Gretchkin Hoifung Poon state of the art our reimplementation of the state of the art our dueling pianos NN amount of training data used
  • 37. VIZIOMETRICS: COMPREHENDING VISUAL INFORMATION IN THE SCIENTIFIC LITERATURE Human-Data Interaction 9/25/2016 Bill Howe, UW 46
  • 38. Step 1: Dismantling Composite Figures Poshen Lee ICPRAM 2015
  • 39. Do high-impact papers have fewer equations, as indicated by Fawcett and Higginson? (Yes) Poshen LeeJevin West high impact papers low impact papers
  • 40. Do high-impact papers have more diagrams? (Yes) Poshen LeeJevin West
  • 41.
  • 42.
  • 43. TEACHING DATA ETHICS IN DATA SCIENCE
  • 44. Session 2 Summer 2014 121,215 students Session 1 Spring 2013 119,504 students
  • 45. Participation numbers • “Registered”: 119,517 totally irrelevant • Clicked play in first 2 weeks: 78,589 • Turned in 1st homework: 10,663 • Completed all assignments: ~9000 typical for a MOOC • “Passed”: 7022 • Forum threads: 4661 • Forum posts: 22,900 Fairly consistent with Coursera data across “hard” courses Define success however you want – Many love it in parts, start late, don’t turn in homework, etc. – Learning rather than watching television
  • 46. Lectures • Data Science Context and Case Studies (~1 week) • Data Management at Scale – Relational Databases (~1 week) – MapReduce (~1 week) – NoSQL (~1 week) • Topics in Analytics – Permutation Methods, Bayesian Methods (~1 week) – Machine Learning Algorithms and Evaluation (~1 week) • Visualization (~1 week) • Graph Analytics (~1 week) • Guest Lectures
  • 47. 9/25/2016 Bill Howe, UW 56 Who took the course?
  • 48. 9/25/2016 Bill Howe, UW 57 Who took the course?
  • 49. 9/25/2016 Bill Howe, UW 58 Who took the course? What programming language do you typically use?
  • 52. 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 Attrition, video lectures Number of students watching videos by segment, ordered by time
  • 53. 9/25/2016 Bill Howe, UW 62 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Twitter1 Twitter2 Twitter3 Twitter4 Twitter5 Twitter6 Database1 Database2 Database3 Database4 Database5 Database6 Database7 Database8 Database9 MapReduce1 MapReduce2 MapReduce3 MapReduce4 MapReduce5 MapReduce6 Kaggle Tableau Attrition, assignments Number of students completing assignments by part
  • 54.
  • 55. 9/25/2016 Bill Howe, UW 64 Who took the course? In a directory with 1000 text files, you are asked to create a list of files that contain the word Drosophila
  • 56. 9/25/2016 Bill Howe, UW 65 Who took the course? What if you were given a billion documents spread across many computers and asked to count the occurrences of a given phrase?
  • 57. “I left the company I co-founded in 2005 to do data analytics with Wibidata, with whom I was introduced as a result of their guest lecture in your course.

Notes de l'éditeur

  1. We use this device to talk about this idea: the pi-shaped researcher.
  2. Native leaders and city officials in Barrow, Alaska, worried about drinking and associated violence and accidental deaths in their community invited a group of sociology researchers to assess the problem and work with them to devise solutions. At the conclusion of the study researchers formulated a report entitled “The Inupiat, Economics and Alcohol on the Alaskan North Slope” which was released simultaneously at a press release and to the Barrow community. The press release was picked up by the New York Times, who ran a front page story entitled Alcohol Plagues
  3. Responsibility to which parties? * Society * Employers and Clients * Colleagues * Research Subjects ASA: Professionalism Responsibilities to Funders, Clients, Employers Responsibilities in Publications and Testimony Responsibilities to Research Subjects Responsibilities to Research Team Colleagues Responsibilities to Other Statisticians or Statistical Practitioners Responsibilities Regarding Allegations of Misconduct Responsibilities of Employers Code of Conduct: Rules Competence Do what you client asks, unless violates law Communication with clients Confidential information Conflicts of interest Rule 7: More on conflicts of interest and confidentiality Rule 8: Scientific integrity +++ Interesting: If a data scientist reasonably believes a client is misusing data science to communicate a false reality or promote an illusion of understanding, the data scientist shall take reasonable remedial measures, including disclosure to the client, and including, if necessary, disclosure to the proper authorities. The data scientist shall take reasonable measures to persuade the client to use data science appropriately. Rule 9: Misconduct (follow the rules)
  4. This week we’re going to talk about estimation and prediction. I want to begin with a non-research article from 2010 by Jonah Lehrer. In this article, the author describes cases where once-promising research results become weaker over time – they become harder to replicate, or the effect size becomes smaller. He quotes John Davis speaking about the efficiacy of antidepressants, saying… He talks about Anders Moller, a biologist who made an important discovery based on precise measurements of symmetry in the plumage of barn swallows, only to find the effect size shrank by 80 percent in the studies following the initial paper. Jonathan Schooler made a discovery he called verbal overshadowing, which showed, counter-intuitively, that talking about something someone’s face made it harder to recognize later rather than easier. But this effect too became weaker over time. Back in the 1930s, Joseph Rhine, a researcher at Duke Unviersity who coined the terms parapscyhology and etrasensor perception, reported data showing that some invdividuals could correctly guess the symbols on special cards without seeing them in remarkably long streaks. But the same individuals’ performance would decline over time. He called it the decline effect. What’s going on? The article offers some sensible and some not-so-sensible ideas about the root cause.
  5. One culprit is publication bias. Joober et al. in 2012 You can’t roll the dice a bunch of times then yell “Yahtzee!”
  6. Here’s a simulation of what Rhine in the 1930s referred to as the decline effect. As the study size increases, the effect size diminishes. Other metrics on the x and y axes are possible: x-axis might be improvements in experimental design, y-axis might be statistical significance. The units of effect size will be application specific – number of smokers who quit, number of T-cells in the blood, amount of ad revenue generated, etc. Something that measures how “good” the result is.
  7. You can’t roll the dice a bunch of times then yell “Yahtzee!”
  8. Google knowledge graph Specialized Ontologies
  9. "HeLa", "K562", "MCF-7" and "brain tumor” PCA on expression values
  10. Google knowledge graph – common knowledge, high redundancy, possibly crowdsourcing (visual: question answering via Google) Text features: presence of ontology terms sibling of ontology term Expression features