ML practitioners and advocates are increasingly finding themselves becoming gatekeepers of the modern world. The models you create have power to get people arrested or vindicated, get loans approved or rejected, determine what interest rate should be charged for such loans, who should be shown to you in your long list of pursuits on your Tinder, what news do you read, who gets called for a job phone screen or even a college admission... the list goes on. My goal in this talk is to summarize the kinds of disparate outcomes that are caused by cargo cult machine learning, and recent academic efforts to address some of them.
22. Discrimination Based on Redundant Encoding
Feature4231:Race=’Black’
Features = {‘loc’, ‘income’, ..}
Polynomial kernel with degree 2
Feature6578:Loc=’EastOakland’^Income=’<10k’
33. How can we characterize fairness?
What does fairness even mean?
Group Fairness vs. Individual Fairness
34. How can we characterize fairness?
One way to characterize group fairness is to ensure both majority and the
protected population have similar outcomes.
or
P(FavorableOutcome | S) : P(FavorableOutcome | Sc) = 1 : 1
35. How can we characterize fairness?
One way to characterize group fairness is to ensure both majority and the
protected population have similar outcomes.
or
P(FavorableOutcome | S) : P(FavorableOutcome | Sc) = 1 : 1
often this is hard to achieve.
For example, for jobs, the EEOC specifies this ratio should be no less than 0.8 : 1
(aka 80% rule).
36. Characterizing Fairness of a black box classifier
One way: Is classifier outcome correlated with membership in
S?
37. Fairness as a constraint
Is classifier outcome correlated with membership in S?
Sensitive attributes
Decision function
Want
40. “If we allowed a model to be
used for college admissions in
1870, we’d still have 0.7% of
women going to college.”
Recommended Reading
41. Reading List
There’s much material on fairness in data-driven decision/policy making from
literature in
- law
- sociology
- political science
- computer science/machine learning
- economics
(the machine learning literature is nascent, only around 2009 onwards)
42. Reading List (Fairness in ML)
Pedreschi, Dino, Salvatore Ruggieri, and Franco Turini. "Measuring Discrimination in Socially-Sensitive
Decision Records." SDM. 2009.
Kamiran, Faisal, and Toon Calders. "Classifying without discriminating."Computer, Control and
Communication, 2009. IC4 2009. 2nd International Conference on. IEEE, 2009.
Dwork, Cynthia, et al. "Fairness through awareness." Proceedings of the 3rd Innovations in
Theoretical Computer Science Conference. ACM, 2012
Romei, Andrea, and Salvatore Ruggieri. "A multidisciplinary survey on discrimination analysis."
The Knowledge Engineering Review 29.05 (2014)
43. Reading List (Fairness in ML)
Friedler, Sorelle, Carlos Scheidegger, and Suresh Venkatasubramanian. "Certifying and removing
disparate impact." CoRR (2014).
Barocas, Solon and Selbst, Andrew D., Big Data's Disparate Impact (August 14, 2015). California
Law Review, Vol. 104,
Zafar, Muhammad Bilal, et al. "Fairness Constraints: A Mechanism for Fair Classification." arXiv preprint
arXiv:1507.05259 (2015).
Zliobaite, Indre. "On the relation between accuracy and fairness in binary classification." arXiv preprint
arXiv:1505.05723 (2015).
44. Other resources
NSF’s “Big Data Innovation Hubs” were created in part to address these
challenges
http://www.nsf.gov/news/news_summ.jsp?cntn_id=136784
Stanford Law Review touches upon this topic regularly
http://www.stanfordlawreview.org/online/privacy-and-big-data
Fairness blog
http://fairness.haverford.edu
Academic: FATML workshops (NIPS 2014, ICML 2015)
www.fatml.org
45. Lessons
Discrimination is an emergent property of any learning algorithm
Watch out for discrimination (implicitly) encoded in features
Big Data can cause Big Problems
Watch out for the proportion of the protected classes
Always do error analysis with protected classes in mind
Notions of fairness are nascent at best. Involve as many people to improve
understanding.
There is no one best notion of fairness
Say you are building a dating app like Tinder. This involves solving a recommendation problem -- recommend a bunch of profiles to be rated thumbs up/thumbs down. This is a very straightforward collaborative filtering problem. We can build really performant models, because in addition to the historical rating data, we also have a very rich profile data.
Note: I’m using Tinder as a convenient example. This work has nothing to do with the Match Group or their dating app Tinder.
Let’s say our goal is to improve the following two business/engagement metrics:
% of right swipes
% of matches (a match happens two people right swipe on top of each other)
A good question ask yourself, “if this was a newspaper personals ad, would mentioning “looking for white guys/girls only” be tolerated? Will a newspaper publish such an ad? If not, how can we build apps that are essentially enabling that?
“But that is what the people want! I am race blind, but I have to give my users what they want”
People wanted segregated bathrooms at some point. Perhaps some people still do. But as a society, and a collective conscious, we agreed that is not in the interest of the minorities and the vulnerable classes, and it promotes discrimination.
So why are we okay in building the online equivalent of segregated bathrooms?
This disparity can arise not just from machine learning, but in any kind of data-driven policy making, which is becoming the norm today. Consider a city fixing potholes.
Imagine if there was an app to report the potholes, and the city could send somebody to fix them. Crowdsourcing for efficient governance. Sounds like a good idea, right?
What if most of the complaints came from well-to-do neighborhoods, because they complain about every little thing, and most of the limited city road-repair resources were diverted to these well-to-do neighborhoods?
The disparity caused by the street bump app was actually noted in this report commissioned by the White House in early 2014.
From the report abstract
Among the many call to actions, expanding technical expertise was identified as a major one.
Talking about expanding technical expertise, consider this piece on predictive policing by Gillian Tett. This appeared 3 months after the White House report came out.
Gillian is an experienced journalist, who railed against Wall Street quants for thoughtlessly deploying models.
The same Gillian Tett, now writes this on Chicago’s predictive policing. “The program has nothing to do with race but multi-variable equations”.
Today’s ML landscape is changing as we speak. Things are scaling in many dimensions.
Scaling dimension 1: Data collection and processing
It’s super easy to collect troves of data of all kinds, very cheap & fast to store/process it.
Scaling dimension 2: Tools
Today, pretty much anyone with basic programming skills (not even algorithmic) can build a predictive application. With startups building ML as a service, all you need is a rest endpoint and basic JS coding to build a model and serve predictions.
Scaling dimension 3: Data Scientist Factories
These are filling a need but not sure if that is the right way to fill ..
Statistics used to be a rigorous discipline. That hasn’t stopped cargo-cult statisticians to popup and do bad statistics. Similarly, ML used to be a very academic discipline with practitioners having insights about the models they were training. Today that’s no longer the case.
The curricula in most of these “data scientist factories” is questionable at best (8 weeks to learn ML/NLP/DataViz in one of them), and none of them have any time to get into the critical aspects about model thinking. What is worrisome is graduates from these places go on to work in places that build applications affecting people’s lives. At the very least, an awareness about the ethical issues of machine learning should be incorporated in every ML curriculum.
Let’s say there’s a sensitive attribute or set of attributes (like gender, race, etc) that partitions a population into a protected class (or a minority class), and the majority class.
Sometimes information about the sensitive attributes can “leak in” from other data sources, even if not explicitly encoded. For example, adding someone’s last name could divulge ethnicity or the location of an individual could correlate with race. Or in this example, know who your facebook friends are, and their sexual orientation, tells a lot about your sexual orientation!
http://firstmonday.org/article/view/2611/2302
Often sensitive attributes like race and gender are redundantly encoded in other variables. For example, the text in tweets can reveal many demographic variables
http://www.fastcompany.com/1769217/you-cant-keep-your-secrets-twitter
Big data can cause big problems
The very thing that makes big data helpful in driving error rates and producing better models, also explains why models perform poorly on the minority population. For minority populations, the number of training samples is dwarfed by the majority. One could try balancing the number of training samples in the majority, but that in turn creates an accuracy-fairness tradeoff.
https://en.wikipedia.org/wiki/Nymwars
If your classifier has 95% accuracy, and you deploy it in production, the 5% of the errors might be affecting a big chunk of the minority population.
Not exhaustive, but should provide a good seed set. The ones in bold are a must read.