## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Incorporating Interpretable Output Constraints in Bayesian Neural Networks

NIPS 2020, (2020)

EI

Keywords

Abstract

Domains where supervised models are deployed often come with task-specific constraints, such as prior expert knowledge on the ground-truth function, or desiderata like safety and fairness. We introduce a novel probabilistic framework for reasoning with such constraints and formulate a prior that enables us to effectively incorporate the...More

Introduction

- In domains where predictive errors are prohibitively costly, the authors desire models that can both capture predictive uncertainty as well as enforce prior human expertise or knowledge.
- Recent work has addressed the challenge of incorporating richer functional knowledge into BNNs, such as preventing miscalibrated model predictions out-of-distribution [9], enforcing smoothness constraints [2] or specifying priors induced by covariance structures in the dataset [25, 19].
- Unlike other types of functional beliefs, output constraints are intuitive, interpretable and specified, 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada

Highlights

- In domains where predictive errors are prohibitively costly, we desire models that can both capture predictive uncertainty as well as enforce prior human expertise or knowledge
- Our contributions are: (a) we present a formal framework that lays out what it means to learn from output constraints in the probabilistic setting that Bayesian neural networks (BNNs) operate in, (b) we formulate a prior that enforces output constraint satisfaction on the resulting posterior predictive, including a variant that can be amortized across multiple tasks, (c) we demonstrate proof-of-concepts on toy simulations and apply Output-Constrained BNN (OC-BNN) to three real-world, high-dimensional datasets: (i) enforcing physiologically feasible interventions on a clinical action prediction task, (ii) enforcing a racial fairness constraint on a recidivism prediction task where the training data is biased, and (iii) enforcing recourse on a credit scoring task where a subpopulation is poorly represented by data
- This is because the constraints are intentionally specified in input regions out-of-distribution, and incorporating this knowledge augments what the OC-BNN learns from Dtr alone
- We propose OC-BNNs, which allow us to incorporate interpretable and intuitive prior knowledge, in the form of output constraints, into BNNs
- Through a series of low-dimensional simulations as well as real-world applications with realistic constraints, we show that OC-BNNs generally maintain the desirable properties of ordinary BNNs while satisfying specified constraints
- Our work shows promise in various high-stakes domains, such as healthcare and criminal justice, where both uncertainty quantification and prior expert constraints are necessary for safe and desirable model behavior

Methods

**Experiments with Real**

World Data

To demonstrate the efficacy of OC-BNNs, the authors apply meaningful and interpretable output constraints on real-life datasets.- The authors construct a dataset (N = 405K) of 8 relevant features and consider a binary classification task of whether clinical interventions for hypotension management — namely, vasopressors or IV fluids — should be taken for any patient.
- The authors specify two physiologically feasible, positive constraints: (1) if the patient has high creatinine, high BUN and low urine, action should be taken (Cy = {1}); (2) if the patient has high lactate and low bicarbonate, action should be taken.
- In addition to accuracy and F1 score on the test

Results

- By constraining recidivism prediction to the defendant’s actual criminal history, OC-BNNs strictly enforce a fairness constraint
- On both versions of Dtr, the baseline BNN predicts unequal risk for the two groups since the output labels (COMPAS decisions) are themselves biased.
- This inequality is more stark when the race feature is included, as the model learns the explicit, positive correlation between race and the output label.
- When an actionability constraint is enforced, the OC-BNN reduces the effort of recourse without sacrificing predictive accuracy on the test set, reaching the closest to the ground-truth recourse.

Conclusion

- The usage of OC-BNNs depends on how the authors view constraints in relation to data. The clinical action prediction and credit scoring tasks are cases where the constraint is a complementary source of information, being defined in input regions where Dtr is sparse.
- In contrast with [7, 19, 25], OC-BNNs take a sampling-based approach to bridge functional and parametric objectives
- The simplicity of this can be advantageous — output constraints are a common currency of knowledge specified by domain experts, in contrast to more technical forms such as stochastic process priors.
- (1) OC-BNNs allow them to manipulate an interpretable form of knowledge
- They can be useful even to domain experts without technical machine learning expertise, who can specify such constraints for model behavior.
- The authors intentionally showcase applications of high societal relevance, such as recidivism prediction and credit scoring, where the ability to specify and satisfy constraints can lead to fairer and more ethical model behavior

- Table1: Compared to the baseline, the OC-BNN maintains equally high accuracy and F1 score on both train and test sets. The violation fraction decreased about six-fold when using OC-BNNs
- Table2: The OC-BNN predicts both racial groups with almost equal rates of high-risk recidivism, compared to a 5-fold difference on the baseline. However, accuracy metrics decrease (expectedly)
- Table3: All three models have comparable accuracy on the test set. However, the OC-BNN has the lowest recourse effort (closest to ground truth)

Related work

- Noise Contrastive Priors Hafner et al [9] propose a generative “data prior” in function space, modeled as zero-mean Gaussians if the input is out-of-distribution. Noise contrastive priors are similar to OC-BNNs as both methods involve placing a prior on function space but performing inference in parameter space. However, OC-BNNs model output constraints, which encode a richer class of functional beliefs than the simpler Gaussian assumptions encoded by NCPs.

Global functional properties Previous work have enforced various functional properties such as Lipschitz smoothness [2] or monotonicity [29]. The constraints that they consider are different from output constraints, which can be defined for local regions in the input space. Furthermore, these works focus on classical NNs rather than BNNs.

Funding

- HL acknowledges support from Google
- WY and FDV acknowledge support from the Sloan Foundation

Study subjects and analysis

defendants: 6172

A study by ProPublica in 2016 found it to be racially biased against African American defendants [1, 16]. We use the same dataset as this study, containing 9 features on N = 6172 defendants related to their criminal history and demographic attributes. We consider the same binary classification task as in Slack et al [24] — predicting whether a defendant is profiled by COMPAS as being high-risk

cases: 3

Motivated by Ustun et al [26]’s work on recourse (defined as the extent that input features must be altered to change the model’s outcome), we consider the feature RevolvingUtilizationOfUnsecuredLines (RUUL), which has a ground-truth positive correlation with financial distress. We analyze how much a young adult under 35 has to reduce RUUL to flip their prediction to negative in three cases: (i) a BNN trained on the full dataset, (ii) a BNN trained on a blind dataset (age ≥ 35), (iii) an OC-BNN with an actionability constraint: for young adults, predict “no financial distress” even if RUUL is large. The positive Dirichlet COCP (5) is used

data: 1

The positive Dirichlet COCP (5) is used. In addition to scoring accuracy and F1 score on the entire test set (N = 10K); we measure the effort of recourse as the mean difference of RUUL between the two outcomes (Y = 0 or 1) on the subset of individuals where age < 35 (N = 1.5K). Results As can be seen in Table 3, the ground-truth positive correlation between RUUL and the output is weak, and the effort of recourse is consequentially low

COCP samples: 5

Figure 2. a) 1D regression with the positive constraint: Cx+ = R and Cy+(x) = {y | x · y ≥ 0} (green), using AOCP. (b) 1D regression with the negative constraint: Cx− = [−1, 1] and Cy− = [1, 2.5] (red), with the negative exponential COCP (6). The 50 SVGD particles represent functions passing above and below the constrained region, capturing two distinct predictive modes. (c) Fraction of rejected SVGD particles (out of 100) for the OC-BNN (blue, plotted as a function of log-samples used with COCP) and the baseline (black). All baseline particles were rejected, however, only 4% of particles were rejected, using just only 5 COCP samples. Figure 4

Reference

- Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine Bias: Risk Assessments in Criminal Sentencing. ProPublica, 2016.
- Cem Anil, James Lucas, and Roger Grosse. Sorting Out Lipschitz Function Approximation. In Proceedings of the 36th International Conference on Machine Learning, 2019.
- Christopher M Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
- Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight Uncertainty in Neural Networks. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
- Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics Letters B, 195(2):216–222, 1987.
- John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
- Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, S.M. Ali Eslami, and Yee Whye Teh. Neural Processes. In 35th ICML Workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018.
- Alex Graves. Practical Variational Inference for Neural Networks. In Advances in Neural Information Processing Systems, pages 2348–2356, 2011.
- Danijar Hafner, Dustin Tran, Timothy Lillicrap, Alex Irpan, and James Davidson. Noise Contrastive Priors for Functional Uncertainty. arXiv:1807.09289, 2018.
- Geoffrey E Hinton and Drew Van Camp. Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights. In Proceedings of the 6th Annual Conference on Computational Learning Theory, pages 5–13, 1993.
- Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer Feedforward Networks are Universal Approximators. Neural Networks, 2(5):359–366, 1989.
- Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. MIMIC-III, A Freely Accessible Critical Care Database. Scientific Data, 3:160035, 2016.
- Kaggle. Give Me Some Credit. http://www.kaggle.com/c/GiveMeSomeCredit/, 2011.
- Nathan Kallus and Angela Zhou. Residual Unfairness in Fair Machine Mearning from Prejudiced Data. In Proceedings of the 35th International Conference on Machine Learning, 2018.
- Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, 2014.
- Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. How We Analyzed the COMPAS Recidivism Algorithm. ProPublica, 2016.
- Qiang Liu and Dilin Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. In Advances in Neural Information Processing Systems, pages 2378–2386, 2016.
- Marco Lorenzi and Maurizio Filippone. Constraining the Dynamics of Deep Probabilistic Models. In Proceedings of the 35th International Conference on Machine Learning, 2018.
- Christos Louizos, Xiahan Shi, Klamer Schutte, and Max Welling. The Functional Neural Process. In Advances in Neural Information Processing Systems, pages 8743–8754, 2019.
- David J C MacKay. Probable Networks and Plausible Predictions — a Review of Practical Bayesian Methods for Supervised Neural Networks. Network: Computation in Neural Systems, 6(3):469–505, 1995.
- Radford M Neal. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, 1995.
- Radford M Neal. MCMC Using Hamiltonian Dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2, 2011.
- Bernt Øksendal. Stochastic Differential Equations. In Stochastic Differential Equations, pages 65–84.
- Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. In Proceedings of the 3rd AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society, pages 180–186, 2020.
- Shengyang Sun, Guodong Zhang, Jiaxin Shi, and Roger Grosse. Functional Variational Bayesian Neural Networks. In Proceedings of the 7th International Conference on Learning Representations, 2019.
- Berk Ustun, Alexander Spangher, and Yang Liu. Actionable Recourse in Linear Classification. In Proceedings of the ACM Conference on Fairness, Accountability and Transparency, pages 10–19, 2019.
- Andrew Gordon Wilson. The Case for Bayesian Deep Learning. arXiv:2001.10995, 2020.
- Wanqian Yang, Lars Lorch, Moritz A Graule, Srivatsan Srinivasan, Anirudh Suresh, Jiayu Yao, Melanie F Pradier, and Finale Doshi-Velez. Output-Constrained Bayesian Neural Networks. In 36th ICML Workshop on Uncertainty and Robustness in Deep Learning, 2019.
- Seungil You, David Ding, Kevin Canini, Jan Pfeifer, and Maya Gupta. Deep Lattice Networks and Partial Monotonic Functions. In Advances in Neural Information Processing Systems, pages 2981–2989, 2017.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn