Underspecification Presents Challenges for Credibility in Modern Machine Learning

Cited: 216|Views89
Full Text
Bibtex
Weibo

Abstract

ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in mode...More

Code:

Data:

0
Introduction
  • In many applications of machine learning (ML), a trained model is required to predict well in the training domain, and encode some essential structure of the underlying system.
  • In some domains, such as medical diagnostics, the required structure corresponds to causal phenomena that remain invariant under intervention.
  • In other domains, such as natural language processing, the required structure is determined by the details of the application
  • These requirements for encoded structure have practical consequences: they determine whether the model will generalize as expected in deployment scenarios.
  • The evaluations in this pipeline are agnostic to the particular inductive biases encoded by the trained model.
  • Concerns regarding “spurious correlations” and “shortcut learning” in trained models are widespread (e.g., Geirhos et al, 2020; Arjovsky et al, 2019)
Highlights
  • In many applications of machine learning (ML), a trained model is required to predict well in the training domain, and encode some essential structure of the underlying system
  • Our work here complements these efforts in two ways: first, our goal is to understand how underspecification relates to inductive biases that could enable generalization beyond the training distribution P; and secondly, the primary object that we study is practical ML pipelines rather than the loss landscape itself
  • These results indicate that the inductive biases that are relevant to making predictions in the presence of these corruptions are so weakly differentiated by the iid prediction task that changing random seeds in training can cause the pipeline to return predictors with substantially different stress test performance
  • Our results here suggest that underspecification remains an issue even for these models; potentially, as models are scaled up, underspecified dimensions may account for a larger proportion of the “headroom” available for improving out-of-distribution model performance
  • The first classifies images of patient retinas, while the second classifies clinical images of patient skin. We show that these models are underspecified along dimensions that are practically important for deployment
  • In the Supplement, we show one such preliminary result, where a model trained with the timestamp feature completely ablated was able to achieve identical iid predictive performance
Results
  • Performance on the British test set is only weakly associated with performance on the “non-British” set (Spearman ρ = 0.135; 95% CI 0.070-0.20; Figure 3, right).
  • On the ImageNet validation set, the ResNet-50 predictors achieve a 75.9% ± 0.11 top-1 accuracy, while the BiT models achieve a 86.2% ± 0.09 top-1 accuracy.
  • The variability of predictor accuracies on ObjectNet is large compared to this baseline (p = 0.002 for ResNet-50 and p = 0.000 for BiT).
  • The authors' ensemble of predictors achieves accuracy ranging from 0.960 to 0.965.
  • Tomašev et al (2019a) achieve state-of-the-art performance, detecting the onset of AKI up to 48 hours in advance with an accuracy of 55.8% across all episodes and 90.2% for episodes associated with dialysis administration
Conclusion
  • These results indicate that the inductive biases that are relevant to making predictions in the presence of these corruptions are so weakly differentiated by the iid prediction task that changing random seeds in training can cause the pipeline to return predictors with substantially different stress test performance.
  • The vignettes demonstrate that underspecification can introduce complications for deploying ML, even in application areas where it has the potential to highly beneficial
  • These results suggest that one cannot expect ML models to automatically generalize to new clinical settings or populations, because the inductive biases that would enable such generalization are underspecified.
  • The results replicate prior findings that highly-parametrized NLP models do learn spurious correlations and shortcuts
  • This reliance is underspecified by the model architecture, learning algorithm, and training data: merely changing the random seed can induce large variation in the extent to which spurious correlations are learned.
Tables
  • Table1: Accuracies of ensemble members on stress tests. Ensemble mean (standard deviations) of accuracy proportions on ResNet-50 and BiT models
  • Table2: Ensemble disagreement proportions for ImageNet vs ObjectNet models. Average disagreement between pairs of predictors in the ResNet and BiT ensembles. The “subset” test set only includes classes that also appear in the ObjectNet test set. Models show substantially more disagreement on the ObjectNet test set
  • Table3: Summary statistics for structure of variation on gendered shortcut stress tests. For each dataset, we measure the accuracy of 100 predictors, corresponding to 20 randomly initialized fine-tunings from 5 randomly initialized pretrained BERT checkpoints. Models are fine-tuned on the STS-B and OntoNotes training sets, respectively. The F statistic quantifies how systematic differences are between pretrainings using the ratio of within-pretraining variance to between-pretraining variance in the accuracy statistics. p-values are reported to give a sense of scale, but not for inferential purposes; it is unlikely that assumptions for a valid F -test are met. F -values of this magnitude are consistent with systematic between-group variation. The Spearman ρ statistic quantifies how ranked performance on the fine-tuning task correlates with the stress test metric of gender correlation
  • Table4: Summary statistics for structure of variation in predictor accuracy across NLI stress tests. For each dataset, we measure the accuracy of 100 predictors, corresponding to 20 randomly initialized fine-tunings from 5 randomly initialized pretrained BERT checkpoints. All models are fine-tuned on the MNLI training set, and validated on the MNLI matched test set (<a class="ref-link" id="cWilliams_et+al_2018_a" href="#rWilliams_et+al_2018_a">Williams et al, 2018</a>). The F statistic quantifies how systematic differences are between pretrainings. Specifically, it is the ratio of within-pretraining variance to between-pretraining variance in the accuracy statistics. p-values are reported to give a sense of scale, but not for inferential purposes; it is unlikely that assumptions for a valid F -test are met. The Spearman ρ statistic quantifies how ranked performance on the MNLI matched test set correlates with ranked performance on each stress test. For most stress tests, there is only a weak relationship, such that choosing models based on test performance alone would not yield the best models on stress test performance
  • Table5: Patterns of creatinine sampling can induce a spurious relationship between time of day and AKI. Prevalence of AKI is stable across times of day in the test set (test set prevalence is 2.269%), but creatinine samples are taken more frequently in the first two time buckets. As a result, conditional on a sample being taken, AKI prevalence is higher in the latter two time buckets
  • Table6: Model sensitivity per time of day: Normalized PRAUC on the test set, per time of day, and per time of day for creatinine samples for each model instance when the data is not perturbed (‘Test’), when time of day is shifted (‘Shift’) and when time of day is shifted and only CHEM-7 labs are considered for creatinine samples (‘Shift+Labs’). ‘Diff.’ refers to the maximum difference in value between instances
  • Table7: Flipped decisions under time-shift and lab order composition interventions depend on random seed. Each cell is number of patient-timepoints at which decisions changed when the time range feature and lab order composition were changed, for patient timepoints with creatinine measured. “+ to -” indicates a change from the “at risk of AKI in next 48 hrs" to “not at risk”; “- to +” indicates the opposite change. Model 1 and model 2 are LSTM models that differ only in random seed. Overlap indicates the number of patient-timepoint flips shared between the two models. The number of flips in each direction changes as a function of random seed, and the patient-timepoints that flip are largely disjoint between random seeds
  • Table8: Distribution of IOP associated variants. 129 variant clusters are distributed over 16 chromosomes
Download tables as Excel
Related work
  • We consider a supervised learning setting, where the goal is to obtain a predictor f : X → Y that maps inputs x (e.g., images, text) to labels y. We say a model is specified by a function class F from which a predictor f (x) will be chosen. An ML pipeline takes in training data D drawn from a training distribution P and produces a trained model, or predictor, f (x) from F. Usually, the pipeline selects f ∈ F by approximately minimizing the predictive risk on the training distribution RP(f ) := E(X,Y )∼P[ (f (X), Y )]. Regardless of the method used to obtain a predictor f , we assume that the pipeline validates that f achieves low expected risk on the training distribution P by evaluating its predictions on an independent and identically distributed test set D , e.g., a hold-out set selected completely at random. This validation translates to a behavioral guarantee, or contract (Jacovi et al, 2020), about the model’s aggregate performance on future data drawn from P.
Funding
  • Performance on the British test set is only weakly associated with performance on the “non-British” set (Spearman ρ = 0.135; 95% CI 0.070-0.20; Figure 3, right)
  • On the ImageNet validation set, the ResNet-50 predictors achieve a 75.9% ± 0.11 top-1 accuracy, while the BiT models achieve a 86.2% ± 0.09 top-1 accuracy
  • The variability of predictor accuracies on ObjectNet is large compared to this baseline (p = 0.002 for ResNet-50 and p = 0.000 for BiT)
  • We compute one-sided p-values with respect to this null distribution and interpret them as exploratory descriptive statistics, finding that the variation between models in skin types II, III, and V are relatively typical (p = 0.29, n = 437; p = 0.54, n = 2619; p = 0.42, n = 109), while the variation in skin type IV is far less typical (p = 0.03, n = 798)
  • Our ensemble of predictors achieves accuracy ranging from 0.960 to 0.965
  • Tomašev et al (2019a) achieve state-of-the-art performance, detecting the onset of AKI up to 48 hours in advance with an accuracy of 55.8% across all episodes and 90.2% for episodes associated with dialysis administration
Study subjects and analysis
British individuals: 91971
Individuals whose z-scored distance from the coordinate-wise median are no greater than 4 in this PC space, are considered British. We then randomly partitioned 91,971 British individuals defined as above into a British training set (82,309 individuals) and a British evaluation set (9,662 individuals). All remaining “non-British” set (14,898 individuals) was used for evaluation

individuals: 14898
We then randomly partitioned 91,971 British individuals defined as above into a British training set (82,309 individuals) and a British evaluation set (9,662 individuals). All remaining “non-British” set (14,898 individuals) was used for evaluation. We trained linear regression models for predicting IOP with (a) demographics and a set of 129 genomic features (one of the 1,000 sets created above) and (b) demographic features only, using the British training set

deep learning case studies: 4
We present a set of examples of underspecification in simple, analytically tractable models as a warm-up in Section 3. We then present a set of four deep learning case studies in Sections 5–8. We close with a discussion in Section 9

data: 437
The results are shown at the bottom of Figure 6. Compared to overall test accuracy, there is larger variation in test accuracy within skin type strata across models, particularly in skin types II and IV, which form substantial portions (n = 437, or 10.7%, and n = 798, or 19.6%, respectively) of the test data. Based on this test set, some models in this ensemble would be judged to have higher discrepancies across skin types than others, even though they were all produced by an identical training pipeline

patients: 703782
Specifically, we apply our experimental protocol to the Tomašev et al (2019a) AKI model which predicts the continuous risk (every 6 hours) of AKI in a 48h lookahead time window (see Supplement for details). 8.1 Data, Predictor Ensemble, and Metrics The pipeline and data used in this study are described in detail in Tomašev et al (2019a). Briefly, the data consists of de-identified EHRs from 703,782 patients across multiple sites in the United. States collected at the US Department of Veterans Affairs3 between 2011 and 2015

Reference
  • Adewole S Adamson and Avery Smith. Machine learning and health care disparities in dermatology. JAMA dermatology, 154(11):1247–1248, 2018.
    Google ScholarLocate open access versionFindings
  • Ademide Adelekun, Ginikanwa Onyekaba, and Jules B Lipoff. Skin color in dermatology textbooks: An updated evaluation and analysis. Journal of the American Academy of Dermatology, 2020.
    Google ScholarLocate open access versionFindings
  • R Ambrosino, B G Buchanan, G F Cooper, and M J Fine. The use of misclassification costs to learn rule-based decision support models for cost-effective hospital admission strategies. Proceedings. Symposium on Computer Applications in Medical Care, pages 304–8, 1995. ISSN 01954210. URL http://www.ncbi.nlm.nih.gov/pubmed/8563290http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC2579104.
    Locate open access versionFindings
  • Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
    Findings
  • Susan Athey. Beyond prediction: Using big data for policy problems. Science, 355(6324):483–485, 2017.
    Google ScholarLocate open access versionFindings
  • Marzieh Babaeianjelodar, Stephen Lorenz, Josh Gordon, Jeanna Matthews, and Evan Freitag. Quantifying gender bias in different corpora. In Companion Proceedings of the Web Conference 2020, pages 752–759, 2020.
    Google ScholarLocate open access versionFindings
  • Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In Advances in Neural Information Processing Systems, pages 9448–9458, 2019.
    Google ScholarLocate open access versionFindings
  • Emma Beede, Elizabeth Baylor, Fred Hersch, Anna Iurchenko, Lauren Wilcox, Paisan Ruamviboonsuk, and Laura M Vardoulakis. A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pages 1–12, 2020.
    Google ScholarLocate open access versionFindings
  • Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018.
    Findings
  • Emily M. Bender and Alexander Koller. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198, Online, July 2020. Association for Computational Linguistics. doi: 18653/v1/2020.acl-main.463. URL https://www.aclweb.org/anthology/2020.acl-main.463.
    Locate open access versionFindings
  • Yoshua Bengio. The consciousness prior. arXiv preprint arXiv:1709.08568, 2017.
    Findings
  • Jeremy J Berg, Arbel Harpak, Nasa Sinnott-Armstrong, Anja Moltke Joergensen, Hakhamanesh Mostafavi, Yair Field, Evan August Boyle, Xinjun Zhang, Fernando Racimo, Jonathan K Pritchard, and Graham Coop. Reduced signal for polygenic adaptation of height in UK biobank. Elife, 8, March 2019.
    Google ScholarLocate open access versionFindings
  • Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems, pages 4349–4357, 2016.
    Google ScholarLocate open access versionFindings
  • Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. URL https://www.aclweb.org/anthology/D15-1075.
    Locate open access versionFindings
  • Kendrick Boyd, Vítor Santos Costa, Jesse Davis, and C. David Page. Unachievable region in precision-recall space and its effect on empirical evaluation. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, volume 1, pages 639–646, 2012. ISBN 9781450312851.
    Google ScholarLocate open access versionFindings
  • Theodora S. Brisimi, Tingting Xu, Taiyao Wang, Wuyang Dai, and Ioannis Ch Paschalidis. Predicting diabetes-related hospitalizations based on electronic health records. Statistical Methods in Medical Research, 28(12):3667–3682, dec 2019. ISSN 14770334. doi: 10.1177/0962280218810911.
    Locate open access versionFindings
  • Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pages 77–91, 2018.
    Google ScholarLocate open access versionFindings
  • Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186, 2017.
    Google ScholarLocate open access versionFindings
  • Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible Models for HealthCare. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’15, pages 1721–1730, 2015. ISBN 9781450336642. doi: 10.1145/2783258.2788613. URL http://dx.doi.org/10.1145/2783258.2788613http://dl.acm.org/citation.cfm?doid=2783258.2788613.
    Locate open access versionFindings
  • Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/S17-2001. URL https://www.aclweb.org/anthology/S17-2001.
    Locate open access versionFindings
  • Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018, 2019.
    Google ScholarLocate open access versionFindings
  • Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. 2019.
    Google ScholarFindings
  • Edward Choi, Mohammad Taha Bahadori, Le Song, Walter F Stewart, and Jimeng Sun. GRAM: Graph-based Attention Model for Healthcare Representation Learning. 2017. doi: 10.1145/3097983. 3098126. URL http://dx.doi.org/10.1145/3097983.3098126.
    Locate open access versionFindings
  • Gary S Collins, Johannes B Reitsma, Douglas G Altman, and Karel GM Moons. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod): the tripod statement. British Journal of Surgery, 102(3):148–158, 2015.
    Google ScholarLocate open access versionFindings
  • Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo. Capacity and trainability in recurrent neural networks. In 5th International Conference on Learning Representations, ICLR 2017 Conference Track Proceedings, 2017.
    Google ScholarLocate open access versionFindings
  • Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 120–128, 2019.
    Google ScholarLocate open access versionFindings
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
    Google ScholarLocate open access versionFindings
  • Josip Djolonga, Jessica Yung, Michael Tschannen, Rob Romijnders, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Matthias Minderer, Alexander D’Amour, Dan Moldovan, et al. On robustness and transferability of convolutional neural networks. arXiv preprint arXiv:2007.08558, 2020.
    Findings
  • Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020.
    Findings
  • L Duncan, H Shen, B Gelaye, J Meijsen, K Ressler, M Feldman, R Peterson, and B Domingue. Analysis of polygenic risk score usage and performance in diverse human populations. Nat. Commun., 10(1):3328, July 2019.
    Google ScholarLocate open access versionFindings
  • Michael W Dusenberry, Dustin Tran, Edward Choi, Jonas Kemp, Jeremy Nixon, Ghassen Jerfel, Katherine Heller, and Andrew M Dai. Analyzing the role of model uncertainty for electronic health records. In Proceedings of the ACM Conference on Health, Inference, and Learning, pages 204–213, 2020.
    Google ScholarLocate open access versionFindings
  • Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639):115–118, 2017.
    Google ScholarLocate open access versionFindings
  • Chenchen Feng, David Le, and Allison B. McCoy. Using Electronic Health Records to Identify Adverse Drug Events in Ambulatory Care: A Systematic Review. Applied Clinical Informatics, 10 (1):123–128, 2019. ISSN 18690327. doi: 10.1055/s-0039-1677738.
    Locate open access versionFindings
  • Aaron Fisher, Cynthia Rudin, and Francesca Dominici. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. Journal of Machine Learning Research, 20(177):1–81, 2019.
    Google ScholarLocate open access versionFindings
  • TB Fitzpatrick. Sun and skin. Journal de Medecine Esthetique, 2:33–34, 1975.
    Google ScholarLocate open access versionFindings
  • Seth Flaxman, Swapnil Mishra, Axel Gandy, H Juliette T Unwin, Thomas A Mellan, Helen Coupland, Charles Whittaker, Harrison Zhu, Tresnia Berah, Jeffrey W Eaton, et al. Estimating the effects of non-pharmaceutical interventions on covid-19 in europe. Nature, 584(7820):257–261, 2020.
    Google ScholarLocate open access versionFindings
  • Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.
    Findings
  • Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In Proceedings of the 37th International Conference on Machine Learning, 2020.
    Google ScholarLocate open access versionFindings
  • Joseph Futoma, Morgan Simons, Trishan Panch, Finale Doshi-Velez, and Leo Anthony Celi. The myth of generalisability in clinical research and machine learning in health care. The Lancet Digital Health, 2(9):e489 – e492, 2020. ISSN 2589-7500. doi: https://doi.org/10.1016/S2589-7500(20)30186-2. URL http://www.sciencedirect.com/science/article/pii/S2589750020301862.
    Locate open access versionFindings
  • Sahaj Garg, Vincent Perot, Nicole Limtiaco, Ankur Taly, Ed H Chi, and Alex Beutel. Counterfactual fairness in text classification through robustness. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 219–226, 2019.
    Google ScholarLocate open access versionFindings
  • Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing Systems, pages 8789–8798, 2018.
    Google ScholarLocate open access versionFindings
  • Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bygh9j09KX.
    Locate open access versionFindings
  • Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. arXiv preprint arXiv:2004.07780, 2020.
    Findings
  • Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
    Google ScholarFindings
  • Varun Gulshan, Lily Peng, Marc Coram, Martin C Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama, 316(22):2402–2410, 2016.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    Google ScholarLocate open access versionFindings
  • Christina Heinze-Deml, Jonas Peters, and Nicolai Meinshausen. Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6(2), 2018.
    Google ScholarLocate open access versionFindings
  • Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HJz6tiCqYm.
    Locate open access versionFindings
  • Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. arXiv preprint arXiv:1907.07174, 2019.
    Findings
  • Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. arXiv preprint arXiv:2006.16241, 2020.
    Findings
  • Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735– 1780, 1997. URL http://www7.informatik.tu-muenchen.de/{~}hochreithttp://www.idsia.ch/{~}juergen.
    Locate open access versionFindings
  • Wolfgang Hoffmann, Ute Latza, Sebastian E Baumeister, Martin Brünger, Nina Buttmann-Schweiger, Juliane Hardt, Verena Hoffmann, André Karch, Adrian Richter, Carsten Oliver Schmidt, et al. Guidelines and recommendations for ensuring good epidemiological practice (gep): a guideline developed by the german society for epidemiology. European journal of epidemiology, 34(3):301–317, 2019.
    Google ScholarLocate open access versionFindings
  • Sara Hooker. The hardware lottery. arXiv preprint arXiv:2009.06489, 2020.
    Findings
  • Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. OntoNotes: The 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pages 57–60, New York City, USA, June 2006. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/N06-2015.
    Locate open access versionFindings
  • Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, 2018.
    Google ScholarLocate open access versionFindings
  • Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems, pages 125–136, 2019.
    Google ScholarLocate open access versionFindings
  • International Schizophrenia Consortium, Shaun M Purcell, Naomi R Wray, Jennifer L Stone, Peter M Visscher, Michael C O’Donovan, Patrick F Sullivan, and Pamela Sklar. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature, 460(7256):748–752, August 2009.
    Google ScholarLocate open access versionFindings
  • Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
    Findings
  • Alon Jacovi, Ana Marasović, Tim Miller, and Yoav Goldberg. Formalizing trust in artificial intelligence: Prerequisites, causes and goals of human trust in ai. arXiv preprint arXiv:2010.07487, 2020.
    Findings
  • Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Sklgs0NFvr.
    Locate open access versionFindings
  • John A Kellum and Azra Bihorac. Artificial intelligence to predict aki: is it a breakthrough? Nature Reviews Nephrology, pages 1–2, 2019.
    Google ScholarLocate open access versionFindings
  • Christopher J Kelly, Alan Karthikesalingam, Mustafa Suleyman, Greg Corrado, and Dominic King. Key challenges for delivering clinical impact with artificial intelligence. BMC medicine, 17(1):195, 2019.
    Google ScholarLocate open access versionFindings
  • Amit V Khera, Mark Chaffin, Krishna G Aragam, Mary E Haas, Carolina Roselli, Seung Hoan Choi, Pradeep Natarajan, Eric S Lander, Steven A Lubitz, Patrick T Ellinor, and Sekar Kathiresan. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet., 50(9):1219–1224, September 2018.
    Google ScholarLocate open access versionFindings
  • Arif Khwaja. KDIGO clinical practice guidelines for acute kidney injury. Nephron - Clinical Practice, 120(4), oct 2012. ISSN 16602110. doi: 10.1159/000339789.
    Locate open access versionFindings
  • Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Ziad Obermeyer. Prediction policy problems. American Economic Review, 105(5):491–95, 2015.
    Google ScholarLocate open access versionFindings
  • Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Large scale learning of general visual representations for transfer. arXiv preprint arXiv:1912.11370, 2019.
    Findings
  • Jonathan Krause, Varun Gulshan, Ehsan Rahimy, Peter Karth, Kasumi Widner, Greg S Corrado, Lily Peng, and Dale R Webster. Grader variability and the importance of reference standards for evaluating machine learning models for diabetic retinopathy. Ophthalmology, 125(8):1264–1272, 2018.
    Google ScholarLocate open access versionFindings
  • Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In Advances in neural information processing systems, pages 4066–4076, 2017.
    Google ScholarLocate open access versionFindings
  • Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6402–6413. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7219-simple-and-scalable-predictive-uncertainty-estimation-using-deep-ensembles.pdf.
    Locate open access versionFindings
  • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
    Findings
  • Olivier Ledoit and Sandrine Péché. Eigenvectors of some large sample covariance matrix ensembles. Probability Theory and Related Fields, 151(1-2):233–264, 2011.
    Google ScholarLocate open access versionFindings
  • Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav Artzi. Simple recurrent units for highly parallelizable recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMzhou 2018, pages 4470–4481. Association for Computational Linguistics, sep 2018. ISBN 9781948087841. doi: 10.18653/v1/d18-1477. URL http://arxiv.org/abs/1709.02755.
    Findings
  • Tal Linzen. How can we accelerate progress towards human-like linguistic generalization? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5210–5217, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020. acl-main.465. URL https://www.aclweb.org/anthology/2020.acl-main.465.
    Locate open access versionFindings
  • Xiaoxuan Liu, Samantha Cruz Rivera, David Moher, Melanie J Calvert, and Alastair K Denniston. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the consort-ai extension. bmj, 370, 2020a.
    Google ScholarLocate open access versionFindings
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    Findings
  • Yuan Liu, Ayush Jain, Clara Eng, David H Way, Kang Lee, Peggy Bui, Kimberly Kanada, Guilherme de Oliveira Marinho, Jessica Gallegos, Sara Gabriele, et al. A deep learning system for differential diagnosis of skin diseases. Nature Medicine, pages 1–9, 2020b.
    Google ScholarLocate open access versionFindings
  • Sara Magliacane, Thijs van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg, and Joris M Mooij. Domain adaptation by using causal inference to predict invariant conditional distributions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31 (NeurIPS2018), pages 10869–10879. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/8282-domain-adaptation-by-using-causal-inference-to-predict-invariant-conditional-distributions.pdf.
    Locate open access versionFindings
  • Alicia R Martin, Christopher R Gignoux, Raymond K Walters, Genevieve L Wojcik, Benjamin M Neale, Simon Gravel, Mark J Daly, Carlos D Bustamante, and Eimear E Kenny. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet., 100(4):635–649, April 2017.
    Google ScholarLocate open access versionFindings
  • Alicia R Martin, Masahiro Kanai, Yoichiro Kamatani, Yukinori Okada, Benjamin M Neale, and Mark J Daly. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet., 51(4):584–591, April 2019.
    Google ScholarLocate open access versionFindings
  • Charles T Marx, Flavio du Pin Calmon, and Berk Ustun. Predictive multiplicity in classification. arXiv preprint arXiv:1909.06677, 2019.
    Findings
  • R Thomas McCoy, Junghyun Min, and Tal Linzen. Berts of a feather do not generalize together: Large variability in generalization across models with similar test set performance. arXiv preprint arXiv:1911.02969, 2019a.
    Findings
  • R Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007, 2019b.
    Findings
  • Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv:1908.05355, 2019.
    Findings
  • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
    Findings
  • Joannella Morales, Danielle Welter, Emily H Bowler, Maria Cerezo, Laura W Harris, Aoife C McMahon, Peggy Hall, Heather A Junkins, Annalisa Milano, Emma Hastings, Cinzia Malangone, Annalisa Buniello, Tony Burdett, Paul Flicek, Helen Parkinson, Fiona Cunningham, Lucia A Hindorff, and Jacqueline A L MacArthur. A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS catalog. Genome Biol., 19 (1):21, February 2018.
    Google ScholarFindings
  • Sendhil Mullainathan and Jann Spiess. Machine learning: an applied econometric approach. Journal of Economic Perspectives, 31(2):87–106, 2017.
    Google ScholarLocate open access versionFindings
  • Moin Nadeem, Anna Bethke, and Siva Reddy. Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456, 2020.
    Findings
  • Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. Stress test evaluation for natural language inference. arXiv preprint arXiv:1806.00692, 2018.
    Findings
  • Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=B1g5sA4twr.
    Locate open access versionFindings
  • National Institute for Health and Care Excellence (NICE). Acute kidney injury: prevention, detection and management. NICE Guideline NG148, 2019.
    Google ScholarLocate open access versionFindings
  • Anna C Need and David B Goldstein. Next generation disparities in human genomics: concerns and remedies. Trends Genet., 25(11):489–494, November 2009.
    Google ScholarLocate open access versionFindings
  • Bret Nestor, Matthew B. A. McDermott, Willie Boag, Gabriela Berner, Tristan Naumann, Michael C Hughes, Anna Goldenberg, and Marzyeh Ghassemi. Feature Robustness in Non-stationary Health Records: Caveats to Deployable Model Performance in Common Clinical Machine Learning Tasks. Proceedings of Machine Learning Research, 106:1–23, 2019. URL https://mimic.physionet.org/mimicdata/carevue/http://arxiv.org/abs/1908.00690.
    Findings
  • Luke Oakden-Rayner, Jared Dunnmon, Gustavo Carneiro, and Christopher Ré. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In Proceedings of the ACM Conference on Health, Inference, and Learning, pages 151–159, 2020.
    Google ScholarLocate open access versionFindings
  • Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447–453, oct 2019. ISSN 10959203. doi: 10.1126/science.aax2342.
    Locate open access versionFindings
  • Cecilia Panigutti, Alan Perotti, and Dino Pedreschi. Doctor XAI An ontology-based approach to blackbox sequential data classification explanations. In FAT* 2020 - Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 629–639, 2020. ISBN 9781450369367. doi: 10.1145/3351095.3372855. URL https://doi.org/10.1145/3351095.3372855.
    Locate open access versionFindings
  • Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947–1012, 2016. doi: 10.1111/rssb.12167. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12167.
    Locate open access versionFindings
  • Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, 2018.
    Google ScholarLocate open access versionFindings
  • Alice B Popejoy and Stephanie M Fullerton. Genomics is failing on diversity. Nature, 538(7624): 161–164, October 2016.
    Google ScholarLocate open access versionFindings
  • Mihail Popescu and Mohammad Khalilia. Improving disease prediction using ICD-9 ontological features. In IEEE International Conference on Fuzzy Systems, pages 1805–1809, 2011. ISBN 9781424473175. doi: 10.1109/FUZZY.2011.6007410.
    Locate open access versionFindings
  • Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet., 38(8):904–909, August 2006.
    Google ScholarLocate open access versionFindings
  • Shaun Purcell, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel A R Ferreira, David Bender, Julian Maller, Pamela Sklar, Paul I W de Bakker, Mark J Daly, and Pak C Sham. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet., 81(3):559–575, September 2007.
    Google ScholarLocate open access versionFindings
  • Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John Duchi, and Percy Liang. Understanding and mitigating the tradeoff between robustness and accuracy. arXiv preprint arXiv:2002.10716, 2020.
    Findings
  • Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
    Google ScholarLocate open access versionFindings
  • Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.442. URL https://www.aclweb.org/anthology/2020.acl-main.442.
    Locate open access versionFindings
  • Samantha Cruz Rivera, Xiaoxuan Liu, An-Wen Chan, Alastair K Denniston, and Melanie J Calvert. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the spirit-ai extension. bmj, 370, 2020.
    Google ScholarLocate open access versionFindings
  • Andrew Slavin Ross, Michael C Hughes, and Finale Doshi-Velez. Right for the right reasons: training differentiable models by constraining their explanations. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 2662–2670. AAAI Press, 2017.
    Google ScholarLocate open access versionFindings
  • Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 8–14, 2018.
    Google ScholarLocate open access versionFindings
  • Bernhard Schölkopf. Causality for machine learning. arXiv preprint arXiv:1911.10500, 2019.
    Findings
  • Lesia Semenova, Cynthia Rudin, and Ronald Parr. A study in rashomon curves and volumes: A new perspective on generalization and model simplicity in machine learning. arXiv preprint arXiv:1908.01755, 2019.
    Findings
  • Montgomery Slatkin. Linkage disequilibrium — understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics, 9:477–485, 2008.
    Google ScholarLocate open access versionFindings
  • Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin, D Sculley, Joshua Dillon, Jie Ren, and Zachary Nado. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pages 13969–13980, 2019.
    Google ScholarLocate open access versionFindings
  • Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, Bette Liu, Paul Matthews, Giok Ong, Jill Pell, Alan Silman, Alan Young, Tim Sprosen, Tim Peakman, and Rory Collins. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med., 12(3):e1001779, March 2015.
    Google ScholarLocate open access versionFindings
  • Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pages 843–852, 2017.
    Google ScholarLocate open access versionFindings
  • Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inceptionresnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence, 2017.
    Google ScholarLocate open access versionFindings
  • Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification. arXiv preprint arXiv:2007.00644, 2020.
    Findings
  • Daniel Shu Wei Ting, Carol Yim-Lui Cheung, Gilbert Lim, Gavin Siew Wei Tan, Nguyen D Quang, Alfred Gan, Haslina Hamzah, Renata Garcia-Franco, Ian Yew San Yeo, Shu Yen Lee, et al. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. Jama, 318(22):2211–2223, 2017.
    Google ScholarLocate open access versionFindings
  • Nenad Tomašev, Xavier Glorot, Jack W Rae, Michal Zielinski, Harry Askham, Andre Saraiva, Anne Mottram, Clemens Meyer, Suman Ravuri, Ivan Protsyuk, Alistair Connell, Cían O Hughes, Alan Karthikesalingam, Julien Cornebise, Hugh Montgomery, Geraint Rees, Chris Laing, Clifton R Baker, Kelly Peterson, Ruth Reeves, Demis Hassabis, Dominic King, Mustafa Suleyman, Trevor Back, Christopher Nielson, Joseph R Ledsam, and Shakir Mohamed. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature, 572(7767):116–119, aug 2019a. ISSN 0028-0836. doi: 10.1038/s41586-019-1390-1.
    Locate open access versionFindings
  • Nenad Tomašev, Xavier Glorot, Jack W. Rae, Michal Zielinski, Harry Askham, Andre Saraiva, Anne Mottram, Clemens Meyer, Suman Ravuri, Ivan Protsyuk, Alistair Connell, Cian O. Hugues, Alan Kathikesalingam, Julien Cornebise, Hugh Montgomery, Geraint Rees, Chris Laing, Clifton R. Baker, Kelly Peterson, Ruth Reeves, Demis Hassabis, Dominic King, Mustafa Suleyman, Trevor Back, Christopher Nielson, Joseph R. Ledsam, and Shakir Mohamed. Developing Deep Learning Continuous Risk Models for Early Adverse Event Prediction in Electronic Health Records: an AKI Case Study. PROTOCOL available at Protocol Exchange, version 1, jul 2019b. doi: 10.21203/RS. 2.10083/V1.
    Locate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Bjarni J Vilhjálmsson, Jian Yang, Hilary K Finucane, Alexander Gusev, Sara Lindström, Stephan Ripke, Giulio Genovese, Po-Ru Loh, Gaurav Bhatia, Ron Do, Tristan Hayeck, Hong-Hee Won, Schizophrenia Working Group of the Psychiatric Genomics Consortium, Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE) study, Sekar Kathiresan, Michele Pato, Carlos Pato, Rulla Tamimi, Eli Stahl, Noah Zaitlen, Bogdan Pasaniuc, Gillian Belbin, Eimear E Kenny, Mikkel H Schierup, Philip De Jager, Nikolaos A Patsopoulos, Steve McCarroll, Mark Daly, Shaun Purcell, Daniel Chasman, Benjamin Neale, Michael Goddard, Peter M Visscher, Peter Kraft, Nick Patterson, and Alkes L Price. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet., 97(4):576–592, October 2015.
    Google ScholarLocate open access versionFindings
  • Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, 2018.
    Google ScholarLocate open access versionFindings
  • Kellie Webster, Xuezhi Wang, Ian Tenney, Alex Beutel, Emily Pitler, Ellie Pavlick, Jilin Chen, and Slav Petrov. Measuring and reducing gendered correlations in pre-trained models. arXiv preprint arXiv:2010.06032, 2020.
    Findings
  • Florian Wenzel, Jasper Snoek, Dustin Tran, and Rodolphe Jenatton. Hyperparameter ensembles for robustness and uncertainty quantification. arXiv preprint arXiv:2006.13570, 2020.
    Findings
  • Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, 2018.
    Google ScholarLocate open access versionFindings
  • Andrew Gordon Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. arXiv preprint arXiv:2002.08791, 2020.
    Findings
  • Julia K Winkler, Christine Fink, Ferdinand Toberer, Alexander Enk, Teresa Deinlein, Rainer HofmannWellenhof, Luc Thomas, Aimilios Lallas, Andreas Blum, Wilhelm Stolz, et al. Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA dermatology, 155(10):1135–1141, 2019.
    Google ScholarLocate open access versionFindings
  • Naomi R Wray, Michael E Goddard, and Peter M Visscher. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res., 17(10):1520–1528, October 2007.
    Google ScholarLocate open access versionFindings
  • Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5753–5763, 2019.
    Google ScholarLocate open access versionFindings
  • Dong Yin, Raphael Gontijo Lopes, Jon Shlens, Ekin Dogus Cubuk, and Justin Gilmer. A fourier perspective on model robustness in computer vision. In Advances in Neural Information Processing Systems, pages 13255–13265, 2019.
    Google ScholarLocate open access versionFindings
  • Bin Yu et al. Stability. Bernoulli, 19(4):1484–1500, 2013. Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. Swag: A large-scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326, 2018. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15–20, 2018. Xiang Zhou, Yixin Nie, Hao Tan, and Mohit Bansal. The curse of performance instability in analysis datasets: Consequences, source, and suggestions. arXiv preprint arXiv:2004.13606, 2020.
    Findings
Author
Alexander D'Amour
Alexander D'Amour
Dan Moldovan
Dan Moldovan
Ben Adlam
Ben Adlam
Babak Alipanahi
Babak Alipanahi
Christina Chen
Christina Chen
Jonathan Deaton
Jonathan Deaton
0
Your rating :

No Ratings

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn