Responsible AI Dashboard
Git repository:

As of December 2021, we have released the Responsible AI Dashboard with the open source community. The dashboard integrates together several mature efforts developed as a collaboration between Microsoft Research \ Aether Committee and Azure Machine Learning. The main functionalities focus on debugging machine learning models and responsible decision-making. The dashboard is fully customizable and enables practitioners to construct their own workflows using visualizations and capabilities from Error Analysis, InterpretML, Fairlearn, EconML, and DiCE. The dashboard is part of the Responsible AI Toolbox, a larger open source effort at Microsoft for integrating Responsible AI into the end-to-end Machine Learning lifecycle.

Read more about the tool in our blog: Responsible AI dashboard: A one-stop shop for operationalizing Responsible AI in practice or watch our demo at the latest MSR Summit: Demo: RAI Toolbox: An open-source framework for building responsible AI.

I am a researcher in the Adaptive Systems and Interaction group at Microsoft Research. My research work lies in the intersection of human and machine intelligence aiming at improving current systems either with better debugging tools or by optimizing them for human-centered properties. I am currently excited about two main directions in this intersection:

Debugging and Failure Analysis of AI\ML Systems for accelerating the software development lifecycle of reliable and robust learning systems. I build tools that enable machine learning practitioners identify and diagnose failures in learned models. Take a look at Error Analysis, BackwardCompatibilityML, and more recently the Responsible AI Toolbox as examples of such tools in the open source community.
Human-AI Collaboration for enhancing human capabilities while solving complex decision-making tasks. I study properties of ML models that make them better collaborators with people and design new optimization techniques encode such properties into models. Check out my recent talk on this topic: The Utopia of Human-AI Collaboration.
I am also involved in various research initiatives that study the societal impact of artificial intelligence as well as various quality-of-service aspects of AI including interpretability, reliability, accountability, and fairness. This is a recent research podcast on my current interests.

If you are a PhD student looking for an internship position around these topics send me an email. The Adaptive Systems and Interaction group is a fun bunch of excellent and diverse researchers.

Prior to joining Microsoft Research, in 2016 I completed my PhD degree at ETH Zurich (Switzerland) in the Systems Group, advised by Prof. Donald Kossmann and Prof. Andreas Krause. My doctoral thesis focuses on building cost and quality-aware models for integrating crowdsourcing in the process of building machine learning algorithms and systems. In 2011, I completed my master studies in computer science in a double-degree MSc program at RWTH University of Aachen (Germany) and University of Trento (Italy) as an Erasmus Mundus scholar. I also have a Diploma in Informatics from University of Tirana (Albania) from where I graduated in 2007.


February 2022 - Our new paper Investigations of Performance and Bias in Human-AI Teamwork in Hiring lead by Andi Peng was presented at AAAI 2022. The work contributes a study on human-AI decision-making for hiring decisions investigating the dynamics between model and human decision-making performance and bias. In fact, the study reveals that depending on the model architecture, some models may mitigate bias and others may reinforce it, motivating the explicit need to assess these complex dynamics prior to deployment.

February 2022 - HINT was accepted as a full paper at the IUI 2022 conference. This is a contribution lead by Quanze Chen (Jim Chen) during his MSR Internship on Human-AI INtegration Testing. The HINT framework addresses challenges around testing AI-based features over time with people in the loop.

October 2021 - We presented a first demo on the Responsible AI Toolbox at the Microsoft Research Summit. The toolbox is a collaboration between Azure Machine Learning and several MSR groups. It provides an open-source framework for integrating various RAI tools and components into fluid, interactive workflows. Take a peak at the recorded demo during the MSR Summit website.

June 2021 - New work on Understanding Failures of Deep Networks via Robust Feature Extraction was accepted at CVPR 2021. The project was lead by Sahil Singla, as part of his MSR summer internship, and extends our efforts on failure explanation by proposing a semi-automated technique for extracting visual attributes useful for error analysis on image data. .

May 2021 - Our work investigating the usefulness of explanations for Human-AI decision-making was presented at CHI 2021: Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. The work was lead by Tongshuang Wu and Gagan Bansal, who continue to make impressive contributions to Human-Centered AI.

February 2020 - We are happy to share the Error Analysis tool with the open source community. The tool is based on our earlier work for failure explanation in ML systems and was developed with the help of our amazing partners in Azure Machine Learning and Microsoft Mixed Reality. Read more about the tool in our blog: Responsible Machine Learning with Error Analysis.

February 2020 - Pursuing our Human-AI Collaboration projects, we presented our latest work at AAAI 2021 Is the Most Accurate AI the Best Teammate? Optimizing AI for Teamwork. The project lead by Gagan Bansal, introduces new challenges at the intersection of Human-AI Collaboration and Machine Learning Optimization, proposing to optimize ML models such that they maximize team utility in an environment when humans take decisions assisted by ML predictions.

December 2020 - Our team in collaboration with the Microsoft Aether Committee, released the BackwardCompatibilityML tool. The open source tool enables ML developers to retrain model that do not introduce new errors during updates through a series of loss functions and compatibility metrics. It also provides a set of visualization widgets for model comparison in jupyter notebooks.

How can we better understand failures of an AI system?

Building reliable AI requires a deep understanding of the potential system failures. The focus of this project is to build tools that can help engineers to accelerate development and improvement cycles by assisting them in debugging and troubleshooting. For example, in 2018 we proposed Pandora as a set of hybrid human-machine methods for describing and explaining system failures. The approach provided descriptive performance reports to engineers correlating input conditions with errors, guiding them towards discovering hidden conditions of failure.

Based on Pandora, we built the Error Analysis tool and later the Responsible AI Toolbox as a collaboration between Microsoft Research, Azure Machine Learning, and the Microsoft Aether Committee. The vision of this collaboration is to build tools that help engineers accelerate the development iterations by identifying errors faster, systematically, and rigorously. We are also continuing to innovate with extended debugging techniques for vision by extracting interpretable attributes from robust representations. Moving forward, our team is also working on closing the loop of error identification, diagnosis, and mitigation by providing a library of mitigation strategies and by including model comparison capabilities in our toolchain. To get a comprehensive summary of our efforts on model comparison, take a look at our BackwardcompatibilityML tool for training compatible models and comparing them with each other.

Pandora error analysis workflow
Pandora workflow for error analysis

Pandora error analysis workflow
Error Analysis views for error identification and diagnosis

In the same vein, this project has also explored ideas on enabling troubleshooting techniques for AI systems that contain multiple learning components. Diagnosing such systems is a challenging task. Often, errors get propagated, suppressed or even amplified down the computation pipelines. We propose a troubleshooting methodology that generates counterfactual improved states of system components by using crowd intelligence. These states, which would have been too expensive or infeasible to generate otherwise, are then integrated in the system execution to create insights about which component fixes are the most efficient given the current system architecture.
Human in the loop troubleshooting
Troubleshooting Integrative AI systems with humans in the loop

Current and past collaborators in the project:
Ece Kamar, Shital Shah, Gagan Bansal, Eric Horvitz, Juan Lema, Xavier Fernandes, Nicholas King, Mihaela Vorvoreanu, Jingya Chen (Microsoft Research and Microsoft Aether Committee)
Ilya Matiach, Mehrnoosh Sameki, Hyemi Song, Minsoo Thigpen, Richard Edgar, Roman Lutz (Azure Machine Learning)
Parham Mohadjer, Josh Hinds, Russell Eames, Yan Esteve Balducci, Mark Flick (Microsoft Mixed Reality)
Sahil Singla, Megha Srivastava (Microsoft Research Interns and Residents)

What are the properties of a good AI collaborator? How to optimize ML models for collaboration?

Machine learning models are currently optimized to maximize model accuracy on given benchmarks and test datasets. When a learning model is being used by a human to either accomplish a complex task or to take a high-stake decision (e.g. medical diagnosis or recidivism), team performance does not only depend on model accuracy but also on how well do humans understand when to trust the AI or not so that they can learn when to override its decisions. In this project, we instead aim at optimizing for joint human-model performance.

AI-advised Human Decision Making
AI-advised Human Decision Making

Our first steps towards this goal have been to study how models should be updated so that they do not violate previous trust that users might have built during their interaction over time. By incorporating in the loss function the goal of staying backward compatible to the previous model, and therefore to the previous user experience, we minimize update disruption for the whole team.
Updates in a human-AI team
Human-AI teams undergoing a model update

Within the same context, we have also studied further properties of the machine learning error boundary (i.e. when does the model make an error?) such as parsimony and stochasticity to understand their impact in human decision making. The parsimony of an error boundary expresses how simple it is to express when the model errs or succeeds (e.g. How many feature rules are needed?). Stochasticity instead expresses how clean that description would be. Both of them are of course related to the learnability of the error boundary as a function of the data representation from a human perspective. Based on our study, more parsimonious and less stochastic error boundaries are easier to learn, which opens up a new opportunity in machine learning optimization for training and deploying models that are easier to work with.

In the broader context of designing for Human-AI Collaboration, in 2019 our research initiative lead by Saleema Amershi also devised a set of Human-AI Interaction Guidelines, which synthesize more than 20 years of thinking and research in human-AI interaction. They recommend best practices for how AI systems should behave upon initial interaction, during regular interaction, when they’re inevitably wrong, and over time.

Current and past collaborators in the project:
Gagan Bansal, Ece Kamar, Kori Inkpen, Tobias Schnabel, Adam Fourney, Eric Horvitz, Saleema Amershi, Mihaela Vorvoreanu (Microsoft Research and Microsoft Aether Committee)
Riccardo Fogliato, Shreya Chappidi, Divya Ramesh, Keri Mallari, Andi Peng, Tongshuang Wu (Microsoft Research Interns and Residents)


Investigations of Performance and Bias in Human-AI Teamwork in Hiring. Andi Peng, Besmira Nushi, Emre Kıcıman, Kori Inkpen, Ece Kamar; AAAI 2022. pdf

HINT: Integration Testing for AI-based features with Humans in the Loop. Quanze Chen, Tobias Schnabel, Besmira Nushi, Saleema Amershi; IUI 2022. pdf

Understanding Failures of Deep Networks via Robust Feature Extraction. Sahil Singla, Besmira Nushi, Shital Shah, Ece Kamar, Eric Horvitz; CVPR 2021. pdf

Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, Daniel S. Weld; CHI 2021. pdf

Is the Most Accurate AI the Best Teammate? Optimizing AI for Teamwork. Gagan Bansal, Besmira Nushi, Ece Kamar, Eric Horvitz, Dan Weld; AAAI 2021. pdf

An Empirical Analysis of Backward Compatibility in Machine Learning Systems. Megha Srivastava, Besmira Nushi, Ece Kamar, Shital Shah, Eric Horvitz; KDD 2020. pdf

SQuINTing at VQA Models: Interrogating VQA Models with Sub-Questions. Ramprasaath R. Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Ribeiro, Besmira Nushi, Ece Kamar; CVPR 2020. pdf

Characterizing Search-Engine Traffic toInternet Research Agency Web Properties. Alexander Spangher, Gireeja Ranade, Besmira Nushi, Adam Fourney, Eric Horvitz; WebConf 2020. pdf

Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations. Aditya Modi, Debadeepta Dey, Alekh Agarwal, Adith Swaminathan, Besmira Nushi, Sean Andrist, Eric Horvitz; AAAI 2020. pdf

Beyond Accuracy: The Role of Mental Models in Human-AI Team Performance. Gagan Bansal, Besmira Nushi, Ece Kamar, Daniel S Weld, Walter Lasecki**, Eric Horvitz; HCOMP 2019. pdf
**Statement about author's misconduct

What You See Is What You Get? The Impact of Representation Criteria on Human Bias in Hiring. Andi Peng, Besmira Nushi, Emre Kiciman, Kori Inkpen, Siddharth Suri, Ece Kamar; HCOMP 2019. pdf

Software Engineering for Machine Learning: A Case Study. Saleema Amershi, Andrew Begel, Christian Bird, Rob DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, Thomas Zimmermann; ICSE 2019 . pdf

Guidelines for Human-AI Interaction. Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, Eric Horvitz; CHI 2019. pdf

Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff. Gagan Bansal, Besmira Nushi, Ece Kamar, Daniel S Weld, Walter Lasecki**, Eric Horvitz; AAAI 2019. pdf
**Statement about author's misconduct

Overcoming Blind Spots in the RealWorld: Leveraging Complementary Abilities for Joint Execution. Ramya Ramakrishnan, Ece Kamar, Besmira Nushi, Debadeepta Dey, Julie Shah, Eric Horvitz; AAAI 2019. pdf

Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure. Besmira Nushi, Ece Kamar, Eric Horvitz; HCOMP 2018. pdf

Analysis of Strategy and Spread of Russia-sponsored Content in the US in 2017. Alexander Spangher, Gireeja Ranade, Besmira Nushi, Adam Fourney, Eric Horvitz; arXiv 2018. pdf

On Human Intellect and Machine Failures: Troubleshooting Integrative Machine Learning Systems. Besmira Nushi, Ece Kamar, Eric Horvitz, Donald Kossmann; AAAI 2017. pdf

Quality Control and Optimization for Hybrid Crowd-Machine Learning Systems. Besmira Nushi; ETH PhD Thesis 2016. pdf

Learning and Feature Selection under Budget Constraints in Crowdsourcing. Besmira Nushi, Adish Singla, Andreas Krause, Donald Kossmann; HCOMP 2016. pdf

Fault-Tolerant Entity Resolution with the Crowd. Anja Gruenheid, Besmira Nushi, Tim Kraska, Wolfgang Gatterbauer, Donald Kossmann; arXiv 2016. full technical report

Crowd Access Path Optimization: Diversity Matters. Besmira Nushi, Adish Singla, Anja Gruenheid, Erfan Zamanian, Andreas Krause, Donald Kossmann; HCOMP 2015. pdf

CrowdSTAR: A Social Task Routing Framework for Online Communities. Besmira Nushi, Omar Alonso, Martin Hentschel, and Vasileios Kandylas; ICWE 2015. pdf full technical report

When is A = B? Anja Gruenheid, Donald Kossmann, Besmira Nushi, Yuri Gurevich; EATCS Bulletin 111 (2013) pdf

Uncertain time-series similarity: Return to the basics. Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, and Themis Palpanas; Proceedings of the VLDB Endowment 5, no. 11 (2012): 1662-1673. pdf

Similarity matching for uncertain time series: analytical and experimental comparison. Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, and Themis Palpanas. Proceedings of the 2nd ACM SIGSPATIAL International Workshop on Querying and Mining Uncertain Spatio-Temporal Data, pp. 8-15. ACM, 2011. pdf


Microsoft building 99 (3137)
14820 NE 36th St, Redmond, WA 98052, USA