Error Analysis Tool
Git repository: https://github.com/microsoft/responsible-ai-widgets

We are happy to share the Error Analysis tool with the open source community. The tool is based on our earlier work for failure explanation in ML systems and was developed with the help of our amazing partners in Azure Machine Learning and Microsoft Mixed Reality. Read more about the tool in our blog: Responsible Machine Learning with Error Analysis.

BIOGRAPHY
I am a researcher in the Adaptive Systems and Interaction group at Microsoft Research. My research work lies in the intersection of human and machine intelligence aiming at improving current systems either with better debugging tools or by optimizing them for human-centered properties. I am currently excited about two main directions in this intersection:

Debugging and Failure Analysis of AI\ML Systems for accelerating the software development lifecycle of reliable and robust learning systems. I build tools that enable machine learning practitioners identify and diagnose failures in learned models. Take a look at Error Analysis and BackwardCompatibilityML as examples of such tools in the open source community.
Human-AI Collaboration for enhancing human capabilities while solving complex decision-making tasks. I study properties of ML models that make them better collaborators with people and design new optimization techniques encode such properties into models. Check out my recent talk on this topic: The Utopia of Human-AI Collaboration.
I am also involved in various research initiatives that study the societal impact of artificial intelligence as well as various quality-of-service aspects of AI including interpretability, reliability, accountability, and fairness. This is a recent research podcast on my current interests.

If you are a PhD student looking for an internship position around these topics send me an email. The Adaptive Systems and Interaction group is a fun bunch of excellent and diverse researchers.

Prior to joining Microsoft Research, in 2016 I completed my PhD degree at ETH Zurich (Switzerland) in the Systems Group, advised by Prof. Donald Kossmann and Prof. Andreas Krause. My doctoral thesis focuses on building cost and quality-aware models for integrating crowdsourcing in the process of building machine learning algorithms and systems. In 2011, I completed my master studies in computer science in a double-degree MSc program at RWTH University of Aachen (Germany) and University of Trento (Italy) as an Erasmus Mundus scholar. I also have a Diploma in Informatics from University of Tirana (Albania) from where I graduated in 2007.

NEWS

Coming soon - New work on Understanding Failures of Deep Networks via Robust Feature Extraction was accepted at CVPR 2021. The project was lead by Sahil Singla, as part of his MSR summer internship, and extends our efforts on failure explanation by proposing a semi-automated technique for extracting visual attributes useful for error analysis on image data. .

Coming soon - Our work investigating the usefulness of explanations for Human-AI decision-making is going to appear at CHI 2021: Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. The work was lead by Tongshuang Wu and Gagan Bansal, who continue to make impressive contributions to Human-Centered AI.

February 2020 - We are happy to share the Error Analysis tool with the open source community. The tool is based on our earlier work for failure explanation in ML systems and was developed with the help of our amazing partners in Azure Machine Learning and Microsoft Mixed Reality. Read more about the tool in our blog: Responsible Machine Learning with Error Analysis.

February 2020 - Pursuing our Human-AI Collaboration projects, we presented our latest work at AAAI 2021 Is the Most Accurate AI the Best Teammate? Optimizing AI for Teamwork. The project lead by Gagan Bansal, introduces new challenges at the intersection of Human-AI Collaboration and Machine Learning Optimization, proposing to optimize ML models such that they maximize team utility in an environment when humans take decisions assisted by ML predictions.

December 2020 - Our team in collaboration with the Microsoft Aether Committee, released the BackwardCompatibilityML tool. The open source tool enables ML developers to retrain model that do not introduce new errors during updates through a series of loss functions and compatibility metrics. It also provides a set of visualization widgets for model comparison in jupyter notebooks.

August 2020 - Our new paper, An Empirical Analysis of Backward Compatibility in Machine Learning Systems, was presented at KDD 2020. The work characterizes when and how backward incompatibility and newly introduced errors may happen while retraining machine learning models. Project lead by Megha Srivastava during her AI Residency year at Microsoft Research.

August 2020 - I gave an invited talk at the Machine Learning for Healthcare conference, titled "The Unpaved Path of Deploying Reliable and Human-Centered Machine Learning Systems".

July 2020 - I organized a focused session at the Frontiers in Machine Learning event at Microsoft Research, on Machine Learning Reliability and Robustness. Make sure you hear Tom Diettrich, Suchi Saria, and Ece Kamar speaking about their thoughts and work in this space.

June 2020 - Our paper titled SQuINTing at VQA Models: Interrogating VQA Models with Sub-Questions was presented at CVPR 2020. Ramprasaath R. Selvaraju lead this work during his internship at MSR bringing to life important ideas on enabling VQA model reasoning through enforcing sub-question consistency. The paper also contributes a novel dataset to the community, VQA-Introspect for training and evaluating VQA models that are right for the right reasons.

March 2020 - New paper on Characterizing Search-Engine Traffic to Internet Research Agency Web Properties presented at WebConf 2020. The work lead by Alex Spangher presents a thorough analysis on the impact of IRA-related posts and ads on web search activity.

February 2020 - Our group presented a new paper on Metareasoning in Modular Software Systems at AAAI 2020. Work lead by Aditya Modi and Debadeepta Dey on optimizing integrative AI systems on-the-fly, using reinforcement learning with rich contextual representations. Read our blog post summarizing the work and vision.

February 2020 - Together with Dan Weld, Adam Fourney, and Saleema Amershi, we organized a tutorial on Guidelines for Human-AI Interaction at AAAI 2020 in New York, February 8th 2020. We also published a detailed blog post on How to build effective human-AI interaction: Considerations for machine learning and software engineering.

DEBUGGING AND FAILURE ANALYSIS FOR AI SYSTEMS
How can we better understand failures of an AI system?

Building reliable AI requires a deep understanding of the potential system failures. The focus of this project is to build tools that can help engineers to accelerate development and improvement cycles by assisting them in debugging and troubleshooting. For example, in 2018 we proposed Pandora as a set of hybrid human-machine methods for describing and explaining system failures. The approach provided descriptive performance reports to engineers correlating input conditions with errors, guiding them towards discovering hidden conditions of failure.

Based on Pandora, we built the Error Analysis tool as a collaboration between Microsoft Research, Azure Machine Learning, and Microsoft Mixed Reality, and the Microsoft Aether Committee. The vision of this collaboration is to build tools that help engineers accelerate the development iterations by identifying errors faster, systematically, and rigorously. We are also continuing to innovate with extended debugging techniques for vision by extracting interpretable attributes from robust representations. Moving forward, our team is also working on closing the loop of error identification, diagnosis, and mitigation by providing a library of mitigation strategies and by including model comparison capabilities in our toolchain. To get a comprehensive summary of our efforts on model comparison, take a look at our BackwardcompatibilityML tool for training compatible models and comparing them with each other.

Pandora error analysis workflow
Pandora workflow for error analysis

Pandora error analysis workflow
Error Analysis views for error identification and diagnosis

In the same vein, this project has also explored ideas on enabling troubleshooting techniques for AI systems that contain multiple learning components. Diagnosing such systems is a challenging task. Often, errors get propagated, suppressed or even amplified down the computation pipelines. We propose a troubleshooting methodology that generates counterfactual improved states of system components by using crowd intelligence. These states, which would have been too expensive or infeasible to generate otherwise, are then integrated in the system execution to create insights about which component fixes are the most efficient given the current system architecture.
Human in the loop troubleshooting
Troubleshooting Integrative AI systems with humans in the loop

Current and past collaborators in the project:
Ece Kamar, Shital Shah, Eric Horvitz, Juan Lema, Xavier Fernandes, Nicholas King, Mihaela Vorvoreanu (Microsoft Research and Microsoft Aether Committee)
Ilya Matiach, Mehrnoosh Sameki, Hyemi Song, Richard Edgar, Roman Lutz (Azure Machine Learning)
Parham Mohadjer, Josh Hinds, Russell Eames, Yan Esteve Balducci, Mark Flick (Microsoft Mixed Reality)
Gagan Bansal, Sahil Singla, Megha Srivastava (Microsoft Research Interns and Residents)

HUMAN-AI COLLABORATION
What are the properties of a good AI collaborator? How to optimize ML models for collaboration?

Machine learning models are currently optimized to maximize model accuracy on given benchmarks and test datasets. When a learning model is being used by a human to either accomplish a complex task or to take a high-stake decision (e.g. medical diagnosis or recidivism), team performance does not only depend on model accuracy but also on how well do humans understand when to trust the AI or not so that they can learn when to override its decisions. In this project, we instead aim at optimizing for joint human-model performance.

AI-advised Human Decision Making
AI-advised Human Decision Making

Our first steps towards this goal have been to study how models should be updated so that they do not violate previous trust that users might have built during their interaction over time. By incorporating in the loss function the goal of staying backward compatible to the previous model, and therefore to the previous user experience, we minimize update disruption for the whole team.
Updates in a human-AI team
Human-AI teams undergoing a model update

Within the same context, we have also studied further properties of the machine learning error boundary (i.e. when does the model make an error?) such as parsimony and stochasticisty to understand their impact in human decision making. The parsimony of an error boundary expresses how simple it is to express when the model errs or succeeds (e.g. How many feature rules are needed?). Stochasticity instead expresses how clean that description would be. Both of them are of course related to the learnability of the error boundary as a function of the data representation from a human perspective. Based on our study, more parsimonious and less stochastic error boundaries are easier to learn, which opens up a new opportunity in machine learning optimization for training and deploying models that are easier to work with.

In the broader context of designing for Human-AI Collaboration, in 2019 our research initiative lead by Saleema Amershi also devised a set of Human-AI Interaction Guidelines, which synthesize more than 20 years of thinking and research in human-AI interaction. They recommend best practices for how AI systems should behave upon initial interaction, during regular interaction, when they’re inevitably wrong, and over time.

Current and past collaborators in the project:
Ece Kamar, Kori Inkpen, Tobias Schnabel, Adam Fourney, Eric Horvitz, Saleema Amershi, Mihaela Vorvoreanu (Microsoft Research and Microsoft Aether Committee)
Gagan Bansal, Andi Peng, Tongshuang Wu (Microsoft Research Interns and Residents)

PUBLICATIONS

Understanding Failures of Deep Networks via Robust Feature Extraction. Sahil Singla, Besmira Nushi, Shital Shah, Ece Kamar, Eric Horvitz; CVPR 2021. pdf

Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, Daniel S. Weld; CHI 2021. pdf

Is the Most Accurate AI the Best Teammate? Optimizing AI for Teamwork. Gagan Bansal, Besmira Nushi, Ece Kamar, Eric Horvitz, Dan Weld; AAAI 2021. pdf

An Empirical Analysis of Backward Compatibility in Machine Learning Systems. Megha Srivastava, Besmira Nushi, Ece Kamar, Shital Shah, Eric Horvitz; KDD 2020. pdf

SQuINTing at VQA Models: Interrogating VQA Models with Sub-Questions. Ramprasaath R. Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Ribeiro, Besmira Nushi, Ece Kamar; CVPR 2020. pdf

Characterizing Search-Engine Traffic toInternet Research Agency Web Properties. Alexander Spangher, Gireeja Ranade, Besmira Nushi, Adam Fourney, Eric Horvitz; WebConf 2020. pdf

Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations. Aditya Modi, Debadeepta Dey, Alekh Agarwal, Adith Swaminathan, Besmira Nushi, Sean Andrist, Eric Horvitz; AAAI 2020. pdf

Beyond Accuracy: The Role of Mental Models in Human-AI Team Performance. Gagan Bansal, Besmira Nushi, Ece Kamar, Daniel S Weld, Walter S Lasecki, Eric Horvitz; HCOMP 2019. pdf

What You See Is What You Get? The Impact of Representation Criteria on Human Bias in Hiring. Andi Peng, Besmira Nushi, Emre Kiciman, Kori Inkpen, Siddharth Suri, Ece Kamar; HCOMP 2019. pdf

Software Engineering for Machine Learning: A Case Study. Saleema Amershi, Andrew Begel, Christian Bird, Rob DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, Thomas Zimmermann; ICSE 2019 . pdf

Guidelines for Human-AI Interaction. Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, Eric Horvitz; CHI 2019. pdf

Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff. Gagan Bansal, Besmira Nushi, Ece Kamar, Daniel S Weld, Walter S Lasecki, Eric Horvitz; AAAI 2019. pdf

Overcoming Blind Spots in the RealWorld: Leveraging Complementary Abilities for Joint Execution. Ramya Ramakrishnan, Ece Kamar, Besmira Nushi, Debadeepta Dey, Julie Shah, Eric Horvitz; AAAI 2019. pdf

Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure. Besmira Nushi, Ece Kamar, Eric Horvitz; HCOMP 2018. pdf

Analysis of Strategy and Spread of Russia-sponsored Content in the US in 2017. Alexander Spangher, Gireeja Ranade, Besmira Nushi, Adam Fourney, Eric Horvitz; arXiv 2018. pdf

On Human Intellect and Machine Failures: Troubleshooting Integrative Machine Learning Systems. Besmira Nushi, Ece Kamar, Eric Horvitz, Donald Kossmann; AAAI 2017. pdf

Quality Control and Optimization for Hybrid Crowd-Machine Learning Systems. Besmira Nushi; ETH PhD Thesis 2016. pdf

Learning and Feature Selection under Budget Constraints in Crowdsourcing. Besmira Nushi, Adish Singla, Andreas Krause, Donald Kossmann; HCOMP 2016. pdf

Fault-Tolerant Entity Resolution with the Crowd. Anja Gruenheid, Besmira Nushi, Tim Kraska, Wolfgang Gatterbauer, Donald Kossmann; arXiv 2016. full technical report

Crowd Access Path Optimization: Diversity Matters. Besmira Nushi, Adish Singla, Anja Gruenheid, Erfan Zamanian, Andreas Krause, Donald Kossmann; HCOMP 2015. pdf

CrowdSTAR: A Social Task Routing Framework for Online Communities. Besmira Nushi, Omar Alonso, Martin Hentschel, and Vasileios Kandylas; ICWE 2015. pdf full technical report

When is A = B? Anja Gruenheid, Donald Kossmann, Besmira Nushi, Yuri Gurevich; EATCS Bulletin 111 (2013) pdf

Uncertain time-series similarity: Return to the basics. Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, and Themis Palpanas; Proceedings of the VLDB Endowment 5, no. 11 (2012): 1662-1673. pdf

Similarity matching for uncertain time series: analytical and experimental comparison. Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, and Themis Palpanas. Proceedings of the 2nd ACM SIGSPATIAL International Workshop on Querying and Mining Uncertain Spatio-Temporal Data, pp. 8-15. ACM, 2011. pdf

CONTACT INFO

Microsoft building 99 (3121)
14820 NE 36th St, Redmond, WA 98052, USA