Research Theme: Evaluation and Benchmarking

This research theme focuses on the design and develoment of new metrics, benchmarks, and approaches for evaluating information retrieval, AI, and other computing systems.

Keynotes, invited talks, and lectures

So, You Want to Release a Dataset? Reflections on Benchmark Development, Community Building, and Making Robust Scientific Progress

Spotify
Virtual, September 2022
SlideShare | PDF

Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond

NLIWOD workshop, International Semantic Web Conference
Virtual, November 2020
SlideShare | PPT

Shared task organization

Tip-of-the-Tongue task, NII Test Collection for IR Systems (NTCIR), (2026)
Tip-of-the-Tongue track, Text REtrieval Conference (TREC), (2023-2025)
Deep Learning track, Text REtrieval Conference (TREC), (2019-2023)
Microsoft MAchine Reading COmprehension (MS MARCO) passage and document ranking leaderboards, November 2018.

Workshop organization

LLM4Eval: Large Language Model for Evaluation in IR, SIGIR, July 2025
LLM4Eval: Large Language Model for Evaluation in IR, WSDM, March 2025
LLM4Eval: Large Language Model for Evaluation in IR, SIGIR, July 2024

Publications

Judging the Judges: A Collection of LLM-Generated Relevance Judgements

Hossein A. Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, and Emine Yilmaz
Preprint, 2025
PDF | ArXiv

Towards Understanding Bias in Synthetic Data for Evaluation

Hossein A. Rahmani, Varsha Ramineni, Nick Craswell, Bhaskar Mitra, and Emine Yilmaz
In proc. ACM CIKM, 2025
Publication | PDF | ArXiv

LLM4Eval: Large Language Model for Evaluation in IR

Clemencia Siro, Hossein A. Rahmani, Mohammad Aliannejadi, Nick Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, and Emine Yilmaz
In proc. ACM SIGIR, 2025
Publication | PDF

Tip of the Tongue Query Elicitation for Simulated Evaluation

Yifan He, To Eun Kim, Fernando Diaz, Jaime Arguello, and Bhaskar Mitra
In proc. ACM SIGIR, 2025
Publication | PDF | ArXiv

JudgeBlender: Ensembling Automatic Relevance Judgments

Hossein A. Rahmani, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra
In proc. ACM TheWebConf, 2025
Publication | PDF | ArXiv

SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval

Hossein A. Rahmani, Xi Wang, Emine Yilmaz, Nick Craswell, Bhaskar Mitra, and Paul Thomas
In proc. ACM TheWebConf, 2025
Publication | PDF | ArXiv

LLM4Eval@WSDM 2025: Large Language Model for Evaluation in Information Retrieval

Hossein A. Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L.A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, and Emine Yilmaz
In proc. ACM WSDM, 2025
Publication | PDF

Recall, Robustness, and Lexicographic Evaluation

Fernando Diaz, Michael D. Ekstrand, and Bhaskar Mitra
In ACM Transactions on Recommender Systems (TORS), 2025
Publication | PDF | ArXiv

Overview of the TREC 2024 Tip-of-the-Tongue Track

Jaime Arguello, Samarth Bhargav, Fernando Diaz, To Eun Kim, Yifan He, Evangelos Kanoulas, and Bhaskar Mitra
In proc. Text REtrieval Conference (TREC), 2025
Publication | PDF

LLMJudge: LLMs for Relevance Judgments

Hossein A. Rahmani, Emine Yilmaz, Nick Craswell, Bhaskar Mitra, Paul Thomas, Charles L. A. Clarke, Mohammad Aliannejadi, Clemencia Siro, and Guglielmo Faggioli
In proc. LM4Eval: The First Workshop on Large Language Models for Evaluation in Information Retrieval, ACM SIGIR, 2024
Publication | PDF | ArXiv

Proceedings of The First Workshop on Large Language Models for Evaluation in Information Retrieval (LLM4Eval 2024)

Clemencia Siro, Mohammad Aliannejadi, Hossein A. Rahmani, Nick Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, and Emine Yilmaz
Proceedings

Report on the 1st Workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) at SIGIR 2024

Hossein A. Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, and Emine Yilmaz
In ACM SIGIR Forum, 2024
Publication | PDF | ArXiv

LLM4Eval: Large Language Model for Evaluation in IR

Hossein A. Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, and Emine Yilmaz
In proc. ACM SIGIR, 2024
Publication | PDF

Synthetic Test Collections for Retrieval Evaluation

Hossein A. Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, and Daniel Campos
In proc. ACM SIGIR, 2024
Publication | PDF | ArXiv

Large Language Models can Accurately Predict Searcher Preferences

Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra
In proc. ACM SIGIR, 2024
Publication | PDF | ArXiv

Towards Group-aware Search Success

Haolun Wu, Bhaskar Mitra, and Nick Craswell
In proc. ACM ICTIR, 2024
Publication | PDF | ArXiv

Learning to Extract Structured Entities Using Language Models

Haolun Wu, Ye Yuan, Liana Mikaelyan, Alexander Meulemans, Xue Liu, James Hensman, and Bhaskar Mitra
In proc. EMNLP, 2024
Publication | PDF | ArXiv

Overview of the TREC 2023 Tip-of-the-Tongue Track

Jaime Arguello, Samarth Bhargav, Fernando Diaz, Evangelos Kanoulas, and Bhaskar Mitra
In proc. Text REtrieval Conference (TREC), 2024
Publication | PDF

Overview of the TREC 2023 Deep Learning Track

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Hossein A. Rahmani, Daniel Campos, Jimmy Lin, Ellen M. Voorhees, and Ian Soboroff
In proc. Text REtrieval Conference (TREC), 2024
Publication | PDF | ArXiv

Overview of the TREC 2022 Deep Learning Track

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin, Ellen M. Voorhees, and Ian Soboroff
In proc. Text REtrieval Conference (TREC), 2023
Publication | PDF | ArXiv

Are We There Yet? A Decision Framework for Replacing Term-Based Retrieval with Dense Retrieval Systems

Sebastian Hofstätter, Nick Craswell, Bhaskar Mitra, Hamed Zamani, and Allan Hanbury
Preprint, 2022
PDF | ArXiv

Fostering Coopetition While Plugging Leaks: The Design and Implementation of the MS MARCO Leaderboards

Jimmy Lin, Daniel Campos, Nick Craswell, Bhaskar Mitra, and Emine Yilmaz
In proc. ACM SIGIR, 2022
Publication | PDF

Joint Multisided Exposure Fairness for Recommendation

Haolun Wu, Bhaskar Mitra, Chen Ma, Fernando Diaz, and Xue Liu
In proc. ACM SIGIR, 2022
Publication | PDF | ArXiv

Overview of the TREC 2021 Deep Learning Track

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Jimmy Lin
In proc. Text REtrieval Conference (TREC), 2022
Publication | PDF | ArXiv

MS MARCO Chameleons: Challenging the MS MARCO Leaderboard with Extremely Obstinate Queries

Negar Arabzadeh, Bhaskar Mitra, and Ebrahim Bagheri
In proc. ACM CIKM, 2021
Publication | PDF

MS MARCO: Benchmarking Ranking Models in the Large-Data Regime

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Jimmy Lin
In proc. ACM SIGIR, 2021
Publication | PDF | ArXiv

TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Ellen M. Voorhees, and Ian Soboroff
In proc. ACM SIGIR, 2021
Publication | PDF | ArXiv

Significant Improvements over the State of the Art? A Case Study of the MS MARCO Document Ranking Leaderboard

Jimmy Lin, Daniel Campos, Nick Craswell, Bhaskar Mitra, and Emine Yilmaz
In proc. ACM SIGIR, 2021
Publication | PDF | ArXiv

Tip of the Tongue Known-Item Retrieval: A Case Study in Movie Identification

Jaime Arguello, Adam Ferguson, Emery Fine, Bhaskar Mitra, Hamed Zamani, and Fernando Diaz
In proc. ACM CHIIR, 2021
Publication | PDF | ArXiv

Neural methods for effective, efficient, and exposure-aware information retrieval

Bhaskar Mitra
In ACM SIGIR Forum, 2021
Publication | PDF

Neural Methods for Effective, Efficient, and Exposure-Aware Information Retrieval

Bhaskar Mitra
PhD thesis, University College London, 2021
Publication | PDF | ArXiv

Overview of the TREC 2020 Deep Learning Track

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos
In proc. Text REtrieval Conference (TREC), 2021
Publication | PDF | ArXiv

Evaluating Stochastic Rankings with Expected Exposure

Fernando Diaz, Bhaskar Mitra, Michael D. Ekstrand, Asia J. Biega, and Ben Carterette
In proc. ACM CIKM, 2020
🏆 Best Long Research Paper Nominee
Publication | PDF | ArXiv

ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search

Nick Craswell, Daniel Campos, Bhaskar Mitra, Emine Yilmaz, and Bodo Billerbeck
In proc. ACM CIKM, 2020
Publication | PDF | ArXiv

On the Reliability of Test Collections for Evaluating Systems of Different Types

Emine Yilmaz, Nick Craswell, Bhaskar Mitra, and Daniel Campos
In proc. ACM SIGIR, 2020
Publication | PDF | ArXiv

Overview of the TREC 2019 Deep Learning Track

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees
In proc. Text REtrieval Conference (TREC), 2020
Publication | PDF | ArXiv

Benchmark for Complex Answer Retrieval

Federico Nanni, Bhaskar Mitra, Matt Magnusson, and Laura Dietz
In proc. ACM ICTIR, 2017
Publication | PDF | ArXiv

An Eye-tracking Study of User Interactions with Query Auto Completion

Kajta Hofmann, Bhaskar Mitra, Filip Radlinski, and Milad Shokouhi
In proc. ACM CIKM, 2014
Publication | PDF

Bhaskar Mitra | ভাস্কর মিত্র

Research Theme: Evaluation and Benchmarking

Keynotes, invited talks, and lectures

So, You Want to Release a Dataset? Reflections on Benchmark Development, Community Building, and Making Robust Scientific Progress

Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond

Shared task organization

Workshop organization

Publications

Judging the Judges: A Collection of LLM-Generated Relevance Judgements

Towards Understanding Bias in Synthetic Data for Evaluation

LLM4Eval: Large Language Model for Evaluation in IR

Tip of the Tongue Query Elicitation for Simulated Evaluation

JudgeBlender: Ensembling Automatic Relevance Judgments

SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval

LLM4Eval@WSDM 2025: Large Language Model for Evaluation in Information Retrieval

Recall, Robustness, and Lexicographic Evaluation

Overview of the TREC 2024 Tip-of-the-Tongue Track

LLMJudge: LLMs for Relevance Judgments

Proceedings of The First Workshop on Large Language Models for Evaluation in Information Retrieval (LLM4Eval 2024)

Report on the 1st Workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) at SIGIR 2024

LLM4Eval: Large Language Model for Evaluation in IR

Synthetic Test Collections for Retrieval Evaluation

Large Language Models can Accurately Predict Searcher Preferences

Towards Group-aware Search Success

Learning to Extract Structured Entities Using Language Models

Overview of the TREC 2023 Tip-of-the-Tongue Track

Overview of the TREC 2023 Deep Learning Track

Overview of the TREC 2022 Deep Learning Track

Are We There Yet? A Decision Framework for Replacing Term-Based Retrieval with Dense Retrieval Systems

Fostering Coopetition While Plugging Leaks: The Design and Implementation of the MS MARCO Leaderboards

Joint Multisided Exposure Fairness for Recommendation

Overview of the TREC 2021 Deep Learning Track

MS MARCO Chameleons: Challenging the MS MARCO Leaderboard with Extremely Obstinate Queries

MS MARCO: Benchmarking Ranking Models in the Large-Data Regime

TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime

Significant Improvements over the State of the Art? A Case Study of the MS MARCO Document Ranking Leaderboard

Tip of the Tongue Known-Item Retrieval: A Case Study in Movie Identification

Neural methods for effective, efficient, and exposure-aware information retrieval

Neural Methods for Effective, Efficient, and Exposure-Aware Information Retrieval

Overview of the TREC 2020 Deep Learning Track

Evaluating Stochastic Rankings with Expected Exposure

ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search

On the Reliability of Test Collections for Evaluating Systems of Different Types

Overview of the TREC 2019 Deep Learning Track

Benchmark for Complex Answer Retrieval

An Eye-tracking Study of User Interactions with Query Auto Completion