Publications | Shramay Palta

2026

ACL 2026

Arguments that Alter Minds: LLM Rationales Sway Human (and LLM) Notions of Plausibility

Shramay Palta, Peter Rankel, Sarah Wiegreffe, and Rachel Rudinger

Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We investigate the degree to which human (and LLM) plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by LLMs. We collect 3,000 plausibility judgments from humans and another 13,600 judgments from LLMs. Overall, we observe increases and decreases in mean human plausibility ratings in the presence of LLM-generated PRO and CON rationales, respectively, suggesting that, on the whole, human judges find these rationales convincing. Experiments with LLMs reveal similar patterns of influence. Our findings demonstrate a novel use of LLMs for studying aspects of human cognition, while also raising practical concerns that, even in domains where humans are ``experts'' (i.e., common sense), LLMs have the potential to exert considerable influence on people’s beliefs.

@inproceedings{palta-etal-2026-arguments,
  abbr = ACL,
  title = "Arguments that Alter Minds: {LLM} Rationales Sway Human (and {LLM}) Notions of Plausibility",
  author = "Palta, Shramay and Rankel, Peter A. and Wiegreffe, Sarah and Rudinger, Rachel",
  editor = "Liakata, Maria and Moreira, Viviane P. and Zhang, Jiajun and Jurgens, David",
  booktitle = "Proceedings of the 64th Annual Meeting of the {A}ssociation for {C}omputational {L}inguistics (Volume 1: Long Papers)",
  month = jul,
  year = "2026",
  publisher = "Association for Computational Linguistics",
  address = "San Diego, California, United States",
  url = "https://aclanthology.org/2026.acl-long.599/",
  pages = "13132--13152",
  ISBN = "979-8-89176-390-6",
  abstract = "We investigate the degree to which human (and LLM) plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by LLMs. We collect 3,000 plausibility judgments from humans and another 13,600 judgments from LLMs. Overall, we observe increases and decreases in mean human plausibility ratings in the presence of LLM-generated PRO and CON rationales, respectively, suggesting that, on the whole, human judges find these rationales convincing. Experiments with LLMs reveal similar patterns of influence. Our findings demonstrate a novel use of LLMs for studying aspects of human cognition, while also raising practical concerns that, even in domains where humans are ``experts'' (i.e., common sense), LLMs have the potential to exert considerable influence on people’s beliefs."
}

2025

IJCNLP-AACL 2025

Speaking the Right Language: The Impact of Expertise (Mis)Alignment in User-AI Interactions

Shramay Palta, Nirupama Chandrasekaran, Rachel Rudinger, and Scott Counts

Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Abstract Bib PDF Poster Video

Using a sample of 25,000 Bing Copilot conversations, we study how the agent responds to users of varying levels of domain expertise and the resulting impact on user experience along multiple dimensions. Our findings show that across a variety of topical domains, the agent largely responds at proficient or expert levels of expertise (77{\%} of conversations) which correlates with positive user experience regardless of the user{'}s level of expertise. Misalignment, such that the agent responds at a level of expertise below that of the user, has a negative impact on overall user experience, with the impact more profound for more complex tasks. We also show that users engage more, as measured by the number of words in the conversation, when the agent responds at a level of expertise commensurate with that of the user. Our findings underscore the importance of alignment between users and AI when designing human-centered AI systems, to ensure satisfactory and productive interactions.

@inproceedings{palta-etal-2025-speaking,
  abbr = IJCNLP-AACL 2025,
  title = "Speaking the Right Language: The Impact of Expertise (Mis)Alignment in User-{AI} Interactions",
  author = "Palta, Shramay and Chandrasekaran, Nirupama and Rudinger, Rachel and Counts, Scott",
  editor = "Inui, Kentaro and Sakti, Sakriani and Wang, Haofen and Wong, Derek F. and Bhattacharyya, Pushpak and Banerjee, Biplab and Ekbal, Asif and Chakraborty, Tanmoy and Singh, Dhirendra Pratap",
  booktitle = "Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics",
  month = dec,
  year = "2025",
  publisher = "The Asian Federation of Natural Language Processing and The Association for Computational Linguistics",
  address = "Mumbai, India",
  url = "https://aclanthology.org/2025.ijcnlp-short.5/",
  doi = "10.18653/v1/2025.ijcnlp-short.5",
  pages = "58--69",
  ISBN = "979-8-89176-299-2",
  abstract = "Using a sample of 25,000 Bing Copilot conversations, we study how the agent responds to users of varying levels of domain expertise and the resulting impact on user experience along multiple dimensions. Our findings show that across a variety of topical domains, the agent largely responds at proficient or expert levels of expertise (77{\%} of conversations) which correlates with positive user experience regardless of the user{'}s level of expertise. Misalignment, such that the agent responds at a level of expertise below that of the user, has a negative impact on overall user experience, with the impact more profound for more complex tasks. We also show that users engage more, as measured by the number of words in the conversation, when the agent responds at a level of expertise commensurate with that of the user. Our findings underscore the importance of alignment between users and AI when designing human-centered AI systems, to ensure satisfactory and productive interactions."
}

2024

EMNLP 2024

Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning

Shramay Palta, Nishant Balepur, Peter A. Rankel, Sarah Wiegreffe, Marine Carpuat, and Rachel Rudinger

Findings of the Association for Computational Linguistics: EMNLP 2024

Abstract Bib PDF Poster Data Video

Questions involving commonsense reasoning about everyday situations often admit many possible or plausible answers. In contrast, multiple-choice question (MCQ) benchmarks for commonsense reasoning require a hard selection of a single correct answer, which, in principle, should represent the most plausible answer choice. On 250 MCQ items sampled from two commonsense reasoning benchmarks, we collect 5,000 independent plausibility judgments on answer choices. We find that for over 20% of the sampled MCQs, the answer choice rated most plausible does not match the benchmark gold answers; upon manual inspection, we confirm that this subset exhibits higher rates of problems like ambiguity or semantic mismatch between question and answer choices. Experiments with LLMs reveal low accuracy and high variation in performance on the subset, suggesting our plausibility criterion may be helpful in identifying more reliable benchmark items for commonsense evaluation.

@inproceedings{palta-etal-2024-plausibly,
  abbr = EMNLP,
  title = "Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning",
  author = "Palta, Shramay and Balepur, Nishant and Rankel, Peter and Wiegreffe, Sarah and Carpuat, Marine and Rudinger, Rachel",
  editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung",
  booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
  month = nov,
  year = "2024",
  publisher = "Association for Computational Linguistics",
  address = "Miami, Florida, USA",
  url = "https://aclanthology.org/2024.findings-emnlp.198/",
  doi = "10.18653/v1/2024.findings-emnlp.198",
  pages = "3451--3473",
  abstract = "Questions involving commonsense reasoning about everyday situations often admit many possible or plausible answers. In contrast, multiple-choice question (MCQ) benchmarks for commonsense reasoning require a hard selection of a single correct answer, which, in principle, should represent the most plausible answer choice. On 250 MCQ items sampled from two commonsense reasoning benchmarks, we collect 5,000 independent plausibility judgments on answer choices. We find that for over 20% of the sampled MCQs, the answer choice rated most plausible does not match the benchmark gold answers; upon manual inspection, we confirm that this subset exhibits higher rates of problems like ambiguity or semantic mismatch between question and answer choices. Experiments with LLMs reveal low accuracy and high variation in performance on the subset, suggesting our plausibility criterion may be helpful in identifying more reliable benchmark items for commonsense evaluation."
}

ECCV 2024

Investigating Style Similarity in Diffusion Models

Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shramay Palta, Micah Goldblum, Jonas Geiping, Abhinav Shrivastava, and Tom Goldstein

The 18th European Conference on Computer Vision: ECCV 2024

Abstract Bib PDF Code

Generative models are now widely used by graphic designers and artists. Prior works have shown that these models remember and often replicate content from their training data during generation. Hence as their proliferation increases, it has become important to perform a database search to determine whether the properties of the image are attributable to specific training data, every time before a generated image is used for professional purposes. Existing tools for this purpose focus on retrieving images of similar semantic content. Meanwhile, many artists are concerned with style replication in text-to-image models. We present a framework for understanding and extracting style descriptors from images. Our framework comprises a new dataset curated using the insight that style is a subjective property of an image that captures complex yet meaningful interactions of factors including but not limited to colors, textures, shapes, etc. We also propose a method to extract style descriptors that can be used to attribute style of a generated image to the images used in the training dataset of a text-to-image model. We showcase promising results in various style retrieval tasks. We also quantitatively and qualitatively analyze style attribution and matching in the Stable Diffusion model. Code and artifacts are available at this https URL.

@inproceedings{somepalli2024measuring,
  title = Investigating Style Similarity in Diffusion Models,
  author = "Gowthami Somepalli and Anubhav Gupta and Kamal Gupta and Shramay Palta and Micah Goldblum and Jonas Geiping and Abhinav Shrivastava and Tom Goldstein",
  booktitle = "Computer Vision -- ECCV 2024",
  publisher = "Springer Nature Switzerland",
  pages = "143--160",
  address = "Cham",
  url = "https://link.springer.com/chapter/10.1007/978-3-031-72848-8_9",
  year = "2025"
  ISBN = "978-3-031-72848-8"
  abstract = "Generative models are now widely used by graphic designers and artists. Prior works have shown that these models remember and often replicate content from their training data during generation. Hence as their proliferation increases, it has become important to perform a database search to determine whether the properties of the image are attributable to specific training data, every time before a generated image is used for professional purposes. Existing tools for this purpose focus on retrieving images of similar semantic content. Meanwhile, many artists are concerned with style replication in text-to-image models. We present a framework for understanding and extracting style descriptors from images. Our framework comprises a new dataset curated using the insight that style is a subjective property of an image that captures complex yet meaningful interactions of factors including but not limited to colors, textures, shapes, etc.. We also propose a method to extract style descriptors that can be used to attribute style of a generated image to the images used in the training dataset of a text-to-image model. We showcase promising results in various style retrieval tasks. We also quantitatively and qualitatively analyze style attribution and matching in the Stable Diffusion model."
  
}

ACL 2024

It’s Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning

Nishant Balepur, Shramay Palta, and Rachel Rudinger

Findings of the Association for Computational Linguistics: ACL 2024

Abstract Bib PDF

Chain-of-thought (COT) prompting can help large language models (LLMs) reason toward correct answers, but its efficacy in reasoning toward incorrect answers is unexplored. This strategy of process of elimination (PoE), when used with COT, has the potential to enhance interpretability in tasks like medical diagnoses of exclusion. Thus, we propose PoE with COT, a new task where LLMs must reason toward incorrect options on multiple-choice questions. We evaluate the ability of GPT-3.5, LLaMA-2, and Falcon to perform PoE with COT on 2-choice commonsense and scientific reasoning datasets. We show that PoE consistently underperforms directly choosing the correct answer. The agreement of these strategies is also lower than the self-consistency of each strategy. To study these issues further, we conduct an error analysis and give suggestions for future work.

@inproceedings{balepur-etal-2024-easy,
    abbr = ACL,
    title = "It{'}s Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning",
    author = "Balepur, Nishant and Palta, Shramay and Rudinger, Rachel",
    editor = "Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
    month = aug,
    year = "2024",
    publisher = "Association for Computational Linguistics",
    address = "Bangkok, Thailand",
    url = "https://aclanthology.org/2024.findings-acl.604/",
    doi = "10.18653/v1/2024.findings-acl.604",
    pages = "10143--10166",
    abstract = "Chain-of-thought (COT) prompting can help large language models (LLMs) reason toward correct answers, but its efficacy in reasoning toward incorrect answers is unexplored. This process of elimination (PoE), when used with COT, can enhance self-consistency, interpretability, and tasks such as medical diagnoses of exclusion. Thus, we propose PoE with COT, where LLMs must reason toward incorrect options on multiple-choice questions. We evaluate the ability of GPT-3.5, LLaMA-2, and Falcon to perform PoE with COT on a total of four commonsense and scientific reasoning datasets. We find that the strategy of PoE always underperforms the strategy of choosing the correct answer. The agreement of these strategies is also lower than the self-consistency of each strategy. To study these issues further, we conduct error analyses and give suggestions for future work."
}

2023

ACL 2023

FORK: A Bite-Sized Test Set for Probing Culinary Cultural Biases in Commonsense Reasoning Models

Shramay Palta, and Rachel Rudinger

Findings of the Association for Computational Linguistics: ACL 2023

Abstract Bib PDF Poster Dataset Video

It is common sense that one should prefer to eat a salad with a fork rather than with a chainsaw. However, for eating a bowl of rice, the choice between a fork and a pair of chopsticks is culturally relative. We introduce FORK, a small, manually-curated set of CommonsenseQA-style questions for probing cultural biases and assumptions present in com- monsense reasoning systems, with a specific focus on food-related customs. We test several CommonsenseQA systems on FORK, and while we see high performance on questions about the US culture, the poor performance of these systems on questions about non-US cultures highlights systematic cultural biases aligned with US over non-US cultures.

@inproceedings{palta-rudinger-2023-fork,
  abbr = ACL,
  title = "{FORK}: A Bite-Sized Test Set for Probing Culinary Cultural Biases in Commonsense Reasoning Models",
  author = "Palta, Shramay and Rudinger, Rachel",
  editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki",
  booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
  month = jul,
  year = "2023",
  publisher = "Association for Computational Linguistics",
  address = "Toronto, Canada",
  url = "https://aclanthology.org/2023.findings-acl.631/",
  doi = "10.18653/v1/2023.findings-acl.631",
  pages = "9952--9962",
  abstract = "It is common sense that one should prefer to eat a salad with a fork rather than with a chainsaw. However, for eating a bowl of rice, the choice between a fork and a pair of chopsticks is culturally relative. We introduce FORK, a small, manually-curated set of CommonsenseQA-style questions for probing cultural biases and assumptions present in commonsense reasoning systems, with a specific focus on food-related customs. We test several CommonsenseQA systems on FORK, and while we see high performance on questions about the US culture, the poor performance of these systems on questions about non-US cultures highlights systematic cultural assumptions aligned with US over non-US cultures."
}

2022

arXiv Preprint
Investigating Information Inconsistency in Multilingual Open-Domain Question Answering

Shramay Palta, Haozhe An, Yifan Yang, Shuaiyi Huang, and Maharshi Gor

arXiv preprint, 2022

Abstract Bib PDF

Multilingual open-domain question answering can unlock information that might be unavailable in a user's primary language. However, can that user trust the information they get? But multilingual question answering can potentially expose users to unreliable information through cultural differences, divergent national laws, or uneven resources. To understand the effects of the biased availability of information and cultural influence, we analyze the behavior of multilingual open-domain question answering models with a focus on retrieval bias. We analyze if different retriever models present different passages---and answers---given the same question in different languages on TyDi QA and XOR-TyDi QA, two multilingual QA datasets. While most answers are consistent, where they differ reveals valuable information about per-language resources disparity, and linguistic variation.
@article{palta2022investigating, title = {Investigating Information Inconsistency in Multilingual Open-Domain Question Answering}, author = {Palta, Shramay and An, Haozhe and Yang, Yifan and Huang, Shuaiyi and Gor, Maharshi}, journal = {arXiv preprint arXiv:2205.12456}, year = {2022} }