Skip to Main Content

Semantically equivalent prompt reformulations caused accuracy swings of up to 76% on the same benchmark.Sclar et al., 'Quantifying Language Models' Sensit…