Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs

Shenzhe Zhu^1,🌟 Shu Yang^2,🌟 Michiel A. Bakker³ Alex Pentland^3,4 Jiaxin Pei^4,📮

University of Toronto¹ KAUST² MIT³ Stanford University⁴

🌟 Equal contribution 📮 Corresponding author

Paper

Code

Public deliberations generate vast amounts of free-form input, yet producing summaries that are both informative and fair remains costly and difficult. Recent approaches attempt to harness LLMs to assist this process, offering efficiency gains but also raising concerns: they risk over-representing majority views and introducing bias, challenging the reliability of current practices. These issues motivate our central question:

Can LLMs truly support deliberation by producing summaries that are representative, informative, and neutral for policy use—and how can we evaluate them reliably at scale?

To tackle this challenge, we propose DeliberationBank, a large-scale dataset of diverse human opinions, together with DeliberationJudge, an automatic evaluation framework that integrates LLM-based and model-based judging for opinion summarization in deliberations.

Automatic Deliberation Summarization Evaluation Framework

The overall goal is to evaluate LLM's capabilities to generate representative and effective summaries for public deliberations. We create four deliberation-relevant metrics, grounded in prior work on summarization and societal deliberation: Representativeness, Informativeness, Neutrality, and Policy Approval, and adopt an automatic evaluation framework. For each deliberation question (

q

), a subset of opinions (

\tilde{O}

) from public opinion dataset part of DeliberationBank is given to a summarization model (M), which produces a summary:

S_{M, \tilde{O}} = M (q, \tilde{O})

\[ S_{\mathcal{M},\,\tilde{\mathcal{O}}} = \mathcal{M}(q,\,\tilde{\mathcal{O}}). \]

Each summary is then paired with an individual opinion (o_i) and scored by DeliberationJudge, a fine-tuned DeBERTa judge, yielding a 4-dimensional vector:

J_{θ} (q, o_{i}, S_{M, \tilde{O}}) = (\hat{y} (rep), \hat{y} (inf), \hat{y} (neu), \hat{y} (pol)) \in {[0, 1]}^{4} .

\[ \mathcal{J}_\theta(q,\, o_i,\, S_{\mathcal{M},\,\tilde{\mathcal{O}}}) = (\hat{y}^{(\mathrm{rep})},\, \hat{y}^{(\mathrm{inf})},\, \hat{y}^{(\mathrm{neu})},\, \hat{y}^{(\mathrm{pol})}) \in [0,1]^4. \]

DeliberationBank

DeliberationBank is a large-scale deliberation benchmark dataset created by 7,500 participants from a US representative sample with two subsets: (i) a public opinion dataset of 3,000 free-form opinions collected from 10 societal deliberation questions on trending topics, and (ii) a summary judgement dataset of 4,500 annotations that evaluate deliberation summaries from individual perspectives. The figure showing below is the construction process of DeliberationBank.

DeliberationJudge

Recent work has adopted pretrained LLMs as automated judges, but studies reveal systematic biases and instability rooted in their black-box nature and alignment limits. To improve reliability while retaining efficiency, we introduce DeliberationJudge, a DeBERTa-based model fine-tuned on human judgments for deliberation summarization and used for automatic evaluation. We utilize the summary judgement dataset to fine-tune the language model.

The DeliberationJudge is trained with normalized labels from both rating and comparison tasks on a unified [0,1] scale. Formally, given a deliberation question $q_{i}$ , an annotator opinion $o_{i}^{(j)}$ , and a candidate summary $S_{M, {\tilde{O}}_{i}}$ , the judge encodes:

[CLS]; q_{i}; [SEP]; o_{i}^{(j)}; [SEP]; S_{M, {\tilde{O}}_{i}}; [SEP]

\[ [\texttt{[CLS]};\, q_i;\, \texttt{[SEP]};\, o_i^{(j)};\, \texttt{[SEP]};\, S_{M,\tilde{O}_i};\, \texttt{[SEP]}] \]

and outputs a four-dimensional score vector:

\hat{y} = J_{θ} (q_{i}, o_{i}^{(j)}, S_{M, {\tilde{O}}_{i}}) = (\hat{y} (rep), \hat{y} (inf), \hat{y} (neu), \hat{y} (pol)) \in {[0, 1]}^{4}

\[ \hat{\mathbf{y}} = \mathcal{J}_{\theta}\!\left(q_i,\, o_i^{(j)},\, S_{M,\tilde{O}_i}\right) = \bigl(\hat{y}^{(\mathrm{rep})},\, \hat{y}^{(\mathrm{inf})},\, \hat{y}^{(\mathrm{neu})},\, \hat{y}^{(\mathrm{pol})}\bigr) \in [0,1]^4 \]

Here the $[CLS]$ representation from the final encoder layer is passed through a hidden layer and a linear projection to produce the four regression outputs. Human annotations $y^{raw}$ ∈ [-1,7]⁴ are linearly normalized to $y$ ∈ [0,1]⁴ for training stability. The model is trained with the Huber loss averaged across dimensions. At inference time, predictions remain in the [0,1] range and are used directly as summary scores.

As shown above, DeliberationJudge offers the best trade-off between accuracy and efficiency. In large-scale deliberation, LLM inference becomes prohibitively costly as inputs grow, whereas DeliberationJudge maintains strong human alignment with evaluation time largely independent of input size.

Main Results

As shown below, table shows the LLM's global average score (GAS), expressed as

GAS \pm 95 % CI

, computed across four evaluation dimensions (Representativeness, Informativeness, Neutrality, and Policy Approval) over all data samples. Models are arranged from left to right and top to bottom in descending order of mean performance.

Influencing Factors

We conduct a comprehesive analysis to understand the influencing factors of LLM's performance on deliberation summarization. This can be categorized into four types: (i) summarization input size, (ii) question topic, (iii) question type, and (iv) model size.

Minority Underrepresentation

To examine LLMs' ability to represent minority opinions, we collected 1,000 U.S. responses on two deliberation questions (Tariff Policy and AI Change Life), where participants self-identified whether their stance reflected a minority view, providing a ground-truth partition of minority vs. non-minority subsets. Focusing on the representativeness dimension, we partition opinions by self-reports and find that all models consistently score lower on minority opinions, revealing a systematic bias.