Consortium Plugin for gptme

Multi-model consensus decision-making for gptme.

Status: ✅ Phase 1 Complete - Core functionality implemented and tested

Overview

The consortium plugin orchestrates multiple LLMs to provide diverse perspectives and synthesize consensus responses. It queries multiple frontier models in parallel, then uses an arbiter model to analyze and synthesize a consensus answer with confidence scoring.

Key improvements (Phase 1):

✅ Real model integration via gptme.llm
✅ Robust JSON extraction (handles markdown code blocks, embedded JSON)
✅ Error handling with graceful fallbacks
✅ Comprehensive test coverage (14 unit tests + integration tests)
✅ Type-safe confidence scoring

Features

Multi-model orchestration: Query multiple models in parallel
Consensus synthesis: Arbiter model synthesizes best answer
Confidence scoring: Quantifies agreement between models
Flexible configuration: Choose models and arbiter
Detailed output: See individual responses and synthesis reasoning

Installation

The plugin is automatically discovered when placed in a configured plugin path. Add to your gptme.toml (user or project level):

[plugins]
paths = ["path/to/plugins"]
enabled = ["gptme_consortium"]  # Optional: limit which plugins load

Usage

Basic Query

query_consortium(
    question="What's the best approach for handling rate limiting?"
)

With Custom Models

query_consortium(
    question="Should we use microservices or monolith?",
    models=[
        "anthropic/claude-sonnet-4-5",
        "openai/gpt-4o",
        "openai/o1"
    ],
    arbiter="anthropic/claude-opus-4"
)

With Confidence Threshold

query_consortium(
    question="Critical architectural decision...",
    confidence_threshold=0.9  # Require 90% confidence
)

Output Format

The tool returns:

Consensus: Synthesized answer incorporating all perspectives
Confidence: Score from 0-1 indicating model agreement
Individual Responses: Each model's perspective
Synthesis Reasoning: Why the arbiter chose this consensus
Metadata: Models used, arbiter model

Use Cases

Architectural Decisions: Get multiple expert perspectives
Code Review: Multiple models review the same code
Quality Checking: Validate important outputs
Model Comparison: See how different models approach a problem
High-Stakes Decisions: Require consensus before proceeding

Implementation Status

✅ Phase 1 Complete (Core Functionality)

Real model integration via gptme.llm.reply()
Robust JSON parsing from arbiter responses
Error handling with fallback synthesis
Comprehensive test suite (14 tests, 100% pass)
Confidence type validation

🚧 Phase 2 Planned (Advanced Features)

Iterative refinement (multi-round consensus)
Response caching (avoid redundant queries)
Parallel querying (faster execution)
Voting mechanisms (for discrete choices)

🔮 Phase 3 Future (Production Polish)

Detailed metadata tracking (tokens, costs)
Custom arbiter strategies
Performance optimization
Cost tracking dashboard

Dependencies

gptme >= 0.27.0
Access to configured LLM providers (Anthropic, OpenAI, etc.)
Valid API keys in environment or config

Testing

# Run all tests
uv run --with pytest --with pytest-mock pytest tests/test_consortium.py -v

# Run fast tests only (skip integration)
uv run --with pytest --with pytest-mock pytest tests/ -v -m "not slow"

# Run with coverage
uv run --with pytest --with pytest-mock --with pytest-cov pytest tests/ --cov=src/gptme_consortium

Configuration

Default models (used if not specified):

anthropic/claude-sonnet-4-5 (Claude Sonnet 4.5, Sept 2025)
openai/gpt-5.1 (GPT-5.1, Nov 2025)
google/gemini-3-pro (Gemini 3 Pro, Nov 2025)
xai/grok-4 (Grok 4)

Default arbiter:

anthropic/claude-sonnet-4-5 (Claude Sonnet 4.5)

These represent diverse frontier models for comprehensive perspectives.

Future Enhancements

[ ] Iterative refinement with multiple rounds
[ ] Voting mechanisms for discrete choices
[ ] Integration with gptme's model configuration
[ ] Caching of model responses
[ ] Async parallel querying for speed
[ ] Support for structured output formats