I built a tool that leverages user-defined rubrics to drive an autonomous reasoning loop on hard-to-verify problems. The user defines a problem (e.g., possible roadmap for Intel’s transition to foundry), specifies some compute budget (e.g., $5), and the tool iterates on a solution with feedback per the rubric. This uses three LLM roles:
- Researcher – proposes solutions in JSON.
- Critic – returns strengths, weaknesses, and improvement suggestions following the rubric.
- Grader – performs pairwise evaluations to update an ELO score for each idea.
A multi-armed bandit / UCB selection policy balances exploration of new branches while exploiting high-scoring ones. We use embeddings to grade novelty and prevent duplication.
selectionScore = normalizedElo + noveltyBonus + explorationTerm
There's an interesting multiplayer version of this where you can effectively crowd-fund compute dedicated to a specific problem.