I built a tool that leverages user-defined rubrics to drive an autonomous reasoning loop on hard-to-verify problems. The user defines a problem (e.g., possible roadmap for Intel’s transition to foundry), specifies some compute budget (e.g., $5), and the tool iterates on a solution with feedback per the rubric. This uses three LLM roles:

Researcher – proposes solutions in JSON.
Critic – returns strengths, weaknesses, and improvement suggestions following the rubric.
Grader – performs pairwise evaluations to update an ELO score for each idea.

A multi-armed bandit / UCB selection policy balances exploration of new branches while exploiting high-scoring ones. We use embeddings to grade novelty and prevent duplication.

selectionScore = normalizedElo + noveltyBonus + explorationTerm

There's an interesting multiplayer version of this where you can effectively crowd-fund compute dedicated to a specific problem.

Compute/$GitHub Repo