Blog Post
Part II: We Compared AdviserGPT and the Leading LLMs for RFP Completion. This One is for Medium-size Asset Managers.
Key Takeaways
All three models produced semantically-similar drafts (Cosine Score >0.80), meaning any of them can effectively capture your firm’s tone and voice at the mid-sized manager scale.
Factual grounding is where the models diverge. AdviserGPT’s Expanded Risk Rate (5.36%) is significantly lower and Grounding Rate (75.60%) is significantly higher than both Claude (ERR 17.44%, GR 40.12%) and Copilot (ERR 17.46%, GR 32.28%) showing its ability to handle more complex data without losing accuracy.
AdviserGPT completed its RFP response in approximately 2 minutes versus 20 minutes for Claude and 40 minutes for Copilot, with Copilot requiring a second run after hitting its context window. Speed reflects architectural differences more than raw quality, but combined with the grounding results, AdviserGPT handled high document volumes more efficiently end-to-end.
The right tool depends on your firm's resources and technical acumen. Teams with access to technical infrastructure may find a well-configured, off-the-shelf model sufficient. Teams without the technical resources and significant budget, or operating at high RFP volume, will likely find a purpose-built tool reduces overhead and increases accuracy from the start.
Previously, we conducted a series of scientifically-based experiments for completing net new RFPs, comparing Anthropic Claude (Opus 4.7), Microsoft Copilot, and AdviserGPT, a generative RFP/DDQ AI tool specifically tailored for investment managers. By using meticulously crafted source documents and prompts, our goal was to provide an objective comparison for emerging asset managers (<$5B AUM) on the best use cases for frontier models and AdviserGPT. We measured two critical quantitative data points: Semantic Similarity to benchmark documents and Hallucination Rates among generated responses.
While all three models produced drafts with high semantic similarity, indicating they can all effectively mimic a firm's established voice, a significant gap emerged in their ability to verify facts. Off-the-shelf models were far less reliable, generating a high 'fact check tax' due to factual errors and hallucinations.
In this post, we expand our scope to medium-sized firms ($7.5-$25B AUM) with over 5X the volume of source documents across a wide array of investment strategies. Because larger managers present a greater challenge, we introduce two deeper measures: Expanded Risk Rate, which includes unverified information in Hallucination Rate calculations, and a Grounding Rate, which measures the proportion of statements that are strongly supported by the benchmark document, (cosine score >0.75). Statements below this threshold are flagged as needing additional information, giving us a better view of each approach’s reliability.
For this experiment, we provided Claude and Copilot with custom index files to assist with document retrieval across 15+ source files, a step beyond basic prompting, but short of a fully engineered RAG pipeline. We made this choice because building a production-grade retrieval system for Claude or Copilot would require meaningful technical investment that most medium-sized managers would need to contract out or build internally. The results for Claude and Copilot therefore reflect a partially configured setup, more capable than general use, but not the ceiling of what these models could achieve with full infrastructure. Readers evaluating horizontal LLMs for their own workflows should factor in whether that build-out is feasible for their team, as it may narrow the performance gap shown here.
Most important for medium-sized managers, quantitative performance such as similarity and hallucinations are not the only factors when choosing the best AI tool for your firm. In addition to our quantitative findings, we tracked numerous user experience factors, specifically onboarding friction, context / token window ceilings, and each model’s ability to handle large volumes of document uploads. One tool might perform better statistically than another, but if it’s difficult to onboard users, requires constant re-prompting or delays for context windows to reopen, or needs to be re-run due large RFP inputs, statistical benefits get offset by workflow challenges.
How AI Models Perform for Medium-Sized Managers
Analyzing Grounding and Precision
Benchmark Comparison Empirical Results | |||
|---|---|---|---|
Metrics | AdviserGPT | Claude | Copilot |
Statement Classification | |||
Cosine Score | 0.9003 | 0.8249 | 0.8043 |
Total Statements Analyzed | 168 | 172 | 189 |
Supported Statements | 116 | 39 | 31 |
Statements with Weak Support | 32 | 73 | 95 |
Possible Hallucinations | 0 | 9 | 7 |
Possible Contradictions | 2 | 5 | 2 |
Reliability Rates | |||
Risk Rate | 1.19% | 8.14% | 4.76% |
Expanded Risk Rate | 5.36% | 17.44% | 17.46% |
Grounding Rate | 75.60% | 40.12% | 32.28% |
Our hallucination analysis reinforces the grounding findings above. When ‘unverified information’ is factored into the risk calculation alongside hallucinations and contradictions, the Expanded Risk Rate for Claude and Copilot rises to roughly 17.4%, compared to 5.36% for AdviserGPT. That gap is meaningful, though worth contextualizing: AdviserGPT's standard risk rate of 1.19% reflects only outright hallucinations and contradictions, while its expanded rate climbs to 5.36% once weakly-supported statements are included. All three models produce output that benefits from review; the difference is the degree of verification required.
For medium-sized firms, that degree matters at scale. A higher rate of unverified statements translates to more time spent cross-checking responses before submission, a real operational cost that compounds across a high volume of RFPs and DDQs. Off-the-shelf frontier models configured with proper infrastructure can reduce that overhead, but as noted in the methodology section, that achievement requires time and resource investments. AdviserGPT's pre-built retrieval and validation layer reduces the verification burden compared to a partially configured horizontal model, though it does not eliminate the need for human review entirely.
Assessing Risk at Scale
Our analysis of hallucination rates echoes our earlier findings. AdviserGPT produced highly reliable responses, and even though Claude and Copilot generated suitable answers with error rates below 10%, a deeper analysis reveals a more nuanced reality. When we factor in 'unverified information' alongside standard hallucination and contradiction metrics, the gap between specialized and general-use models widens significantly. This highlights that horizontal models still face challenges in maintaining rigorous factual accuracy.
For firms, these discrepancies have practical implications. At the scale of medium-sized RFP/DDQ workflows, even minor inaccuracies compound, creating a 'hidden tax' requiring manual auditing and verification. While horizontal models can be improved through extensive skills development and specialized technical infrastructure such as with a vector database of embedded content, the operational overhead remains.
The Operational Reality
Statistical information only illustrates part of the story when selecting an AI tool for accelerating RFP and DDQ workflows. While many firms gravitate toward simply using off-the-shelf horizontal LLMs, we wanted to see if they hold up under our specified experimenting. Beyond our quantitative benchmarks, we tracked several additional User Experience factors that are critical to daily operations.
Despite receiving identical inputs, model performance varied dramatically. AdviserGPT provided the RFP response across 20 standard but varied questions in 2 minutes. Claude took 20 minutes for the same RFP, while Copilot responded in 40 minutes, as it required a second run after hitting its token limit.
Copilot was also the only model to reach its maximum context window, whereas AdviserGPT and Claude were able to provide answers in a single run. The context window issue could pose a number of problems. Not only would your firm face increased computational costs, but exceeding the context window forces your model to potentially forget vital parts of a questionnaire and dilute the accuracy of responses.
The results highlight a genuine performance gap at medium-manager scale, but the right tool still depends on your firm's specific technical resources and budget constraints.If your firm lacks the technical infrastructure and resources, or operates at high RFP/DDQ volume where manual review time compounds quickly, a purpose-built tool like AdviserGPT reduces that overhead from the start. Off-the-shelf frontier models require more setup and more review to match comparable output quality at this scale.
Our series will conclude later this month with a similar experiment for the $25B+ AUM Enterprise manager segment. This will be the most difficult task for each approach, with several times more sources documents, investment strategies, and qualitative questions to start. Once we complete this test, we’ll graduate to a comparison of even more arduous data-intensive Due Diligence Questionnaires.
