Blog Post
We Compared AdviserGPT and the Leading LLMs for RFP Completion. Here's What We Learned.

General-purpose LLMs have entered the RFP/DDQ workflow because investment managers require a faster, more accurate, more scalable path to a first draft. We conducted a series of scientifically-based experiments for completing net new RFPs, comparing Anthropic Claude (Opus 4.7), Microsoft Copilot, and AdviserGPT, a generative RFP/DDQ AI tool specifically tailored for investment managers. We used identical source documents, identical prompts, and the latest model versions to compare each approach on tasks that asset managers conduct often weekly: completing RFPs and DDQs.
Our first blog post of several covers the experiment setup and introduces important quantitative data points that every institutional asset manager should consider: Semantic Similarity to benchmark documents and Hallucination Rates among generated responses. This first phase of testing focuses on the needs of Emerging Asset Managers with under $5 billion in AUM, where every RFP submission is a critical, resource-intensive opportunity. Subsequent blogs in this series will extend the analysis to Medium-sized managers ($5–$20B AUM) and Enterprise-class firms ($20B+ AUM), introducing more quantitative results and qualitative differences between the three models.
Our goal is not to declare a winner, but to provide an objective comparison for investment managers on the best use cases to use frontier models versus AdviserGPT. Here’s what we’ve found thus far.
Experiment Setup
This experiment requires three inputs. First, a “Golden Copy” benchmark RFP response for each of Emerging, Medium-size, and Enterprise-class asset managers. These RFP and DDQ responses are human-created, production-quality RFP responses from anonymized investment firm content. The questions were carefully selected to cover typical sections in an institutional RFP from investment strategy to firm overview, portfolio management and trading, compliance, and information security. The selected “correct” answers generated by humans represent the sole source of truth for all scoring.
Second, fewer than ten anonymized, previously-completed RFPs were provided to each AI as its source document vault. These documents give each model the factual foundation needed to answer the benchmark questions, the same information a human drafter would require. For this first Emerging Manager comparison, we created multiple Vaults from anonymized ‘prior art’ responses. The number of previously-completed source RFPs will be at least 3-4X higher for example for a Medium-class manager versus an Emerging manager, and similar for an Enterprise-class Manager versus Medium..
Third, a blank version of the new 20 question benchmark RFP document was given to each model to complete, identical to the Golden but with the answers removed. This ensures that no model is exposed to the approved answers before generating its response. The only difference across the three tests was the model processing the inputs.
Variables Measured
Semantic Similarity: Semantic similarity measures how closely each model’s response tracks the vocabulary and phrasing of the Golden benchmark, weighted by word frequency. Scores are expressed as a decimal between 0 and 1, where 1.00 indicates identical language and 0 indicates no vocabulary overlap.
Hallucination Score: Inventoried and measured all hallucinations with no counterpart in the source documents or Golden benchmark. A hallucination is defined as a fabrication with no meaningful basis in the source documents provided to the model and was classified using a cosine score threshold:
≥ 0.90: Statement is strongly supported
≥ 0.75: Statement might require additional information
≥ 0.50: Statement lacks strong support
< 0.50: Possible hallucination
We then calculated the Risk Rate for each AI by converting the number of statements identified as either hallucinations or contradictions across an entire response into a percentage. In this experiment, a contradiction occurred when a response maintained semantic similarity to the Golden benchmark, but contained conflicting factual details, such as figures or dates.
Key Takeaways
The good news is that investment management firms can definitely use AI to quickly generate a first draft for a net new RFP response or recurring DDQ. All that is needed is carefully constructed, well-defined prompts and a catalog of ‘prior art’ source documents. In sum:
All models that we tested produced drafts with high semantic similarity (cosine scores > 0.83); AdviserGPT holds an edge at 0.8992 versus .8323 for Claude and .8315 for Copilot in capturing a firm's specific linguistic nuances and established voice.
Frontier models alone however are less reliable because their factual errors and hallucinations create a high 'fact check tax' that likely cancels out efficiency gains.
AdviserGPT's risk rate (of inaccurate answers) is substantially lower at 1.86%, compared to Claude's 9.70% (5.2X higher) and Copilot's 10.29% (5.5X higher) for the most simple Emerging manager RFP use cases.
Our next blogs dive further into the details of the experiment results for the Emerging and then Medium, and Enterprise, Asset Manager segments. What are the potential use cases for frontier models alone versus an industry-specific, tightly tailored agent for RFP and DDQ acceleration and completion?
