Blog Post

The Data Story: Which Tool Actually Scales with You?

Jun 8, 2026

Key Takeaways

The good news is that investment management firms can definitely use AI to quickly generate a first draft for a net new RFP response or recurring DDQ. All that is needed is carefully constructed, well-defined prompts and a catalog of ‘prior art’ source documents. In sum:

All models that we tested produced drafts with high semantic similarity (cosine scores > 0.83); AdviserGPT holds an edge (0.8992) in capturing a firm's specific linguistic nuances and established voice.
Horizontal models alone however are less reliable because their factual errors and hallucinations create a high 'fact check tax' that likely cancels out efficiency gains.
AdviserGPT's risk rate (of inaccurate answers) is substantially lower at 1.86%, compared to Claude's 9.70% (5.2X higher) and Copilot's 10.29% (5.5X higher) for the most simple RFP use cases.
AdviserGPT up levels the use of AI for the first draft of a completed questionnaire from a "creative writer needing continual supervision into a reliable senior editor" by verifying the output against source documents.

For Emerging asset managers, every RFP is a high-stakes opportunity to prove institutional-grade rigor. Our testing reveals that while all of the models "speak your firm’s language,” there is a significant delta in their ability to “actually tell the truth, and nothing but the truth.” To find the right partner for your growth, we looked at two critical metrics: how well the AI mimics your unique voice (Semantic Similarity) and how often it makes things up or takes creative liberties (Hallucination Rate).

Semantic Similarity: Can They Capture Your Voice?

Model	Cosine Score	Spread vs. Golden benchmark
AdviserGPT	0.8992	-0.1008
Claude	0.8323	-0.1677
Copilot	0.8315	-0.1685

At first glance, the comparison is similar and any model/approach will do just fine. All three models produced text that looks and feels like a professional RFP. With cosine scores all above 0.80, any of these tools can deliver a draft that sounds just like your firm’s tone of voice. However, AdviserGPT (0.8992) holds an edge in capturing the specific linguistic nuances of the Golden benchmark. For a lean team where brand consistency is paramount, that extra 4% in semantic accuracy means fewer hours spent manually fixing the tone to match your firm’s established identity.

If tone of voice and semantic similarity is not crucially important for your firm from one RFP or DDQ to the next, and some variance is OK, then all models work directionally the same.

The Trust Gap: Where the Story Diverges

This is where the narrative shifts from how it sounds to how much it costs a firm in risk. For an asset manager, a single factual error in a DDQ can lose a prospective mandate. While Claude and Copilot can mimic your prose, they struggle with your facts. Our study found that for every two accurate statements Claude produced, it generated a corresponding hallucination or contradiction. For emerging asset managers, this high ratio of fiction to truth creates an editorial tax that may negate the efficiency gains of using AI. For an emerging firm, minimizing the factual errors of out-of-the-box frontier models is possible, but could require a long-term investment in building and maintaining custom technical infrastructure such as we do with AdviserGPT.

Hallucination Rates between Models
Variables	AdviserGPT	Claude	Copilot
Strongly Supported Statements	157	74	46
Statements with Weaker Support	54	96	103
Likely Hallucinations	4	32	27
Possible Contradictions	3	8	15
Risk Rate	1.86%	9.70%	10.29%
Risk Spread	--	7.84%	8.43%

AdviserGPT tells a different story. By maintaining a risk rate 5.2X lower than Claude and 5.5X lower than Copilot, it transforms the AI from a creative writer that needs constant supervision into a reliable senior analyst or editor. For Emerging managers without compliance teams to fact-check every response, the ability of AdviserGPT to verify almost all of its output against your source vault is the difference between a tool that helps you scale and one that may create extra work and possibly a liability. Of course, AdviserGPT leverages the leading LLMs from Google Gemini to ChatGPT and Claude; it’s our content embeddings, RAG engineering, and reinforcement learning that makes the difference.

AdviserGPT gives emerging asset managers an edge in crafting more semantically and factually accurate RFPs, but our findings show that horizontal models like Claude and Copilot can function as suitable alternatives with the right amount of source documents combined with ‘TLC’ in the form of review and corrections.

Next, our series will focus on the $7.5B-$20B AUM medium-size manager segment and the empirical study results based upon 3-4X more source documents for a specific investment strategy. Our hypothesis is that with 5X the source documents, the Semantic Similarity scores should be mostly equivalent, but we're not sure what will happen with the Hallucination and Risk Rates, and of course Cost (token windows) is also a variable.

‹ Part II: We Compared AdviserGPT and the Leading LLMs for RFP Completion. These Results Are for Medium-Size Asset Managers.

We Compared AdviserGPT and the Leading LLMs for RFP Completion. Here's What We Learned. ›

The Data Story: Which Tool Actually Scales with You?

Key Takeaways

Semantic Similarity: Can They Capture Your Voice?

The Trust Gap: Where the Story Diverges

Hallucination Rates between Models

See AdviserGPT in action

See AdviserGPT in action