Verina System Card: Reimagining AI-Powered Search

Version: 1.0 Date: 2025-10-20 Author: Li Yang

Abstract

Verina is an experimental AI-powered search engine that fundamentally reimagines how users interact with information. Unlike traditional chatbot interfaces that force users to master prompting techniques, Verina offers three specialized modes tailored to different research depths: Fast/Deep Mode for quick answers, Chat Mode for interactive exploration, and Agent Mode for sustained deep research. The system leverages cutting-edge LLM capabilities, intelligent tool orchestration, and an innovative external file system architecture to deliver results that balance speed, depth, and accuracy.

Our core philosophy: Give AI agents the right tools, and they'll reason their way to better answers than relying on pre-trained knowledge alone.

System Overview
Architecture & Tech Stack
Mode Deep-Dive
- Fast/Deep Mode
- Chat Mode
- Agent Mode
Technical Innovations
Model Selection & Rationale
Future Work

System Overview

Verina addresses three fundamental problems in current AI search products:

Prompting burden: Users shouldn't need technical skills to get good results
Shallow reasoning: Most AI search engines rely too heavily on cached knowledge
Context limitations: Long research sessions hit token limits quickly

Our solution: Three specialized modes, each optimized for a specific use case.

Mode	Speed	Depth	Use Case	LLM Engine
Fast Mode	5s	Basic	Quick facts, breaking news	Gemini 2.5 Flash
Deep Mode	15-30s	Extended	Multi-perspective analysis	Gemini 2.5 Flash
Chat Mode	Variable	Interactive	Follow-up questions, file reading	Claude Sonnet 4.5
Agent Mode	10-30min	Research-grade	Deep investigation, reports	GPT-5 Codex

Architecture & Tech Stack

Core Components

Three-Layer Architecture:

1. Frontend Layer (Next.js 15 + React 19)

Three-mode UI: Fast/Deep/Chat/Agent
Real-time streaming via SSE
Citation rendering & artifact viewer

↓ HTTP/SSE

2. Backend Layer (FastAPI + Python)

SearchAgent (v1 engine for Fast/Deep Mode)
ChatModeAgent (interactive Q&A)
AgentMode Agent (deep research)

↓ Unified Tool Layer (13+ tools)

3. Tool & Service Layer

Core Tools: web_search, execute_python, file_* operations
Intelligence Tools: compact_context, research_assistant
MCP Tools: Dynamically loaded (browser automation, etc.)

↓ External APIs

4. External Services & Storage

OpenRouter API (GPT-5, Claude, Gemini)
Exa API (neural search engine)
E2B Sandbox (optional code execution)
Local file system (workspace & cache)

Tech Stack

Frontend

Framework: Next.js 15 (App Router), React 19
Language: TypeScript
Streaming: Server-Sent Events (SSE)
UI: Custom components (no shadcn/ui), responsive design

Backend

Framework: FastAPI (Python 3.11+)
LLM Access: OpenRouter (unified API for multiple providers)
Search: Exa API (neural search, superior to traditional engines for LLMs)
Code Execution: E2B Sandbox (optional, secure Python environment)
Storage: Local file system (workspace-based architecture)

Infrastructure

Containerization: Docker (frontend, backend, Redis optional)
Deployment: Single-command CLI (verina package)
Development: Hot reload, monorepo structure

Mode Deep-Dive

1. Fast/Deep Mode: The Search Engine

Model: Gemini 2.5 Flash (google/gemini-2.5-flash-preview-09-2025)

Why Gemini 2.5 Flash?

Blazing fast (5s end-to-end for Fast Mode)
Excellent tool-calling capabilities
Cost-effective
Native multimodal support (future expansion)

Architecture: Optimized Pipeline

Fast Mode (2 LLM calls):

User Query → LLM Call 1: Tool Selection
→ fast_search(query) → Exa API
→ Sources returned
→ LLM Call 2: Answer Streaming
→ Final Answer (5-8 seconds)

Deep Mode (3 LLM calls + Test-Time Scaling):

User Query → LLM Call 1: Query Analysis + Tool
→ Reasoning: Multi-perspective decomposition
→ deep_search(refined_query) → Exa API
→ First Batch Sources
→ LLM Call 2: Deep Exploration (Forced)
→ Insight: What's missing? Alternative angles?
→ deep_search(supplemental_query) → Exa API
→ Second Batch Sources (deduplicated)
→ LLM Call 3: Answer Streaming
→ Comprehensive Answer (15-30 seconds)

Key Innovation: Deep Mode uses mandatory two-round search as a test-time scaling trick. The first round gathers initial information, then the LLM is forced to perform a second search to fill gaps or explore alternative perspectives. This dramatically improves answer quality without training.

Tools:

fast_search: Single-round Exa search with auto query refinement
deep_search: Multi-round search with reflection and supplemental queries

Technical Details:

Deduplication: Second batch merges with first, preserving citation indices
Highlights extraction: Only relevant snippets sent to LLM (saves tokens)
Citation format: [1][2][3] inline references

2. Chat Mode: Interactive Exploration

Model: Claude Sonnet 4.5 (anthropic/claude-sonnet-4.5)

Why Claude Sonnet 4.5?

Best-in-class agentic reasoning (our observation: coding ability correlates with general task performance)
Excellent at using tools precisely
Strong at citation management
Fast enough for interactive use

Architecture: ReAct Loop with External File System

ReAct Loop Flow:

User Message → MessageManager
→ Enter ReAct Loop
→ Decision Point:
- Yes, need tools → Execute Tools → Tool Results → Loop back
- No, ready → Final Answer
→ (Max 200 iterations)

Key Feature: External File System

Instead of cramming everything into context, Chat Mode uses a workspace:

workspace_chat_{session_id}/
├── cache/                    # Downloaded articles (full text)
│   ├── article_001.md
│   ├── article_002.md
│   └── ...
└── analysis/                 # Python execution outputs
    ├── images/               # Matplotlib charts
    ├── data/                 # CSV, JSON results
    └── ...

Workflow:

User: "Compare pricing of GPT-5 vs Claude 4"
Agent: Calls web_search(query="GPT-5 pricing") → Exa returns 5 articles
System: Articles saved to cache/, LLM receives only highlights (200-500 chars each)
Agent: Provides initial answer using highlights
User: "What about enterprise contracts?"
Agent: "Let me read the full article." → Calls file_read(filename="cache/article_003.md")
System: Returns full 5000-word article text
Agent: Provides detailed answer about enterprise pricing

Why This Works:

Context efficiency: Highlights (~100 tokens) vs full articles (~3000 tokens)
Human-in-the-loop: User can request deep dives on specific sources
Flexibility: Agent decides when to read full content

Tools (8 total):

web_search: Search, cache articles, return highlights with citations
execute_python: E2B sandbox for data analysis (optional, requires E2B_API_KEY)
file_read: Read cached articles or analysis outputs
MCP Tools: Dynamically loaded from Model Context Protocol servers (extensible)

MCP Integration: Chat Mode automatically loads tools from configured MCP servers. Currently, Verina includes:

Configured MCP Server:

chrome-devtools: Browser automation and web interaction
- Tools: take_snapshot, navigate_page, click, fill, take_screenshot, list_pages, evaluate_script, etc.
- Runs in headless Chromium with Docker-optimized settings
- Enables web browsing, form filling, screenshot capture, and JavaScript execution

The MCP architecture is extensible - additional servers (PostgreSQL, Filesystem, GitHub, etc.) can be added to the MCP_SERVERS configuration without changing agent code.

3. Agent Mode: Deep Research & Report Generation

Model: GPT-5 Codex (openai/gpt-5-codex)

Why GPT-5 Codex? This was the most surprising finding in our research. Initially, we expected GPT-5 (the chat variant) to be better for research tasks. However, testing revealed:

Codex's reasoning is decisive: Short, focused reasoning that gets to the point
GPT-5's reasoning is verbose: Often reasons about irrelevant details
Codex excels at tool orchestration: Likely due to its training on function calling in code
KV-cache efficiency: Codex seems better optimized for long-context scenarios

In 18-minute research sessions consuming ~150k tokens per round, Codex consistently outperformed GPT-5 in both speed and reasoning quality.

Architecture: Two-Stage Progression

Agent Mode uses a Human-in-the-Loop (HIL) → Research progression:

Stage 1: HIL (Quick Search + User Confirmation)

User Query → web_search → Quick Results
→ LLM provides initial analysis
→ User decides: "Good enough" or "Go deeper"
→ If deeper → call start_research → Stage 2

Stage 2: Research (Full Toolset Unleashed)

Workspace Initialization:

progress.md (strategy tracker)
notes.md (research findings)
draft.md (answer composition)
cache/ (article storage)

Research Loop (up to 200 iterations):

Tool calls → Results → Update workspace files
Agent maintains progress.md:
- Overall goal
- Current stage (searching? analyzing? writing?)
- Next steps
Agent fills notes.md:
- Key findings from each article
- Data points, quotes, insights
Agent drafts answer in draft.md:
- Structured argument with [1][2] citations
If context > 280k tokens:
- → compact_context auto-triggers
When done: call stop_answer

HTML Blog Generation Phase:

Load draft.md and notes.md
Inject 2000-word prompt template
Generate Notion-inspired HTML blog
Extract to artifact.html

Tools (12+ total):

Core Research Tools:

web_search: Search and cache articles
file_read, file_write, file_list: Workspace file management
execute_python: E2B sandbox for data analysis, visualization (optional)

Intelligence Amplification Tools:

research_assistant: Auxiliary LLM agent (GPT-5) with independent conversation threads
- Use case: "Read cache/quantum_article.md and explain qubit stability"
- Maintains separate conversation history (multi-turn dialogue)
- Saves main agent's context by delegating file reading
compact_context: Intelligent context compression
- Uses a mini-agent (Gemini 2.5 Pro) that can read workspace files
- Generates structured 5-section summary:
  1. Overall goal
  2. File system state (what's created/modified)
  3. Key knowledge (facts, data, insights)
  4. Recent actions (last 5-10 tool calls with full details)
  5. Current plan (next steps)
- Auto-triggers at 280k tokens (limit: 400k)
- Preserves file paths and navigation hints

Meta Tools:

start_research: Trigger stage transition (HIL → Research)
stop_answer: Signal completion and trigger blog generation

MCP Tools: Same as Chat Mode (chrome-devtools for browser automation), dynamically loaded in Research stage

Workspace Structure (Example from 18-minute session):

workspace_agent_{session_id}/
├── progress.md              # Research strategy (300 lines)
├── notes.md                 # Detailed findings (2000 lines)
├── draft.md                 # Structured answer (1500 lines)
├── cache/
│   ├── article_001.md       # 5000 words from Nature
│   ├── article_002.md       # 3000 words from ArXiv
│   ├── ... (15 articles)
├── conversations/
│   ├── conv_a1b2c3/         # Research assistant dialogue #1
│   │   └── messages.json
│   ├── conv_d4e5f6/         # Research assistant dialogue #2
│   │   └── messages.json
├── analysis/
│   ├── images/
│   │   ├── trend_chart.png  # Matplotlib output
│   │   └── comparison.png
│   ├── data/
│   │   ├── processed.csv
│   │   └── stats.json
│   └── reports/
│       └── analysis.md
└── artifact.html            # Final blog (auto-generated)

HTML Blog Generation:

When the agent calls stop_answer, a special prompt (2000+ words) is injected that instructs the LLM to:

Read draft.md and notes.md from workspace
Generate two deliverables:
- Brief overview (2-3 paragraphs for chat display)
- Deep technical blog (HTML format, Notion-inspired design)

Blog Specifications:

Design: Minimalist, Notion-inspired (800px max width, clean typography)
Content: Deep technical analysis (3000-5000 words typical)
Structure: Title, Executive Summary, Background, Core Analysis, Deep Dives, Practical Implications, References
Format: Standalone HTML (inline CSS, no external dependencies)
Citations: All references are clickable <a> tags
Responsive: Mobile-friendly with media queries
Accessibility: Semantic HTML5, ARIA labels, proper contrast

Example Output: A query like "Analyze the scalability challenges of quantum computing" results in:

18-minute research session
15 articles read and analyzed
5 Python data analysis scripts executed
1500-line draft.md with 30+ citations
4000-word HTML blog with charts and deep technical insights

Context Management Innovation:

The compact_context tool is a mini-agent itself. Here's how it works:

Triggered: When context exceeds 280k tokens (or manually called)
Mini-Agent Spawned: Uses Gemini 2.5 Pro with file_read tool access
Review Phase: Agent reads workspace files (progress.md, notes.md, draft.md)
Compression: Generates structured 5-section summary (see above)
Confirmation: Main LLM reviews summary and confirms understanding
Rebuild: Replace old messages with [summary + confirmation] + recent 10 user turns
Resume: Agent continues work seamlessly

This approach preserves critical information (file paths, data points, strategic decisions) while reducing token count by 60-80%.

Technical Innovations

1. External File System Architecture

Problem: LLM context windows are limited, but research involves reading dozens of articles.

Solution: Workspace-based storage with selective loading.

Benefits:

Articles stored once, referenced many times
Agent decides what to read (highlights → full text only when needed)
Workspace persists between sessions (future: resume research)
Python outputs (charts, data) accessible via file paths

Implementation:

# web_search tool
articles = exa_api.search(query)
for article in articles:
    # Save full text to cache
    cache_path = workspace / "cache" / f"article_{idx}.md"
    cache_path.write_text(article.full_text)

    # Return only highlights to LLM
    highlights.append({
        "idx": idx,
        "title": article.title,
        "snippet": article.highlights[0][:500],  # First 500 chars
        "cache_path": str(cache_path)
    })

return {"highlights": highlights}  # LLM sees ~100 tokens instead of ~3000

2. Test-Time Scaling via Multi-Round Search

Problem: Single search often misses important perspectives.

Solution: Mandatory two-round search in Deep Mode.

Implementation:

Round 1: LLM analyzes query → searches → receives results
Forced reflection: LLM must identify gaps and search again
Round 2: Supplemental search fills gaps or explores alternatives
Deduplication: Merge results with continuous citation indices

Why It Works:

Forces LLM to critique its own results
Explores alternative angles and queries the first round may have missed
Costs 2x search API calls but improves answer comprehensiveness significantly

3. Research Assistant: Multi-Turn Auxiliary Agent

Problem: Reading files consumes main agent's context. Asking follow-up questions requires repeating file content.

Solution: Separate auxiliary agent with independent conversation memory.

Use Cases:

File Reading: "Read cache/article_005.md and summarize quantum decoherence challenges"
- Research Assistant reads file, maintains understanding
- Main agent receives summary (saves ~2000 tokens)
Multi-Turn Analysis:
- Main: "Compare articles 3 and 7 on qubit stability" (returns conv_id: "conv_a1b2")
- Main: "What about temperature requirements?", conv_id="conv_a1b2"
- Research Assistant remembers previous comparison, provides focused answer
Draft Review: "Review my draft.md and suggest improvements"
- Assistant reads draft, provides feedback
- Multi-turn editing session without cluttering main context

Technical Implementation:

Each conv_id = independent conversation thread stored in workspace/conversations/{conv_id}/
Uses GPT-5 for strategic guidance
Can call file_read tool to access workspace files
Returns results to main agent as simple text

4. Intelligent Context Compression

Problem: Traditional context compression loses critical information (file paths, data points, strategic decisions).

Solution: File-aware compression agent with structured output.

Innovation: The compactor is a mini-agent that:

Reads workspace files to understand current state
Reviews old conversation messages
Generates structured summary (5 sections, XML format)
Main LLM confirms understanding before continuing

How It Works:

Triggered: When context exceeds 280k tokens (or manually called)
Mini-Agent Spawned: Uses Gemini 2.5 Pro with file_read tool access
Review Phase: Agent reads workspace files (progress.md, notes.md, draft.md) if needed
Compression: Generates structured 5-section XML summary:
- <overall_goal>: User's ultimate objective
- <file_system_state>: All file operations (CREATED/MODIFIED/READ) with navigation hints
- <key_knowledge>: Hard facts, data points, URLs, constraints, strategic decisions
- <recent_actions>: Last 5-10 tool calls with full parameters and results
- <current_plan>: Next steps and continuation strategy
Confirmation: Main LLM reviews summary and confirms understanding
Rebuild: Replace old messages with [summary + confirmation] + recent 10 user turns
Resume: Agent continues work seamlessly

Result: Preserves critical information (file paths, data points, strategic decisions) while reducing token count by 60-80%.

5. Model Context Protocol (MCP) Integration

Problem: Tool ecosystems are closed. Adding new capabilities requires code changes.

Solution: MCP (Model Context Protocol) - a standardized protocol for LLM tools.

How It Works:

MCP servers are configured in backend/src/chat/mcp_client.py in the MCP_SERVERS dictionary
Verina automatically loads tools from all servers on startup
Tools appear in Chat Mode and Agent Mode (Research stage) automatically
Adding new MCP servers only requires editing the configuration dictionary

Current Implementation: chrome-devtools MCP server

# backend/src/chat/mcp_client.py
MCP_SERVERS = {
    "chrome-devtools": {
        "command": "chrome-devtools-mcp",
        "args": [
            "--headless",
            "--executablePath", "/usr/bin/chromium",
            "--isolated",
            "--chromeArg=--no-sandbox",
            "--chromeArg=--disable-setuid-sandbox",
            "--chromeArg=--disable-dev-shm-usage",
        ],
        "env": None
    }
}

This provides Claude/GPT-5 with browser automation capabilities: navigate websites, fill forms, take screenshots, execute JavaScript, interact with web pages - all without modifying agent code.

Extensibility: Additional MCP servers (PostgreSQL, Filesystem, GitHub, etc.) can be added to MCP_SERVERS dictionary with zero changes to agent logic.

Model Selection & Rationale

Our model choices are based on extensive testing and practical observations:

Gemini 2.5 Flash (Fast/Deep Mode)

Selected for: Search pipeline (tool calling + answer generation)

Why:

Speed: 5-second end-to-end for Fast Mode
Tool calling: Excellent at using fast_search and deep_search precisely
Streaming: Low latency for answer generation

Trade-offs:

Not as strong at deep reasoning as Claude or GPT-5
Acceptable for search tasks where tools do heavy lifting

Claude Sonnet 4.5 (Chat Mode)

Selected for: Interactive Q&A with tool access

Why:

Agentic reasoning: Best-in-class for deciding when to use tools
Citation management: Excellent at using [1][2] format consistently
File reading: Great at selectively choosing when to read full articles vs using highlights
MCP compatibility: Strong tool-calling capabilities for dynamic tools

Observations:

Models good at coding tend to be good at general tool use
Claude's function calling is more reliable than Gemini's
Faster than GPT-5 for interactive use

Trade-offs:

More expensive than Gemini
Acceptable for Chat Mode where quality > speed

GPT-5 Codex (Agent Mode)

Selected for: Deep research with multi-tool orchestration

Why (most surprising finding):

Decisive reasoning: Codex's reasoning is short and focused, unlike GPT-5's verbosity
Tool orchestration: Excels at complex multi-step workflows (search → read → analyze → write)
Long-context efficiency: Better KV-cache design for 150k+ token contexts
Strategic planning: Maintains coherent research strategy over 200 iterations

Our Hypothesis: Codex's training on code (function composition, API calls) transfers to tool use better than pure chat training.

Evidence: In 18-minute research sessions, Codex:

Called 40+ tools with 95%+ success rate
Maintained coherent workspace state (progress.md, notes.md, draft.md)
Generated higher-quality HTML blogs than GPT-5

Trade-offs:

Most expensive
Justified for Agent Mode where depth matters most

Auxiliary Models

Gemini 2.5 Pro (context compression): Chosen for speed and cost-effectiveness in the compact_context mini-agent
GPT-5 (research assistant): Same reasoning capabilities as Codex, used for auxiliary tasks

Challenges & Solutions

Challenge 1: Context Explosion in Long Research

Problem: Long research sessions can quickly consume massive amounts of context with full article content.

Solution:

External file system (articles stored in cache/)
Highlight-first approach (LLM sees snippets, reads full text only when needed)
Intelligent compression (file-aware compaction at 280k tokens)

Result: Enables sustained research sessions without hitting context limits prematurely.

Challenge 2: Tool Calling Reliability

Problem: LLMs sometimes call tools with wrong parameters or hallucinate tool names.

Solution:

Strict tool schemas with detailed descriptions
Error messages guide LLM to retry correctly
Model selection (Claude/Codex have higher tool-calling accuracy than Gemini)

Example:

# Strict schema with examples
{
  "name": "file_read",
  "parameters": {
    "filename": {
      "type": "string",
      "description": "Relative path to file, e.g., 'cache/article_001.md' or 'notes.md'. Do NOT include workspace path prefix."
    }
  }
}

# Error handling with guidance
if not file_exists(filename):
    return {
        "error": f"File '{filename}' not found. Available files: {list_files()}. Did you mean 'cache/{filename}'?"
    }

Result: Tool call success rate improved from ~70% (early versions) to 95%+ (current).

Challenge 3: Citation Consistency

Problem: LLMs sometimes forget citation format, use wrong numbers, or duplicate citations.

Solution:

System prompt emphasizes citation format
Sources provided with clear [idx] markers in highlights
Post-processing validation (future: detect missing citations)

Example Prompt:

When using search results, ALWAYS cite with [1][2][3] format.
The number corresponds to the source's idx field.

Example:
Quantum computers face decoherence challenges [1]. However,
recent advances in error correction show promise [2][3].

Result: ~90% citation accuracy in generated answers.

Future Work

Our current focus areas:

Exploring new LLM interaction paradigms: Moving beyond traditional prompting to more natural, intuitive ways of interacting with AI agents.
Long-term memory optimization: Building persistent user memory and cross-session knowledge accumulation.
Advanced context management: Improving context compression and enabling longer research sessions without quality degradation.

We're committed to continuously building and evolving the Verina brand as a platform for AI Era search paradigm.

Conclusion

Verina demonstrates that giving AI agents the right tools and architecture can overcome traditional limitations:

External file system solves context constraints
Multi-round search improves answer quality via test-time scaling
Specialized modes serve different user needs efficiently
Intelligent compression enables sustained 18-minute research sessions
Model selection matters: Codex > GPT-5 for tool orchestration, Claude > Gemini for chat

Our core insight: LLM capability = Base reasoning × Tool quality × Architecture. While everyone focuses on base reasoning (bigger models, more training), we believe tools and architecture are equally important.

Verina is an experiment in pushing architectural boundaries. We're excited to see where this leads.

Appendix: Tool Reference

Agent Mode Tools (Research Stage)

Tool	Purpose	Example
`web_search`	Search web, cache articles, return highlights	`web_search(query="quantum computing challenges")`
`file_read`	Read workspace files	`file_read(filename="cache/article_001.md")`
`file_write`	Write to workspace	`file_write(filename="notes.md", content="...")`
`file_list`	List workspace files	`file_list(directory="cache")`
`execute_python`	Run Python code in E2B sandbox	`execute_python(code="import matplotlib...")`
`compact_context`	Compress context intelligently	`compact_context(keep_recent_user_messages=10)`
`research_assistant`	Multi-turn auxiliary agent	`research_assistant(question="Summarize article 5", conv_id="conv_001")`
`stop_answer`	Signal completion	`stop_answer()`
MCP Tools	Browser automation (chrome-devtools)	`mcp_chrome-devtools_take_snapshot()`, `mcp_chrome-devtools_navigate_page(url="...")`, etc.

Context Management

Limit: 400,000 tokens (GPT-5 Codex)
Auto-compact: Triggered at 280,000 tokens
Compaction strategy: Keep recent 10 user turns intact, summarize older messages
Compaction agent: Gemini 2.5 Pro with file_read access

Workspace Files

File	Purpose	Example Content
`progress.md`	Research strategy, status	"Overall goal: ... Current stage: Analyzing data... Next: Write draft..."
`notes.md`	Detailed findings	"Article 1 (Nature): Quantum decoherence rates... Article 2 (ArXiv): Error correction codes..."
`draft.md`	Answer composition	"# Introduction\nQuantum computing faces three key challenges [1][2]..."
`cache/*.md`	Downloaded articles	Full article text from Exa API
`conversations/*/messages.json`	Research assistant dialogues	Conversation history for multi-turn consultations
`analysis/images/*.png`	Python outputs	Matplotlib charts, visualizations
`artifact.html`	Final blog	Auto-generated HTML report

Contact

For questions, feedback, or contributions:

Author: Li Yang
X/Twitter: @YangLi_leo
GitHub: https://github.com/YangLi-leo/Verina
Issues: https://github.com/YangLi-leo/Verina/issues

Verina is an experimental project. All findings, observations, and design decisions are based on practical testing and may evolve as we learn more.

Version History:

v1.0 (2025-10-20): Initial comprehensive system card

Abstract

Table of Contents

System Overview

Architecture & Tech Stack

Core Components

Tech Stack

Mode Deep-Dive

1. Fast/Deep Mode: The Search Engine

2. Chat Mode: Interactive Exploration

3. Agent Mode: Deep Research & Report Generation

Technical Innovations

1. External File System Architecture

2. Test-Time Scaling via Multi-Round Search

3. Research Assistant: Multi-Turn Auxiliary Agent

4. Intelligent Context Compression

5. Model Context Protocol (MCP) Integration

Model Selection & Rationale

Gemini 2.5 Flash (Fast/Deep Mode)

Claude Sonnet 4.5 (Chat Mode)

GPT-5 Codex (Agent Mode)

Auxiliary Models

Challenges & Solutions

Challenge 1: Context Explosion in Long Research

Challenge 2: Tool Calling Reliability

Challenge 3: Citation Consistency

Future Work

Conclusion

Appendix: Tool Reference

Agent Mode Tools (Research Stage)

Context Management

Workspace Files

Contact