×

II-Researcher:开源deep Research智能体

hqy hqy 发表于2025-04-20 05:57:13 浏览3 评论0

抢沙发发表评论

IntroducingII-Researcher: a new open-source framework designed to aid building research agents. By providing an open tool capable of tackling complex inquiries, II-Researcher directly counters the trend of proprietary systems and lowers barriers for innovators worldwide. The II-Researcher embodies our commitment to Universal Basic AI (UBAI), empowering users across the network—from individual researchers using Edge Nodes to institutions leveraging Specialized AI —with sophisticated tools to navigate, analyze, and contribute to the global knowledge commons.

介绍 II-Researcher:一个新的开源框架,旨在帮助构建研究智能体。通过提供能够处理复杂查询的开放工具,II-Researcher 直接对抗专有系统的趋势,并为全球创新者降低了门槛。II-Researcher 体现了我们对 Universal Basic AI (UBAI) 的承诺,为整个网络的用户(从使用边缘节点的个人研究人员到利用专业 AI 的机构)提供先进的工具来导航、分析和为全球知识共享做出贡献。

To better contextualize the contributions of II-Researcher, it is essential to examine the current landscape of advanced AI agents, particularly those addressing complex research tasks. OpenAIs Deep Research initiatives, for instance, demonstrate the scale of challenges in AI today.

为了更好地将 II-Researcher 的贡献置于背景中,必须研究高级 AI 智能体的现状,尤其是那些处理复杂研究任务的智能体。例如,OpenAI 的 Deep Research 计划展示了当今 AI 面临的挑战规模。

OpenAI’s Deep Research: An Autonomous Research AgentOpenAI 的深入研究:自主研究智能体

OpenAI’s Deep Research is a new AI agent for in-depth, multi-step research tasks [1]

. Unlike standard chatbots that answer in one go, Deep Research iteratively browses the web, reads multiple sources, and compiles information into structured outputs. It is powered by a version of OpenAI’s upcoming o3 large language model, which has been specially optimized for reasoning and web-based analysis. Deep Research behaves like a digital research analyst: it can plan a research strategy and gather data from various websites and documents to produce a synthesized report complete with citations and reasoning steps. For example, when tasked with comparing market trends or summarizing academic literature, Deep Research will navigate through relevant articles, refine its queries, and document its findings with references, much as a human researcher would.

OpenAI 的 Deep Research 是一种新的 AI 智能体,用于深入、多步骤的研究任务 [1]。 与一次性回答的标准聊天机器人不同,Deep Research 迭代浏览 Web,读取多个来源,并将信息编译成结构化输出。它由 OpenAI 即将推出的 o3 大型语言模型的一个版本提供支持,该模型已针对推理和基于 Web 的分析进行了优化。Deep Research 的行为类似于数字研究分析师:它可以规划研究策略并从各种网站和文档中收集数据,以生成包含引文和推理步骤的综合报告。例如,当任务是比较市场趋势或总结学术文献时,Deep Research 将浏览相关文章,优化查询,并通过参考资料记录其发现,就像人类研究人员所做的那样。

How Does It Work? While OpenAI doesn’t offer open-sourced deep research, in their blog post, they also provide an overview of how their system works [1]

. It uses an advanced GPT-based model (the “o3” model) combined with tool-use capabilities like web browsing and even a Python interpreter for data analysis. The system was trained with reinforcement learning on real browsing and reasoning tasks – essentially learning how to follow multi-step research plans that yield correct answers by trial and error. This training helps break down complex queries into sub-tasks, find relevant information, and verify facts.

它是如何工作的? 虽然 OpenAI 不提供开源的深入研究,但在他们的博客文章中,他们还概述了他们的系统是如何工作的 [1]。 它使用基于 GPT 的高级模型(“o3”模型)与 Web 浏览等工具使用功能相结合,甚至使用用于数据分析的 Python 解释器。该系统在真实的浏览和推理任务上进行了强化学习训练——本质上是学习如何遵循多步骤研究计划,通过反复试验得出正确答案。此培训有助于将复杂的查询分解为子任务、查找相关信息并验证事实。
II-Researcher:开源deep Research智能体 第1张

Figure 1: Overview of OpenAI’s Deep Research Methodology

图 1:OpenAI 的深度研究方法概述

Performance: Early benchmarks show that OpenAI’s Deep Research is a leap ahead of previous AI models in research tasks. On the challenging Humanity’s Last Exam[14] test – a broad set of expert-level questions across 100+ subjects – Deep Research achieved 26.6% accuracy, far surpassing earlier GPT-based models (for context, OpenAI’s older o1

model scored 9.1%). This indicates that it can handle complex, cross-disciplinary questions much better than standard LLMs. It also sets a new state-of-the-art on the GAIA benchmark for AI agents, leading with strong performance on multi-step reasoning tasks.

性能: 早期的基准测试表明,OpenAI 的 Deep Research 在研究任务方面领先于以前的 AI 模型。在具有挑战性的 Humanitys Last Exam[14] 测试中——涵盖 100+ 个主题的广泛专家级问题——Deep Research 达到了 26.6% 的准确率,远超早期基于 GPT 的模型(就上下文而言,OpenAI 较旧的 o1 模型得分为 9.1%)。这表明它可以比 标准 LLMs 更好地处理复杂的跨学科问题。它还在 AI 智能体的 GAIA 基准上树立了新的水平,在多步骤推理任务上表现出色。

Open-Source Ecosystem 开源生态系统

The emergence of closed-source systems like Deep Research has spurred development within the open-source community to create analogous agents, offering transparency and customization. Many open-source projects are emerging, showcasing the communitys excitement and innovation around Deep Research implementations. Some notable efforts include the following:

Deep Research 等闭源系统的出现刺激了开源社区内部的发展,以创建类似的智能体,提供透明度和定制性。许多开源项目不断涌现,展示了社区对 Deep Research 实施的兴奋和创新。一些值得注意的努力包括: Hugging Face’s Open DeepResearch (Smol Agents) [2]Jina AI’s Deep Research Clone[3]LangChains Open Deep Research[4] U14Apps Deep Research[20]Independent project by assafelovic, dzhng, btahir, nickscramara, and mshumer[5][8][17][18][19]assafelovicdzhngbtahirnickscramaramshumer 的独立项目 [5][8][17][18][19]And many others… 还有许多其他...

Among those, the Hugging Face and Jina AI projects have garnered the most attention thanks to their rapid development, strong community engagement, and promising performance on benchmarks:

其中,Hugging Face 和 Jina AI 项目凭借快速发展、社区参与度强、benchmarks 表现优异等特点,获得了最大的关注: Hugging Face – Open DeepResearch (Smol Agents): This project rapidly developed an open-source agent framework, smolagents[2]. A key innovation is the CodeAct[6]

approach, which represents agent plans as executable code rather than declarative structures (e.g., JSON). This reportedly reduced reasoning steps by approximately 30%, enhancing efficiency. Equipped with basic web browsing and text reading tools, their agent achieved 55.15% accuracy on the GAIA validation set, compared to approximately 67% for OpenAIs closed system.

Hugging Face – Open DeepResearch (Smol Agents): 该项目迅速开发了一个开源智能体框架 smolagents[2]。 一个关键的创新是 CodeAct[6] 方法,它将智能体计划表示为可执行代码,而不是声明性结构(例如 JSON)。据报道,这减少了大约 30% 的推理步骤,提高了效率。配备基本的 Web 浏览和文本阅读工具后,他们的智能体在 GAIA 验证集上实现了 55.15% 的准确率,而 OpenAI 的封闭系统则约为 67%。Jina AI’s Deep Research Clone: Jina AI developed a replica leveraging their expertise in search workflows [3]. While specific implementation details are limited, it likely utilizes components like Jinas DocArray, open LLMs (e.g., Llama 2 [13]

), search providers (e.g., Brave, DuckDuckGo), and Jinas reader models to execute a search-read-synthesize loop.

Jina AI 的深度研究克隆:Jina AI 利用他们在搜索工作流程方面的专业知识开发了一个副本 [3]。 虽然具体的实现细节有限,但它可能利用 Jina 的 DocArray、open LLMs (例如 Llama 2 [13])、搜索提供程序(例如 Brave、DuckDuckGo)和 Jina 的读者模型等组件来执行搜索-读取-合成循环。

Conceptual Frameworks for Deep Research

深入研究的概念框架

As analyzed by Lee [7]

, the term "Deep Research" lacks a formal definition, similar to the ambiguity surrounding terms like Retrieval-Augmented Generation (RAG) in 2025. Lee defines it as a report generation system using LLMs for iterative search, analysis, and synthesis. Implementations are broadly categorized as:

正如 Lee [7] 所分析的那样,“深度研究”一词缺乏正式定义,类似于 2025 年检索增强一代 (RAG) 等术语的歧义。Lee 将其定义为用于LLMs迭代搜索、分析和综合的报告生成系统。实施大致分为:Untrained Approaches

(both Jina and Huggingface also follow this direction):

未训练的方法 (Jina 和 Huggingface 也都遵循这个方向):

Directed Acyclic Graph (DAG): Decomposes queries, retrieves information for each part, and synthesizes a report (e.g., GPT-Researcher [8]

).

有向无环图 (DAG): 分解查询,检索每个部分的信息,并合成报告(例如,GPT-Researcher [8])

State Machine (SM): Extends DAGs by incorporating self-reflection, enabling LLMs to review and refine outputs dynamically. Both the Hugging Face and Jina AI efforts align with this untrained direction.

状态机 (SM): 通过整合自反射来扩展 DAG,从而能够LLMs动态地查看和优化输出。Hugging Face 和 Jina AI 的努力都与这个未经训练的方向一致。

Trained Approaches

:

End-to-End Systems: Holistically optimized systems (e.g., Stanfords STORM [9]

) producing high-quality, structured outputs.

端到端系统 :整体优化的系统(例如, 斯坦福大学的 STORM [9]),可产生高质量、结构化的输出。Large Reasoning Models: Models specifically fine-tuned for reasoning tasks, enhancing performance in report generation (e.g., OpenAIs Deep Research [1]

).

大型推理模型 :专门针对推理任务进行微调的模型,提高报告生成性能(例如,OpenAI 的 Deep Research [1])

II-Researcher’s Purpose: II-研究员的目的:

The capabilities exhibited by agents such as Deep Research illustrate the potential of AI in complex information analysis. However, our commitment is to democratize such power. Motivated by both the potential demonstrated and the limitations of proprietary approaches, we developed II-Researcher as an open-source framework specifically designed to tackle complex inquiries, counter the trend towards proprietary systems, and lower barriers for innovators worldwide. This framework provides users with sophisticated tools to build capable research agents, directly furthering our mission to foster an open, distributed, and inclusive AI ecosystem.

Deep Research 等智能体所展示的能力说明了 AI 在复杂信息分析中的潜力。然而,我们的承诺是使这种权力民主化。在专有方法所展示的潜力和局限性的激励下,我们开发了 II-Researcher 作为一个开源框架,专门用于处理复杂的查询,对抗专有系统的趋势,并为全球创新者降低障碍。该框架为用户提供了复杂的工具来构建强大的研究智能体,直接推进了我们培育开放、分布式和包容性 AI 生态系统的使命。

In the following sections we will dive into the methodology, components, and examples

在以下部分中,我们将深入探讨方法、组件和示例

II-Researcher Implementation:II-Researcher 实施:

Our II-Researcher framework investigates both untrained and trained-inspired methodologies for autonomous research.

我们的 II-Researcher 框架研究了自主研究的未经训练和受训练的方法。

Approach 1: Untrained State Machine Pipeline方法 1:未经训练的状态机流水线

II-Researcher:开源deep Research智能体 第1张

Figure 2: The first approach uses a state-machine pipeline.

图 2:第一种方法使用状态机管道。

This approach implements a state machine architecture, facilitating iterative refinement and dynamic state transitions mirroring human research processes. The pipeline consists of the following stages:

这种方法实现了状态机架构,促进了迭代细化和动态状态转换,反映了人类的研究过程。该管道由以下阶段组成: Query Evaluation

: The initial user query is analyzed to determine key requirements for the answer, precisely:

查询评估 :分析初始用户查询以确定答案的关键要求,具体如下:Freshness

: Whether the response requires the most up-to-date information available.

新鲜度 :响应是否需要可用的最新信息。Plurality

: Whether the response should incorporate multiple perspectives or sources.

Plurality:响应是否应包含多个观点或来源。Completeness

: Whether the response demands a thorough and detailed explanation or solution.

完整性 :响应是否需要全面详细的解释或解决方案。

This approach ensures that the system tailors its processing to deliver an accurate and relevant answer based on the querys requirements.

这种方法可确保系统定制其处理,以根据查询的要求提供准确且相关的答案。 Web Search Query Generation and Execution:

For each sub-query, the system autonomously generates search queries and utilizes browser integration through search engines like Tavily or SerpAPI to gather relevant resources.

Web 搜索查询生成和执行: 对于每个子查询,系统会自动生成搜索查询,并通过 Tavily 或 SerpAPI 等搜索引擎利用浏览器集成来收集相关资源。Information Retrieval and Compression

: Retrieved web content undergoes a compression process using LLM-based and embedding-based methods to efficiently extract and consolidate essential information, facts, and data (detailed in Section Context Compression below).

信息检索和压缩 :检索到的 Web 内容使用LLM基于嵌入和基于嵌入的方法进行压缩过程,以有效地提取和整合基本信息、事实和数据(详见下面的上下文压缩部分)。Self-Reflection and Critique Cycle

: The agent critically evaluates synthesized information, reflecting on knowledge gaps, inconsistencies, or inaccuracies and determining necessary follow-up actions. This step can trigger further searches or refinement cycles.

自我反省和批判周期 :智能体批判性地评估综合信息,反思知识差距、不一致或不准确之处,并确定必要的后续行动。此步骤可以触发进一步的搜索或优化周期。State Management

: A memory module maintains state, including accumulated knowledge, generated queries, action logs, and records of failed attempts, informing future decisions.

状态管理 :内存模块维护状态,包括累积的知识、生成的查询、作日志和失败尝试的记录,为未来的决策提供信息。Final Report Generation

: After the draft answer is successfully evaluated based on the aspects determined in the previous query analysis step and once all information is thoroughly verified and deemed accurate, the final structured report, complete with detailed citations and references, is compiled.

最终报告生成 :根据上一个查询分析步骤中确定的方面成功评估草稿答案后,所有信息都经过彻底验证并被认为准确后,将编制最终结构化报告,包括详细的引文和参考文献。

Key Components 关键组件

Context Compression 上下文压缩

Pushing all content into an LLM is not ideal in terms of both quality and cost. Additionally, the content may exceed the models maximum context length.

将所有内容推送到一个LLM中,无论是质量还是成本都不理想。此外,内容可能会超过模型的最大上下文长度。

However, over-compressing or omitting necessary information can significantly degrade pipeline performance.

但是,过度压缩或遗漏必要的信息会显著降低管道性能。

To address this, we adopt a hybrid approach that leverages an LLM and an embedding model for compression.

为了解决这个问题,我们采用了一种混合方法,该方法利用 和 LLM 嵌入模型进行压缩。 Text Segmentation:

We use a simple approach by splitting the document into paragraphs or fixed sentence chunks

文本分割: 我们使用一种简单的方法,将文档拆分为段落或固定的句子块

Embedding-Based Filtering:

基于嵌入的筛选:

Each chunk is converted into a vector representation using an embedding model; the embedding model that we are using is text-embedding-3-large from OpenAI

每个块都使用嵌入模型转换为向量表示;我们使用的嵌入模型是 OpenAI 的 text-embedding-3-large

The vector includes the website title and the current query/question.

向量包括网站标题和当前查询/问题。

Only chunks that pass a predefined relevance threshold are retained.

仅保留通过预定义相关性阈值的区块。

LLM Generative Retrieval:

LLM生成检索:

We numbered each chunk and fed it into the LLM.

我们对每个块进行编号并将其馈送到 LLM.

The LLM is prompted to identify and rank relevant chunks in order of decreasing relevance.

系统会提示 按LLM相关性递减的顺序识别相关块并对其进行排名。

This method, known as generative retrieval, is different from paraphrasing or rewriting—it helps us save a significant number of output tokens, which are more expensive than input tokens.

这种方法称为生成检索,与释义或重写不同,它可以帮助我们保存大量输出令牌,而这些令牌比输入令牌更昂贵。
II-Researcher:开源deep Research智能体 第1张

Figure 3: Generative Retrieval

图 3:生成检索

Final Selection & Compression:

最终选择和压缩:

We combine results from both retrieval methods.

我们结合了两种检索方法的结果。

Based on a predefined word limit for each website, we use a voting mechanism and ranking order to compress the text effectively.

根据每个网站的预定义字数限制,我们使用投票机制和排名顺序来有效地压缩文本。

By following this approach, we ensure that the most relevant content is retained while staying within each websites maximum word limit.

通过遵循这种方法,我们确保保留最相关的内容,同时保持在每个网站的最大字数限制内。

Self-Reflection Mechanism

自反射机制

Relying solely on the model to choose the right action is not always ideal, especially for non-reasoning models. We introduce an additional reflection step to enhance decision-making after the agent visits a website.

仅依靠模型来选择正确的作并不总是理想的,尤其是对于非推理模型。我们引入了一个额外的反思步骤,以增强智能体访问网站后的决策。

After each visit, the agent evaluates:

每次访问后,座席都会评估: Knowledge gained

: What new information was obtained from this visit?

获得的知识 :从这次访问中获得了哪些新信息?Previous Actions

: What steps have already been taken?

以前的作 :已经采取了哪些步骤?Gaps & next steps

: What information is still missing or requires deeper investigation?

差距和下一步: 哪些信息仍然缺失或需要深入调查?

This self-reflection process serves as context for the model, helping it make more informed decisions in subsequent steps based on newly acquired information.

这个自我反思过程是模型的背景,帮助它根据新获得的信息在后续步骤中做出更明智的决策。
II-Researcher:开源deep Research智能体 第1张

Figure 4: Retrieval and Self-Reflection Process

图 4:检索和自反射过程

Examples 例子

Example 1: 示例 1:

II-Researcher:开源deep Research智能体 第1张

Figure 5: Prompt for Pipeline Question 1

图 5:管道问题 1 的提示

Example 2: 示例 2:

II-Researcher:开源deep Research智能体 第1张

Figure 6: Prompt for Pipeline Question 2

图 6:管道问题 2 的提示

Approach with Reasoning Model使用推理模型的方法

In addition to our State Machine (SM) approach, we explored an alternative method using a reasoning model through prompting rather than fine-tuning. This approach builds on the strengths of large reasoning models like Deepseek R1 or QwQ [15][16], enhancing their ability to process complex queries while maintaining logical consistency and factual accuracy.

除了我们的状态机 (SM) 方法外,我们还探索了一种使用推理模型的替代方法,通过提示而不是微调。这种方法建立在 Deepseek R1 或 QwQ [15][16] 等大型推理模型的优势之上,增强了它们处理复杂查询的能力,同时保持逻辑一致性和事实准确性。

Unlike the traditional State Machine approach—which decomposes tasks into discrete components executed sequentially via a predefined pipeline—this method avoids rigid compartmentalization. Pipeline-based designs can sometimes result in overly generalized logic and insufficient contextual awareness across steps. In contrast, our observations of open-ended reasoning models like DeepSeek R1 and QwQ reveal a distinct pattern: these models internally perform actions and reflections as part of a dynamic, self-directed reasoning process, as illustrated below:

与传统的状态机方法不同,传统的状态机方法将任务分解为离散组件,通过预定义的管道按顺序执行,这种方法避免了严格的划分。基于管道的设计有时会导致逻辑过于通用,并且跨步骤的上下文感知不足。相比之下,我们对 DeepSeek R1 和 QwQ 等开放式推理模型的观察揭示了一个独特的模式:这些模型在内部执行动作和反思,作为动态、自我导向推理过程的一部分,如下所示:
II-Researcher:开源deep Research智能体 第1张

Figure 7: DeepSeek R1 reasoning process

图 7:DeepSeek R1 推理过程

Inspired by this emergent behavior, we propose an architecture where tool usage and reflection are embedded within the models internal thought process (e.g., within ... blocks). Given the user input, we defined a system prompt (Appendix A) including tool definitions, allowing the model to use them arbitrarily. For II-Researcher, we provided two primary tools: WebSearch and Visit. We adapted the CodeAct [6]

style for robustness, instructing the model to generate a Python snippet to execute these tools.

受这种涌现行为的启发,我们提出了一种架构,其中工具的使用和反思嵌入到模型的内部思维过程中(例如,在 ... 块)。根据用户输入,我们定义了一个系统提示符(附录 A),其中包括工具定义,允许模型任意使用它们。对于 II-Researcher,我们提供了两个主要工具:WebSearch 和 Visit。我们调整了 CodeAct [6] 样式以实现健壮性,指示模型生成一个 Python 片段来执行这些工具。

However, a key challenge with this reasoning-focused approach is that models like Deepseek R1 may have limitations in strictly adhering to instructions compared to more general-purpose models (e.g., GPT-4o), especially within their internal thought process ( blocks), as they arent explicitly trained for instruction-following. We observed that with longer contexts (e.g., after 2-3 tool uses), the model could struggle to recall all initial instructions from the system prompt. To address this, we introduce a prefilled thinking process.

然而,这种以推理为中心的方法的一个关键挑战是,与更通用的模型(例如 GPT-4o)相比,像 Deepseek R1 这样的模型在严格遵守指令方面可能存在局限性,尤其是在其内部思维过程中( 块),因为它们没有经过明确的指令跟踪训练。我们观察到,在较长的上下文中(例如,使用 2-3 次工具后),模型可能难以从系统提示符中调用所有初始指令。为了解决这个问题,我们引入了一个预填充的思维过程。

Prefilling the Thinking Process

预填思考过程

To improve the model’s performance, we prepend a structured thinking template to the input, which the model uses as a starting point for its reasoning. This prefilled process acts as a scaffold, directing the model to follow a systematic approach while leveraging available tools. You can check the detailed prompt in the prefill thinking prompt (Appendix A)

为了提高模型的性能,我们在 Importing 前面预置了一个结构化思维模板,模型将其用作推理的起点。这个预填充过程充当脚手架,指导模型在利用可用工具的同时遵循系统的方法。详细提示可以在 prefill 思维提示(附录 A) 中查看

How It Works 如何运作

II-Researcher:开源deep Research智能体 第1张

Figure 8: Reasoning Model with Tool Execution Flow

图 8:使用工具执行流程的推理模型

The model is primed to approach each query methodically by prefilling the thinking process with these instructions. For instance, when faced with a question, it begins by reflecting on the task within a block, identifying gaps in its knowledge, and invoking tools like web_search or page_visit to gather data. Once sufficient information is collected and validated, the model finalizes its reasoning and presents a detailed, evidence-based response after the tag.

该模型通过使用这些说明预先填充思维过程,准备好有条不紊地处理每个查询。例如,当面临一个问题时,它首先会反思 块中的任务,找出其知识中的差距,并调用 web_search 或 page_visit 等工具来收集数据。一旦收集并验证了足够的信息,该模型就会最终确定其推理,并在 标签后提供详细的、基于证据的响应。

The underlying implementations for the WebSearch and Page Visit tools remain consistent with those used in the State Machine approach.

WebSearch 和 Page Visit 工具的底层实施与状态机方法中使用的实施保持一致。

Examples 例子

Example 1: 示例 1:

II-Researcher:开源deep Research智能体 第1张

Figure 9: Reasoning Question 1 Prompt

图 9:推理问题 1 提示

Example 2: 示例 2:

II-Researcher:开源deep Research智能体 第1张

Figure 10: Reasoning Question 2 Prompt

图 10:推理问题 2 提示

Benchmarking Results 基准测试结果

1. GAIA Dataset 1. GAIA 数据集

To evaluate the performance of our implementations, we used the GAIA [11] benchmarks validation set, focusing on questions requiring only Web Browsers and Search Engines. This subset includes 44 questions across three difficulty levels. Performance on the GAIA validation subset is shown in Figure 11.

为了评估我们的实现性能,我们使用了 GAIA [11] 基准测试的验证集,重点关注只需要 Web 浏览器和搜索引擎的问题。此子集包括 3 个难度级别的 44 个问题。GAIA 验证子集的性能如图 11 所示。
II-Researcher:开源deep Research智能体 第1张

Figure 11: Benchmark Results on GAIA Validation Subset (WebTool only)

图 11:GAIA 验证子集的基准测试结果(仅限 WebTool)

The GAIA evaluation details can be found here

.

GAIA 评估详情可 在此处找到。

2. FRAMES test set: 2. FRAMES 测试集:

FRAMES [12]

is a comprehensive evaluation dataset designed to assess Retrieval-Augmented Generation (RAG) systems regarding factuality, retrieval accuracy, and reasoning capabilities. Figure 12 presents the results of the FRAMES test set.

FRAMES [12] 是一个综合评估数据集,旨在评估检索增强一代 (RAG) 系统的事实性、检索准确性和推理能力。图 12 显示了 FRAMES 测试集的结果。

The dataset is available here

.

数据集可 在此处获取。

Details of Frames results that you can check here.您可以在此处查看的 Frames 结果的详细信息

II-Researcher:开源deep Research智能体 第1张

Figure 12: Benchmark Results on FRAMES Test Set

图 12:FRAMES 测试集的基准测试结果

Conclusion 结论

Our experiments demonstrate that utilizing large reasoning models via advanced prompting techniques can significantly enhance the accuracy of autonomous AI research agents on complex, multi-step tasks. This surpasses the performance of our implemented untrained state machine approach and baseline systems on the selected GAIA subset. While the prompt-guided approach requires careful prompt engineering, the gains in reasoning accuracy and report quality suggest its viability for demanding research applications.

我们的实验表明,通过先进的提示技术利用大型推理模型可以显著提高自主 AI 研究智能体在复杂、多步骤任务上的准确性。这超过了我们在选定的 GAIA 子集上实现的未经训练的状态机方法和基线系统的性能。虽然提示引导的方法需要仔细的提示工程,但推理准确性和报告质量的提高表明它在要求苛刻的研究应用中的可行性。

Future work will focus on developing hybrid architectures that integrate the efficiency and structured process of the State Machine approach with the deep reasoning capabilities elicited from models like Deepseek R1 through optimized prompting. The goal is to create robust and performant AI research agents applicable to a broader range of complex inquiries.

未来的工作将侧重于开发混合架构,将状态机方法的效率和结构化过程与通过优化提示从 Deepseek R1 等模型中获得的深度推理功能集成在一起。目标是创建适用于更广泛的复杂查询的强大且高性能的 AI 研究智能体。

翻译自:

https://www.ii.inc/web/blog/post/ii-researcher

#AI代码##agent##AI技术##智能体#

群贤毕至

访客