A Blueprint Architecture of Compound AI Systems for Enterprise

Eser Kandogan, Sajjadur Rahman, Nikita Bhutani, Dan Zhang, Rafael Li Chen, Kushan Mitra, Sairam Gurajada, Pouya Pezeshkpour, Hayate Iso, Yanlin Feng, Hannah Kim, Chen Shen, ** Wang, Estevam Hruschka Megagon LabsUSA
Abstract.

Large Language Models (LLMs) have showcased remarkable capabilities surpassing conventional NLP challenges, creating opportunities for use in production use-cases. Towards this goal, there is a notable shift to building compound AI systems, wherein LLMs are integrated into an expansive software infrastructure with a multitude of components like models, retrievers, databases and tools. In this paper, we introduce a blueprint architecture for compound AI systems to operate in enterprise settings, in a cost-effective and feasible manner. Our proposed architecture aims for seamless integration with existing compute and data infrastructure, with ‘stream’ serving as the key orchestration concept to coordinate data and instructions among agents and other components. Task and data planners, respectively breakdown, map, and optimize tasks and data to available agents and data sources defined in respective registries, given production constraints such as accuracy and latency.

conference: Make sure to enter the correct conference title from your rights confirmation emai; June 13, 2024; San Francisco, CA

1. Introduction

LLMs have demonstrated impressive capabilities in various tasks that extend beyond traditional NLP problems  (Petroni et al., 2021; Izacard et al., 2022; Li et al., 2022; Schick et al., 2024; Zhang et al., 2023) , ushering a new era of LLM-powered applications that leverage their abilities across multiple domains. In current approaches, LLMs assume a central role in nearly every aspect, encompassing task planning, data discovery and retrieval, and interfacing with other tools and services. However, such extensive involvement often poses challenges to deployment in production settings, where additional task and data constraints, such as latency, accuracy, cost, availability and quality, among others, must be considered  (Xi et al., 2023; Yang et al., 2024; Wu et al., 2023; Khattab et al., 2023).

Towards productionalization, there is a shift from monolithic models to compound AI systems that incorporate various components other than LLMs, e.g. components for data retrieval, control flow, proprietary models, and databases. Such systems provide enhanced performance for complex tasks, greater flexibility and adaptability across different use cases, easier integration of existing models and data, and greater control and trust. For example, LinkedIn and Indeed, two global job matching and hiring platforms, are productionalizing compound AI systems for a multitude of tasks in HR such as matching, recruitment, and career guidance, among others (OpenAI, 2024; Bottaro and Ramgopal, 2024).

We propose a blueprint architecture of a compound AI system tailored for enterprise use unlike existing work (Wu et al., 2023; Liu, 2022) which lack support for agent orchestration and optimization of agentic workflows. Key factors we consider in the design include: (1) ensuring seamless integration into existing infrastructure through suitable touch points and interfaces, and (2) effectively orchestrating work within and external to the compound system with appropriate resource allocation, and (3) maximizing utilization of the system in a cost-effective manner.

2. Blueprint Architecture

Key components in the blueprint architecture include: (1) agents, agent and data registries as key touch points and interfaces to seamlessly integrate with existing deployed models, APIs, databases, and tools, (2) streams to orchestrates data and instructions across components, and (3) task and data planners to optimize for cost and quality constraints in task execution and data retrieval (Figure 1).

Refer to caption
Figure 1. Blueprint Architecture: Data and Agent Registries are touch points that define existing data, models, APIs, and services in the enterprise for utilization by agents.

2.1. Integration: Touchpoints and Interfaces

Agents. Agents are ‘compute’ constructs to perform tasks (Figure 2). They do so by calling service APIs (e.g. JobSearch), interfacing with LLMs (e.g. OpenAI), running predictive models (e.g. MatchPredict), etc. As part of agent specification, inclusion and exclusion rules dictate when agent execution gets triggered. To improve utilization and concurrency, each agent has a group of workers, which process input data through a ‘processor’ function, defined by the agent.

Refer to caption
Figure 2. Agents: Triggered by data/instruction messages from multiple incoming streams agents process and produce output data and instructions to multiple output streams.

Agent Registry. An agent registry stores and organizes metadata (e.g., agent descriptions, inputs, outputs and specifications) of agents, and supports search and retrieve functions. Existing APIs, tools and models in the infrastructure can be defined as agents, with their descriptive metadata in the registry.

Data Registry. Analogous to an agent registry, data registry stores metadata of data in the enterprise, and as such is a key touch point to the existing data infrastructure. Data registry is key to aiding in the search and discovery of multi-modal enterprise data stored in data lakes and warehouses at various levels of granularity (e.g., raw data, summaries, and metadata such as schema). For instance, a lake-house architecture can facilitate agents to operate over both data lakes (e.g., models) and data warehouses (e.g., OLAP).

2.2. Orchestration: Streams and Sessions

Streams. A ‘stream’ is the central ‘orchestration’ concept in the blueprint architecture. A stream is essentially a sequence of messages, e.g., data, instructions, that can be dynamically produced, monitored, and consumed. They serve as the universal communication facilitators. For instance, a user ty** text in a chat can be modeled as a stream, with each word as individual messages. Similarly, an LLM agent generating content can be another stream. Streams can contain data and instruction messages and can include data of various types, e.g., int, str, json.

Session. ‘Session’ is the key ‘context’ concept that defines the scope of work, where agents join and accomplish the overall task.

Streams and sessions together facilitate an event-driven orchestration as shown in Figure 3. A user agent initializes a session by creating an initial stream. Other agents are then added either by the user (or by default as part of session configuration) to the session to coordinate a response to the initial user input. Each agent announces when it joins and leaves the session in the session stream. Agents have the capability to behave autonomously by listening to a stream. If they decide to listen, they process data in the input stream and generate data (or instructions) into a new output streams within the session. Streams and messages are tagged to enable other agents to selectively consume them. Alternatively, agents can be invoked by centralized planners discussed next.

Refer to caption
Figure 3. Orchestration of Agents: As agents join a session and generate output streams, these events are broadcasted in session streams. Other agent may choose to respond to a stream, and initiate a worker to process data in the stream, interacting with external services and databases. Computation occurs in various layers for optimal utilization.

2.3. Utilization: Planners and Coordinators

While the architecture enables agents to accomplish tasks in a decentralized manner by utilizing stream and/or message tags, we introduce task and data planners to optimize the execution of tasks and data operations according to production constraints.

Task Planner and Coordinator. A task planner, modeled as an agent, listens to initial user agent stream and generates a task plan in the form of directed acyclic graphs (DAGs), utilizing metadata from the agent registry to identify the appropriate agents. The task planner can be interactive and adaptive, learning from user or agent feedback, in the current session, and previous sessions. Output of the task planner is also a stream containing the DAG. A task coordinator takes the DAG plan and invokes corresponding agents by issuing instruction messages with input parameters into its own output stream (which agents in the session listen to). Coordinators closely monitor and guide the execution according to the plan DAG with constraints: collecting agent output and passing it on to the following agent upon task completion (as instruction messages); intervening upon constraint violation (e.g., timeout or low-quality result); and invoking the task planner to replan when necessary.

Data Planner. The data planner helps optimize the data operations within specified constraints on cost, performance and/or quality. It decomposes a complex data retrieval task into sub-tasks (e.g. discover, query, extract, summarize, join, compare). For each sub-task, it utilizes metadata from the data registry to determine the most efficient and effective way to accomplish the task.

3. Conclusion

We echo the sentiment by Zaharia et al. (Zaharia et al., 2024) that a systems approach offers a viable path to develop reliable, effective and usable AI applications. While the best practices for develo** AI systems is an open problem, we believe our proposed blueprint architecture will help the design and experimentation and encourage interdisciplinary research in AI, NLP, Databases, Systems, and HCI.

References

  • (1)
  • Bottaro and Ramgopal (2024) Juan Pablo Bottaro and Karthik Ramgopal. 2024. Musings on building a Generative AI product. https://www.linkedin.com/blog/engineering/generative-ai/musings-on-building-a-generative-ai-product/.
  • Izacard et al. (2022) Gautier Izacard et al. 2022. Atlas: Few-shot learning with retrieval augmented language models. arXiv:2208.03299 (2022).
  • Khattab et al. (2023) Omar Khattab, , et al. 2023. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv preprint arXiv:2310.03714 (2023).
  • Li et al. (2022) Yujia Li, , et al. 2022. Competition-level code generation with alphacode. Science (2022).
  • Liu (2022) Jerry Liu. 2022. LlamaIndex. https://doi.org/10.5281/zenodo.1234
  • OpenAI (2024) OpenAI. 2024. Introducing improvements to the fine-tuning API and expanding our custom models program. https://openai.com/index/introducing-improvements-to-the-fine-tuning-api-and-expanding-our-custom-models-program/.
  • Petroni et al. (2021) Fabio Petroni et al. 2021. KILT: a Benchmark for Knowledge Intensive Language Tasks. In NAACL. ACL, Online, 2523–2544.
  • Schick et al. (2024) Timo Schick et al. 2024. Toolformer: Language models can teach themselves to use tools. NeurIPS (2024).
  • Wu et al. (2023) Qingyun Wu et al. 2023. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv:2308.08155 (2023).
  • Xi et al. (2023) Zhiheng Xi et al. 2023. The rise and potential of large language model based agents: A survey. arXiv:2309.07864 (2023).
  • Yang et al. (2024) **gfeng Yang et al. 2024. Harnessing the power of llms in practice: A survey on chatgpt and beyond. TKDD (2024).
  • Zaharia et al. (2024) Matei Zaharia et al. 2024. The Shift from Models to Compound AI Systems. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/.
  • Zhang et al. (2023) Zhuosheng Zhang et al. 2023. Multimodal chain-of-thought reasoning in language models. arXiv:2302.00923 (2023).