Coordination Without Contracts: Toward Formally Grounded Agentic AI Systems

Authors

  • Sai Manoj Jayakannan

Keywords:

Multi-Agent Systems, Compositional Safety, Memory Architecture, Trust Propagation, Long-Horizon Evaluation

Abstract

Autonomous AI agents systems that plan, invoke external tools, spawn sub-agents, and iterate toward long-horizon goals are rapidly moving from research prototypes to production deployments. Yet the theoretical scaffolding needed to reason about agent behavior remains conspicuously thin. Unlike single-turn language models, which inherit decades of statistical learning theory and empirical benchmarking infrastructure, multi-agent LLM pipelines operate without formal contracts between participants, without verified memory semantics, and without evaluation protocols that reflect the temporal depth of real tasks. This paper argues that the central bottleneck in agentic AI research is not the capability of current frontier models, which are already impressive planners in isolation, but rather the absence of compositional safety guarantees that survive agent-to-agent delegation.

This work diagnoses four structural limits of the dominant paradigm: (1) context-window memory creates ephemeral, unverifiable state; (2) informal tool-calling interfaces lack precondition/postcondition semantics; (3) inter-agent trust is implicitly inherited rather than explicitly negotiated; and (4) existing benchmarks measure shallow reactive competence rather than long-horizon coherence under adversarial perturbation. Against this diagnosis, four technically-detailed research directions are proposed: typed agent communication protocols with verifiable postconditions; hierarchical memory architectures grounded in external write-ahead logs; a trust-propagation algebra for multi-agent delegation chains; and a new benchmark family, Long Horizon Agent Bench (LHAB) designed to stress-test agents over multi-day, multi-session task horizons. Proof-of-concept experiments feasible in 2026–2027 are outlined, closing with a 36-month research agenda for the community.

Downloads

Download data is not yet available.

References

WILLIAM TORGBI AGBEMABIESE, Toward Constitutional Autonomy in AI Systems: A Theoretical Framework for Aligned Agentic Intelligence. IEEE Xplore, 2025. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=11354471 .

Anthropic, Claude's Character, and Agentic Capabilities: Technical Report on Claude 3.7 Sonnet. Technical Report, Anthropic, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

Google DeepMind, Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next-Generation Agentic Capabilities. Google DeepMind, 2026. https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf

Nathan Schlaffer, Cobus Greyling. Parallel Agent Processing, 2025. https://www.kore.ai/ai-insights/parallel-agent-processing

Joon Sung Park et al., Social Simulacra: Creating Populated Prototypes for Social Computing Systems. ACM Digital Library, 2025. https://dl.acm.org/doi/10.1145/3526113.3545616

Gheorghe Comanici et al. “Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.” Technical Report, Google DeepMind, 2025. https://arxiv.org/abs/2507.06261

Anthropic, Scaling Long-Context Reasoning in Claude 4. Technical Report, Anthropic, 2025. https://www.anthropic.com/news/claude-sonnet-4-6

Fábio Perez, Ian Ribeiro, Ignore Previous Prompt: Attack Techniques for Language Models. In Proc. NeurIPS ML Safety Workshop 2022. https://arxiv.org/abs/2211.09527

Kai Greshake et al., "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." arXiv:2302.12173 [cs.CR] 2023. https://arxiv.org/abs/2302.12173

Yangjun Ruan et al., Identifying the Risks of LM Agents with an LM-Emulated Sandbox. arXiv:2309.15817 [cs.AI], 2024. https://arxiv.org/abs/2309.15817

Xiao Liu et al., AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688 [cs.AI], 2025. https://arxiv.org/abs/2308.03688

Shuyan Zhou et al., WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854 [cs.AI], 2024. https://arxiv.org/abs/2307.13854

Tianbao Xie et al., OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. arXiv:2404.07972 [cs.AI], 2024. https://arxiv.org/abs/2404.07972

Shunyu Yao et al., τ-Bench: A Benchmark for Tool-Augmented Language Agent Evaluation in Real-World Domains. arXiv:2406.12045 [cs.AI], 2025. https://arxiv.org/abs/2406.12045

Freda Shi et al., Large Language Models Can Be Easily Distracted by Irrelevant Context. ACM Digital Library, 2023. https://arxiv.org/abs/2302.00093

Cheng-Ping Hsieh et al., RULER: What's the Real Context Size of Your Long-Context Language Models? arXiv:2404.06654 [cs.CL], 2024.https://arxiv.org/abs/2404.06654

Nelson F. Liu et al., Lost in the Middle: How Language Models Use Long Contexts Transactions of the ACL, arXiv:2307.03172 [cs.CL], 2023. https://arxiv.org/abs/2307.03172

Yejin Bang et al., A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In Proc. AACL 2023.https://arxiv.org/abs/2302.04023

Lei Huang et al., A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Digital Library, 2025. https://dl.acm.org/doi/10.1145/3703155

John Yang et al., InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback. arXiv:2306.14898 [cs.CL], 2024. https://arxiv.org/abs/2306.14898

Qiusi Zhan, Zhixiang Liang, InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. arXiv:2403.02691 [cs.CL], 2024. https://arxiv.org/abs/2403.02691

Mark S. Miller et al., Capability Myths Demolished. Technical Report, Johns Hopkins University Systems Research Laboratory, 2003. https://classpages.cselabs.umn.edu/Fall-2021/csci5271/papers/SRL2003-02.pdf

A. Sabelfeld and A.C. Myers, Language-Based Information-Flow Security. IEEE Xplore, 2003. https://ieeexplore.ieee.org/document/1159651

Leo Gao et al., “Scaling Laws for Reward Model Overoptimization.” In Proc. ICML 2023. https://proceedings.mlr.press/v202/gao23h/gao23h.pdf

Yann Dubois et al., “Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.” In Proc. ACL 2024. https://arxiv.org/abs/2404.04475

Austin, JL, How to Do Things with Words. Oxford University Press. 1962. https://silverbronzo.wordpress.com/wp-content/uploads/2017/10/austin-how-to-do-things-with-words-1962.pdf

Searle, John R, Speech Acts: An Essay in the Philosophy of Language. Cambridge University Press, 1969. https://archive.org/details/speechactsessayi0000sear

Lianmin Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL], 2023. https://arxiv.org/abs/2306.05685

Qingyun Wu et al., AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155 [cs.AI], 2023. https://arxiv.org/abs/2308.08155

Tianbao Xie et al., OpenAgents: An Open Platform for Language Agents in the Wild. arXiv:2310.10634 [cs.CL], 2023. https://arxiv.org/abs/2310.10634

Sirui Hong et al., MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. arXiv preprint arXiv:2308.00352. 2023. https://arxiv.org/abs/2308.00352

Charles Packer et al., MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560 [cs.AI], 2024. https://arxiv.org/abs/2310.08560

Wanjun Zhong et al., MemoryBank: Enhancing Large Language Models with Long-Term Memory. Proceedings of the AAAI Conference on Artificial Intelligence, 2024. https://ojs.aaai.org/index.php/AAAI/article/view/29946

Fouad Bousetouane, “AI Agents Need Memory Control Over More Context.” arXiv:2601.11653 [q-bio.NC], 2026. https://arxiv.org/abs/2601.11653

Yuntao Bai et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback.” arXiv preprint arXiv:2204.05862, 2022. https://arxiv.org/abs/2204.05862

Long Ouyang et al., Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL], 2022. https://arxiv.org/abs/2203.02155

Anthropic, “Responsible Scaling Policy: Frontier AI Safety Commitments.” Technical Report, Anthropic, 2024. https://assets.anthropic.com/m/24a47b00f10301cd/original/Anthropic-Responsible-Scaling-Policy-2024-10-15.pdf

Ivan Nardini. “Introducing Code Execution: The code sandbox for your agents on Vertex AI Agent Engine.” In Proc. NDSS 2025. https://discuss.google.dev/t/introducing-code-execution-the-code-sandbox-for-your-agents-on-vertex-ai-agent-engine/264336

Difei Gao et al., AssistGUI: Task-Oriented Desktop Graphical User Interface Automation. In Proc. CVPR 2024. https://arxiv.org/html/2312.13108v2

Grégoire Mialon et al., GAIA: A Benchmark for General AI Assistants. arXiv preprint arXiv:2311.12983, 2023. https://arxiv.org/abs/2311.12983

Suchin Gururangan et al., Annotation Artifacts in Natural Language Inference Data. In Proc. NAACL 2018. https://aclanthology.org/N18-2017/

R. Thomas McCoy et al., Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. In Proc. ACL 2019. https://aclanthology.org/P19-1334/

Downloads

Published

30.06.2026

How to Cite

Sai Manoj Jayakannan. (2026). Coordination Without Contracts: Toward Formally Grounded Agentic AI Systems. International Journal of Intelligent Systems and Applications in Engineering, 14(1s), 1712–1724. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/8410

Issue

Section

Research Article