The Rise of Intelligent Infrastructure for LLM Applications

Salman Paracha

Co-Founder/CEO

March 13, 2025

The rise of LLMs isn’t just another tech trend — it’s a seismic shift in computing. Much like the early internet or the advent of cloud computing, LLMs are redefining how applications are built, how users interact with software, and what developers expect from infrastructure. As someone who’s built on the internet and helped shape the cloud at AWS and Oracle, I believe this shift will dramatically improve the lives of billions.

‍

But here’s the catch: traditional application patterns won’t cut it anymore. Users will expect work to get done via prompts — and they'll expect you to get the nuanced details right. In other words, point-and-click apps are dead; conversational experiences are taking their place. Everyday as users interact with chatGPT or similar apps, they get trained on a high-quality experience and what it means to get work done with LLMs. This behavioral change will be table stakes for every application. Developers now face a new challenge: how should they go about building high-quality LLM apps quickly and reliably?

‍

That’s where intelligent infrastructure comes in. By managing the pesky the heavy lifting in handling and processing of prompts outside the application layer developers can focus on what matters most: higher level objectives and move faster. Lets dig into what is all the pesky heavy lifting and why we need new building blocks for AI applications.

‍

Why do we need new primitives (building blocks) for LLM apps?

There is a lot to unpack in this term “intelligent primitives”, especially given how frequently terms like platforms and frameworks (e.g. Langchain, LLamaIndex and Stack AI) are thrown around these days. The rate at which new frameworks are getting launched is probably at par with how fast LLMs and model updates are being released today. Hard to keep up.

‍

And I’ll admit that these new AI frameworks are helpful in many ways, but they have some serious downsides - they attempt to encapsulate and abstract patterns in code that are still rapidly evolving, often forcing developers to build and maintain everything in the application layer. These applications get harder to debug given the black box nature of these abstractions, slower to build because you have to cobble together everything yourself in code from guardrails, to routing, to observability, and usually much slower and expensive as several LLM calls might be made underneath the covers. This is why a large number of developers are moving away from said frameworks in favor of simpler techniques.

‍

Drawing from my decades of experience in building infrastructure software, I asked myself, "what won't change in AI?" The answer is outcomes. For example, developers will always prioritize fast response times for common tasks (low latency and quick Time-To-First-Token), will want to rapidly incorporate new LLMs to improve the user experience, quickly build guardrails for safe interactions, and need rich observability for debugging purposes. By focusing on outcomes, we can start separating the application or business logic layer from the crufty and pesky heavylifting in building LLM apps. This is where intelligent primitives come in - a.k.a building blocks designed for this new workload that empower developers to innovate faster and more reliably by pushing the critical but undifferentiated parts of the development process outside the application layer.

‍

For example, no one wants to store vector embeddings of documents in flat files. Vector databases like Pinecone, Qdrant and Weaviate emerged as new primitives in this space to help developers bring context to the LLM and to personalize the search and retrieval experiences for their applications. Vector databases are purpose-built storage primitives to help developers unlock value from their data and knowledge via LLMs and move faster by focusing on higher level objectives. Vector DBs are just one of the critical primitives needed to build and run an LLM application. What are some other primitives that developers need to move faster?

‍

In compute, open source projects like vLLM, Ollama, Llama.cpp offer specialized primitives to run LLMs. We are seeing LLM-enriched caches that aim to offer a specialized memory buffer to improve retrieval performance, latency and cost. But what about primitives at the transport and communications layer? What capabilities would unlock value for developers so that they can focus more on the higher level objectives? It so happens that while history doesn’t repeat itself, it sure does rhyme.

‍

Say hello to Arch Gateway. The third pillar 🏛 of the intelligent AI stack

I’ve talked to 100s of developers about the walls they hit in structuring, building and deploying apps with LLMs. Past the thrill of a quick demo using the OpenAI SDK and some prompts, the list gets long very fast. Every developer struggles to incorporate new LLMs in their application, this is both an issue with evaluating new models and having a battle-tested access layer to local and 3rd party LLMs. And once, a long and messy prompt that handled everything now needs to be broken down to smaller task units because optimizing for one kind of input can hurt performance on other inputs and developers need to route to different endpoints (or agents) based on user input.

‍

Going beyond nascent demos is hard for developers.

‍

They also want responses to be fast for common scenarios, but are left to their devices to parse intent and critical information from user queries and route to smaller and faster LLMs. Last, but not least, developers must build usage governance (keep the user topical, observe traffic) and prevent harmful outcomes via guardrails. They can choose to build, integrate and scale ALL these capabilities themselves, or push this pesky heavy lifting to Arch: an open-source intelligent proxy inspired by NGINX, designed exclusively to handle incoming and outgoing prompts and built on the battle tested Envoy proxy.

‍

Arch was built with the belief that prompts are nuanced and opaque user requests that require the same capabilities as traditional HTTP requests including secure handling, intelligent routing and handoff, robust observability, and seamless integration with backend (tools) to build fast agentic scenarios – outside application code so that you can focus on higher level objectives and move faster.

‍

I don’t want to use this blog to describe a laundry list of features, but I'll cover some fundamental concepts that might help you understand how we transparently operate at the communications and the transport layer to handle and manage prompts, and why Arch is fundamentally a new building block designed for AI workloads. Remember that I mentioned that history doesn’t repeat itself but rhymes? Something similar is going on here with Arch.

‍

In the early days of the internet to protect and scale application workloads, a load balancer quickly became a critical part of the stack. It offered SSL termination for security, managed traffic to less busy servers, and fundamentally improved the responsiveness and reliability of web applications. From low-level traffic network traffic, we moved to higher-level APIs where usage had to be monitored, access to APIs secured via keys, load between read and write operations separated to improve responsiveness, etc. And over time microservices emerged, where Envoy now sits comfortably as the most widely deployed proxy to manage the communications layer for microservices workloads. It so happens that Adil (Co-Founder) is deeply familiar with Envoy because he helped build and scale it at Lyft.

‍

Now, we have a new abstraction and workload pattern: prompts and LLMs. Prompts are the highest representation of work from users, and LLMs operate on prompts. And the challenges related to handling, processing and securing traffic remain the same in shape, but not in implementation. Prompts are opaque, nuanced and non-deterministic user requests. This is why Arch is exclusively designed with purpose-built LLMs for exceptional speed and efficiency for scenarios like agent routing and hand-off, input validation and task clarification, guardrails, and giving developers unified access and observability to ANY LLM. Arch is a force multiplier that helps you focus on high level objectives, move faster and cross the GenAI chasm with confidence.

‍

A high-level architectural represetnation of the ingress traffic that Arch handles.

‍

A little peek into the integrated science work behind Arch

Building a proxy is non-trivial work, and while it's tempting to say “can’t I write routing logic in the application layer by prompting an LLM”? Soon to be followed by “can’t I just add guardrails in my application layer?” and the list goes on, until you realize that you are building, integrating and maintaining infrastructure code for LLM applications; not focused on business goals and higher level objectives of your AI applications. For example, you might be building the next best PRD agent, or a sales engineering agent, or whatever scenario you imagine. The focus should be on the UX and not crufty infrastructure work.

‍

More importantly, building a specialized piece of software means we can design for speed, efficiency and robustness from the ground up. We are taking meticulous care in solving hard problems in ways that offer developers outside value. We’ve built world-class LLMs that punch above their weight class that offer exceptional accuracy and speed for scenarios like routing, function calling for common agentic scenarios, input validation and query clarification. These models are open source, much like the project and neatly integrated as a subsystem of Arch. Plus you have the option to deploy the project and its models locally, in your VPC or you can use our hosted versions for more predictable performance.

‍

Arch-Function (SOTA) Function Calling Model) - designed for speed, efficiency and task accuracy

‍

The journey has just begun

I like to describe ourselves as operating at the intersection of science and infrastructure - inventing primitives needed to serve this new emerging workload. And given the team’s deep pedigree in developer tools at AWS, machine learning at Microsoft and Meta, and internet-scale infrastructure services at Lyft we hope to delight you with our projects and offerings so that you can move faster and do more with LLMs.

‍

In the near term roadmap, we are working on adaptive routing: sending prompts to the right LLM based on usage scenarios described by developers, lightweight orchestration to manage the communication between agents, and a short-lived cache to improve responsiveness of your applications and lower token cost. A bunch of exciting feature updates and research work that will continue to fuel developer innovation.

‍

Check out the project, drop us a star ⭐️ and give it a spin. We are actively building with the devs in the community and routinely share updates in our discord. We’d love to see you there and hear from you. Happy building with Arch Gateway!