I built an AI system that reads codebases and generates architecture diagrams. Here's what I learned about the technology, the challenges, and why it's not magic.

AI-Powered Architecture Discovery: How It Works

Last year, I inherited a codebase with zero documentation. 200,000 lines of code across 15 services, written by a team that had since moved on. My task was to understand it well enough to add a new payment integration. The company estimated it would take 2-3 weeks just to map out the architecture.

That experience is why I built Archyl's AI discovery feature. And now, after spending months working with large language models to analyze code, I want to share what actually works, what doesn't, and why AI-powered architecture discovery is powerful but not magical.

The Problem With Manual Discovery

Before diving into AI, let's acknowledge why we need this in the first place.

When I faced that undocumented codebase, here's what my discovery process looked like:

Week 1: Grep through the code looking for keywords. Find the main entry points. Draw some boxes on a whiteboard. Realize I misunderstood the service boundaries. Erase and redraw.

Week 2: Interview the one engineer who's been around long enough to remember some history. Half of what he tells me contradicts what I found in the code. Turns out things changed but nobody updated his mental model.

Week 3: Finally feel like I understand the system well enough to make changes. Create documentation that I swear I'll keep updated. (Spoiler: I didn't.)

This process is slow, error-prone, and doesn't scale. Every new team member goes through the same painful discovery. The documentation drifts out of date within months.

How AI Discovery Actually Works

When you connect a repository to Archyl and run discovery, here's what happens under the hood:

Step 1: Repository Scanning

First, we build a map of your codebase. This isn't AI yet — it's straightforward file system traversal:

List all files and directories
Identify configuration files (package.json, go.mod, docker-compose.yml)
Find entry points (main functions, index files, API routes)
Build a dependency graph from imports

This gives us the skeleton. We know what files exist and how they reference each other. But we don't yet understand what they do.

Step 2: Chunked Analysis

Here's where LLMs come in, and also where it gets tricky.

Modern language models have context limits. GPT-4 can handle about 128K tokens, Claude can do 200K. That sounds like a lot, but a medium-sized codebase easily exceeds that. So we can't just dump the entire codebase into a prompt and ask "what is this?"

Instead, we chunk the codebase into digestible pieces:

Group files by directory or module
Send each chunk to the LLM with context about its location in the project
Ask the model to identify: What is this code responsible for? What patterns does it use? What external systems does it interact with?

The responses come back as structured data — JSON describing systems, containers, and components with their relationships.

Step 3: Aggregation and Reconciliation

This is the hardest part, and where I spent most of my development time.

Each chunk analysis gives us a partial view. The user service chunk knows about the user database. The payment chunk knows about Stripe. But neither knows the full picture.

We need to reconcile these partial views:

Merge duplicate entities (is "UserDB" the same as "users_database"?)
Infer relationships between chunks (the order service calls the user service, but they were analyzed separately)
Resolve conflicts (one chunk says we use PostgreSQL, another says MySQL — which is right?)

This reconciliation uses another round of LLM analysis, plus heuristics based on common patterns. It's imperfect. Sometimes the AI gets it wrong. That's why discovery produces suggestions that humans review, not final documentation.

Step 4: C4 Model Generation

Finally, we map the discovered entities to C4 model elements:

External systems (third-party APIs, databases we don't manage)
Containers (our deployable units)
Components (major modules within containers)
Relationships (who calls whom, what data flows where)

The result is a set of draft C4 diagrams that capture the AI's understanding of your architecture.

What the AI Gets Right

After running discovery on dozens of codebases during development, here's what impressed me:

Technology Stack Detection

LLMs are remarkably good at identifying what technologies a project uses. They recognize framework patterns, library idioms, and configuration file formats. When GPT sees a @Controller annotation, it knows you're using Spring. When it sees fiber.New(), it knows you're using Go Fiber.

Service Boundary Detection

In microservice architectures, the AI reliably identifies service boundaries. It understands that code in /services/user/ is probably a separate service from /services/order/. It recognizes Docker Compose files as indicators of service topology.

Common Pattern Recognition

The AI has seen millions of codebases in its training data. It recognizes repository patterns, MVC structures, event-driven architectures, and API gateway setups. When your code follows common patterns, the AI identifies them quickly.

External Integration Discovery

Every API key constant, webhook URL, or SDK import is a clue about external integrations. The AI catches most of these, building a picture of what third-party services your system depends on.

What the AI Gets Wrong

Here's where I had to set realistic expectations:

Custom Domain Logic

The AI doesn't understand your business domain. It can tell that you have a processOrder function, but it doesn't know what "processing an order" means in your specific business context. It might misidentify the purpose of domain-specific components.

Unusual Architectures

If your architecture doesn't follow common patterns, the AI struggles. A custom plugin system, an unconventional folder structure, or a homegrown framework will confuse it. The AI expects Rails apps to look like Rails apps.

Hidden Dependencies

Not all dependencies are explicit in code. Maybe your service requires a specific version of Redis that only exists in production. Maybe there's a sidecar container that the AI never sees. Runtime dependencies are often invisible to static analysis.

Stale Code Paths

The AI doesn't know which code is actively used versus which is legacy cruft nobody's touched in years. It might prominently feature a deprecated service that's still in the codebase but no longer deployed.

Making AI Discovery Work Better

Through trial and error, I've found ways to improve discovery accuracy:

Provide Context

Before running discovery, tell the AI about your system. "This is an e-commerce platform with payment processing" gives the model a frame of reference. Without context, it's guessing blind.

Start with Structure

If you have any existing documentation — even a README with a rough architecture sketch — provide it. The AI uses this as a prior to guide its analysis.

Review Incrementally

Don't run discovery on your entire codebase at once. Start with one service. Review and correct the results. Then expand to the next service. Corrections you make inform future analysis.

Trust But Verify

Treat AI suggestions as a starting point, not the final answer. The AI might be 80% accurate. You need to verify the other 20%. Click into the source code links, confirm the relationships make sense, and correct mistakes.

The Technical Details

For those curious about the implementation:

Chunking Strategy

We use semantic chunking rather than fixed-size chunks. A chunk is typically one module, one service, or one directory tree. This keeps related code together, which improves the AI's understanding.

Prompt Engineering

The prompts evolved significantly. Early versions produced verbose, narrative descriptions. Current prompts demand structured output with specific fields. We use few-shot examples to demonstrate the expected format.

Concurrency

Large codebases have thousands of files. Processing sequentially would take forever. We analyze chunks in parallel, with configurable concurrency limits to avoid API rate limits.

Model Selection

Different models have different strengths. GPT-4 produces more accurate analysis but costs more. Claude is better at following structured output requirements. We support both, plus local models via Ollama for teams who can't send code to external APIs.

The Future of AI-Powered Discovery

What we have today is version 1. Here's what I'm working toward:

Continuous Discovery

Instead of one-time analysis, monitor your codebase continuously. When code changes, update the relevant diagrams automatically. Detect architectural drift before it becomes a problem.

Deeper Understanding

Current analysis is mostly structural. Future versions could understand behavior: "This endpoint validates input, calls the payment service, then sends a confirmation email." Sequence diagrams generated from code.

Cross-Repository Analysis

Most organizations have multiple repositories. Discovery should understand how they connect — which services in repo A call services in repo B.

Confidence Scoring

Not all discoveries are equally certain. We're adding confidence scores so you know which suggestions to scrutinize more carefully.

Conclusion

AI-powered architecture discovery isn't magic. It's a tool that accelerates the tedious parts of understanding a codebase while still requiring human judgment for the nuanced parts.

When I run discovery on that 200K-line codebase today, I get a draft architecture diagram in 10 minutes instead of 3 weeks. It's not perfect — I still need to review and correct it. But it's a dramatically better starting point than a blank whiteboard.

If you're drowning in undocumented code, give AI discovery a try. Go in with realistic expectations: it won't understand your business domain, it might miss unusual patterns, and it definitely requires human review. But it'll get you 80% of the way there in a fraction of the time.

Want to learn more? Check out our introduction to the C4 model that AI discovery generates, or read about why architecture documentation matters in the first place.

AI-Powered Architecture Discovery: How It Works - Archyl Blog