AI-Assisted MLR Review — Case Study

The Problem

Manual review meant everyone had their own rulebook

At Eli Lilly, every piece of promotional and medical content — from web banners to HCP materials — goes through a Medical, Legal, and Regulatory (MLR) review before it can be published. For a company operating across 125 countries with thousands of branded assets, the stakes of getting that review wrong are significant: regulatory fines, product recalls, and reputational damage.

But the review process itself was deeply manual and inconsistent. There was no formal training or onboarding documentation for new content reviewers. Instead, new hires shadowed experienced reviewers for weeks before being left to develop their own approach — complete with personal cheat sheets, bookmark folders, and SharePoint documents that lived entirely in their own heads.

The Real Cost

Creatives called them "volleys" — submissions that bounced back with conflicting feedback depending on which reviewer picked them up. One reviewer might flag an issue another had never mentioned. The same content, reviewed twice, could get two different answers. Every volley meant rework, delay, and eroding trust between creative and review teams.

The opportunity was clear: if the review rules lived in code rather than people's heads, every submission would be evaluated against the same standards, every time. And when a rule changed, it would propagate instantly — no retraining, no interpretation drift, no lag.

My Role

The PM behind the product, within the constraints of the client

As a third-party consultant, Lilly's governance policies prevented me from holding formal system ownership — that sat with a Lilly employee. In practice, however, I functioned as the full product lead: running discovery, rewriting the PRD from scratch after inheriting incomplete requirements, building the product roadmap, defining the release strategy, and managing the engineering backlog day to day.

My Lilly counterpart focused on timeline management, navigating internal bureaucracy, and stakeholder communications — the organizational interface a consultant can't fully occupy. The product thinking, the engineering partnership, and the decisions about what to build and in what order were mine.

AI as a PM Tool

I also used Microsoft Copilot agents to generate the PRD, Epics, and User Stories — reducing documentation overhead by 30% and freeing time for the work that actually required judgment. The product I was building used AI to eliminate manual review. It would have been inconsistent not to apply the same thinking to my own workflow.

The Approach

A deliberate Alpha — not a cautious one

Lilly's internal AI platform, Cortex, sat on top of multiple LLMs including GPT-4.5. On paper, it could do a lot. In practice, nobody on the project had shipped something on it before. Rather than commit to a full multi-agent architecture before understanding what the platform could actually do, I designed the Alpha as a technical probe: one agent, one brand, one file type, one user type.

Alpha

Probe & Learn

Single Legal agent
6 business rules (Kisunla brand)
PDF only, Legal reviewer
Validate Cortex capabilities
Stand up full infrastructure

Beta → Production

Scale & Harden

35 additional legal rules
Evaluation framework
95% accuracy threshold
Full regulatory documentation
Prod environment readiness

This constraint — a single agent reviewing a single brand — wasn't timidity. It was risk management. The Alpha would tell us what Cortex could and couldn't do, which would inform every architectural decision that followed. Without that signal, we'd be building on assumptions.

Delivery

Alpha shipped in 6 weeks — from a cold start

The team I inherited was distributed across Canada, Mexico, Costa Rica, and India. We hadn't worked together before. The business requirements I received were incomplete enough that I rewrote the PRD from scratch before the first sprint. And we were operating inside Lilly's enterprise governance framework, which required formal application registration, architecture solution reviews, cybersecurity reviews, and environment configuration before a single user could log in.

In six weeks, the team delivered: repositories and deployment pipelines, separate Dev and QA environments, application registration within Lilly's governance framework, a full web UI integrated into the CATS ecosystem, core logging and error handling, and the Legal agent processing six business rules on day one.

Prompt Engineering

Throughout development, I worked directly with the engineering team on prompt engineering and refinement — not just defining what the agent should check, but how it should reason about edge cases. I also sourced and created positive and negative test cases to feed the evaluation framework, building the quality foundation the agents would be measured against.

The Hard Part

Probabilistic AI meets a deterministic client

When my Lilly counterpart performed UAT on the Alpha release, she was not satisfied with the agent's accuracy. The complaints were understandable. What surfaced underneath them was a more fundamental misalignment that nobody had explicitly addressed: Lilly expected the AI to behave like traditional software — either right or wrong, with 100% accuracy as the baseline expectation.

Our team's model was different. For an Alpha release, we were targeting 65% accuracy as a reasonable floor for a first agent on a new platform. The path to Production ran through a 95% accuracy threshold — but you don't get there by waiting until everything is perfect before releasing. You get there by releasing, observing, and improving.

The Core Tension

Probabilistic AI systems are not software bugs waiting to be fixed. A Legal agent that's right 68% of the time in Alpha isn't broken — it's a starting point. The feedback loop between user accepts/rejects and prompt refinement is the improvement mechanism. This distinction between deterministic and probabilistic systems is one of the most common — and most consequential — gaps in enterprise AI adoption.

We proceeded to Beta with a clear goal: hit 95% accuracy, complete all regulatory documentation, and push to Production within 8 weeks. The team expanded the Legal agent from 6 to 41 business rules across Trademark & IP, Privacy & Consent, Digital & Web Compliance, and Disclaimers & Branding.

What Happened

A skunkworks team changed the calculus

While we were navigating Lilly's regulatory requirements for a Production release — the architecture reviews, security approvals, documentation, and environment hardening that enterprise software deployments require — an internal Lilly team, working in isolation, built a competing prototype in three weeks. Leveraging the business rules and requirements our team had spent months gathering, they demonstrated a tool with more visible features in a fraction of the time.

The project was placed on hold. I understood the decision. If the internal team had built something better and faster, pausing to evaluate was the right call for the business.

An Honest Observation

The internal team built in a room, without the overhead our team carried: no environment setup, no security reviews, no application registration, no enterprise governance. They built a prototype. We built a production-ready platform. Whether they'll face the same regulatory waters on the path to Production — and whether the business rules they inherited will hold up to Medical and Regulatory scrutiny — remains to be seen.

Looking Back

What I'd do differently

Push harder to release Alpha to real users

I should have advocated more forcefully for getting the Alpha in front of end users despite the accuracy concerns. Yes, some confidence might have been lost in the short term. But the iterative feedback loop — real reviewers accepting and rejecting agent findings — is the mechanism that improves the model. Waiting for perfection in the lower environments was the wrong trade-off.

Own the stakeholder communication more directly

I relied too heavily on my Lilly counterpart to keep stakeholders informed. In hindsight, I should have created more forcing functions — regular written updates, a visible dashboard of agent accuracy progress, something that made the team's work legible to people who weren't in the sprints. Stakeholders who feel in the dark become nervous stakeholders, and nervous stakeholders are a project risk.

Push the engineering team harder on AI-assisted development

I asked several times whether we could leverage Claude Code or Copilot to accelerate development. The answers never quite landed. The team — eight engineers spanning AI/ML, full-stack, architecture, and technical leadership — was accustomed to building in more traditional ways. I even used AI to produce a hard review of the architecture solution document, surfacing questions I thought were worth discussing. The team declined to engage with the results. I trusted their expertise and moved on. Looking back, I should have found a way to make the conversation happen rather than accepting the deflection.

Skills Demonstrated

What this project required

AI Product Management Prompt Engineering Enterprise Stakeholder Management Regulatory Navigation Distributed Team Leadership Product Discovery Requirements Definition Release Strategy RAG Architecture Agile / Scrum Evaluation Framework Design Backlog Management