Designed a Copilot feature that analyzes data shape, infers user intent, and recommends optimal chart configs with story-first titles like “Quarterly Trends.” Collaborated with ML engineers on RAG model tuning, eliminating chart type decision fatigue and making great visualizations accessible to all.
Liked this project?
Let's talk about what we can build together.
Introduction
How I designed an AI system that bridges the gap between chart creation and chart communication — for 400M Excel users.
Creating charts is easy. Making them GOOD is hard. Users struggled with chart type selection, styling decisions, and best practices—resulting in suboptimal visualisations even when data was correct.
As Lead Designer for Copilot Chart Design Recommendations, I designed an LLM-powered system that analyses data shape, infers user intent, and suggests optimal chart configurations. This required deep collaboration with ML engineers to train and tune the RAG models with data visualisation principles—essentially encoding expert knowledge into AI prompts.
The result bridges the gap between 'chart exists' and 'chart communicates effectively,' democratising data visualisation expertise for 400M users.
Results Overview
The feature shipped, scaled, and proved that AI-assisted design guidance moves real product needles.
Execution Success
User Effort Saved
Clicks Eliminated
Shipping Timeline
The Problem: The Chart Design Expertise Gap
Most users could create a chart. Almost none could create a good one. The expertise gap was the product gap.
"No suggestions... I have to try different charts and hope one communicates my idea." — Usability Study 2024
"I spend hours tweaking charts to look professional." — Power User
"It looks boring. How do I make it presentation-ready?" — Enterprise Analyst
"I created a chart but don't know if I picked the right type." — Intermediate User
| Pain Point | User Behaviour | Business Impact |
|---|---|---|
| Chart Type Uncertainty | Try multiple types, delete, start over. 5-10 minute cycle. | Wasted time, user frustration, suboptimal final choices |
| Styling Paralysis | Don't know which formatting options matter. Either over-style or under-style. | Charts look unprofessional or cluttered |
| Best Practice Ignorance | Unaware of data viz principles (e.g., start axis at zero, use direct labels) | Misleading visuals, poor communication |
The pattern was clear, Users could INSERT charts (thanks to our P0 improvements), but they couldn't optimise them.
Why Competitors Had the Advantage
Competitive analysis revealed sophisticated design assistance
Tools that handle complex data are hard to use. User-friendly tools handle simple data.
40% of Excel charts were deleted same session. Canva/Flourish users kept charts because they looked presentation-ready on first insert.
Google Explore, Napkin.ai, Tableau Show Me — every modern tool reduces data→chart to 1-2 clicks. Excel required 5+ steps with no guidance.
Pitch, Miro, Figma use click-to-format context menus. Excel used ribbon + dialog boxes — 3+ clicks to format a single element.
In our AI compete benchmark, Copilot in Excel scored 48/100 — below ChatGPT (85), Gemini (72), even Gemini Sheets (56). Task success was 40%.
BI tools provide insights alongside charts. Google Gemini explains trends. Excel charts were 'purely graphical — static visuals, with no story.’
Synthesis
Across all 30+ tools, three truths emerged:
Reduce friction at the start — suggestions, templates, one-click creation.
Make the default output impressive — users keep charts that look good on first insert.
Add intelligence — the tool should explain the data, not just display it. Excel had the data capability. It needed the ease and the storytelling.
As a User
Functional & Emotional JTBDs
What users wanted to accomplish — and how they wanted to feel — when reaching for charts in Excel.
"When I insert a chart, help me create a visualization that tells my story effectively — without needing to be a data viz expert."
| Job Category | User Statement | Pain Point It Solves |
|---|---|---|
| Chart Type Selection | "Help me figure out which chart best represents my data" | 5-10 minute trial-and-error cycles; users try multiple types, delete, start over |
| Visual Design | "Make my chart look professional and presentation-ready" | Charts described as "boring," "old-fashioned," "embarrassing to present" |
| Best Practice Application | "Tell me what I don't know about good data visualization" | Users unaware of principles like "start Y-axis at zero" or "use direct labels" |
| JTBD | Description | Copilot Intent Share |
|---|---|---|
| Comparative Analysis | Compare values across categories, geographies, or periods to uncover insights | Part of 83% "Create Chart" intents |
| Presentation & Storytelling | Make complex information clear, engaging, persuasive in meetings/reports | 9.6% of explicit intents |
| Trend Analysis | Visualize how metrics change over time to identify patterns | Primary use case for Line charts |
| Answering Business Questions | Create ad-hoc visuals to answer specific questions quickly | Core Excel workflow |
The "Magic Wand" Quote
from User Research
The single question that unlocked what users truly needed — and reframed the entire design brief.
"If you could wave a magic wand, what would you change?"
Users wanted three things:
Automatic chart creation
"Based on my specific goal and storytelling needs, help me tell my story"
Automatic beautification
"Make my charts look beautiful without me having to figure it out"
Natural language customization
"Let me ask for customizations in plain English"
As a Business
Strategic JTBDs
The commercial imperatives driving investment in chart intelligence — retention, ecosystem depth, and competitive parity.
"Increase chart adoption and retention to keep users within the M365 ecosystem for their data visualization needs — preventing defection to competitors."
| Metric | Baseline Problem | Target Impact |
|---|---|---|
| Chart Kept Rate | ~45% of charts deleted in same session | Push toward >70% retention |
| Chart Create MAU | Only 2% of MAU on web create charts | Increase top-of-funnel creation |
| Net Chart Creation | Inserts minus deletes was too low | Increase net positive |
| Data Viz NPS | Charting issues dragging down Excel NPS | Measurable improvement |
| Copilot Tried/Enabled | Design Recommendations as gateway | Lift adoption rate |
| Business Job | Why It Matters | How Design Recommendations Solves It |
|---|---|---|
| Compete Defence | Tableau, Power BI, ChatGPT Code Interpreter, Napkin AI democratizing design expertise | Embed expertise IN the tool; no learning curve required |
| Copilot Adoption | Only ~9% of Copilot users engaged with chart-related prompts | Proactive recommendations at insert = gateway to Copilot |
| User Retention | Users looking outside M365 for data viz needs | "Wow moment" on first chart = sticky behavior |
| Unlock Latent Demand | 33% of commercial users want to create charts but don't | Remove friction to convert intent → action |
The Business Funnel Problem
The massive drop from awareness → creation is where AI Design Recommendations lives. It attacks the -98% conversion gap.
My Role
Designing AI as Design Partner
I wasn't just designing a UI — I was co-designing the intelligence behind it, working across ML, data science, engineering, and research simultaneously.
Systems thinking — Thinking Charts through complete M365 ecosystem
RAG model training & tuning — defined data viz properties that inform chart type recommendations
Recommendation interaction patterns — preview, apply, undo flows
AI prompt engineering collaboration — co-designed LLM prompts with ML team for chart analysis, gave examples of visually stunning data viz.
End-to-end UX strategy for Copilot-powered design recommendations
Multi-recommendation handling — when LLM suggests 3-5 improvements, how to present without overwhelming
Trust-building mechanisms — explainability, rationale, learn more links
Design recommendations should feel like:
The design philosophy that shaped every interaction pattern, recommendation format, and piece of copy in the system.
A helpful colleague, not a know-it-all boss
Educational — explain WHY, don't just say WHAT
Suggestions, not mandates — users always have final say
Confidence-building — help users become better designers over time
Phase 1:
Analyzed telemetry revealing a -98% funnel drop from chart awareness to creation. Synthesized OCV feedback — users called charts "boring" and "embarrassing." Defined the core JTBD: help users tell data stories without being viz experts.
The OCV analysis showed 38% of chart complaints were about poor quality — tables instead of charts, wrong grouping, blank outputs. That's what we were solving.
Phase 2:
Explored 3 directions: auto-apply magic, inline tooltips, side-by-side preview. User testing rejected "AI takeover" — they wanted to see options first. Landed on story-first titles and preview-before-commit as guiding principles.
Phase 3:
Partnered with engineering to map hard limits: 2-4s LLM latency, ~85% preview fidelity, single Copilot pane. Made key tradeoffs — 4 recommendations, refresh button, dropdown for placement. Designed around constraints, not against them.
Phase 4:
Built a golden dataset of 50+ data scenarios with ideal chart recommendations. Defined statistical signals for preprocessing — time-series, part-to-whole, category comparison. Reviewed model outputs weekly to catch and correct bad patterns.
Phase 5:
Working with the ML team, I co-created prompts optimized for chart type selection. The key was encoding data visualization best practices into the prompt structure — story-first titles, rationale text, diverse recommendations.
I defined the decision tree that the model uses to recommend chart types:
| User Intent / Data Shape | Recommended Chart | Why This Works |
| Time series with trend | Line Chart | Shows change over time; eye follows the trajectory |
| Categorical comparison | Clustered Bar/Column | Easy side-by-side comparison; clear value differences |
| Part-to-whole (<7 categories) | Pie/Donut Chart | Intuitive percentage representation; limited categories |
| Part-to-whole (>7 categories) | Stacked Bar/Area | Handles many categories; shows composition |
| Correlation/distribution | Scatter/Bubble Chart | Reveals relationships; shows outliers clearly |
| Actual vs. Target | Combo Chart | Different visual encoding for different data types |
Through iterative testing on 20+ sample datasets across industries (Telecom, Finance, Manufacturing, Retail), we tuned the prompts to:
**Prompt Architecture**
Given a chart with [data structure], current type [X], analyze if a better visualization exists.
Consider:
1) Data relationships,
2) Storytelling intent,
3) Visual clarity
Return top 4 recommendations with executable chart config and brief rationale.Phase 6:
Designed the List → Detail two-panel flow. Specified card anatomy: thumbnail, story-first title, rationale, one-click apply. Added "Review changes" section for transparency. Created interaction specs for hover, dropdown, and back navigation.
Phase 7:
Ran usability sessions validating story-first titles. A/B tested model versions tracking Kept rate. Iterated on "Show details" for power users. Shipped to 10% Fastfood — poor quality dropped 20pp, satisfaction hit 64%.
Initial direction, design and concepts
Three directions explored before converging: auto-apply magic, inline guidance, and story-first previews. User testing killed option one fast.
The MVP
This feature is part of a broader Excel Charting strategy. Read the strategy case study
99% Execution success rate
9 min Saved per user per session
172 Clicks eliminated
79% Chart kept rate (Month 2, up 8pp)
18% Poor quality rate (down from 38%)
+52% Copilot tried/enabled lift
64% Net satisfaction score (up 16pp)
25% rollout FY26 H1 public rollout
45% of Excel charts got deleted in the same session they were created.
That's not a usage problem. That's a design problem. Users could insert a chart just fine. What they couldn't do was make it good. Wrong chart type, ugly defaults, no sense of whether they'd even communicated anything.
The feedback in OCV was blunt: "boring," "embarrassing," "not presentation-ready." One user put it plainly in a 2024 usability study: "No suggestions... I have to try different charts and hope one communicates my idea."
And they weren't wrong to be frustrated. The tool gave them 5+ steps to format a single element. Competitors had reduced the same flow to one or two clicks. In our own AI benchmark, Copilot in Excel scored 48/100 — below ChatGPT at 85, below Gemini at 72, below even Gemini Sheets at 56. Task success was 40%.
The competitive analysis made three things clear:
The question wasn't "how do we improve chart formatting?" It was: how do we give 400 million users the expertise they don't have?
The user research unlocked a clear job to be done: "Help me create a visualization that tells my story — without needing to be a data viz expert."
Three things came up constantly when people were asked what they'd change with a magic wand:
On the business side, the numbers told the same story from a different angle. Only 2% of monthly active users on the web created charts. 33% of commercial users wanted to create charts but didn't. There was a -98% funnel drop from chart awareness to creation. That's not friction. That's a wall.
Design Recommendations had to sit right at that wall and remove it.
I led the UX for Copilot Chart Design Recommendations end-to-end. That meant the interaction design, yes. But it also meant sitting with ML engineers and data scientists to co-design the intelligence behind it.
Most AI product design projects treat the model as a black box. You design around it. I didn't want to do that here.
The core of the work was defining "good" — before the model could learn it.
I built the training dataset from scratch with the data science team. Hundreds of annotated chart examples: good ones tagged against six quality dimensions, bad ones tagged with what specifically failed. This wasn't a design brief. It was a structured definition of chart quality, written as engineering specification.
The six quality dimensions we coded every example against:
The LLM system prompt injected these rules as hard instructions:
I reviewed model outputs weekly against this rubric. Bad outputs went into a negative examples set with annotations on what failed and why. That feedback loop ran across the full pilot and is what got the poor quality rate from 38% to 18%.
The dataset composition at the end of the pilot: 60% positive examples (all six dimensions passing), 25% negative examples (at least one dimension failing), 15% edge cases (ambiguous data, sparse datasets, low-confidence routing).
Three directions. One got killed fast.
Option 1: AI Takeover. Auto-apply the best recommendation immediately. Users hated it. "Let me see my chart first." That was consistent across every test session. No one wanted the AI to make the call for them.
Option 2: Inline tooltips. Minimal guidance surfaced on hover. Too subtle. Didn't break users out of their current habit of guessing.
Option 3: Story-first previews with explicit commit. Show 4 ranked recommendations. Each with a thumbnail, a story-first title ("Quarterly Trends" not "Line Chart"), and a rationale. User picks one. User applies it. User controls the outcome.
Option 3 won because it respected agency while removing effort. Users aren't experts, but they want to feel like they made a decision, not that the machine made it for them.
I debated this with the PM. The instinct was "design-first" — lead with the visual. My view was that sequencing trust matters. Users needed to see options before they committed. A preview-first, explicit-commit model built more trust than magic.
Real constraints, not theoretical ones.
LLM latency was 2-4 seconds. Preview fidelity was around 85%, not pixel-perfect. Everything had to live inside a single Copilot pane. These weren't problems to solve. They were facts to design around.
So: a loading state that feels worth the wait. A disclaimer that sets expectations on preview fidelity. Four recommendations rather than five (which overwhelmed) or three (which felt too narrow). A refresh button so users could explore more without feeling stuck.
I pushed back on one thing hard: auto-apply. Engineering could have shipped it. It would have been technically impressive. But it would have killed trust. The preview-first model was non-negotiable.
The card had to do four things at once: show the chart quickly, tell the story behind it, explain why it was suggested, and make the action obvious.
The anatomy I specified: thumbnail, story-first title, rationale text, action row (Apply, Copy, Review changes). The "Review changes" expandable section was a deliberate trust mechanism. Users could see exactly what the AI was changing before they committed. For power users, it was critical.
I also specified: hover states, dropdown behavior on the Replace action (so users could control chart placement), back navigation between list view and detail view. None of this is glamorous work. All of it matters.
Fourteen usability sessions confirmed the story-first title naming resonated. Users understood titles like "Revenue by Region" immediately. "Clustered Column Chart" meant nothing to them.
The A/B test on model versions tracked Kept rate as the primary signal. Month 1 was 71%. Month 2 was 79%. The target was 80%. We were close.
The satisfaction feedback loop — thumbs up/thumbs down — started in a more prominent position. It moved to the footer after testing showed it was creating hesitation before people even tried a recommendation. Less intrusion, same signal quality.
We shipped to 10% Fastfood first. Poor quality rate dropped 20 percentage points. Satisfaction hit 64%. That gave us confidence for 25% rollout.
The feature shipped to a 25% public rollout in FY26 H1.
| Metric | Target | Month 1 | Month 2 | Change |
|---|---|---|---|---|
| Kept/Tried Rate | >=80% | 71% | 79% | +8pp |
| Error-Free Load Rate | >=80% | 74% | 85% | +11pp |
| Poor Quality Rate | <20% | 32% | 18% | -14pp |
| Pane Dismiss Rate | <30% | 41% | 28% | -13pp |
| Net Satisfaction | >60% | 48% | 64% | +16pp |
The result that surprised me most: novice and intermediate users, who typically had a 0.3-3.7% Copilot engagement baseline, hit 54-61% engagement rates with Design Recommendations. The feature reached exactly the people it was built for. Not power users looking for shortcuts, but everyday Excel users who had never touched AI features before.
Copilot tried/enabled rate lifted 52%. Design Recommendations became the #2 most-used Copilot feature in Excel, after Insights.
Designer as model co-designer. I expected to design the interface around an AI. What I didn't expect was how much the interface quality depended on model quality, and how much model quality depended on design judgment. Co-authoring prompts with the ML team, building the golden dataset, reviewing outputs weekly — that wasn't extra credit. It was the job.
Trust is the product, not the chart. Users who saw rationale text didn't just apply recommendations. They started making better choices on their own over time. The goal was always to give people expertise, not dependency. The data suggests it worked.
Constraints were honest. 85% preview fidelity felt like a problem until we treated it as a design fact. A disclaimer plus a "Review changes" section turned a limitation into a transparency feature. Users appreciated the honesty.
One thing I'd do differently: the side-by-side comparison view. Users wanted it in the MVP. We deferred it. It shipped in a later phase, but I should have pushed harder to include a simpler version earlier. Even a toggle between two options would have helped users feel more confident in their choices.
This project changed how I think about AI product design.
The trap is building AI features that impress people once. The goal is building features that make people better over time. Those are different design problems.
The first one optimizes for wow. The second optimizes for trust, transparency, and education. It's slower to build and harder to demo. It's also the only version that compounds.
Key Design Decisions & Trade-offs
The four biggest calls I made — and the reasoning, constraints, and user evidence that shaped each one.
Choice: Generate NATIVE Excel charts, not PNG images
Why: Editable, data-bound, refreshable. Competitors' AI-generated images look good but can't be tweaked.
Choice: Show 1-4 recommendations, prioritized by confidence
Why: Balance guidance with choice. 1 felt prescriptive, 5+ overwhelmed. 4 was sweet spot.
Choice: Thumbnail preview in pane, NOT live chart manipulation on hover
Why: Live preview felt overwhelming. Thumbnails gave control without distraction.
Choice: Always show WHY, not just WHAT to change
Why: Builds user understanding over time. Trust through transparency.
Impact & Results
From controlled testing to 25% public rollout — every metric moved in the right direction.
Kept/Tried Rate
Users who apply recommendation keep the chart
Error-Free Load Rate
Down from 38% baseline
Copilot Tried/Enabled Lift
Uplift in users who try Copilot after seeing recommendations
Chart Retention Improvement
Reduction in same-session chart deletions
The feature is successfully reaching Novice/Intermediate users who traditionally DON'T use Copilot for charts (0.3%-3.7% baseline). Their engagement rates (54-61%) far exceed their typical Copilot usage.
| Metric | Target | Month 1 | Month 2 | Trend |
| Kept/Tried Rate | ≥80% | 71% | 79% | 📈 +8pp |
| Error-Free Load Rate | ≥80% | 74% | 85% | 📈 +11pp |
| Poor Quality Rate | <20% | 32% | 18% | 📉 -14pp |
| Pane Dismiss Rate | <30% | 41% | 28% | 📉 -13pp |
| Net Satisfaction (👍-👎) | >60% | 48% | 64% | 📈 +16pp |
Feedback from Users
Key finding: Users didn't just apply recommendations — they LEARNED from them. Over time, they started making better initial choices.
"This is like having a data viz expert sitting next to me."
Power User, Internal Preview
"Finally! I don't have to guess if my chart is good."
Intermediate User
"I learned more about charting from these suggestions than from any tutorial."
Novice User, Usability Study
"This is like having a data analyst whispering in my ear when I make a chart."
Financial Analyst, Early Adopter
Internal Testing & Strategic Impact
Pre-launch benchmarks that validated both the technical approach and design decisions before a single user saw it.
✅ 100% execution success rate
✅ LLM-generated chart code worked every time in controlled tests
✅ 9 mins, 172 clicks saved
✅ Measured against manual chart optimization workflow
✅ All common chart types supported
✅ Column, bar, line, scatter, pie, combo — full MVP coverage
Key Learnings: Designing for AI Collaboration
What I'd do the same, what I'd change, and what this project taught me about designing with — not around — AI.
The Bigger Lesson for AI Product Design