Why We Chose Logfire Over Braintrust

4 minute read Published: 2025-12-01

TL;DR — For us, Logfire was by far the best option.

We needed observability for production AI agents in regulated industries. After evaluating the landscape, we went with Pydantic Logfire over Braintrust. We didn't even look at LangSmith after our LangChain experience.

Here's why Logfire won.

✨ Full Observability, Not Just AI

Braintrust is an AI observability platform. Logfire is an observability platform that happens to be excellent for AI.

That distinction matters. Our stack is FastAPI + Postgres + Pydantic AI. When something goes wrong in production, the problem isn't always in the LLM call. Sometimes it's a slow database query. Sometimes it's a downstream API timing out. Sometimes it's a subtle bug in our business logic.

Logfire traces everything. It's built on OpenTelemetry, the industry standard. One dashboard shows our AI agent spans right next to our HTTP requests right next to our SQL queries. When a request is slow, I can see exactly where the time went—was it the LLM, the database, or something else entirely?

With an AI-only tool like Braintrust, we'd need a separate APM for everything else. Two dashboards. Two mental models. Two bills. No thanks.

✨ Developer Experience That Actually Works

Here's how you set up Logfire:

import logfire

logfire.configure()
logfire.info('Hello from {city}!', city='Paris')

That's it. Run logfire auth once in your terminal, and you're done.

Need manual tracing? Add a span:

import logfire


def process_order(order_id: str) -> None:
    with logfire.span('processing_order', order_id=order_id):
        # Your code here - automatically timed and traced
        ...

FastAPI instrumentation? One line:

import logfire
from fastapi import FastAPI

app = FastAPI()
logfire.configure()
logfire.instrument_fastapi(app)

Postgres query tracing? One line:

import logfire

logfire.configure()
logfire.instrument_psycopg()

Every SQL query now shows up in your traces with timing, parameters, and row counts. When that one query is suddenly taking 500ms instead of 5ms, you see it immediately.

Our whole team uses Logfire as a better version of print debugging. Instead of scattering print() statements everywhere and then removing them, we use logfire.info() and logfire.span(). The logs persist. They're structured. They're searchable. And when something breaks in production, we already have the observability we need.

The live view in Logfire's dashboard is particularly nice during local development. Watch your traces stream in real-time as you hit your endpoints. It's like having a debugger that works across your entire request lifecycle.

✨ Compliance for Regulated Industries

We build AI automation for insurance, finance, and healthcare. Compliance isn't optional—it's table stakes.

Logfire is SOC2 Type II certified and HIPAA compliant. For our clients, this means fewer security questionnaires and faster procurement cycles.

Braintrust has SOC2 as well, to be fair. But there's something reassuring about the Pydantic pedigree. These are the same folks who built the most widely-used data validation library in Python. They understand what it means to be trusted by enterprises.

✨ Evals That Actually Work

Pydantic Evals takes a code-first approach to AI evaluation. No web UI for defining test cases. No YAML configs. Just Python:

from pydantic_evals import Case, Dataset

dataset = Dataset(
    cases=[
        Case(
            name='capital_question',
            inputs='What is the capital of France?',
            expected_output='Paris',
        ),
        Case(
            name='whiskey_region',
            inputs='What region is Yamazaki whiskey from?',
            expected_output='Japan',
        ),
    ]
)

# Run your agent against all cases, get a typed report
# report = dataset.evaluate_sync(my_agent_function)

It feels like writing unit tests—because it basically is. Define your cases, define your evaluators, run your experiments. Results automatically appear in Logfire for visualization and comparison.

Braintrust's eval system works, but it felt clunkier. More platform-dependent. More clicking around in a web UI to set things up. Pydantic Evals is just Python code that lives in your repo, runs in CI, and integrates with everything else you're already using.

The typesafety is the real win. Your evaluators are Pydantic models. Your cases are typed. When you refactor your agent's output schema, your IDE tells you which evals need updating.

Braintrust Isn't Bad

I want to be fair here. Braintrust is a solid platform for teams that want a UI-first workflow. Their AI-assisted prompt optimization is clever. The playground is nice for prototyping.

But it's AI-only. If you're building a pure LLM application with no database, no HTTP calls, no complex business logic—Braintrust might be perfect.

That's not us. We have FastAPI services, Postgres databases, external APIs, background workers. We needed observability that covers the whole stack, not just the AI parts. Running two separate observability platforms felt like unnecessary complexity.

The Bottom Line

Logfire gives us:

For teams building AI in regulated industries with real backend infrastructure, it's the obvious choice.

Now if you'll excuse me, I have some traces to review—and a glass of Hibiki to pour. À la prochaine. 🥃