Tools — SentryML

LLM Observability

Langfuse

MIT (self-host) or SaaS

Our take

Default starting point for new LLM teams. Self-host on a single VM or use SaaS — either scales surprisingly far.

Phoenix (Arize)

Elastic

Our take

Strong if you have classical ML alongside LLMs. Embeddings-drift visualizations are best-in-class.

Helicone

Apache 2.0 (self-host) or SaaS

Our take

Lowest setup cost. Less feature-rich than Langfuse but starts paying off in 5 minutes.

Honeyhive

Commercial

Our take

Commercial-only; competitive with LangSmith. Good if you need a hosted product without DIY.

ML Experiment Tracking

MLflow

Apache 2.0

Our take

Boring choice that works. Self-host the tracking server; integrate with whatever you use for serving.

Weights & Biases

Commercial

Our take

The premium option. Pricey at team scale but UX gain is real for research-heavy teams.

Aim

Apache 2.0

Our take

Underrated alternative to MLflow if you find MLflow's UI clunky. Smaller ecosystem.

ML Monitoring & Drift

Evidently AI

Apache 2.0

Our take

Best OSS ML monitoring tool. Generates HTML reports out of the box; exports to Grafana for dashboards.

WhyLabs

Commercial / OSS profiling lib

Our take

Profile lots of data without sending raw data — appealing for regulated industries.

Fiddler

Commercial

Our take

Enterprise-grade with explainability emphasis. Worth evaluating if model risk management is a procurement requirement.

Vector DBs

Qdrant

Apache 2.0

Our take

Performance leader for self-hosted. Filtering + payloads work well; ops are simple.

Weaviate

BSD-3 / Commercial

Our take

Strong hybrid search (BM25 + vectors). Heavier than Qdrant but offers more out-of-the-box.

pgvector

PostgreSQL

Our take

If you already run Postgres, start here. Performance is good enough for most apps; you defer to a dedicated vector DB only when you outgrow it.

Pinecone

Commercial

Our take

Pricey at scale. Use only if the team's bandwidth for ops is zero and the budget allows.

LLM Frameworks & Orchestration

LangChain

MIT

Our take

Inevitable in many stacks. Critique it all you want; the ecosystem and integration coverage make it hard to avoid.

LlamaIndex

MIT

Our take

Better than LangChain for pure RAG use cases. Cleaner abstractions; less framework drama.

DSPy

MIT

Our take

Different paradigm than LangChain. Worth trying for production prompt tuning that doesn't depend on a single model version.

Instructor

MIT

Our take

Drop-in for any structured-output use case. Cleaner than rolling your own JSON-mode wrappers.

Inference & Serving

vLLM

Apache 2.0

Our take

Best self-hosted serving for latency-tolerant batch workloads. 2-5x throughput improvement over naive transformers.

TGI (Text Generation Inference)

Apache 2.0

Our take

Solid choice when you're deploying HF-hosted models; tight integration with the hub.

Ollama

MIT

Our take

Best dev/local environment story. Not a production server, but invaluable for local prototyping and integration tests.

Evaluation & Testing

promptfoo

MIT

Our take

Best eval framework UX. Use as the regression suite for any LLM application.

DeepEval

Apache 2.0

Our take

Clean Pytest integration. Use when your team already lives in Pytest test suites.