Tools
A curated directory of 23 tools we use, evaluate, and recommend across the AI security landscape — with our take on each.
LLM Observability
Langfuse
Our take
Default starting point for new LLM teams. Self-host on a single VM or use SaaS — either scales surprisingly far.
Phoenix (Arize)
Our take
Strong if you have classical ML alongside LLMs. Embeddings-drift visualizations are best-in-class.
Helicone
Our take
Lowest setup cost. Less feature-rich than Langfuse but starts paying off in 5 minutes.
Honeyhive
Our take
Commercial-only; competitive with LangSmith. Good if you need a hosted product without DIY.
ML Experiment Tracking
MLflow
Our take
Boring choice that works. Self-host the tracking server; integrate with whatever you use for serving.
Weights & Biases
Our take
The premium option. Pricey at team scale but UX gain is real for research-heavy teams.
Aim
Our take
Underrated alternative to MLflow if you find MLflow's UI clunky. Smaller ecosystem.
ML Monitoring & Drift
Evidently AI
Our take
Best OSS ML monitoring tool. Generates HTML reports out of the box; exports to Grafana for dashboards.
WhyLabs
Our take
Profile lots of data without sending raw data — appealing for regulated industries.
Fiddler
Our take
Enterprise-grade with explainability emphasis. Worth evaluating if model risk management is a procurement requirement.
Vector DBs
Qdrant
Our take
Performance leader for self-hosted. Filtering + payloads work well; ops are simple.
Weaviate
Our take
Strong hybrid search (BM25 + vectors). Heavier than Qdrant but offers more out-of-the-box.
pgvector
Our take
If you already run Postgres, start here. Performance is good enough for most apps; you defer to a dedicated vector DB only when you outgrow it.
Pinecone
Our take
Pricey at scale. Use only if the team's bandwidth for ops is zero and the budget allows.
LLM Frameworks & Orchestration
LangChain
Our take
Inevitable in many stacks. Critique it all you want; the ecosystem and integration coverage make it hard to avoid.
LlamaIndex
Our take
Better than LangChain for pure RAG use cases. Cleaner abstractions; less framework drama.
DSPy
Our take
Different paradigm than LangChain. Worth trying for production prompt tuning that doesn't depend on a single model version.
Instructor
Our take
Drop-in for any structured-output use case. Cleaner than rolling your own JSON-mode wrappers.
Inference & Serving
vLLM
Our take
Best self-hosted serving for latency-tolerant batch workloads. 2-5x throughput improvement over naive transformers.
TGI (Text Generation Inference)
Our take
Solid choice when you're deploying HF-hosted models; tight integration with the hub.
Ollama
Our take
Best dev/local environment story. Not a production server, but invaluable for local prototyping and integration tests.