Union.ai - Enterprise AI orchestration platform from experiment to production

Launched on Feb 18, 2025

Union.ai is an enterprise AI orchestration platform built on Flyte, supporting the complete AI development lifecycle from experiment to production. It offers dynamic workflows, agentic AI runtime, and multi-cloud deployment, serving 30+ Fortune 100 companies with proven ROI in ML operations.

AI DevTools PaidDebuggingWorkflow AutomationEnterpriseOpen Source

Visit Website

What is Union.ai Core Capabilities of Union.ai Who Uses Union.ai Quick Start Technical Architecture Frequently Asked Questions Pricing Overview Comments Related Content

What is Union.ai

The transition from machine learning experimentation to production deployment remains one of the most significant challenges facing data science teams today. Organizations invest heavily in building sophisticated models, only to encounter bottlenecks when orchestrating complex ML workflows across distributed infrastructure. The fragmentation between data processing, model training, and inference stages creates operational silos, while the complexity of managing multi-cloud environments adds another layer of overhead that diverts engineering resources from actual model development.

Union.ai addresses these challenges by providing an enterprise-grade AI orchestration platform built on Flyte, the open-source workflow automation engine originally developed at Lyft in 2016. The platform unifies the entire ML development lifecycle—from data preparation and feature engineering through model training and deployment—into a single, coherent system that eliminates the friction traditionally associated with moving ML projects from prototype to production scale.

With more than 30 Fortune 100 companies trusting Union.ai to power their AI initiatives, including industry leaders such as Spotify, Toyota (Woven by Toyota), Johnson & Johnson, Lockheed, and Spotify, the platform has proven its capability to handle mission-critical workloads at enterprise scale. Companies leverage Union.ai to orchestrate everything from large-scale model training pipelines to real-time inference services, all within a unified architecture that provides consistent visibility, reproducibility, and cost control across the development lifecycle.

Key Takeaways

Enterprise AI orchestration platform built on open-source Flyte core
Trusted by 30+ Fortune 100 companies including Spotify, Toyota, and Johnson & Johnson
Reduces iteration time by 96% through dynamic workflow capabilities
Supports 50,000+ actions per run with <100ms task startup time

Core Capabilities of Union.ai

Union.ai delivers a comprehensive set of capabilities designed to address the full spectrum of ML engineering challenges, from individual task execution to enterprise-wide workflow orchestration. The platform's architecture emphasizes developer productivity, operational efficiency, and seamless scaling without requiring teams to sacrifice control over their infrastructure.

Dynamic Workflows and Agent Runtime

The platform enables teams to author workflows entirely in Python, supporting runtime-defined branching, loops, and automatic retry logic that adapts to execution context. This dynamic approach eliminates the need for static pipeline definitions that break when real-world data introduces unexpected conditions. The Agentic AI runtime extends these capabilities to orchestrate complex multi-agent workflows, supporting use cases ranging from automated research synthesis to adaptive data processing pipelines. Organizations have demonstrated the ability to execute more than 50,000 actions in a single workflow run, enabling massive parallelization of compute-intensive tasks.

Model Training Orchestration

Union.ai simplifies distributed model training by providing automatic resource provisioning and scaling across Kubernetes clusters. The platform handles the complexity of coordinating PyTorch, TensorFlow, and other training frameworks across multiple nodes, while maintaining full reproducibility through automatic caching of intermediate results and version-controlled artifact management. Teams can scale from single-node experiments to multi-GPU training clusters without modifying their code, with the platform handling all cluster provisioning and cleanup automatically.

Real-Time Inference

The unified training and inference architecture eliminates the traditional separation between development and production environments, enabling models to move seamlessly from training to serving within the same platform. Dynamic resource allocation ensures that inference endpoints scale automatically based on demand, while the <100ms latency target ensures suitability for real-time applications. Organizations deploy inference services alongside training pipelines, creating continuous learning loops where model performance metrics automatically trigger retraining workflows.

Observability and Cost Tracking

Comprehensive observability tools provide visibility across the entire ML development lifecycle, with cost allocation dashboards that attribute spending to specific teams, projects, or individual workflow executions. Data lineage tracking enables teams to trace predictions back to specific training data versions, supporting compliance requirements and debugging workflows. Integration with monitoring systems like Prometheus and Grafana ensures that ML operations integrate seamlessly with existing operational infrastructure.

Enterprise Security and Compliance

Enterprise deployments benefit from role-based access control (RBAC), single sign-on support via SAML and OIDC protocols, and VPC isolation that keeps sensitive workloads separated from shared infrastructure. The platform maintains SOC 2 Type I and Type II certifications alongside HIPAA compliance, addressing the stringent security requirements of healthcare, financial services, and government customers. All customer data, including workflow executions, code, images, logs, and secrets, remains within the customer's VPC, ensuring data sovereignty and minimizing exposure to third-party systems.

Container Reuse and Remote Debugging

The container pooling mechanism maintains a warm pool of pre-initialized containers, reducing task startup time to under 100 milliseconds by eliminating the traditional container initialization overhead. Remote debugging capabilities allow engineers to attach debuggers directly to tasks running on the actual production infrastructure, enabling line-by-line inspection of remote task execution without requiring local reproduction of complex environment configurations.

Multi-Cloud and Hybrid Deployment

Organizations retain full control over their infrastructure choices through support for bring-your-own-cloud (BYOC) deployments across AWS, GCP, Azure, and neo-cloud environments. Self-hosted deployment options support on-premises, hybrid, and air-gapped configurations for organizations with specific compliance or data residency requirements. This flexibility enables enterprises to execute multi-cloud strategies without platform lock-in while maintaining consistent tooling and workflow definitions across environments.

Dynamic Workflows: Python-native workflow definition with runtime branching, loops, and automatic retries supporting 50,000+ actions per run
Enterprise Security: SOC 2 Type I/II and HIPAA certified with RBAC, SSO, and VPC isolation
Multi-Cloud Flexibility: BYOC support for AWS, GCP, Azure, and self-hosted deployments including air-gapped configurations
Cost Efficiency: 96% iteration time reduction and automatic resource optimization

Learning Curve: Dynamic workflow concepts require understanding of runtime execution model for teams accustomed to static pipelines
Enterprise Pricing: Custom enterprise pricing may present budget considerations for smaller teams requiring advanced features

Who Uses Union.ai

Union.ai serves organizations across diverse industries, with particular strength in sectors requiring large-scale compute orchestration, rigorous reproducibility, and strict data governance. The following case studies illustrate how different industries leverage the platform to address their unique challenges.

Biotechnology and Healthcare

The biotechnology sector relies on Union.ai to accelerate药物发现 and genomic analysis workflows that require processing vast datasets across thousands of parallel compute tasks. Rezo utilizes the platform to orchestrate drug discovery pipelines, achieving over 90% reduction in compute costs while dramatically accelerating the identification of promising therapeutic candidates. Artera leverages Union.ai to personalize cancer treatments by analyzing patient-specific data at scale, while Delve Bio applies the platform to accelerate infectious disease diagnosis through rapid pathogen identification. Cradle uses Union.ai to streamline protein design workflows, enabling ML researchers to iterate on protein structures faster than traditional laboratory approaches permit.

Autonomous Systems

Autonomous vehicle development demands efficient orchestration of massive data processing pipelines, simulation workloads, and continuous model training cycles. Woven by Toyota (formerly Toyota Research Institute) employs Union.ai to manage the computational infrastructure supporting autonomous vehicle development, generating millions of dollars in savings while enabling unprecedented scaling of自动驾驶 research. Wayve leverages the platform's dynamic workflow capabilities to accelerate autonomous driving R&D, using the platform's ability to coordinate complex multi-stage training pipelines across distributed infrastructure.

Geospatial Analysis

Organizations processing global-scale geospatial data benefit from Union.ai's ability to coordinate massive parallel processing workloads across geographically distributed compute resources. MethaneSAT uses the platform to orchestrate global methane emission monitoring workflows, processing satellite imagery and sensor data to track climate change indicators at planetary scale. Blackshark.ai applies Union.ai to build and maintain digital twins of Earth's surface, processing petabytes of imagery and geographic data to create comprehensive digital representations of physical environments.

Data Processing and ETL

Organizations modernizing their data infrastructure leverage Union.ai to unify previously siloed data and ML operations. Porch migrated from Apache Airflow to Union.ai, achieving operational consistency between data engineering and machine learning teams while gaining the reproducibility guarantees essential for regulated industries. The platform's unified approach eliminates the need for maintaining separate tooling for batch ETL pipelines and ML training workflows.

Financial Technology

Financial services organizations use Union.ai to optimize compute-intensive forecasting and risk modeling workflows. Spotify applies the platform to orchestrate quarterly prediction pipelines, achieving 50% reduction in forecasting cycle time while maintaining the accuracy required for business-critical decisions. Stash reduced pipeline compute costs by 67% through Union.ai's resource optimization capabilities, demonstrating the platform's ability to deliver significant operational savings at scale.

Agentic AI

Emerging Agentic AI applications require sophisticated workflow orchestration capable of coordinating multiple AI agents executing complex, multi-step reasoning tasks. Dragonfly uses Union.ai to scale agentic research workflows across 250,000 products, enabling AI-driven research at a scale previously impossible with traditional pipeline tools. The platform's support for dynamic branching and conditional execution enables researchers to build adaptive agent behaviors that respond to intermediate results.

Industry Selection Guidance

Organizations in biotechnology and autonomous systems should prioritize evaluation of Union.ai's dynamic workflow capabilities, as these industries frequently require adaptive pipelines that respond to experimental results. Financial services and fintech teams should focus on the cost tracking and resource optimization features, which have demonstrated 67%+ compute cost reductions in production deployments.

Quick Start

Getting started with Union.ai requires minimal setup for development environments, with production deployments supporting various architectural patterns depending on organizational requirements.

Installation

The platform provides a Python-native client that integrates seamlessly with existing ML toolchains:

pip install union
union login

The installation requires Python 3.8 or higher, with Kubernetes cluster access required for self-managed deployments. Teams opting for Union's managed service can bypass infrastructure setup entirely and begin developing workflows immediately.

Minimum Viable Example

Creating a basic workflow requires defining tasks and composing them into a workflow:

from union import task, workflow

@task
def preprocess_data(input_path: str) -> str:
    # Data preprocessing logic
    return processed_path

@task
def train_model(data_path: str) -> str:
    # Model training logic
    return model_path

@workflow
def ml_pipeline(input_path: str) -> str:
    processed = preprocess_data(input_path=input_path)
    model = train_model(data_path=processed)
    return model

This minimal example demonstrates the Python-native approach that eliminates the need for separate configuration files or YAML definitions. The @task and @workflow decorators automatically handle serialization, distributed execution, and retry logic.

Deployment Options

Organizations should select deployment architectures based on their specific requirements:

Union Managed: The fastest path to production, with Union operating and maintaining the orchestration infrastructure. Recommended for teams prioritizing rapid development velocity over infrastructure control.

Bring Your Own Cloud (BYOC): Customers provide their own AWS, GCP, Azure, or neo-cloud accounts while Union manages the platform software. This option maintains data residency within customer-controlled VPCs while reducing operational burden. Recommended for organizations with data sovereignty requirements or existing cloud commitments.

Self-Hosted: Complete deployment on-premises, in hybrid configurations, or within air-gapped environments. Recommended for organizations with strict compliance requirements, government agencies, or those operating in environments without external network connectivity.

Environment Configuration

Development teams new to workflow orchestration should begin with Union's managed service to experience the full platform capabilities without infrastructure overhead. Production deployments handling sensitive data or requiring compliance certifications should evaluate BYOC or self-hosted options to maintain full control over data residency and infrastructure security.

Additional Resources:

Documentation: https://www.union.ai/docs/
GitHub: https://github.com/flyteorg/flyte
Community Slack: https://slack.flyte.org/

Technical Architecture

Union.ai's architecture builds on Kubernetes as the underlying orchestration layer, extending containerized workload management with specialized capabilities for ML workflow automation. The platform's design philosophy emphasizes extensibility, reproducibility, and operational efficiency while maintaining simplicity for developers.

Technology Stack

The platform integrates with the broader data science ecosystem through native support for Spark, Ray, Dask, PyTorch, and distributed computing frameworks. Native integrations with Snowflake, Databricks, and BigQuery enable seamless data access without requiring custom connector development. The Python-native domain-specific language (DSL) allows developers to define workflows using familiar programming constructs, while metadata validation through Pandera and experiment tracking via Weights & Biases integrate into existing MLOps toolchains.

Flyte 2 and Local Execution

Flyte 2 introduces significant developer experience improvements, including support for local workflow execution that enables rapid iteration without cluster access. Developers can test workflows locally using the same execution engine that powers production deployments, eliminating the gap between local development and production behavior that plagues many ML platforms.

Dynamic Workflow Architecture

The dynamic workflow architecture enables runtime decisions about execution paths, branching logic, and retry behavior based on actual task outputs. This approach differs fundamentally from static pipeline definitions that must anticipate all possible execution paths at definition time. The 96% reduction in iteration time documented by Union.ai customers stems directly from this dynamic capability, eliminating the need to modify pipeline definitions when data characteristics or business logic evolves.

Performance Benchmarks

The platform demonstrates industry-leading performance across key operational metrics:

Task Startup Time: Under 100 milliseconds through container pooling and pre-initialization
Fanout Capacity: Over 50,000 actions per workflow run enabling massive parallelization
Concurrent Operations: Support for 1,000+ simultaneous task executions
Latency: Sub-100ms inference latency for real-time applications

These benchmarks reflect the platform's ability to handle production ML workloads at scale without the infrastructure overhead that characterizes traditional workflow orchestration systems.

Container Architecture

The containerized design ensures consistent execution environments across development, testing, and production stages. Task caching eliminates redundant computation by detecting when task inputs match previously executed work, while container reuse minimizes cold-start delays that typically impact workflow execution times. The Kubernetes-native architecture enables horizontal scaling by adding worker nodes without platform modifications, supporting organizations as their ML workloads grow.

Open Core: Flyte open-source foundation with active community (10,000+ members, 1M+ monthly downloads) ensures vendor independence
Python-Native: DSL leverages existing Python skills without requiring domain-specific language learning
Kubernetes-Native: Architecture inherits Kubernetes ecosystem benefits including security, networking, and storage ecosystems
Vendor-Neutral: Deployment flexibility across clouds and on-premises without lock-in

Self-Hosted Operational Overhead: Organizations choosing self-hosted deployment assume responsibility for Kubernetes cluster management and maintenance
Workflow Complexity: Advanced dynamic workflow patterns require understanding of runtime execution model

Frequently Asked Questions

How does Union.ai pricing work?

The monthly plan fee serves as a usage credit that offsets actual compute and action consumption. This structure means the plan cost effectively becomes the minimum monthly spending commitment, with any unused credit rolling forward to offset future usage charges.

What is an Action in Union.ai?

An Action represents a single task execution—the specific invocation of a task with defined inputs. Each workflow execution generates multiple Actions as tasks execute, with Action count serving as the primary billing metric for Team and Enterprise plans.

Does Union.ai support single sign-on (SSO)?

Yes, Enterprise plans include custom SSO integration supporting both SAML and OIDC protocols. This enables organizations to integrate Union.ai with their existing identity management systems, maintaining centralized access control and simplifying compliance with corporate security policies.

Can Union.ai be self-hosted?

Yes, the platform supports fully self-managed deployments including on-premises installations, hybrid configurations combining cloud and on-premises resources, and air-gapped environments for organizations requiring complete network isolation.

How is resource usage reported and billed?

Resource consumption is calculated per-second based on the allocated resources (CPU, memory, GPU) for containers executing tasks. The platform reports usage at the container level, providing granular visibility into compute consumption by workflow, team, or project.

Does customer data remain in my VPC?

Yes, all customer data including workflow executions, source code, container images, input data, execution logs, and secrets remain within the customer's VPC. Union.ai never extracts customer data from customer-controlled environments, ensuring data sovereignty and minimizing security exposure.

What is the difference between Fanout and Concurrency?

Fanout refers to the total number of Actions created by a workflow execution—the aggregate count of individual task invocations across the entire pipeline. Concurrency represents the maximum number of Actions executing simultaneously at any given moment during workflow execution. Understanding this distinction helps organizations optimize workflow designs for their specific throughput requirements.

Does Union.ai support bring-your-own-cloud (BYOC)?

Yes, BYOC deployments run Union.ai within customer-provided AWS, GCP, Azure, or neo-cloud accounts. This model provides the operational simplicity of managed services while maintaining full data residency within customer-controlled cloud infrastructure.

Pricing Overview

Union.ai offers two primary pricing tiers designed to accommodate teams at different scales and organizational requirements.

Team Plan

The Team plan provides $950 per month of included usage (paid monthly), offering an entry point for teams adopting ML orchestration:

1,000 concurrent operations
30-day data retention
Single cluster deployment
Full platform capabilities

Enterprise Plan

The Enterprise plan provides custom pricing tailored to organizational requirements:

Volume-based discounts for large-scale deployments
Customizable concurrent operation limits
Configurable data retention policies
Multi-cluster deployments (3+ clusters)
Enterprise security features including advanced RBAC
White-glove support with dedicated customer success resources

Resource Pricing

Compute resources are billed separately based on actual consumption:

Resource	Price
vCPU	$0.0417/hour
Memory (GB)	$0.0051/hour
GPU (T4g)	$0.1516/hour
GPU (A100)	$0.6176/hour
GPU (H100)	$1.3760/hour
Action (Base)	$0.0075/action

For detailed pricing information, visit https://union.ai/pricing.

Ready to transform your ML operations? Visit https://union.ai to start your journey, or explore the documentation at https://www.union.ai/docs/ for technical details.

Union.ai

Enterprise AI orchestration platform from experiment to production

Visit Website

Featured

View All

PatentFig AI

AI-powered patent drawing platform for compliant figures in minutes

SciDraw AI

AI-powered scientific illustration and data visualization platform

Humanio

AI text humanizer that reads like authentic human writing

GhostShorts

AI-powered viral short video generator for faceless creators

IdeaPanda

Research-backed business ideas validated by real customer complaints

5 Best AI Agent Frameworks for Developers in 2026

Compare the top AI agent frameworks including LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, and LlamaIndex. Find the best framework for building multi-agent AI systems.

Cursor vs Windsurf vs GitHub Copilot: The Ultimate Comparison (2026)

Cursor vs Windsurf vs GitHub Copilot — we compare features, pricing, AI models, and real-world performance to help you pick the best AI code editor in 2026.