Union.ai is an enterprise AI orchestration platform built on Flyte, supporting the complete AI development lifecycle from experiment to production. It offers dynamic workflows, agentic AI runtime, and multi-cloud deployment, serving 30+ Fortune 100 companies with proven ROI in ML operations.




The transition from machine learning experimentation to production deployment remains one of the most significant challenges facing data science teams today. Organizations invest heavily in building sophisticated models, only to encounter bottlenecks when orchestrating complex ML workflows across distributed infrastructure. The fragmentation between data processing, model training, and inference stages creates operational silos, while the complexity of managing multi-cloud environments adds another layer of overhead that diverts engineering resources from actual model development.
Union.ai addresses these challenges by providing an enterprise-grade AI orchestration platform built on Flyte, the open-source workflow automation engine originally developed at Lyft in 2016. The platform unifies the entire ML development lifecycle—from data preparation and feature engineering through model training and deployment—into a single, coherent system that eliminates the friction traditionally associated with moving ML projects from prototype to production scale.
With more than 30 Fortune 100 companies trusting Union.ai to power their AI initiatives, including industry leaders such as Spotify, Toyota (Woven by Toyota), Johnson & Johnson, Lockheed, and Spotify, the platform has proven its capability to handle mission-critical workloads at enterprise scale. Companies leverage Union.ai to orchestrate everything from large-scale model training pipelines to real-time inference services, all within a unified architecture that provides consistent visibility, reproducibility, and cost control across the development lifecycle.
Union.ai delivers a comprehensive set of capabilities designed to address the full spectrum of ML engineering challenges, from individual task execution to enterprise-wide workflow orchestration. The platform's architecture emphasizes developer productivity, operational efficiency, and seamless scaling without requiring teams to sacrifice control over their infrastructure.
The platform enables teams to author workflows entirely in Python, supporting runtime-defined branching, loops, and automatic retry logic that adapts to execution context. This dynamic approach eliminates the need for static pipeline definitions that break when real-world data introduces unexpected conditions. The Agentic AI runtime extends these capabilities to orchestrate complex multi-agent workflows, supporting use cases ranging from automated research synthesis to adaptive data processing pipelines. Organizations have demonstrated the ability to execute more than 50,000 actions in a single workflow run, enabling massive parallelization of compute-intensive tasks.
Union.ai simplifies distributed model training by providing automatic resource provisioning and scaling across Kubernetes clusters. The platform handles the complexity of coordinating PyTorch, TensorFlow, and other training frameworks across multiple nodes, while maintaining full reproducibility through automatic caching of intermediate results and version-controlled artifact management. Teams can scale from single-node experiments to multi-GPU training clusters without modifying their code, with the platform handling all cluster provisioning and cleanup automatically.
The unified training and inference architecture eliminates the traditional separation between development and production environments, enabling models to move seamlessly from training to serving within the same platform. Dynamic resource allocation ensures that inference endpoints scale automatically based on demand, while the <100ms latency target ensures suitability for real-time applications. Organizations deploy inference services alongside training pipelines, creating continuous learning loops where model performance metrics automatically trigger retraining workflows.
Comprehensive observability tools provide visibility across the entire ML development lifecycle, with cost allocation dashboards that attribute spending to specific teams, projects, or individual workflow executions. Data lineage tracking enables teams to trace predictions back to specific training data versions, supporting compliance requirements and debugging workflows. Integration with monitoring systems like Prometheus and Grafana ensures that ML operations integrate seamlessly with existing operational infrastructure.
Enterprise deployments benefit from role-based access control (RBAC), single sign-on support via SAML and OIDC protocols, and VPC isolation that keeps sensitive workloads separated from shared infrastructure. The platform maintains SOC 2 Type I and Type II certifications alongside HIPAA compliance, addressing the stringent security requirements of healthcare, financial services, and government customers. All customer data, including workflow executions, code, images, logs, and secrets, remains within the customer's VPC, ensuring data sovereignty and minimizing exposure to third-party systems.
The container pooling mechanism maintains a warm pool of pre-initialized containers, reducing task startup time to under 100 milliseconds by eliminating the traditional container initialization overhead. Remote debugging capabilities allow engineers to attach debuggers directly to tasks running on the actual production infrastructure, enabling line-by-line inspection of remote task execution without requiring local reproduction of complex environment configurations.
Organizations retain full control over their infrastructure choices through support for bring-your-own-cloud (BYOC) deployments across AWS, GCP, Azure, and neo-cloud environments. Self-hosted deployment options support on-premises, hybrid, and air-gapped configurations for organizations with specific compliance or data residency requirements. This flexibility enables enterprises to execute multi-cloud strategies without platform lock-in while maintaining consistent tooling and workflow definitions across environments.
Union.ai serves organizations across diverse industries, with particular strength in sectors requiring large-scale compute orchestration, rigorous reproducibility, and strict data governance. The following case studies illustrate how different industries leverage the platform to address their unique challenges.
The biotechnology sector relies on Union.ai to accelerate药物发现 and genomic analysis workflows that require processing vast datasets across thousands of parallel compute tasks. Rezo utilizes the platform to orchestrate drug discovery pipelines, achieving over 90% reduction in compute costs while dramatically accelerating the identification of promising therapeutic candidates. Artera leverages Union.ai to personalize cancer treatments by analyzing patient-specific data at scale, while Delve Bio applies the platform to accelerate infectious disease diagnosis through rapid pathogen identification. Cradle uses Union.ai to streamline protein design workflows, enabling ML researchers to iterate on protein structures faster than traditional laboratory approaches permit.
Autonomous vehicle development demands efficient orchestration of massive data processing pipelines, simulation workloads, and continuous model training cycles. Woven by Toyota (formerly Toyota Research Institute) employs Union.ai to manage the computational infrastructure supporting autonomous vehicle development, generating millions of dollars in savings while enabling unprecedented scaling of自动驾驶 research. Wayve leverages the platform's dynamic workflow capabilities to accelerate autonomous driving R&D, using the platform's ability to coordinate complex multi-stage training pipelines across distributed infrastructure.
Organizations processing global-scale geospatial data benefit from Union.ai's ability to coordinate massive parallel processing workloads across geographically distributed compute resources. MethaneSAT uses the platform to orchestrate global methane emission monitoring workflows, processing satellite imagery and sensor data to track climate change indicators at planetary scale. Blackshark.ai applies Union.ai to build and maintain digital twins of Earth's surface, processing petabytes of imagery and geographic data to create comprehensive digital representations of physical environments.
Organizations modernizing their data infrastructure leverage Union.ai to unify previously siloed data and ML operations. Porch migrated from Apache Airflow to Union.ai, achieving operational consistency between data engineering and machine learning teams while gaining the reproducibility guarantees essential for regulated industries. The platform's unified approach eliminates the need for maintaining separate tooling for batch ETL pipelines and ML training workflows.
Financial services organizations use Union.ai to optimize compute-intensive forecasting and risk modeling workflows. Spotify applies the platform to orchestrate quarterly prediction pipelines, achieving 50% reduction in forecasting cycle time while maintaining the accuracy required for business-critical decisions. Stash reduced pipeline compute costs by 67% through Union.ai's resource optimization capabilities, demonstrating the platform's ability to deliver significant operational savings at scale.
Emerging Agentic AI applications require sophisticated workflow orchestration capable of coordinating multiple AI agents executing complex, multi-step reasoning tasks. Dragonfly uses Union.ai to scale agentic research workflows across 250,000 products, enabling AI-driven research at a scale previously impossible with traditional pipeline tools. The platform's support for dynamic branching and conditional execution enables researchers to build adaptive agent behaviors that respond to intermediate results.
Organizations in biotechnology and autonomous systems should prioritize evaluation of Union.ai's dynamic workflow capabilities, as these industries frequently require adaptive pipelines that respond to experimental results. Financial services and fintech teams should focus on the cost tracking and resource optimization features, which have demonstrated 67%+ compute cost reductions in production deployments.
Getting started with Union.ai requires minimal setup for development environments, with production deployments supporting various architectural patterns depending on organizational requirements.
The platform provides a Python-native client that integrates seamlessly with existing ML toolchains:
pip install union
union login
The installation requires Python 3.8 or higher, with Kubernetes cluster access required for self-managed deployments. Teams opting for Union's managed service can bypass infrastructure setup entirely and begin developing workflows immediately.
Creating a basic workflow requires defining tasks and composing them into a workflow:
from union import task, workflow
@task
def preprocess_data(input_path: str) -> str:
# Data preprocessing logic
return processed_path
@task
def train_model(data_path: str) -> str:
# Model training logic
return model_path
@workflow
def ml_pipeline(input_path: str) -> str:
processed = preprocess_data(input_path=input_path)
model = train_model(data_path=processed)
return model
This minimal example demonstrates the Python-native approach that eliminates the need for separate configuration files or YAML definitions. The @task and @workflow decorators automatically handle serialization, distributed execution, and retry logic.
Organizations should select deployment architectures based on their specific requirements:
Union Managed: The fastest path to production, with Union operating and maintaining the orchestration infrastructure. Recommended for teams prioritizing rapid development velocity over infrastructure control.
Bring Your Own Cloud (BYOC): Customers provide their own AWS, GCP, Azure, or neo-cloud accounts while Union manages the platform software. This option maintains data residency within customer-controlled VPCs while reducing operational burden. Recommended for organizations with data sovereignty requirements or existing cloud commitments.
Self-Hosted: Complete deployment on-premises, in hybrid configurations, or within air-gapped environments. Recommended for organizations with strict compliance requirements, government agencies, or those operating in environments without external network connectivity.
Development teams new to workflow orchestration should begin with Union's managed service to experience the full platform capabilities without infrastructure overhead. Production deployments handling sensitive data or requiring compliance certifications should evaluate BYOC or self-hosted options to maintain full control over data residency and infrastructure security.
Additional Resources:
Union.ai's architecture builds on Kubernetes as the underlying orchestration layer, extending containerized workload management with specialized capabilities for ML workflow automation. The platform's design philosophy emphasizes extensibility, reproducibility, and operational efficiency while maintaining simplicity for developers.
The platform integrates with the broader data science ecosystem through native support for Spark, Ray, Dask, PyTorch, and distributed computing frameworks. Native integrations with Snowflake, Databricks, and BigQuery enable seamless data access without requiring custom connector development. The Python-native domain-specific language (DSL) allows developers to define workflows using familiar programming constructs, while metadata validation through Pandera and experiment tracking via Weights & Biases integrate into existing MLOps toolchains.
Flyte 2 introduces significant developer experience improvements, including support for local workflow execution that enables rapid iteration without cluster access. Developers can test workflows locally using the same execution engine that powers production deployments, eliminating the gap between local development and production behavior that plagues many ML platforms.
The dynamic workflow architecture enables runtime decisions about execution paths, branching logic, and retry behavior based on actual task outputs. This approach differs fundamentally from static pipeline definitions that must anticipate all possible execution paths at definition time. The 96% reduction in iteration time documented by Union.ai customers stems directly from this dynamic capability, eliminating the need to modify pipeline definitions when data characteristics or business logic evolves.
The platform demonstrates industry-leading performance across key operational metrics:
These benchmarks reflect the platform's ability to handle production ML workloads at scale without the infrastructure overhead that characterizes traditional workflow orchestration systems.
The containerized design ensures consistent execution environments across development, testing, and production stages. Task caching eliminates redundant computation by detecting when task inputs match previously executed work, while container reuse minimizes cold-start delays that typically impact workflow execution times. The Kubernetes-native architecture enables horizontal scaling by adding worker nodes without platform modifications, supporting organizations as their ML workloads grow.
The monthly plan fee serves as a usage credit that offsets actual compute and action consumption. This structure means the plan cost effectively becomes the minimum monthly spending commitment, with any unused credit rolling forward to offset future usage charges.
An Action represents a single task execution—the specific invocation of a task with defined inputs. Each workflow execution generates multiple Actions as tasks execute, with Action count serving as the primary billing metric for Team and Enterprise plans.
Yes, Enterprise plans include custom SSO integration supporting both SAML and OIDC protocols. This enables organizations to integrate Union.ai with their existing identity management systems, maintaining centralized access control and simplifying compliance with corporate security policies.
Yes, the platform supports fully self-managed deployments including on-premises installations, hybrid configurations combining cloud and on-premises resources, and air-gapped environments for organizations requiring complete network isolation.
Resource consumption is calculated per-second based on the allocated resources (CPU, memory, GPU) for containers executing tasks. The platform reports usage at the container level, providing granular visibility into compute consumption by workflow, team, or project.
Yes, all customer data including workflow executions, source code, container images, input data, execution logs, and secrets remain within the customer's VPC. Union.ai never extracts customer data from customer-controlled environments, ensuring data sovereignty and minimizing security exposure.
Fanout refers to the total number of Actions created by a workflow execution—the aggregate count of individual task invocations across the entire pipeline. Concurrency represents the maximum number of Actions executing simultaneously at any given moment during workflow execution. Understanding this distinction helps organizations optimize workflow designs for their specific throughput requirements.
Yes, BYOC deployments run Union.ai within customer-provided AWS, GCP, Azure, or neo-cloud accounts. This model provides the operational simplicity of managed services while maintaining full data residency within customer-controlled cloud infrastructure.
Union.ai offers two primary pricing tiers designed to accommodate teams at different scales and organizational requirements.
The Team plan provides $950 per month of included usage (paid monthly), offering an entry point for teams adopting ML orchestration:
The Enterprise plan provides custom pricing tailored to organizational requirements:
Compute resources are billed separately based on actual consumption:
| Resource | Price |
|---|---|
| vCPU | $0.0417/hour |
| Memory (GB) | $0.0051/hour |
| GPU (T4g) | $0.1516/hour |
| GPU (A100) | $0.6176/hour |
| GPU (H100) | $1.3760/hour |
| Action (Base) | $0.0075/action |
For detailed pricing information, visit https://union.ai/pricing.
Ready to transform your ML operations? Visit https://union.ai to start your journey, or explore the documentation at https://www.union.ai/docs/ for technical details.
Union.ai is an enterprise AI orchestration platform built on Flyte, supporting the complete AI development lifecycle from experiment to production. It offers dynamic workflows, agentic AI runtime, and multi-cloud deployment, serving 30+ Fortune 100 companies with proven ROI in ML operations.
One app. Your entire coaching business
AI-powered website builder for everyone
AI dating photos that actually get matches
Popular AI tools directory for discovery and promotion
Product launch platform for founders with SEO backlinks
Compare the top AI agent frameworks including LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, and LlamaIndex. Find the best framework for building multi-agent AI systems.
We tested 30+ AI coding tools to find the 12 best in 2026. Compare features, pricing, and real-world performance of Cursor, GitHub Copilot, Windsurf & more.