Dependencies and DAG Orchestration¶

Master service dependencies and build reliable orchestration graphs with Krill.

Table of Contents¶

Overview
Dependency Basics
Dependency Conditions
Startup Order
Failure Handling
Common Patterns
Best Practices

Overview¶

Krill uses a Directed Acyclic Graph (DAG) to orchestrate service startup, ensuring: - Services start in the correct order - Dependencies are satisfied before dependents start - Failures cascade appropriately to dependent services

Dependency Basics¶

Simple Dependencies¶

The simplest form waits for a service to start:

services:
  database:
    execute:
      type: docker
      image: postgres:15

  api:
    execute:
      type: pixi
      task: start-api
    dependencies:
      - database  # Wait for database to start

Multiple Dependencies¶

Services can depend on multiple others:

services:
  lidar:
    execute:
      type: ros2
      package: ldlidar_ros2
      launch_file: ldlidar.launch.py

  camera:
    execute:
      type: ros2
      package: realsense2_camera
      launch_file: rs_launch.py

  perception:
    execute:
      type: pixi
      task: run-perception
    dependencies:
      - lidar
      - camera  # Wait for both sensors

Dependency Conditions¶

Control when a dependency is considered satisfied:

`started` (Default)¶

Dependency is satisfied when the service process has started:

dependencies:
  - service-name  # Shorthand for "started"
  - other-service: started  # Explicit

Use when: - Service doesn't need to be fully initialized - You just need the process running - Fast startup is more important than readiness

`healthy`¶

Dependency is satisfied when the service is healthy (requires health check):

dependencies:
  - service-name: healthy

Use when: - Dependent needs service to be fully operational - Service exposes a health endpoint or port - Correctness is more important than speed

Requirements: - Dependency must have a health_check configured - Health check must pass before dependent starts

Mixed Conditions¶

Combine different conditions for different dependencies:

services:
  database:
    execute:
      type: docker
      image: postgres:15
    health_check:
      type: tcp
      port: 5432

  cache:
    execute:
      type: docker
      image: redis:7
    health_check:
      type: tcp
      port: 6379

  worker:
    execute:
      type: shell
      command: python worker.py

  api:
    execute:
      type: pixi
      task: start-api
    dependencies:
      - database: healthy  # Must be ready for queries
      - cache: healthy     # Must be ready for caching
      - worker: started    # Just needs to be running

Startup Order¶

Linear Chain¶

Services start sequentially:

services:
  step1:
    execute:
      type: shell
      command: ./init.sh

  step2:
    execute:
      type: pixi
      task: setup
    dependencies:
      - step1

  step3:
    execute:
      type: docker
      image: app:latest
    dependencies:
      - step2

Startup: step1 → step2 → step3

Parallel Branches¶

Independent services start in parallel:

services:
  # These start in parallel
  sensor-a:
    execute:
      type: ros2
      package: sensor_a
      launch_file: sensor.launch.py

  sensor-b:
    execute:
      type: ros2
      package: sensor_b
      launch_file: sensor.launch.py

  # This waits for both
  fusion:
    execute:
      type: pixi
      task: sensor-fusion
    dependencies:
      - sensor-a: healthy
      - sensor-b: healthy

Startup: sensor-a and sensor-b start together → fusion starts when both are healthy

Diamond Pattern¶

Multiple paths converge:

services:
  config:
    execute:
      type: shell
      command: ./load-config.sh

  service-a:
    execute:
      type: pixi
      task: start-a
    dependencies:
      - config

  service-b:
    execute:
      type: pixi
      task: start-b
    dependencies:
      - config

  aggregator:
    execute:
      type: docker
      image: aggregator:latest
    dependencies:
      - service-a: healthy
      - service-b: healthy

Startup: config → (service-a and service-b in parallel) → aggregator

Layered Architecture¶

Build complex dependency graphs:

services:
  # Layer 1: Hardware
  lidar:
    execute:
      type: ros2
      package: ldlidar_ros2
      launch_file: ldlidar.launch.py
    health_check:
      type: tcp
      port: 4048

  camera:
    execute:
      type: ros2
      package: realsense2_camera
      launch_file: rs_launch.py
    health_check:
      type: tcp
      port: 8554

  # Layer 2: Perception
  slam:
    execute:
      type: pixi
      task: run-slam
    dependencies:
      - lidar: healthy
      - camera: healthy
    health_check:
      type: heartbeat
      timeout: 5s

  object-detection:
    execute:
      type: docker
      image: detection:latest
    dependencies:
      - camera: healthy
    health_check:
      type: http
      port: 8080

  # Layer 3: Planning
  path-planner:
    execute:
      type: ros2
      package: nav2_bringup
      launch_file: navigation_launch.py
    dependencies:
      - slam: healthy
      - object-detection: healthy
    health_check:
      type: tcp
      port: 9090

  # Layer 4: Control
  controller:
    execute:
      type: pixi
      task: run-controller
    dependencies:
      - path-planner: healthy
    critical: true

Failure Handling¶

Cascading Failures¶

When a service fails, Krill automatically stops all dependent services:

services:
  database:
    execute:
      type: docker
      image: postgres:15

  api:
    dependencies:
      - database
    # If database fails, api is automatically stopped

Behavior: 1. Database fails 2. Krill detects failure 3. API is stopped (cascade) 4. System settles into a safe state

Critical Services¶

Mark services as critical to trigger emergency stop:

services:
  safety-monitor:
    execute:
      type: pixi
      task: safety-check
    critical: true  # Failure stops ALL services
    health_check:
      type: heartbeat
      timeout: 1s

  motor-controller:
    dependencies:
      - safety-monitor: healthy

Behavior: 1. Safety-monitor fails 2. Krill triggers emergency stop 3. ALL services are stopped immediately 4. System enters safe state

Restart Policies¶

Control how failures affect the dependency graph:

services:
  flaky-sensor:
    execute:
      type: shell
      command: ./sensor-reader
    policy:
      restart: on-failure
      max_restarts: 3
      restart_delay: 2s
    health_check:
      type: tcp
      port: 5000

  processor:
    dependencies:
      - flaky-sensor: healthy
    # Waits for sensor to restart and become healthy

Common Patterns¶

Database-Backed Application¶

services:
  postgres:
    execute:
      type: docker
      image: postgres:15
      volumes:
        - "./data:/var/lib/postgresql/data"
    health_check:
      type: tcp
      port: 5432
    policy:
      restart: on-failure

  migrations:
    execute:
      type: shell
      command: alembic upgrade head
    dependencies:
      - postgres: healthy
    # Runs once, exits when done

  backend:
    execute:
      type: pixi
      task: start-backend
    dependencies:
      - migrations  # Wait for migrations to complete
    health_check:
      type: http
      port: 8000
      path: /health
    policy:
      restart: always

ROS2 Robot Stack¶

services:
  # Hardware drivers
  motors:
    execute:
      type: ros2
      package: motor_driver
      launch_file: motors.launch.py
    health_check:
      type: tcp
      port: 7000

  sensors:
    execute:
      type: ros2
      package: sensor_suite
      launch_file: sensors.launch.py
    health_check:
      type: tcp
      port: 7001

  # Middle layer
  localization:
    execute:
      type: ros2
      package: robot_localization
      launch_file: ekf.launch.py
    dependencies:
      - motors: healthy
      - sensors: healthy

  # High level
  navigation:
    execute:
      type: ros2
      package: nav2_bringup
      launch_file: navigation_launch.py
    dependencies:
      - localization: started
    critical: true

Microservices with Monitoring¶

services:
  # Infrastructure
  prometheus:
    execute:
      type: docker
      image: prom/prometheus:latest
      ports:
        - "9090:9090"
    health_check:
      type: http
      port: 9090

  # Services
  auth-service:
    execute:
      type: docker
      image: auth:v1
    health_check:
      type: http
      port: 8001
      path: /health

  user-service:
    execute:
      type: docker
      image: users:v1
    dependencies:
      - auth-service: healthy
    health_check:
      type: http
      port: 8002

  api-gateway:
    execute:
      type: docker
      image: gateway:v1
      ports:
        - "80:80"
    dependencies:
      - auth-service: healthy
      - user-service: healthy
    health_check:
      type: http
      port: 80

  # Monitoring depends on all services starting
  grafana:
    execute:
      type: docker
      image: grafana/grafana:latest
      ports:
        - "3000:3000"
    dependencies:
      - prometheus: started
      - api-gateway: started

Development Environment¶

services:
  # Start database first
  dev-db:
    execute:
      type: docker
      image: postgres:15
      ports:
        - "5432:5432"
    health_check:
      type: tcp
      port: 5432

  # Run migrations
  dev-migrate:
    execute:
      type: shell
      command: npm run migrate
    dependencies:
      - dev-db: healthy

  # Start backend with hot reload
  dev-backend:
    execute:
      type: shell
      command: npm run dev
      working_dir: ./backend
    dependencies:
      - dev-migrate
    health_check:
      type: http
      port: 3001
    policy:
      restart: on-failure

  # Start frontend with hot reload
  dev-frontend:
    execute:
      type: shell
      command: npm run dev
      working_dir: ./frontend
    dependencies:
      - dev-backend: started
    health_check:
      type: http
      port: 3000

Best Practices¶

1. Use Health Checks for Readiness¶

Always use healthy dependencies when the dependent truly needs the service ready:

# ❌ Bad: API starts before DB is ready
api:
  dependencies:
    - database  # Just "started", might not be ready

# ✅ Good: API waits for DB to be ready
api:
  dependencies:
    - database: healthy

2. Minimize Dependency Chains¶

Shorter chains start faster and are easier to debug:

# ❌ Bad: Long sequential chain
a: {}
b:
  dependencies: [a]
c:
  dependencies: [b]
d:
  dependencies: [c]

# ✅ Good: Parallel where possible
a: {}
b: {}
c: {}
d:
  dependencies: [a, b, c]

3. Use Critical Flag Sparingly¶

Reserve critical for truly safety-critical services:

# ✅ Good: Critical for safety
emergency-stop:
  critical: true

# ❌ Bad: Dashboard isn't safety-critical
dashboard:
  critical: true  # Don't stop everything if dashboard fails

4. Layer Your Architecture¶

Group services into logical layers:

# Layer 1: Infrastructure
# Layer 2: Data/Storage
# Layer 3: Business Logic
# Layer 4: API/Interface

5. Handle Circular Dependencies¶

Krill rejects circular dependencies. If you encounter this:

# ❌ This will fail
service-a:
  dependencies: [service-b]
service-b:
  dependencies: [service-a]

Solutions: - Redesign to remove circular dependency - Split into smaller services - Use message queues for loose coupling

6. Test Failure Scenarios¶

Verify your dependency graph handles failures correctly:

# Start system
krill up recipe.yaml

# Kill a service and observe cascades
krill service stop service-name

# Check dependent services stopped correctly

Troubleshooting¶

Services Start Out of Order¶

Check: - Dependencies are correctly specified - Health checks are configured for healthy dependencies - No typos in service names

Circular Dependency Error¶

Solution: - Review your dependency graph - Look for cycles (A → B → C → A) - Redesign to break the cycle

Service Waits Forever¶

Possible causes: 1. Dependency never becomes healthy 2. Dependency health check is misconfigured 3. Dependency service is failing

Debug:

# View service status
krill tui

# Check logs
krill logs dependency-name

Cascading Failures Too Aggressive¶

Solution: - Review critical flags - Consider using restart policies - May need to restructure dependencies