Workloads

Workloads are trainings, workspaces, and inferences that are fully controlled by NVIDIA Run:ai. Workloads can be native, third party integrations, and typical Kubernetes workload types. For more information, see Workloads overview.

List workloads.

Retrieve a list of active workloads with details.

SecuritybearerAuth

Request

query Parameters

deleted	boolean Return only deleted resources when `true`.
offset	integer <int32> The offset of the first item returned in the collection. Example: offset=100
limit	integer <int32> [ 1 .. 500 ] Default: 50 The maximum number of entries to return.
sortOrder	string Default: "asc" Sort results in descending or ascending order. Enum: "asc" "desc"
sortBy	string Sort results by a parameter. Enum: "type" "name" "clusterId" "projectId" "projectName" "departmentId" "departmentName" "createdAt" "deletedAt" "submittedBy" "phase" "completedAt" "nodepool" "distributedFramework" "allocatedGPU" "idleGpus" "idleAllocatedGpus" "phaseUpdatedAt" "category" "priority" "totalPendingTimeSeconds" "totalRunningTimeSeconds"
filterBy	Array of strings Filter results by a parameter. Use the format field-name operator value. Operators are `==` Equals, `!=` Not equals, `<=` Less than or equal, `>=` Greater than or equal, `=@` contains, `!@` Does not contain, `=^` Starts with and `=$` Ends with. Dates are in ISO 8601 timestamp format and available for operators `==`, `!=`, `<=` and `>=`. Example: filterBy=name!=some-workload-name,allocatedGPU>=2,createdAt>=2021-01-01T00:00:00Z
search	string Filter results by a free text search. Example: search=test project

Responses

200

Executed successfully.

401

Unauthorized

403

Forbidden

500

unexpected error

503

unexpected error

get/api/v1/workloads

Response samples

application/json

{"next": 1,
"workloads": [{"tenantId": 1001,
"runningPods": 1,
"phaseUpdatedAt": "2022-06-08T11:28:24.131Z",
"k8sPhaseUpdatedAt": "2022-06-08T11:28:24.131Z",
"updatedAt": "2022-06-08T11:28:24.131Z",
"source": "CLI",
"deletedAt": "2022-08-12T19:28:24.131Z",
"type": "runai-job",
"name": "very-important-job",
"id": "497f6eca-6276-4993-bfeb-53cbbbba6f08",
"priority": 50,
"priorityClassName": "high-priority",
"submittedBy": "researcher@run.ai",
"clusterId": "71f69d83-ba66-4822-adf5-55ce55efd210",
"projectName": "proj-1",
"projectId": "1",
"departmentName": "department-1",
"departmentId": "1",
"namespace": "runai-proj-1",
"createdAt": "2022-01-01T03:49:52.531Z",
"workloadRequestedResources": {"gpuRequestType": "portion",
"gpu": {"limit": 1.5,
"request": 1
},
"gpuMemory": {"limit": "2G",
"request": "200M"
},
"cpu": {"limit": 1.5,
"request": 1
},
"cpuMemory": {"limit": "2G",
"request": "200M"
},
"migProfile": ["1g.5gb"
],
"extendedResources": [{"resource": "hardware-vendor.example/foo",
"quantity": 2,
"exclude": false
}
]
},
"podsRequestedResources": {"gpuRequestType": "portion",
"gpu": {"limit": 1.5,
"request": 1
},
"gpuMemory": {"limit": "2G",
"request": "200M"
},
"cpu": {"limit": 1.5,
"request": 1
},
"cpuMemory": {"limit": "2G",
"request": "200M"
},
"migProfile": ["1g.5gb"
],
"extendedResources": [{"resource": "hardware-vendor.example/foo",
"quantity": 2,
"exclude": false
}
]
},
"allocatedResources": {"gpu": 1.5,
"migProfile": ["1g.5gb"
],
"gpuMemory": "200Mi",
"cpu": 0.5,
"cpuMemory": "0B",
"extendedResources": [{"resource": "hardware-vendor.example/foo",
"quantity": 2,
"exclude": false
}
]
},
"actionsSupport": {"delete": true,
"suspend": true
},
"phase": "Creating",
"conditions": [{"type": "Ready",
"status": "False",
"message": "Resource validation failed: ...",
"reason": "ErrorConfig",
"lastTransitionTime": "2022-01-01T03:49:52.531Z"
}
],
"phaseMessage": "Not enough resources in the requested nodepool",
"k8sPhase": "Pending",
"requestedPods": {"number": 1,
"min": 2,
"max": 5,
"parallelism": 3,
"completions": 5
},
"requestedNodePools": ["default"
],
"currentNodePools": ["default"
],
"completedAt": "2022-01-01T03:49:52.531Z",
"images": ["alpine:latest"
],
"urls": ["string"
],
"datasources": [{"type": "pvc",
"name": "my-pvc-datasource-1",
"id": "497f6eca-6276-4993-bfeb-53cbbbba6f08"
}
],
"environments": [{"connections": [{"name": "my-pytorch-env",
"toolType": "pytorch",
"connectionType": "ExternalUrl",
"url": "http://wandb.com/yourproject",
"authorizationType": "public",
"authorizedUsers": ["user@company.ai",
"another@company.ai"
],
"authorizedGroups": ["group-a",
"group-b"
],
"containerPort": 8080
}
],
"name": "pytorch",
"id": "497f6eca-6276-4993-bfeb-53cbbbba6f08",
"replicaType": "Master"
}
],
"externalConnections": [{"name": "my-pytorch-env",
"toolType": "pytorch",
"connectionType": "ExternalUrl",
"url": "http://wandb.com/yourproject",
"authorizationType": "public",
"authorizedUsers": ["user@company.ai",
"another@company.ai"
],
"authorizedGroups": ["group-a",
"group-b"
],
"containerPort": 8080
}
],
"distributedFramework": "Pytorch",
"additionalFields": { },
"preemptible": true,
"environmentVariables": {"property1": "string",
"property2": "string"
},
"command": "sleep",
"arguments": "1000",
"phaseReason": "NonPreemptibleOverQuota",
"idleGpus": 3,
"idleAllocatedGpus": 1,
"totalPendingTimeSeconds": 60,
"totalRunningTimeSeconds": 60,
"category": "Train",
"guaranteedRuntimeEndsAt": "2025-08-01T03:49:52.531Z"
}
]
}

Get a workload.

Retrieve workload data using a workloadId.

SecuritybearerAuth

Request

path Parameters

workloadId

required

string <uuid>

The Universally Unique Identifier (UUID) of the workload.

Responses

200

Executed successfully.

401

Unauthorized

403

Forbidden

404

The specified resource was not found

500

unexpected error

503

unexpected error

get/api/v1/workloads/{workloadId}

Response samples

application/json

{"tenantId": 1001,
"runningPods": 1,
"phaseUpdatedAt": "2022-06-08T11:28:24.131Z",
"k8sPhaseUpdatedAt": "2022-06-08T11:28:24.131Z",
"updatedAt": "2022-06-08T11:28:24.131Z",
"source": "CLI",
"deletedAt": "2022-08-12T19:28:24.131Z",
"type": "runai-job",
"name": "very-important-job",
"id": "497f6eca-6276-4993-bfeb-53cbbbba6f08",
"priority": 50,
"priorityClassName": "high-priority",
"submittedBy": "researcher@run.ai",
"clusterId": "71f69d83-ba66-4822-adf5-55ce55efd210",
"projectName": "proj-1",
"projectId": "1",
"departmentName": "department-1",
"departmentId": "1",
"namespace": "runai-proj-1",
"createdAt": "2022-01-01T03:49:52.531Z",
"workloadRequestedResources": {"gpuRequestType": "portion",
"gpu": {"limit": 1.5,
"request": 1
},
"gpuMemory": {"limit": "2G",
"request": "200M"
},
"cpu": {"limit": 1.5,
"request": 1
},
"cpuMemory": {"limit": "2G",
"request": "200M"
},
"migProfile": ["1g.5gb"
],
"extendedResources": [{"resource": "hardware-vendor.example/foo",
"quantity": 2,
"exclude": false
}
]
},
"podsRequestedResources": {"gpuRequestType": "portion",
"gpu": {"limit": 1.5,
"request": 1
},
"gpuMemory": {"limit": "2G",
"request": "200M"
},
"cpu": {"limit": 1.5,
"request": 1
},
"cpuMemory": {"limit": "2G",
"request": "200M"
},
"migProfile": ["1g.5gb"
],
"extendedResources": [{"resource": "hardware-vendor.example/foo",
"quantity": 2,
"exclude": false
}
]
},
"allocatedResources": {"gpu": 1.5,
"migProfile": ["1g.5gb"
],
"gpuMemory": "200Mi",
"cpu": 0.5,
"cpuMemory": "0B",
"extendedResources": [{"resource": "hardware-vendor.example/foo",
"quantity": 2,
"exclude": false
}
]
},
"actionsSupport": {"delete": true,
"suspend": true
},
"phase": "Creating",
"conditions": [{"type": "Ready",
"status": "False",
"message": "Resource validation failed: ...",
"reason": "ErrorConfig",
"lastTransitionTime": "2022-01-01T03:49:52.531Z"
}
],
"phaseMessage": "Not enough resources in the requested nodepool",
"k8sPhase": "Pending",
"requestedPods": {"number": 1,
"min": 2,
"max": 5,
"parallelism": 3,
"completions": 5
},
"requestedNodePools": ["default"
],
"currentNodePools": ["default"
],
"completedAt": "2022-01-01T03:49:52.531Z",
"images": ["alpine:latest"
],
"urls": ["string"
],
"datasources": [{"type": "pvc",
"name": "my-pvc-datasource-1",
"id": "497f6eca-6276-4993-bfeb-53cbbbba6f08"
}
],
"environments": [{"connections": [{"name": "my-pytorch-env",
"toolType": "pytorch",
"connectionType": "ExternalUrl",
"url": "http://wandb.com/yourproject",
"authorizationType": "public",
"authorizedUsers": ["user@company.ai",
"another@company.ai"
],
"authorizedGroups": ["group-a",
"group-b"
],
"containerPort": 8080
}
],
"name": "pytorch",
"id": "497f6eca-6276-4993-bfeb-53cbbbba6f08",
"replicaType": "Master"
}
],
"externalConnections": [{"name": "my-pytorch-env",
"toolType": "pytorch",
"connectionType": "ExternalUrl",
"url": "http://wandb.com/yourproject",
"authorizationType": "public",
"authorizedUsers": ["user@company.ai",
"another@company.ai"
],
"authorizedGroups": ["group-a",
"group-b"
],
"containerPort": 8080
}
],
"distributedFramework": "Pytorch",
"additionalFields": { },
"preemptible": true,
"environmentVariables": {"property1": "string",
"property2": "string"
},
"command": "sleep",
"arguments": "1000",
"phaseReason": "NonPreemptibleOverQuota",
"idleGpus": 3,
"idleAllocatedGpus": 1,
"totalPendingTimeSeconds": 60,
"totalRunningTimeSeconds": 60,
"category": "Train",
"guaranteedRuntimeEndsAt": "2025-08-01T03:49:52.531Z",
"pendingSchedulingMessages": [{"nodePool": "default",
"phaseReason": "NonPreemptibleOverQuota",
"reason": "Non-preemptible over quota",
"orgType": "PROJECT",
"userMessage": "You have reached the limit of non-preemptible resources"
}
]
}

Count workloads.

Retrieve the number of workloads.

SecuritybearerAuth

Request

query Parameters

deleted	boolean Return only deleted resources when `true`.
filterBy	Array of strings Filter results by a parameter. Use the format field-name operator value. Operators are `==` Equals, `!=` Not equals, `<=` Less than or equal, `>=` Greater than or equal, `=@` contains, `!@` Does not contain, `=^` Starts with and `=$` Ends with. Dates are in ISO 8601 timestamp format and available for operators `==`, `!=`, `<=` and `>=`. Example: filterBy=name!=some-workload-name,allocatedGPU>=2,createdAt>=2021-01-01T00:00:00Z
search	string Filter results by a free text search. Example: search=test project

Responses

200

Executed successfully.

401

Unauthorized

403

Forbidden

500

unexpected error

503

unexpected error

get/api/v1/workloads/count

Response samples

application/json

{"count": 1
}

Get the workloads telemetry.

Retrieves workload data by telemetry type.

SecuritybearerAuth

Request

query Parameters

clusterId	string <uuid> Filter using the Universally Unique Identifier (UUID) of the cluster. Example: clusterId=d73a738f-fab3-430a-8fa3-5241493d7128
nodepoolName	string Filter using the nodepool. Example: nodepoolName=default
departmentId	string Filter using the department id. Example: departmentId=1
groupBy	Array of strings <= 2 items Group workloads by field. Items Enum: "ClusterId" "DepartmentId" "ProjectId" "Type" "CurrentNodepools" "Phase" "Category"
telemetryType required	string (WorkloadTelemetryType) Specifies the telemetry type. Enum: "WORKLOADS_COUNT" "GPU_ALLOCATION"

Responses

200

Executed successfully.

400

Bad request.

401

Unauthorized

403

Forbidden

404

The specified resource was not found

500

unexpected error

503

unexpected error

get/api/v1/workloads/telemetry

Response samples

{"type": "ALLOCATION_RATIO",
"timestamp": "2023-06-06 12:09:18.211",
"values": [{"value": "85",
"groups": [{"key": "department",
"value": "1",
"name": "department-A"
}
]
}
]
}

Get workload metrics data.

Retrieves workloads data metrics from the metrics database. Use in reporting and analysis tools.

SecuritybearerAuth

Request

path Parameters

workloadId

required

string <uuid>

The Universally Unique Identifier (UUID) of the workload.

query Parameters

metricType required	Array of strings (WorkloadMetricType) Specify which data to request. Items Enum: "GPU_UTILIZATION" "GPU_MEMORY_USAGE_BYTES" "GPU_MEMORY_REQUEST_BYTES" "CPU_USAGE_CORES" "CPU_REQUEST_CORES" "CPU_LIMIT_CORES" "CPU_MEMORY_USAGE_BYTES" "CPU_MEMORY_REQUEST_BYTES" "CPU_MEMORY_LIMIT_BYTES" "POD_COUNT" "RUNNING_POD_COUNT" "GPU_ALLOCATION"
start required	string <date-time> Start date of time range to fetch data in ISO 8601 timestamp format. Example: start=2023-06-06T12:09:18.211Z
end required	string <date-time> End date of time range to fetch data in ISO 8601 timestamp format. Example: end=2023-06-07T12:09:18.211Z
numberOfSamples	integer [ 0 .. 1000 ] Default: 20 The number of samples to take in the specified time range. Example: numberOfSamples=20

Responses

200

Executed successfully.

207

Partial success.

400

Bad request.

401

Unauthorized

403

Forbidden

404

The specified resource was not found

500

unexpected error

503

unexpected error

get/api/v1/workloads/{workloadId}/metrics

Response samples

{"measurements": [{"type": "ALLOCATED_GPU",
"labels": "{'gpu': '3'}",
"values": [{"value": "85",
"timestamp": "2023-06-06 12:09:18.211"
}
]
}
]
}

➔ Next to Workspaces