Distributed

Distributed Training, is the ability to split the training of a model among multiple processors. It is often a necessity when multi-GPU training no longer applies; typically when you require more GPUs than exist on a single node. Each such split is a pod (see definition above). Run:ai spawns an additional launcher process that manages and coordinates the other worker pods. For more information, see Distributed training.

Create a distributed training. [Experimental]

Use to create a distributed training.

SecuritybearerAuth
Request
Request Body schema: application/json
name
required
string (WorkloadName) non-empty

The name of the workload.

useGivenNameAsPrefix
boolean
Default: false

When true, the requested name will be treated as a prefix. The final name of the workload will be composed of the name followed by a random set of characters.

projectId
required
string (ProjectId2)

The id of the project.

clusterId
required
string <uuid> (ClusterId)

The id of the cluster.

object or null (CommonFlatFields)

The spec of the worker(s).

masterSpecSameAsWorker
boolean or null (MasterSpecSameAsWorker)

used for distributed workloads to indicate that the master spec should be the same as the worker spec. in this case, masterSpec should not be specified.

object or null (CommonFlatFields)

The spec of the master. Should be provided only if masterSpecSameAsWorker is false.

Responses
200

Request completed successfully.

400

Bad submission request.

401

Unauthorized

403

Forbidden

503

unexpected error

post/api/v1/workloads/distributed
Request samples
application/json
{
  • "name": "my-workload-name",
  • "useGivenNameAsPrefix": true,
  • "projectId": 1,
  • "clusterId": "71f69d83-ba66-4822-adf5-55ce55efd210",
  • "spec": {
    },
  • "masterSpecSameAsWorker": true,
  • "masterSpec": {
    }
}
Response samples
application/json
{
  • "name": "my-workload-name",
  • "requestedName": "string",
  • "workloadId": "06d16c5d-4728-42fa-b573-3b11820d999f",
  • "projectId": 1,
  • "departmentId": 2,
  • "clusterId": "71f69d83-ba66-4822-adf5-55ce55efd210",
  • "createdBy": "test@lab.com",
  • "createdAt": "2022-01-01T03:49:52.531Z",
  • "desiredPhase": "Running",
  • "actualPhase": "Creating",
  • "spec": {
    },
  • "masterSpecSameAsWorker": true,
  • "masterSpec": {
    }
}

Delete a distributed training by id. [Experimental]

Use to delete a distributed training by workload id.

SecuritybearerAuth
Request
path Parameters
workloadId
required
string <uuid>

Unique identifier of the workload.

Responses
204

No Content.

401

Unauthorized

403

Forbidden

404

The specified resource was not found

500

unexpected error

503

unexpected error

delete/api/v1/workloads/distributed/{workloadId}
Response samples
application/json
{
  • "code": 401,
  • "message": "Issuer is not familiar."
}

Get distributed training's data. [Experimental]

Retrieve the details of a distributed training by workload id.

SecuritybearerAuth
Request
path Parameters
workloadId
required
string <uuid>

Unique identifier of the workload.

Responses
200

Executed successfully.

401

Unauthorized

403

Forbidden

404

The specified resource was not found

500

unexpected error

503

unexpected error

get/api/v1/workloads/distributed/{workloadId}
Response samples
application/json
{
  • "name": "my-workload-name",
  • "requestedName": "string",
  • "workloadId": "06d16c5d-4728-42fa-b573-3b11820d999f",
  • "projectId": 1,
  • "departmentId": 2,
  • "clusterId": "71f69d83-ba66-4822-adf5-55ce55efd210",
  • "createdBy": "test@lab.com",
  • "createdAt": "2022-01-01T03:49:52.531Z",
  • "desiredPhase": "Running",
  • "actualPhase": "Creating",
  • "spec": {
    },
  • "masterSpecSameAsWorker": true,
  • "masterSpec": {
    }
}