New resource : aws_sagemaker_training_job#46892
Conversation
…on, checkpoint config, debug hook, debug rule, experiment config, infra and input config
…package config and output data config block
…urce config block
…rverless job config block
…pping conditions, tensor board output and vpc config block
…up skaff comments + add waiters
…ce identity + testcases
… check point config and env hyper parameters
…ix test cases for retry strategy, algo specification, input/ouput data config
…e debug, session chaining and serverless config + schema changes
aws_sagemaker_training_job
| * `create` - (Default `60m`) | ||
| * `update` - (Default `180m`) | ||
| * `delete` - (Default `90m`) |
There was a problem hiding this comment.
| * `create` - (Default `60m`) | |
| * `update` - (Default `180m`) | |
| * `delete` - (Default `90m`) | |
| * `create` - (Default `25m`) | |
| * `update` - (Default `25m`) | |
| * `delete` - (Default `25m`) |
There was a problem hiding this comment.
Serverless jobs sometimes tends to take around 35 mins, So I have updated this to newer values consistently in resource code and in documentation.
Before timeout change :
make testacc PKG=sagemaker TESTS=TestAccSageMakerTrainingJob_serverless
make: Verifying source code with gofmt...
==> Checking that code complies with gofmt requirements...
make: Running acceptance tests on branch: 🌿 f-aws_sagemaker_training_job 🌿...
TF_ACC=1 go1.25.8 test ./internal/service/sagemaker/... -v -count 1 -parallel 20 -run='TestAccSageMakerTrainingJob_serverless' -timeout 360m -vet=off
2026/03/17 12:27:15 Creating Terraform AWS Provider (SDKv2-style)...
2026/03/17 12:27:15 Initializing Terraform AWS Provider (SDKv2-style)...
=== RUN TestAccSageMakerTrainingJob_serverless
training_job_test.go:640: Error running post-test destroy, there may be dangling resources: exit status 1
Error: deleting SageMaker AI Training Job
ID: tf-acc-test-XXXXXXXXXXXXXXXX
Cause: While waiting, timeout while waiting for state to become 'Completed,
Failed, Stopped' (timeout: 35m0s): operation error SageMaker:
DescribeTrainingJob, context deadline exceeded"
--- FAIL: TestAccSageMakerTrainingJob_serverless (2182.86s)
FAIL
FAIL github.com/hashicorp/terraform-provider-aws/internal/service/sagemaker 2188.032s
FAIL
make: *** [testacc] Error 1After new timeout (took around 36 mins) :
make testacc PKG=sagemaker TESTS=TestAccSageMakerTrainingJob_serverless
make: Verifying source code with gofmt...
==> Checking that code complies with gofmt requirements...
make: Running acceptance tests on branch: 🌿 f-aws_sagemaker_training_job 🌿...
TF_ACC=1 go1.25.8 test ./internal/service/sagemaker/... -v -count 1 -parallel 20 -run='TestAccSageMakerTrainingJob_serverless' -timeout 360m -vet=off
2026/03/17 13:05:53 Creating Terraform AWS Provider (SDKv2-style)...
2026/03/17 13:05:53 Initializing Terraform AWS Provider (SDKv2-style)...
=== RUN TestAccSageMakerTrainingJob_serverless
--- PASS: TestAccSageMakerTrainingJob_serverless (2192.87s)
PASS
ok github.com/hashicorp/terraform-provider-aws/internal/service/sagemaker 2198.048sSo making this 45 to handle any region differences.
internal/service/sagemaker/wait.go
Outdated
| Target: enum.Slice(awstypes.TrainingJobStatusCompleted, awstypes.TrainingJobStatusStopped, awstypes.TrainingJobStatusFailed), | ||
| Refresh: statusTrainingJob(conn, id), | ||
| Timeout: timeout, | ||
| NotFoundChecks: 20, |
There was a problem hiding this comment.
This is the built-in default when no value is set, so can be omitted.
| NotFoundChecks: 20, |
| S3OutputPath types.String `tfsdk:"s3_output_path"` | ||
| } | ||
|
|
||
| func sweepTrainingJobs(ctx context.Context, client *conns.AWSClient) ([]sweep.Sweepable, error) { |
There was a problem hiding this comment.
This can move into sweep.go for consistency with the other sweeper functions.
| return out, nil | ||
| } | ||
|
|
||
| type resourceTrainingJobModel struct { |
| This resource exports the following attributes in addition to the arguments above: | ||
|
|
||
| * `arn` - ARN of the Training Job. | ||
| * `id` - Name of the Training Job. |
There was a problem hiding this comment.
No id attribute in the schema.
| * `id` - Name of the Training Job. |
| out, err := tfresource.RetryWhen(ctx, propagationTimeout, func(ctx context.Context) (*sagemaker.CreateTrainingJobOutput, error) { | ||
| return conn.CreateTrainingJob(ctx, &input) | ||
| }, func(err error) (bool, error) { | ||
| if tfawserr.ErrMessageContains(err, ErrCodeValidationException, "Could not assume role") { |
There was a problem hiding this comment.
The variety of messages tied to a single code here really emphasizes the difficultly of error handling when one error type is overloaded.
It might be worth adding a new helper ErrMessageContainsAny which accepts a code + slice of message strings and returns true if any match. If nothing else it will improve readability here.
| } | ||
| } | ||
|
|
||
| func setZeroAttrValuesToNull(ctx context.Context, target any) diag.Diagnostics { |
There was a problem hiding this comment.
What is the reason for nulling zero values here? Is this something that should be accounted for more centrally for all Framework-based resources?
There was a problem hiding this comment.
The zero values are being nulled because the list API is reusing the full TrainingJobModel, which has custom Terraform Framework collection types. In the list flow, the model starts off as a zero‑value struct, and flex.Flatten() only fills in the fields that actually come back from AWS. Any optional lists or maps that AWS doesn’t return are left as Go zero values instead of proper Terraform nulls.
when SetResult() tries to serialize the model , the framework calls ToTerraformValue() on these zero‑value collection wrappers, and that’s where the nil pointer panic happens. By explicitly nulling these fields, we make sure they are treated as Terraform null values, not invalid zero‑value framework types, and the panic is avoided.
Error I was getting without this function :
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x30 pc=0x101333ea4]
goroutine 8635 [running]:
github.com/hashicorp/terraform-plugin-framework/types/basetypes.ListValue.ToTerraformValue({{0x0, 0x0, 0x0}, {0x0, 0x0}, 0x0}, {0x11e0bd0f0, 0x140034cb2f0})| } | ||
| } | ||
|
|
||
| func testAccPreCheck(ctx context.Context, t *testing.T) { |
There was a problem hiding this comment.
This should be renamed testAccPreCheckTrainingJobs to avoid implying it is a service-level pre-check that should be used in tests for other SageMaker resources.
|
|
||
| // SageMaker injects metric definitions for some built-in algorithms and supported | ||
| // prebuilt images. This fixes unexpected new value errors during apply | ||
| func normalizeAlgoSpecMetricDefinitions( |
There was a problem hiding this comment.
Can this be a plan modifier rather than a function called within each CRUD handler?
There was a problem hiding this comment.
I may be misunderstanding plan modifiers, but I do not think these are a good fit for one because the values being handled are injected by the API after create/read(during apply), so that information does not seem to be available at plan time(?).
Based on that, I moved the shared post-flatten normalization into flatten() and reused it from Createand Read mainly to avoid repeating the same logic in multiple handlers if thats okay.
If I’m thinking about plan modifiers the wrong way, I’m happy to revisit it and take another look.
|
|
||
| out, err := conn.DescribeTrainingJob(ctx, &input) | ||
| if err != nil { | ||
| if errs.Contains(err, "ResourceNotFound") || errs.Contains(err, "Requested resource not found") { |
There was a problem hiding this comment.
Does the API return both of these variations on a "Not Found" message?
There was a problem hiding this comment.
This was repeated in Delete method and findTrainingJobByName.
I looked into AWS Documentation and added ResourceNotFound as the error to expect.
Describe - https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html
Delete - https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteTrainingJob.html
But turns out ResourceNotFound is not something we should expect here. I have made the changes.
I will comment down entire acc test suite result after all these fixes.
…ates + change precheck func name + remove MotFoundChecks in wait funcs
…obs + remove normalize func from CRUD to flatten
|
test case which needs env set : make testacc PKG=sagemaker TESTS=TestAccSageMakerTrainingJob_algorithmSpecificationMetrics
make: Verifying source code with gofmt...
==> Checking that code complies with gofmt requirements...
make: Running acceptance tests on branch: 🌿 f-aws_sagemaker_training_job 🌿...
TF_ACC=1 go1.25.8 test ./internal/service/sagemaker/... -v -count 1 -parallel 20 -run='TestAccSageMakerTrainingJob_algorithmSpecificationMetrics' -timeout 360m -vet=off
2026/03/17 14:27:11 Creating Terraform AWS Provider (SDKv2-style)...
2026/03/17 14:27:11 Initializing Terraform AWS Provider (SDKv2-style)...
=== RUN TestAccSageMakerTrainingJob_algorithmSpecificationMetrics
--- PASS: TestAccSageMakerTrainingJob_algorithmSpecificationMetrics (128.14s)
PASS
ok github.com/hashicorp/terraform-provider-aws/internal/service/sagemaker 133.046s |
|
Acc test after addressing review comments : list acc tests : make testacc PKG=sagemaker TESTS=TestAccSageMakerTrainingJob_List_
make: Verifying source code with gofmt...
==> Checking that code complies with gofmt requirements...
make: Running acceptance tests on branch: 🌿 f-aws_sagemaker_training_job 🌿...
TF_ACC=1 go1.25.8 test ./internal/service/sagemaker/... -v -count 1 -parallel 20 -run='TestAccSageMakerTrainingJob_List_' -timeout 360m -vet=off
2026/03/17 15:30:31 Creating Terraform AWS Provider (SDKv2-style)...
2026/03/17 15:30:31 Initializing Terraform AWS Provider (SDKv2-style)...
=== RUN TestAccSageMakerTrainingJob_List_basic
=== PAUSE TestAccSageMakerTrainingJob_List_basic
=== RUN TestAccSageMakerTrainingJob_List_includeResource
=== PAUSE TestAccSageMakerTrainingJob_List_includeResource
=== RUN TestAccSageMakerTrainingJob_List_regionOverride
=== PAUSE TestAccSageMakerTrainingJob_List_regionOverride
=== CONT TestAccSageMakerTrainingJob_List_basic
=== CONT TestAccSageMakerTrainingJob_List_regionOverride
=== CONT TestAccSageMakerTrainingJob_List_includeResource
--- PASS: TestAccSageMakerTrainingJob_List_includeResource (92.36s)
--- PASS: TestAccSageMakerTrainingJob_List_basic (99.93s)
--- PASS: TestAccSageMakerTrainingJob_List_regionOverride (112.83s)
PASS
ok github.com/hashicorp/terraform-provider-aws/internal/service/sagemaker 118.009sidentity acc tests : make testacc PKG=sagemaker TESTS=TestAccSageMakerTrainingJob_Identity_
make: Verifying source code with gofmt...
==> Checking that code complies with gofmt requirements...
make: Running acceptance tests on branch: 🌿 f-aws_sagemaker_training_job 🌿...
TF_ACC=1 go1.25.8 test ./internal/service/sagemaker/... -v -count 1 -parallel 20 -run='TestAccSageMakerTrainingJob_Identity_' -timeout 360m -vet=off
2026/03/17 15:57:57 Creating Terraform AWS Provider (SDKv2-style)...
2026/03/17 15:57:57 Initializing Terraform AWS Provider (SDKv2-style)...
=== RUN TestAccSageMakerTrainingJob_Identity_basic
=== PAUSE TestAccSageMakerTrainingJob_Identity_basic
=== RUN TestAccSageMakerTrainingJob_Identity_regionOverride
=== PAUSE TestAccSageMakerTrainingJob_Identity_regionOverride
=== CONT TestAccSageMakerTrainingJob_Identity_basic
=== CONT TestAccSageMakerTrainingJob_Identity_regionOverride
--- PASS: TestAccSageMakerTrainingJob_Identity_regionOverride (61.12s)
--- PASS: TestAccSageMakerTrainingJob_Identity_basic (66.85s)
PASS
ok github.com/hashicorp/terraform-provider-aws/internal/service/sagemaker 72.112s |
Rollback Plan
If a change needs to be reverted, we will publish an updated version of the library.
Description
This PR will :
aws_sagemaker_training_jobaws_sagemaker_training_jobRelations
Closes #46049
References
CREATE - https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html
READ - https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html
UPDATE - https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateTrainingJob.html
DELETE - https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteTrainingJob.html
LIST - https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListTrainingJobs.html
Output from Acceptance Testing
TestAccSageMakerTrainingJob_mlflowConfigcould not be tested as we can't push a "nova" images to ECR like we did forTestAccSageMakerTrainingJob_algorithmSpecificationMetrics(For this custom image was pushed to ECR and tested with a env var)DELETE has extra cleanups
Delete has extra cleanup functions because :
Without ENI cleanup :
Without model package cleanup :
CREATE/READ normalise functions
why
normalizeStoppingConditionAWS injects a default stopping_condition for serverless jobs when the user omitted it. Only suppress that value for serverless jobs so import can still retain explicit stopping_condition values for non-serverless jobs.
Without this we get this error for serverless jobs :
Just adding the
stopping_conditionblock for serverless test config , fixes this without normalise function , but we shouldn't expect this block from for serverless jobs(?).For non serverless job , AWS makes this mandatory with error during apply :
why
normalizeAlgoSpecMetricDefinitionsSageMaker injects metric definitions for some built-in algorithms and supported prebuilt images.
Wait jobs won't wait for terminal states
Since the training jobs can go up to 30 days, we cannot wait for terminal states.