Skip to content

gpu support continue#257

Closed
lazarusA wants to merge 6 commits intomainfrom
ac_la/gpu_continue
Closed

gpu support continue#257
lazarusA wants to merge 6 commits intomainfrom
ac_la/gpu_continue

Conversation

@lazarusA
Copy link
Copy Markdown
Member

@lazarusA lazarusA commented Apr 7, 2026

this follows from #256

@lazarusA lazarusA mentioned this pull request Apr 7, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces GPU support and refactors the data pipeline to separate predictors and forcings, updating the training, evaluation, and checkpointing logic accordingly. Feedback identifies several critical issues, including a typo in the loss computation that will cause an UndefVarError, a statistically incorrect R² formula, and flawed NaN masking logic that could lead to incorrect loss values in multi-target models. Additionally, improvements are suggested for configuration type safety, the removal of leftover comments, and fixing a broken warning check in the data preparation logic.

Comment on lines +108 to +111
y_t = y[target]# _get_target_y(y, target)
ŷ_t = ŷ[target]#_get_target_ŷ(ŷ, y_t, target)
_apply_loss(ŷ_t, y_t, y_nan, loss_spec)
# _apply_loss(ŷ_t, y_t, _get_target_nan(y_nan, target), loss_spec)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a typo on line 109: (y + U+0302) is used instead of the argument ŷ (U+0177), which will cause an UndefVarError. Additionally, y_nan should be target-specific (e.g., y_nan[target]) to ensure that NaNs in one target do not invalidate observations for other targets during loss calculation.

                y_t = y[target]
                ŷ_t = ŷ[target]
                _apply_loss(ŷ_t, y_t, y_nan[target], loss_spec)

function loss_fn(ŷ, y, y_nan, ::Val{:r2})
r = cor(ŷ[y_nan], y[y_nan])
return r * r
return 1 - sum((y[y_nan] .- ŷ[y_nan]).^2) / sum((y[y_nan] .- mean(ŷ[y_nan])).^2)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The $R^2$ formula is statistically incorrect. The denominator (Total Sum of Squares) should be calculated using the mean of the observed values y, not the predicted values .

    return 1 - sum((y[y_nan] .- ŷ[y_nan]).^2) / sum((y[y_nan] .- mean(y[y_nan])).^2)

Comment on lines +5 to +8
is_no_nan = falses(length(first(y))) |> cfg.gdev
for vec in y
is_no_nan = is_no_nan.|| .!isnan.(vec)
end
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic computes a global NaN mask by OR-ing all targets. This is problematic for multi-target models: if target A is NaN but target B is valid at index i, the mask will be true, and the loss function will attempt to compute a loss on the NaN value for target A, resulting in a NaN total loss. The mask should be kept per-target (e.g., as a NamedTuple of masks).

        is_no_nan = map(vec -> .!isnan.(vec), y)
        !any(map(any, is_no_nan)) && continue

Comment on lines +46 to +53
is_no_nan_t = falses(length(first(y_train))) |> cfg.gdev
for vec in y_train
is_no_nan_t = is_no_nan_t .|| .!isnan.(vec)
end
is_no_nan_v = falses(length(first(y_val))) |> cfg.gdev
for vec in y_val
is_no_nan_v = is_no_nan_v .|| .!isnan.(vec)
end
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the training loop, the NaN masks for evaluation should be per-target to avoid cross-contamination of NaN values between different target variables.

    is_no_nan_t = map(vec -> .!isnan.(vec), y_train)
    is_no_nan_v = map(vec -> .!isnan.(vec), y_val)

Comment on lines +61 to +64
gdev = gpu_device()

"Set the `cpu_device`, useful for sending back to the cpu model parameters"
cdev = cpu_device()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fields gdev and cdev should have explicit type annotations (e.g., Lux.AbstractLuxDevice) to improve code clarity and potentially help with compiler optimizations. Additionally, ensure that gpu_device() is the intended default for all instances of DataConfig, as it may trigger device initialization.

    "Select a gpu_device or default to cpu if none available"
    gdev::Lux.AbstractLuxDevice = gpu_device()

    "Set the `cpu_device`, useful for sending back to the cpu model parameters"
    cdev::Lux.AbstractLuxDevice = cpu_device()

Comment on lines +112 to +113
# predicto
# predictors_forcing = unique(predictors_forcing)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Remove leftover comments and incomplete words.

Comment on lines 115 to 117
if isempty(predictors_forcing)
@warn "Note that you don't have predictors or forcing variables."
end
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The warning check uses predictors_forcing, which is an empty Symbol[] initialized at line 89 and never updated. The logic should check if both predictors and forcings are empty instead.

    if isempty(predictors) && isempty(forcings)
        @warn "Note that you don't have predictors or forcing variables."
    end

@lazarusA lazarusA closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants