generated from VectorInstitute/aieng-template-uv
-
Notifications
You must be signed in to change notification settings - Fork 1
Better support for target models on the ensemble attack #135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
lotif
wants to merge
46
commits into
main
Choose a base branch
from
marcelo/support-attack-models
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 43 commits
Commits
Show all changes
46 commits
Select commit
Hold shift + click to select a range
1a40fd7
wip
lotif 1d18580
wip
lotif e42e630
WIP moving forward with the ensemble attack code changes
lotif a46a010
WIP adding training and sythesizing code
lotif 30c0ed3
More info on readme
lotif 9464962
More ctgan changes
lotif e5c8fda
Adding the split data code
lotif 8f10678
More config changes and bug fixes
lotif 077d909
Removing ids dynamically
lotif b711fbd
Working!
lotif efdde68
Merge branch 'main' into marcelo/ensamble-ctgan
lotif 1a38af2
Fixing indent on config file and adding some more information to the …
lotif af4f04e
Adding test attack model code
lotif 5afb774
Small bug fixes
lotif e4ec793
Updates to readme and config file values
lotif 1c13126
Small changes on configs and script bug fixes
lotif 4e9a8c9
Adding the compute attack success script and fixing minor issues
lotif d83aabf
Cr by CodeRabbit and Sara
lotif a198fe9
Reducing the amount of training samples to 20k
lotif 0416dbc
Merge branch 'main' into marcelo/ensamble-ctgan
lotif e69b07e
Change function name to avoid pytest thinking it's a test
lotif 579d0f3
Merge remote-tracking branch 'origin/marcelo/ensamble-ctgan' into mar…
lotif 5fa4fef
Fixing test assertions
lotif 8b6bf10
Merge branch 'main' into marcelo/ensamble-ctgan
lotif a9369f6
Making population_all_with_challenge.csv into a constant and adding a…
lotif 163bba8
Addressing last comments by Fatemeh
lotif bf805c1
Merge branch 'main' into marcelo/ensamble-ctgan
lotif ecab1e2
WIP adding model runner class
lotif dda8c5e
Merge branch 'main' into marcelo/support-attack-models
lotif 38a20b5
working first refactor
lotif ac1a0bf
train attack model working
lotif 2c3fa1e
Adding changes for the test model script
lotif ca87ac3
Linter changes
lotif cfb4ded
Merge branch 'main' into marcelo/support-attack-models
lotif c42ee6e
Fixing mypy and ruff
lotif 093b0e4
Tests passing
lotif d50ff39
renaming model to models
lotif 7135924
Small bug fix
lotif 082ea7c
Bringing back the config json saving function against my will
lotif 94da62e
one more bug fix
lotif cc2cb81
Fixing the test
lotif fe78e34
One more refactor to make things simpler.
lotif 26f88f6
CR by Coderabbit
lotif 5137a87
Fixing a bug on the amount of shadow model samples to generate
lotif 068d936
CR by David
lotif e0007e8
Merge branch 'main' into marcelo/support-attack-models
emersodb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,37 +1,31 @@ | ||
| import shutil | ||
| from logging import INFO | ||
| from pathlib import Path | ||
| from typing import cast | ||
|
|
||
| import pandas as pd | ||
| from omegaconf import DictConfig | ||
|
|
||
| from examples.ensemble_attack.real_data_collection import COLLECTED_DATA_FILE_NAME | ||
| from midst_toolkit.attacks.ensemble.data_utils import load_dataframe | ||
| from midst_toolkit.attacks.ensemble.rmia.shadow_model_training import ( | ||
| train_three_sets_of_shadow_models, | ||
| ) | ||
| from midst_toolkit.attacks.ensemble.shadow_model_utils import ( | ||
| ModelType, | ||
| TrainingResult, | ||
| save_additional_training_config, | ||
| train_or_fine_tune_and_synthesize_with_ctgan, | ||
| train_tabddpm_and_synthesize, | ||
| from examples.ensemble_attack.real_data_collection import ( | ||
| COLLECTED_DATA_FILE_NAME, | ||
| ) | ||
| from midst_toolkit.common.config import ClavaDDPMTrainingConfig, CTGANTrainingConfig | ||
| from midst_toolkit.attacks.ensemble.data_utils import load_dataframe | ||
| from midst_toolkit.attacks.ensemble.models import EnsembleAttackModelRunner | ||
| from midst_toolkit.attacks.ensemble.rmia.shadow_model_training import train_three_sets_of_shadow_models | ||
| from midst_toolkit.attacks.ensemble.shadow_model_utils import update_and_save_training_config | ||
| from midst_toolkit.common.logger import log | ||
|
|
||
|
|
||
| DEFAULT_TABLE_NAME = "trans" | ||
| DEFAULT_ID_COLUMN_NAME = "trans_id" | ||
| DEFAULT_MODEL_TYPE = ModelType.TABDDPM | ||
|
|
||
|
|
||
| def run_target_model_training(config: DictConfig) -> Path: | ||
| def run_target_model_training(model_runner: EnsembleAttackModelRunner, config: DictConfig) -> Path: | ||
| """ | ||
| Function to run the target model training for RMIA attack. | ||
|
|
||
| Args: | ||
| model_runner: The model runner to be used for training the target model. | ||
| Should be an instance of a subclass of `EnsembleAttackModelRunner`. | ||
| config: Configuration object set in config.yaml. | ||
|
|
||
| Returns: | ||
|
|
@@ -54,11 +48,6 @@ def run_target_model_training(config: DictConfig) -> Path: | |
|
|
||
| target_folder = target_model_output_path / "target_model" | ||
|
|
||
| model_type = DEFAULT_MODEL_TYPE | ||
| if "model_name" in config.shadow_training: | ||
| model_type = ModelType(config.shadow_training.model_name) | ||
| log(INFO, f"Training target model with model type: {model_type.value}") | ||
|
|
||
| target_folder.mkdir(parents=True, exist_ok=True) | ||
| shutil.copyfile( | ||
| target_training_json_config_paths.table_domain_file_path, | ||
|
|
@@ -68,30 +57,16 @@ def run_target_model_training(config: DictConfig) -> Path: | |
| target_training_json_config_paths.dataset_meta_file_path, | ||
| target_folder / "dataset_meta.json", | ||
| ) | ||
| configs, save_dir = save_additional_training_config( | ||
|
|
||
| configs = update_and_save_training_config( | ||
| config=model_runner.training_config, | ||
| data_dir=target_folder, | ||
| training_config_json_path=Path(target_training_json_config_paths.training_config_path), | ||
| final_config_json_path=target_folder / f"{table_name}.json", # Path to the new json | ||
| experiment_name="trained_target_model", | ||
| model_type=model_type, | ||
| ) | ||
| model_runner.training_config = configs | ||
|
|
||
| train_result: TrainingResult | ||
| if model_type == ModelType.TABDDPM: | ||
| train_result = train_tabddpm_and_synthesize( | ||
| train_set=df_real_data, | ||
| configs=cast(ClavaDDPMTrainingConfig, configs), | ||
| save_dir=save_dir, | ||
| synthesize=True, | ||
| number_of_points_to_synthesize=config.shadow_training.number_of_points_to_synthesize, | ||
| ) | ||
| elif model_type == ModelType.CTGAN: | ||
| train_result = train_or_fine_tune_and_synthesize_with_ctgan( | ||
| dataset=df_real_data, | ||
| configs=cast(CTGANTrainingConfig, configs), | ||
| save_dir=save_dir, | ||
| synthesize=True, | ||
| ) | ||
| train_result = model_runner.train_or_fine_tune_and_synthesize(dataset=df_real_data, synthesize=True) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a much needed change! |
||
|
|
||
| # To train the attack model (metaclassifier), we only need to save target's synthetic data, | ||
| # and not the entire target model's training result object. | ||
|
|
@@ -105,11 +80,17 @@ def run_target_model_training(config: DictConfig) -> Path: | |
| return target_model_synthetic_path | ||
|
|
||
|
|
||
| def run_shadow_model_training(config: DictConfig, df_challenge_train: pd.DataFrame) -> list[Path]: | ||
| def run_shadow_model_training( | ||
| model_runner: EnsembleAttackModelRunner, | ||
| config: DictConfig, | ||
| df_challenge_train: pd.DataFrame, | ||
| ) -> list[Path]: | ||
| """ | ||
| Function to run the shadow model training for RMIA attack. | ||
|
|
||
| Args: | ||
| model_runner: The model runner to be used for training the shadow models. | ||
| Should be an instance of `EnsembleAttackModelRunner`. | ||
| config: Configuration object set in config.yaml. | ||
| df_challenge_train: DataFrame containing the data that is used to train RMIA shadow models. | ||
|
|
||
|
|
@@ -130,10 +111,7 @@ def run_shadow_model_training(config: DictConfig, df_challenge_train: pd.DataFra | |
| # Population data is used to pre-train some of the shadow models. | ||
| df_population_with_challenge = load_dataframe(Path(config.data_paths.population_path), data_file_name) | ||
|
|
||
| model_type = DEFAULT_MODEL_TYPE | ||
| if "model_name" in config.shadow_training: | ||
| model_type = ModelType(config.shadow_training.model_name) | ||
| log(INFO, f"Training shadow models with model type: {model_type.value}") | ||
| log(INFO, f"Training shadow models with model runner: {model_runner}") | ||
|
|
||
| # Make sure master challenge train and population data have the id column. | ||
| assert id_column_name in df_challenge_train.columns, ( | ||
|
|
@@ -146,6 +124,7 @@ def run_shadow_model_training(config: DictConfig, df_challenge_train: pd.DataFra | |
| # ``master_challenge_df`` is used for fine-tuning for half of the shadow models. | ||
| # For the other half of the shadow models, only ``master_challenge_df`` is used for training. | ||
| first_set_result_path, second_set_result_path, third_set_result_path = train_three_sets_of_shadow_models( | ||
| model_runner=model_runner, | ||
| population_data=df_population_with_challenge, | ||
| master_challenge_data=df_challenge_train, | ||
| shadow_models_output_path=Path(config.shadow_training.shadow_models_output_path), | ||
|
|
@@ -157,9 +136,7 @@ def run_shadow_model_training(config: DictConfig, df_challenge_train: pd.DataFra | |
| # ``4 * n_models_per_set`` total shadow models. | ||
| n_models_per_set=4, # 4 based on the original code, must be even | ||
| n_reps=12, # Number of repetitions of challenge points in each shadow model training set. `12` based on the original code | ||
| number_of_points_to_synthesize=config.shadow_training.number_of_points_to_synthesize, | ||
| random_seed=config.random_seed, | ||
| model_type=model_type, | ||
| ) | ||
| log( | ||
| INFO, | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,6 +21,12 @@ | |
| from examples.ensemble_attack.run_shadow_model_training import run_shadow_model_training | ||
| from midst_toolkit.attacks.ensemble.blending import BlendingPlusPlus, MetaClassifierType | ||
| from midst_toolkit.attacks.ensemble.data_utils import load_dataframe | ||
| from midst_toolkit.attacks.ensemble.models import ( | ||
| EnsembleAttackModelRunner, | ||
| EnsembleAttackTabDDPMModelRunner, | ||
| EnsembleAttackTabDDPMTrainingConfig, | ||
| ) | ||
| from midst_toolkit.attacks.ensemble.process_split_data import PROCESSED_TRAIN_DATA_FILE_NAME | ||
| from midst_toolkit.common.logger import log | ||
| from midst_toolkit.common.random import set_all_random_seeds | ||
| from midst_toolkit.models.clavaddpm.train import get_df_without_id | ||
|
|
@@ -87,7 +93,11 @@ def extract_primary_id_column( | |
| return data_frame[id_column_name] | ||
|
|
||
|
|
||
| def run_rmia_shadow_training(config: DictConfig, df_challenge: pd.DataFrame) -> list[dict[str, list[Any]]]: | ||
| def run_rmia_shadow_training( | ||
| model_runner: EnsembleAttackModelRunner, | ||
| config: DictConfig, | ||
| df_challenge: pd.DataFrame, | ||
| ) -> list[dict[str, list[Any]]]: | ||
| """ | ||
| Three sets of shadow models will be trained as a part of this attack. | ||
| Note that shadow models need to be trained on the collection of challenge points once and used | ||
|
|
@@ -96,14 +106,16 @@ def run_rmia_shadow_training(config: DictConfig, df_challenge: pd.DataFrame) -> | |
| of the shadow models, and these shadow models are used to attack all target models. | ||
|
|
||
| Args: | ||
| config: Configuration object set in ``experiments_config.yaml``. | ||
| model_runner: The model runner to be used for training the shadow models. | ||
| Should be an instance of `EnsembleAttackModelRunner`. | ||
| config: Configuration object set in config.yaml. | ||
| df_challenge: DataFrame containing the challenge data points for shadow model training. | ||
|
|
||
| Return: | ||
| A list containing three dictionaries, each representing a collection of shadow | ||
| models with their training data and generated synthetic outputs. | ||
| """ | ||
| shadow_model_paths = run_shadow_model_training(config, df_challenge_train=df_challenge) | ||
| shadow_model_paths = run_shadow_model_training(model_runner, config, df_challenge_train=df_challenge) | ||
|
|
||
| assert len(shadow_model_paths) == 3, "For testing, meta classifier needs the path to three sets of shadow models." | ||
|
|
||
|
|
@@ -198,7 +210,7 @@ def collect_challenge_and_train_data( | |
| # Load master challenge train data | ||
| df_master_train = load_dataframe( | ||
| processed_attack_data_path, | ||
| "master_challenge_train.csv", | ||
| PROCESSED_TRAIN_DATA_FILE_NAME, | ||
| ) | ||
| log( | ||
| INFO, | ||
|
|
@@ -254,12 +266,17 @@ def select_challenge_data_for_training( | |
| return df_challenge | ||
|
|
||
|
|
||
| def train_rmia_shadows_for_test_phase(config: DictConfig) -> list[dict[str, list[Any]]]: | ||
| def train_rmia_shadows_for_test_phase( | ||
| model_runner: EnsembleAttackModelRunner, | ||
| config: DictConfig, | ||
| ) -> list[dict[str, list[Any]]]: | ||
| """ | ||
| Function to train RMIA shadow models for the testing phase using the dataset containing | ||
| challenge data points. | ||
|
|
||
| Args: | ||
| model_runner: The model runner to be used for training the shadow models. | ||
| Should be an instance of `EnsembleAttackModelRunner`. | ||
| config: Configuration object set in ``experiments_config.yaml``. | ||
|
|
||
| Returns: | ||
|
|
@@ -279,7 +296,7 @@ def train_rmia_shadows_for_test_phase(config: DictConfig) -> list[dict[str, list | |
| ) | ||
| df_master_train = load_dataframe( | ||
| processed_attack_data_path, | ||
| "master_challenge_train.csv", | ||
| PROCESSED_TRAIN_DATA_FILE_NAME, | ||
| ) | ||
| else: | ||
| # If challenge data does not exist, collect it from the cluster | ||
|
|
@@ -292,15 +309,10 @@ def train_rmia_shadows_for_test_phase(config: DictConfig) -> list[dict[str, list | |
| # Load the challenge dataframe for training RMIA shadow models. | ||
| rmia_training_choice = RmiaTrainingDataChoice(config.target_model.attack_rmia_shadow_training_data_choice) | ||
| df_challenge = select_challenge_data_for_training(rmia_training_choice, df_challenge_experiment, df_master_train) | ||
| return run_rmia_shadow_training(config, df_challenge=df_challenge) | ||
| return run_rmia_shadow_training(model_runner, config, df_challenge=df_challenge) | ||
|
|
||
|
|
||
| # TODO: Perform inference on all the target models sequentially in a single run instead of running this script | ||
| # multiple times. For more information, refer to https://app.clickup.com/t/868h4xk86 | ||
| @hydra.main(config_path="configs", config_name="experiment_config", version_base=None) | ||
| def run_metaclassifier_testing( | ||
| config: DictConfig, | ||
| ) -> None: | ||
| def run_metaclassifier_testing(model_runner: EnsembleAttackModelRunner, config: DictConfig) -> None: | ||
lotif marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """ | ||
| Function to run the attack on a single target model using a trained metaclassifier. | ||
| Note that RMIA shadow models need to be trained for every new set of target models on | ||
|
|
@@ -313,6 +325,8 @@ def run_metaclassifier_testing( | |
| Test prediction probabilities are saved to the specified attack result path in the config. | ||
|
|
||
| Args: | ||
| model_runner: The model runner to be used for testing the metaclassifier. | ||
| Should be an instance of `EnsembleAttackModelRunner`. | ||
| config: Configuration object set in ``experiments_config.yaml``. | ||
| """ | ||
| log( | ||
|
|
@@ -382,7 +396,7 @@ def run_metaclassifier_testing( | |
|
|
||
| if not models_exists: | ||
| log(INFO, "Shadow models for testing phase do not exist. Training RMIA shadow models...") | ||
| shadow_data_collection = train_rmia_shadows_for_test_phase(config) | ||
| shadow_data_collection = train_rmia_shadows_for_test_phase(model_runner, config) | ||
|
|
||
| else: | ||
| log(INFO, "All shadow models for testing phase found. Using existing RMIA shadow models...") | ||
|
|
@@ -427,5 +441,32 @@ def run_metaclassifier_testing( | |
| save_results(attack_results_path, metaclassifier_model_name, probabilities, pred_score) | ||
|
|
||
|
|
||
| # TODO: Perform inference on all the target models sequentially in a single run instead of running this script | ||
| # multiple times. For more information, refer to https://app.clickup.com/t/868h4xk86 | ||
| @hydra.main(config_path="configs", config_name="experiment_config", version_base=None) | ||
| def run_metaclassifier_testing_with_tabddpm(config: DictConfig) -> None: | ||
| """ | ||
| Run the attack on a single target model using a trained metaclassifier. | ||
| RMIA shadow models will be trained using the TabDDPM model. | ||
|
|
||
| Args: | ||
| config: Configuration object set in config.yaml. | ||
| """ | ||
| log(INFO, "Running metaclassifier testing with TabDDPM...") | ||
|
|
||
| with open(config.shadow_training.training_json_config_paths.training_config_path, "r") as file: | ||
| training_config = EnsembleAttackTabDDPMTrainingConfig(**json.load(file)) | ||
| training_config.fine_tuning_diffusion_iterations = ( | ||
| config.shadow_training.fine_tuning_config.fine_tune_diffusion_iterations | ||
| ) | ||
| training_config.fine_tuning_classifier_iterations = ( | ||
| config.shadow_training.fine_tuning_config.fine_tune_classifier_iterations | ||
| ) | ||
|
|
||
| model_runner = EnsembleAttackTabDDPMModelRunner(training_config=training_config) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar comment here about config processing. |
||
|
|
||
| run_metaclassifier_testing(model_runner, config) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| run_metaclassifier_testing() | ||
| run_metaclassifier_testing_with_tabddpm() | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you've already thought of this, but should the code above be part of the base for the ModelRunner? That is, should lines 87-94 actually happen inside that class rather than in the attack script here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would also slightly simplify the process of subbing out the model, since you would just need to sub the runner class instead of both the running and the config class? I might be missing a complexity though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I understood your idea, but I thought maybe if I pass the config dictionary to the init of the model runner class we would be able to skip making the config. Is that it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of. My thought was that you could simply have the
EnsembleAttackTabDDPMModelRunnerinit take a path to the configuration file. Then you could load the file and do all of the steps to properly constructEnsembleAttackTabDDPMTrainingConfigobject within the runner class? That way a user doesn't have to do that themselves.It's possible I'm missing something where that would be a bad idea though 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if my explanation of what I was trying to suggest isn't clear. We can talk about it together.