You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
pytorch / xla Public
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The callbacks type in the checkpoint created by this function raised TypeError: 'mappingproxy' object does not support item assignment when running xm.save(checkpoint, . )
def on_save_checkpoint(self): """Called when saving a model checkpoint.""" callback_states = <> for callback in self.callbacks: callback_class = type(callback) state = callback.on_save_checkpoint(self, self.get_model()) if state: callback_states[callback_class] = state return callback_states
Would it be possible for you to support resolve this bug, as it is currently breaking PyTorch Lightning TPU support for callbacks states.
Steps to reproduce the behavior:
The text was updated successfully, but these errors were encountered:
ananthsub mentioned this issue Mar 11, 2021 tchaton commented Mar 30, 2021 •yes, state_dict works fine, but in PyTorch Lightning, we will save the trainer states too (callbacks, etc . )
Modify this line to trigger the error:
self.save(, filepath)
self.save(checkpoint, filepath)
Collaborator
zcain117 commented Apr 2, 2021
create VM: gcloud compute instances create zcain-vm --zone=us-central1-a --machine-type=e2-highmem-16 --image-family=torch-xla --image-project=ml-images --boot-disk-size=300GB --scopes=https://www.googleapis.com/auth/cloud-platform create TPU: gcloud compute tpus create zcain-tpu --zone=us-central1-a --network=default --version=pytorch-nightly --accelerator-type=v3-8 SSH into VM: (VM) git clone https://github.com/PyTorchLightning/pytorch-lightning.git (VM) conda activate torch-xla-nightly (VM) export TPU_IP_ADDRESS= (VM) export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470" (VM) cd pytorch-lightning (VM) pip install . (VM) pip install -r requirements/test.txt (VM) cd pl_examples/domain_templates/ (VM) vim computer_vision_fine_tuning.py (VM) (VM) python computer_vision_fine_tuning.py
If I use torch-xla-nightly I get an unrelated crash that seems to be because of Tensorboard
If I use torch-xla-1.8 I get a crash like this:
File "/anaconda3/envs/torch-xla-1.8/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 116, in new_process results = trainer.run_stage() File "/anaconda3/envs/torch-xla-1.8/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 646, in _save_top_k_checkpoint self._update_best_and_save(current, epoch, step, trainer, monitor_candidates) File "/anaconda3/envs/torch-xla-1.8/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 535, in run_stage self.run_train() File "/anaconda3/envs/torch-xla-1.8/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 535, in run_stage self.run_train() File "/anaconda3/envs/torch-xla-1.8/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 708, in _update_best_and_save self._save_model(trainer, filepath) File "/anaconda3/envs/torch-xla-1.8/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 410, in _save_model self._do_save(trainer, filepath) File "/anaconda3/envs/torch-xla-1.8/lib/python3.6/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 422, in _do_save self.save_function(filepath, self.save_weights_only) File "/anaconda3/envs/torch-xla-1.8/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 599, in run_train self.train_loop.run_training_epoch() File "/anaconda3/envs/torch-xla-1.8/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 599, in run_train self.train_loop.run_training_epoch() File "/anaconda3/envs/torch-xla-1.8/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch self.trainer.run_evaluation(on_epoch=True) TypeError: 'mappingproxy' object does not support item assignment
If I use torch-xla-1.7 , then I get the same crash.
@tchaton it seems like your CI is using torch-xla-1.7 , since I see a line like this: Downloaded newer image for pytorchlightning/pytorch_lightning:base-xla-py3.6-torch1.7 . It seems like there is a test to run with disabled checkpoints here: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/models/test_tpu.py#L391
Some of my questions: