I am doing masked language modeling training using Horovod in Databricks with a GPU cluster. In the middle of the training after 13 epochs the mentioned error arises ...
I found a similar thread here: #846 but none of the fixes there worked for me. In particular, I have verified that I have Java 8, that the environment variables are (to the best of my knowledge) set ...