{"id":645,"date":"2025-02-19T14:46:10","date_gmt":"2025-02-19T13:46:10","guid":{"rendered":"https:\/\/www.wolter.tech\/?p=645"},"modified":"2025-03-03T10:25:42","modified_gmt":"2025-03-03T09:25:42","slug":"chain-jobs-on-jewels-booster","status":"publish","type":"post","link":"https:\/\/www.wolter.tech\/?p=645","title":{"rendered":"Chain jobs with Slurm"},"content":{"rendered":"\n<p>Sometimes 24 hours are not enough to train your model. Yet big compute providers like the J\u00fclich Supercomuting Centre enforce 24h job limits to ensure everyone gets access.<\/p>\n\n\n\n<p>Chaining jobs allows us to split work into parts without having to submit every part independently by hand. Here is a minimal example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># python_train.py\n\nimport pickle\nimport os\nimport subprocess\n\nif __name__ == '__main__':\n    if not os.path.exists('model.pkl'):\n        model_var = 0\n    else:\n        with open('model.pkl', 'rb') as f:\n            model_var = pickle.load(f)\n\n    model_var += 1\n    with open('model.pkl', 'wb') as f:\n        model = pickle.dump(model_var, f)\n    print(f\"saved model: {model_var}\")\n    if model_var &lt; 5:\n        subprocess.call(\"sbatch\",\n                        \"run_chain.sh\")\n    else:\n        print(\"Model trained! Done!\")<\/code><\/pre>\n\n\n\n<p>The script above uses pickle to simulate your code&#8217;s results. Store it in a file called <code>python_train.py<\/code>. It checks the <code>model_var<\/code> and resubmits your code until a condition is met. A slurm file like the one below lets you run this example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/bin\/bash\n#\n#SBATCH -A TODO:your-account-name\n#SBATCH --nodes=1\n#SBATCH --job-name=test_chain\n#SBATCH --output=test_chain-%j.out\n#SBATCH --error=test_chain-%j.err\n#SBATCH --time=00:05:00\n#SBATCH --partition develbooster\n\nmodule load Python\n\npython python_train.py<\/code><\/pre>\n\n\n\n<p>Call the file <code>run_chain.sh<\/code>.<\/p>\n\n\n\n<p>Submit everything with <code>sbatch run_chain.sh<\/code> and voil\u00e0, your job executes itself in a chain. This is useful if you are dealing with deep learning code that needs more than 24h to converge. The pickle code in the example is meant to simulate model storage and loading from disc.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Sometimes 24 hours are not enough to train your model. Yet big compute providers like the J\u00fclich Supercomuting Centre enforce 24h job limits to ensure everyone gets access. Chaining jobs allows us to split work into parts without having to submit every part independently by hand. Here is a minimal example: The script above uses &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.wolter.tech\/?p=645\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Chain jobs with Slurm&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[],"class_list":["post-645","post","type-post","status-publish","format-standard","hentry","category-research-projects","entry"],"_links":{"self":[{"href":"https:\/\/www.wolter.tech\/index.php?rest_route=\/wp\/v2\/posts\/645","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wolter.tech\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wolter.tech\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wolter.tech\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wolter.tech\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=645"}],"version-history":[{"count":9,"href":"https:\/\/www.wolter.tech\/index.php?rest_route=\/wp\/v2\/posts\/645\/revisions"}],"predecessor-version":[{"id":656,"href":"https:\/\/www.wolter.tech\/index.php?rest_route=\/wp\/v2\/posts\/645\/revisions\/656"}],"wp:attachment":[{"href":"https:\/\/www.wolter.tech\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=645"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wolter.tech\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=645"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wolter.tech\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=645"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}