SLLURM使用.md

SLURM

  1. 可能因为srun是实时交互的,所以如果链接中断的话,提交的任务也就会中断,但是sbatch不是实时交互的,所以即使链接终端的话,提交的任务也不会中断
  2. 配置样例

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    #!/bin/bash

    #SBATCH --job-name=fileTest
    #SBATCH --partition=audace2018
    #SBATCH --mem=1024
    #SBATCH --cpus-per-task=1
    #SBATCH --gres=gpu:1

    #SBATCH --error=../batchLog/error.log
    #SBATCH --output=../batchLog/output.log

    python ../code/file.py
  3. 在配置过程中所有的以#SBATCH 开始的配置短句,如果有一个解析不成功的话,会直接跳到最后,进行任务的执行

  4. 不需要使用

    --get-user-env
    1
    2
    3
    4
    5
    6
    5. ```#SBATCH --gres=gpu:1``` 的意思就是分配一块GPU,而不是从名字为gpu的分区上拿一块GPU
    6. AssertionError:
    The NVIDIA driver on your system is too old (found version 10000
    * 这这个[网页](http://www.nvidia.com/Download/index.aspx
    )下根据不同的型号进行相应驱动的下载
    * 具体下载链接 ```http://us.download.nvidia.com/tesla/440.33.01/NVIDIA-Linux-x86_64-440.33.01.run

  5. It is possible to specify several partitions in the options of your script or srun. In this case SLURM launches your job on the first available partition.

  6. You can also define several steps in a job (and therefore launch several programs in ==parallel or sequentially==) via the srun command
  7. Job arrays provide a very simple way to submit a large number of independent jobs. They can typically be used to apply the same program to different input data.
  8. 很奇怪啊,不管是HPC还是HPC2使用C语言都能够跑得动CUDA,但是换成PYTHON就不行
  9. I checked on opale (the only node which has CUDA installed)
  10. To see which machines are in which partition use the sinfo -N comman
  11. 使用 sbatch 的话,所有的运行准则,都在 .sh 文件中,如果使用 srun 的话,直接把需要执行的命令放在 srun 之后就可以了。所以涉及到makefile以及运行环境的设置

    PYTHON写日志

    import logging
    import time

    print(“Hello World”)

    fileName = ‘../codeLog/‘ + time.strftime(“%Y:%m:%d_%I-%M-%S_%p”) + ‘.log’
    logFormat = ‘%(levelname)s: %(message)s’
    logging.basicConfig(filename= fileName, filemode= ‘w’, format= logFormat, level=logging.DEBUG)

    logging.debug(‘This is a debug message’)
    logging.info(‘This is an info message’)
    logging.warning(‘This is a warning message’)
    logging.error(‘This is an error message’)
    logging.critical(‘This is a critical message’)

  12. 因为默认的logging.level=warning, 所以如果不重置的话,就会导致无法显示debug、info的信息,所以需要将其设置为最低等级的debug,才能够显示所有的信息
  13. 时间里面不能够使用 strftime("%Y/%m/%d_%I-%M-%S_%p") 的格式,因为找不到20/2/10这个文件夹