首页IT科技ai训练师工资一般多少(Colossal-AI训练diffusion经验记录)

ai训练师工资一般多少(Colossal-AI训练diffusion经验记录)

时间2025-06-14 14:02:18分类IT科技浏览4566
导读:GPU Programming介绍 openAI ColossalAI官方文档 ColossalAI+ldm简介:Diffusion Pretraining and Hardware Fine-Tuning Can Be Almost 7X Cheaper! Colossal-AI’s Ope...

GPU Programming介绍 openAI ColossalAI官方文档 ColossalAI+ldm简介:Diffusion Pretraining and Hardware Fine-Tuning Can Be Almost 7X Cheaper! Colossal-AI’s Open Source Solution Accelerates AIGC at a Low Cost 手部生成ControlNet项目地址 github

基本使用

参考链接

调用colossalai包和 pytorch-lightning包 train #case1: HybridAdam from lightning.pytorch import trainer, LightningModule from colossalai.nn.optimizer import HybridAdam class MyDiffuser(LightningModule): ... def configure_sharded_model(self) -> None: # create your model here self.model = construct_diffuser_model(...) ... def configure_optimizers(self): # use the specified optimizer optimizer = HybridAdam(self.model.parameters(), self.lr) ... model = MyDiffuser() trainer = Trainer(accelerator="gpu", devices=1, precision=16, strategy="colossalai") trainer.fit(model) #case2: ColossalAIStrategy显存优化 from lightning.pytorch import trainer, LightningModule from lightning.pytorch.strategies import ColossalAIStrategy Mystrategy = ColossalAIStrategy(use_chunk=True, enable_distributed_storage=True, placement_policy=auto) trainer = Trainer(accelerator="gpu", devices=4, precision=16, strategy=Mystrategy) trainer.fit(model) finetune

:低成本            ,通过较少资源短时间训练出可以生成自己风格的模型

基于 HuggingFace 上开源的 Stable Diffusion 模型权重进行微调

步骤: 修改 Dataloader 载入微调数据集 读取预训练权重 简单修改参数配置 yaml 文件并运行训练脚本

python main.py --logdir /your_log_dir -t -b config/train_colossalai.yaml model: target: ldm.models.diffusion.ddpm.LatentDiffusion params: your_sub_module_config: target: your.model.import.path params: from_pretrained: your_file_path/unet/diffusion_pytorch_model.bin ... lightning: trainer: strategy: target: pytorch_lightning.strategies.ColossalAIStrategy params: ... inference

:支持原生 Stable Diffusion 推理管道

完成训练或精调后                  ,只需直接调用 diffuser 库+加载自己保存的模型参数

->模型推理对数值精度不敏感 from diffusers import StableDiffusionPipeline pipe = StableDiffusionPipeline.from_pretrained("your_ColoDiffusion_checkpoint_path").to("cuda") image = pipe(your prompt, num_inference_steps=50)["sample"][0] image.save(file path)

Colossal-AI+diffusion

Colossal-AI训练diffusion项目地址 github ldm项目地址 github ldm代码教程

We provide the script train_colossalai.sh to run the training task with colossalai, and can also use train_ddp.sh to run the training task with ddp to compare.

可以通过 train_colossalai.sh 来执行训练任务:

python main.py --logdir /tmp/ --train --base configs/train_colossalai.yaml --ckpt 512-base-ema.ckpt

You can change the --logdir to decide where to save the log information and the last checkpoint.

You will find your ckpt in logdir/checkpoints or logdir/diff_tb/version_0/checkpoints

You will find your train config yaml in logdir/configs

You can add the --ckpt if you want to load the pretrained model, for example 512-base-ema.ckpt

You can change the --base to specify the path of config yaml

Training config

train_colossalai.yaml中可以修改的一些参数

devices

: device number used for training, default 8

max_epochs

: max training epochs, default 2

precision: the precision type used in training, default 16 (fp16), you must use fp16 if you want to apply colossalai

环境配置

使用Colossal-AI需下载pytorch-lightning>=1.8.1

Colossal-AI 已集成作为 PyTorch Lightning

的官方大模型解决方案

Step 2: install lightning

Install Lightning version later than 2022.01.04. We suggest you install lightning from source.

注意太早的 pytorch-lightning 版本没有strategy更没有ColossalAIStrategy     ,

经测试1.4.1和1.6.1版本都报错或者warning了      ,最后换到1.8.1成功跑通

pip install pytorch-lightning==1.8.1

3090需下载 CUDA >= 11.6版本

否则报如下提示

NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.

The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.

参考如下文章

解决:NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with …

xformer需要pytorch>=1.13.1

pytorch 1.13.1

torchvision 0.14.1

cuda 11.7

按environment.yaml配的初始环境会报如下错误: /home/user/anaconda3/envs/colossalai/lib/python3.7/site-packages/torchvision/io/image.py:11: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory warn(f"Failed to load image Python extension: {e}")

应该时torchvision有问题                  ,重新安装pytorch           ,经测试如下版本可成功跑通

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

师兄采用

pytorch 1.12.1

torchvision 1.13.1

cuda 11.6

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge

会报如下提示 ... ========================================================================================= No pre-built kernel is found, build and load the fused_optim kernel during runtime now ========================================================================================= Detected CUDA files, patching ldflags Emitting ninja build file /home/fangfei/.cache/colossalai/torch_extensions/torch1.12_cu11.6/build.ninja... Building extension module fused_optim... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ... [6/7] /home/fangfei/.conda/envs/tmm/bin/nvcc -DTORCH_EXTENSION_NAME=fused_optim -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/kernels/include -I/home/fangfei/.conda/envs/tmm/include -isystem /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include -isystem /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/TH -isystem /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/THC -isystem /home/fangfei/.conda/envs/tmm/include -isystem /home/fangfei/.conda/envs/tmm/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options -fPIC -O3 --use_fast_math -lineinfo -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -std=c++14 -c /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu -o multi_tensor_l2norm_kernel.cuda.o /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu: In function ‘std::tuple<at::Tensor, at::Tensor> multi_tensor_l2norm_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor> >, c10::optional<bool>)’: /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:289:89: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] tensor_lists[0][0].scalar_type(), 0, "multi_tensor_l2norm_cuda", ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:290:9: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] multi_tensor_apply<1>( ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:291:120: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] BLOCK_SIZE, chunk_size, noop_flag, tensor_lists, ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:292:40: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] L2NormFunctor<scalar_t_0>(), output.DATA_PTR<float>(), ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:305:115: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] cleanup<<<per_tensor ? ntensors : 1, 512, 0, stream>>>( ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:305:163: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] cleanup<<<per_tensor ? ntensors : 1, 512, 0, stream>>>( ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:305:196: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] cleanup<<<per_tensor ? ntensors : 1, 512, 0, stream>>>( ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:305:241: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] cleanup<<<per_tensor ? ntensors : 1, 512, 0, stream>>>( ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu: In function ‘void multi_tensor_norm_out_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor> >, at::Tensor, float, float, int)’: /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:349:90: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] tensor_lists[0][0].scalar_type(), 0, "multi_tensor_maxnorm_cuda", ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:349:125: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] tensor_lists[0][0].scalar_type(), 0, "multi_tensor_maxnorm_cuda", ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:351:91: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] BLOCK_SIZE, chunk_size, noop_flag, tensor_lists, ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:351:126: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] BLOCK_SIZE, chunk_size, noop_flag, tensor_lists, ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:356:89: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] tensor_lists[0][0].scalar_type(), 0, "multi_tensor_l2norm_cuda", ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:356:124: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] tensor_lists[0][0].scalar_type(), 0, "multi_tensor_l2norm_cuda", ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:358:89: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] BLOCK_SIZE, chunk_size, noop_flag, tensor_lists, ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:358:124: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] BLOCK_SIZE, chunk_size, noop_flag, tensor_lists, ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:376:101: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] cleanup_v2<<<ntensors, 512, 0, stream>>>( ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:376:136: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] cleanup_v2<<<ntensors, 512, 0, stream>>>( ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:376:157: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] cleanup_v2<<<ntensors, 512, 0, stream>>>( ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:376:178: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] cleanup_v2<<<ntensors, 512, 0, stream>>>( ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 1; T = L2NormFunctor<float>; ArgTypes = {float*, float*, bool, int}]’: /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:290:57: required from here /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>( ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 1; T = L2NormFunctor<c10::Half>; ArgTypes = {float*, float*, bool, int}]’: /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:292:88: required from here /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>( ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 1; T = MaxNormFunctor<float>; ArgTypes = {float*, float*, bool, int}]’: /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:350:27: required from here /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>( ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh: In instantiation of ‘void multi_tensor_apply(int, int, const at::Tensor&, const std::vector<std::vector<at::Tensor> >&, T, ArgTypes ...) [with int depth = 1; T = MaxNormFunctor<c10::Half>; ArgTypes = {float*, float*, bool, int}]’: /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu:352:28: required from here /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:104:150: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations] multi_tensor_apply_kernel<<<loc_block_info, block_size, 0, stream>>>( ^ /home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/include/ATen/core/TensorBody.h:235:1: note: declared here T * data() const { ^ ~~ [7/7] c++ colossal_C_frontend.o multi_tensor_sgd_kernel.cuda.o multi_tensor_scale_kernel.cuda.o multi_tensor_adam.cuda.o multi_tensor_l2norm_kernel.cuda.o multi_tensor_lamb.cuda.o -shared -L/home/fangfei/.conda/envs/tmm/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/home/fangfei/.conda/envs/tmm/lib64 -lcudart -o fused_optim.so Loading extension module fused_optim... Time to load fused_optim op: 67.05005216598511 seconds Segmentation fault (core dumped)

triton

A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named triton ...

虽然不影响程序运行 不过去查了一下好像是xformer的原因

参考文章:解决stable diffusion wenUI安装xformers出错问题

Segmentation fault (core dumped)

... Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads. DiffusionWrapper has 865.91 M params. ========================================================================================= No pre-built kernel is found, build and load the cpu_adam kernel during runtime now ========================================================================================= Emitting ninja build file /home/fangfei/.cache/colossalai/torch_extensions/torch1.13_cu11.7/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.413104295730591 seconds ========================================================================================= No pre-built kernel is found, build and load the fused_optim kernel during runtime now ========================================================================================= Detected CUDA files, patching ldflags Emitting ninja build file /home/fangfei/.cache/colossalai/torch_extensions/torch1.13_cu11.7/build.ninja... Building extension module fused_optim... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_optim... Time to load fused_optim op: 1.8385069370269775 seconds Segmentation fault (core dumped)

因为之前跑stable diffusion不会出现这个问题      ,分析要么是colossalai加的代码有问题                  ,要么是包冲突            。

已向colossalai的github提交issue等待回复中… 尝试输出报错信息进行定位

在我们运行程序的时候           ,经常会遇到Segmentation fault (core dumped)的问题,这种问题多见于内存操作不当                  ,比如空指针            、野指针的读写操作                 ,数组越界访问,常量被破坏等                 。对于一个比较大的程序            ,比较难定位错误具体的位置                 ,因此我们需要利用core文件进行错误的查找      。

[解决问题]查找产生Segmentation fault (core dumped)错误的真正原因

如何生成和调试 Linux 程序崩溃产生的 core 文件

查资料得知core dumped问题可以通过core文件查看报错

1.ulimit -c unlimited #不限制core文件大小

此时     ,程序产生core dump时会自动生成core文件在默认生成路径中

2.如果没有看到core文件可以修改生成路径到当前文件夹

例如:

sudo sysctl -w kernel.core_pattern=core.%p

3.core文件查看方式

gdb sysbench core.37795

或者进入gdb后

core-file core.7416 ... [New LWP 39080] Core was generated by `python main.py --logdir tmp --train --base configs/train_colossalai_cifar10.yaml. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007fba24575540 in ?? () [Current thread is 1 (LWP 37795)]

报错的是这一句Program terminated with signal SIGSEGV

SIGSEGV意思是程序企图向分配的内存以外的区域读写            ,或者向只读区域进行写操作(访问空指针                  ,内存越界访问     ,访问已经释放的内存…)

非常难受的是      ,范围太广了难以定位            。

尝试gdb调试python

GDB调试指南(入门                  ,看这篇够了)

使用gdb调试Python进程

使用gdb调试Python程序

待解决…

声明:本站所有文章           ,如无特殊说明或标注      ,均为本站原创发布                 。任何个人或组织                  ,在未征得本站同意时           ,禁止复制                  、盗用     、采集      、发布本站内容到任何网站                  、书籍等各类媒体平台      。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理      。

创心域SEO版权声明:以上内容作者已申请原创保护,未经允许不得转载,侵权必究!授权事宜、对本内容有异议或投诉,敬请联系网站管理员,我们将尽快回复您,谢谢合作!

展开全文READ MORE
如何使用vue构建项目(【vue3】使用vite构建vue3项目) 怎么批量改图片格式为jpg免费(图片怎么批量修改成原创)