Excellent software and practical tutorials
Lightweight and efficient industrial-grade LLM training and alignment framework, OpenRLHF supports full-parameter and full-process training of RLHF 70B model!
What is OpenRLHF?sinceChatGPTAfter it came out, people began to pay attention to RLHF alignment technology represented by InstructGPT, and tried to reproduce the training process of ChatGPT based on it. Gradually, representative RLHF reproduction works such as ColossalChat and DeepSpeed-Chat appeared. But at that time, everyone's understanding of alignment technology was basically centered around InstructGPT. Since OpenAI has not been very open recently, it actually lacks sufficient third-party verification. Fortunately,LLaMA2Soon it was launched, which not only fully verified the effectiveness of RLHF technology, but also had enough innovations (such as rejection of sampling and multiple RM, etc.), which immediately set off the wholeLLMOpen source community.
In view of the popularity of InstructGPT and LLaMA2, our OpenLLMAI open source community investigated the current mainstream alignment training frameworks and found that most frameworks still lack support for LLaMA2 full-process and full-parameter training, lack sufficient scalability, or are not lightweight and easy to use. Therefore, we are determined to make a truly industrial-grade LLM alignment training framework, reproduce the large model training process represented by InstructGPT and LLaMA2, support mainstream alignment technologies such as RLHF/DPO, and help everyone quickly realize their own alignment ideas.
So, welcome toOpenRLHF, quickly start your alignment work!
https://github.com/OpenLLMAI/OpenRLHF
OpenRLHF Project Introduction
Currently, OpenLLMAI is mainly working on two projects:
The rest of the projects will depend on the subsequent manpower and interests. In the near future, we may launch a KD or SE framework. For the time being, we may not be very interested in training a general small model. The main reason is that time, funds and energy are very limited. It is a bit difficult to generate electricity with love, so most of the time we will consider interest-oriented. However, interest cannot be eaten after all, so we have spent a lot of energy recently to prepare this promotional material (Xianyu classmate was too Buddhist/obsessive-compulsive/busy before, and had many problems hh). Admittedly, OpenLLMAI is still very ignorant,OpenRLHF is not perfect yetHowever, we have tried our best and hope to gain wider recognition and support from the community. A group of people can go further!
OpenRLHF Design Idea
1. Design goals:Lightweight and efficientofIndustrial GradeLLM training and alignment framework
Since the industry currently lacks a truly industrial-grade LLM alignment framework, most manufacturers may choose to implement it themselves (thanks to OpenAI for making a good start). This is understandable in the short term, but in the long run, the problem of reinventing the wheel is inevitable.
Therefore, our goal is to create a lightweight and efficient industrial-grade LLM training and alignment framework. To achieve this goal, we have done relatively prudent development and testing in the first version, striving for the usability of the first version; on the other hand, we have officially opened it to attract more like-minded people to participate in the co-construction. For the framework, we firmly believe that open source is the only way to have vitality!
2. Design concept: easy to use, high performance, scalable, exploratory
Simple and easy to use: Ease of use is the first guiding principle for us to design the OpenRLHF framework. Since high performance is a must for a qualified framework, we do not emphasize this too much. On the premise of ensuring high performance, improving ease of use is our first goal.
Scalable: Based on 7B, it is backward compatible with the training of 1-2B small models and gradually supports the growing model size, such as 34B/70B/170B training.
Exploratory: In addition to ensuring the basic framework functions, we will keep the alignment technology cutting-edge, track the latest progress and quickly implement it, and also provide the latest alignment algorithms developed by our team. In the future, we will also develop the LLMPipeline module to provide fast practice and fair comparison of mainstream alignment algorithms or mainstream model training technologies.
3. Implementation ideas
Ease of use: In terms of basic large model framework, we investigated DeepSpeed/Megatron-LM and other LLM training frameworks, and chose DeepSpeed, which is more concise and easy to use, in the first version; in terms of model library, we chose HugBaoHugFace without hesitation; in terms of distributed expansion, we chose Ray, don’t ask, just ask XiangRay! (mainly used for resource scheduling)
Scalable and high performance: Use ray for reasonableGPU Resource Scheduling, assigning Actor, Reward, Reference, and Critic models to separate GPUs,Separating training and inferenceTo make full use of the excellent tools of the inference community, and cooperate with offload, PEFT, etc.Video memory savingTechnology to achieve scale expansion and efficient training of large models.
Exploratory: In the first version, we fully reproduced the training process of InstructGPT and LLaMA2, and supported newer alignment technologies such as DPO. In the future, we will continue to maintain exploratory nature and develop pipeline modules to support the pipelines of mainstream models such as InstructGPT Pipeline and LLaMA2 Pipeline, helping the community to conduct more scientific comparisons and research.
OpenRLHF Key Highlights
OpenRLHF Key Features
The first open source full-scale reproduction of LLaMA2and the RLHF alignment framework of InstructGPT;
Support SFT/RM/PPO full-process training;
supportrejection sampling, multiple RMs;
Easy to use: OpenRLHF is one of the simplest high-performance RLHF libraries currently available, and can be implemented on a single 8-card DGXA100 node 34B Model RLHF training can be done through scriptsOne-click training;
Separation of training and pushing, distributed and scalable RLHF;
Using multiple cards24GB RTX 4090 GPU7B Full process training of the model
Using multiple cards A100 80G GPU and vLLM70B+ ModelFull process training
Separation of training and pushing: Separate training and reasoning to reuse good reasoning tools from the community (we eventually used vLLM) to reduce reasoning latency;
Distributed and scalable: With the support of ray/deepspeed/vLLM and reasonable resource scheduling, we have achieved efficient and scalable training. The following are two examples:
High performance: Thanks to ray/deepspeed and other memory saving technologies and inference acceleration frameworks, our training performance on the 13B LLaMA2 model is more than 4 times that of DeepSpeedChat;
Zero Series
FlashAttention2
LoRA, QLoRA
offload
Gradient checkpointing
Inference acceleration: vLLM
Tips for saving video memory:
Cutting-edge: Keep up with cutting-edge progress, currently supporting mainstream alignment technologies and mainstream large models;
LLaMA
Baichuan
qwen
Mixtral 8*7b
Cutting-edge models:
Standard RLHF: SFT/RM/PPO;
Rejection Sampling;
DPO (direct-preference-optimization)/IPO/cDPO;
Kahneman-Tversky optimization (KTO);
Conditional SFT (https://arxiv.org/abs/2308.12050);
State-of-the-art alignment technology:
Reinforcement learning tricks: We integrated the implementation tricks for PPO to improve the training stability, referencing Implementation Matters in Deep Policy Gradients and ppo-implementation-details.
OpenRLHF Performance Demonstration
Support Matrix:
The following support matrix shows the comparison between OpenRLHF and the mainstream LLM alignment training framework (there may be delays in the survey, please contact us for corrections if there are any errors or omissions):
PPO Tricks | 34B Full Participation/4 A100 | 70B+full parameter/16 A100 | 7B Full/4 RTX4090 | QLoRA | Mixtral MOE 8*7b | |
---|---|---|---|---|---|---|
OpenRLHF | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
DeepSpeedChat | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ |
ColossalAIChat | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ | ✖️ |
TRL | ✔ | ✖️ | ✖️ | ✖️ | ✔ | ✖️ |
LLaMA-Factory | ✖️ | ✖️ | ✖️ | ✖️ | ✔ | ✔(QLoRA) |
The main advantages of OpenRLHF areGood scalabilityandEfficient performance, which can support efficient training of the entire process and parameters of the 70B model, and can also cope with larger-scale expansion in the future. However, frameworks such as LLaMA-Factory/trl/trlx have similar problems. Does not support 70B full parameter RLHF trainingSome frameworks focus on Lora fine-tuning 13b level models, generally samplingThe solution of incorporating actor critic(This is a stopgap measure for small-scale RLHF to save video memory, but it does not conform to the standard RLHF implementation, and the scalability is very poor, and there will always be a time when it cannot be put down.) Of course, OpenRLHF also has some disadvantages, such as incomplete documentation and benchmarks.Usability needs to be improvedSpecifically, we provide the following comparison between OpenRLHF and other popular RLHF frameworks (please feel free to point out any errors or omissions). A more detailed and comprehensive comparison can be found in our official technical report later.
LLaMA-Factory: Its advantages are efficient fine-tuning and ease of use (which is worth learning, and it even has a web-ui). It uses merged actor-critic, cannot support 70B full-parameter PPO training, and is not easy to expand the model scale.
Colossal-Chat: uses single-step RL, while our framework uses step-wise RL. See OpenRLHF vs Colossal-Chat for details;
trl/trlx: The advantage is that it is very compatible with the Hugging Face ecosystem, but it may have the problem of being too deeply encapsulated and difficult to modify. Similarly, it currently does not support 70B full-parameter PPO training; and it uses merged actor-critic to save video memory, but this is inconsistent with the standard implementation;
NeMo-Aligner: The generation based on Megatron is currently inefficient, which affects the overall training efficiency. It is not very compatible with the Hugging Face ecosystem, and the model may need to be specially modified.
Performance data:
According to existing tests, the training efficiency of our OpenRLHF framework on the 13B model is about 4 times that of DeepSpeedChat (due to manpower limitations, there may be delays in testing. You can report the performance data of other frameworks to us for correction).
7B llama2 RLHF | 13B llama2 RLHF (50k samples) | |
---|---|---|
OpenRLHF | - | 17 hours with 8 A100 |
DeepSpeedChat | - | 48 hours with 16 A100 |
Training throughput:
Default configuration:
4 A100 80G for Actor, 2 A100 80G for Critic, 1 A100 80G for RM, and 1 A100 80G for InitPolicy
ZeRO2 with Adam Offload
Max Sequence Length: 2048
Performance throughput (samples/s in the default configuration, which will be changed to tokens/s later):
7B llama2: 0.105 samples/gpu/secsmicro_batch_size = 16/8 (rollout/train), generation_length = 100~300
13B llama2: 0.04 samples/gpu/secsmicro_batch_size = 8/4 (rollout/train), generation_length = 200~400
34B codellama: 0.007 samples/gpu/secsmicro_batch_size = 2/1 (rollout/train), generation_length = 300~800
Performance data of mainstream models (due to manpower reasons, there is no time to retest for the time being. The test data reported here is the test data when the model was supported at that time. The current version of PPO should be much faster. More models will be added and performance data will be updated in the official technical report later):
model | SFT | RM | PPO | Notes |
---|---|---|---|---|
Baichuan2-7B | 1h | 4h | 71h | |
Qwen-7B | - | - | - |
How to use OpenRLHF
Official documentation:
The official documents including this article will be maintained on Github. Improving the quality of the documents to improve usability is also one of the key directions of our subsequent work (due to manpower reasons, the documents are currently rough, and everyone is welcome to participate and contribute):
Project Homepage
Official Documentation
OpenRLHF Installation
We supportnvidia-docker (recommended to avoid potential environment issues)Or conda environment installation (you can provide a configured conda environment or image later):
First, clone the repository:
Clone the repository: git clone https://github.com/openllmai/OpenRLHF.git
Then, install the nv-docker or conda environment:
# install nv-docker cd examples/scripts # install nvidia-docker (Optional) ./nvidia_docker_install.sh # launch nvidia container ./docker_run.sh # we need conda conda create -n openrlhf python=3.10 # so, we need some package manually: when installing torch install, you may need to match the corresponding cuda version. pip install packaging ninja pip3 install torch # check ninjia ninja --version echo $? # output: 0 # install flash-attn: may take some time. # For network error: you can download specified version from https://github.com/Dao-AILab/flash-attention/releases. pip install flash-attn==2.4.2 ./build_openrlhf.sh # enjoy it! conda activate openrlhf
OpenRLHF Training
Training script:
After configuring the environment, go to the /openrlhf/examples/scripts directory, modify the training script according to your needs, and start training with one click. It supports single-machine and multi-machine training.Supports full-scale and full-process training of 7B-70B+ modelsThe following are some important parameters that users can modify according to the situation to support their own model training:
-pretrain: pretrained model address, in Hugface format
-dataset: dataset address, Hugface format
-dataset_probs: sampling probability of multiple data sets mixed, for example: 0.5, 0.4, 0.1
-save_path: model save address, Hugface format
-max_epochs: number of training epochs
-micro_train_batch_size: single GPU batch_size
-train_batch_size: global batch_size
-learning_rate: learning rate
Standalone training script:
cd examples/scripts # install nvidia-docker (Optional) ./nvidia_docker_install.sh # launch nvidia container ./docker_run.sh # cd in container cd /openrlhf/examples/scripts # build OpenRLHF (ie, pip install) ./build_openrlhf.sh # huggingface login ~/.local/bin/huggingface-cli login # continue pretrain ./train_continue_pretrain_llama.sh # train SFT model ./train_sft_llama.sh # train RM model ./train_rm_llama.sh # train PPO model ./train_ppo_llama.sh # train DPO model ./train_dpo_llama.sh # train KTO model ./train_kto_llama.sh # train Rejection Sampling model ./train_rejection_sampling_llama.sh # train Conditional SFT model ./train_conditional_llama.sh
Multi-machine training script,16-card A100 70B model full parameter RLHF training:
cd examples/scripts # launch nvidia container ./docker_run.sh # cd in container cd /openrlhf/examples/scripts # build OpenRLHF (ie, pip install) ./build_openrlhf.sh # due to the compatibility of nVIDIA PyTorch image pip uninstall xgboost transformer_engine -y # huggingface login ~/.local/bin/huggingface-cli login # launch the master node of ray in container ray start --head --node-ip-address 0.0.0.0 --num-gpus 8 # if you want to launch ray on more nodes, use ray start --address {MASTER-NODE-ADDRESS}:6379 --num-gpus 8 # train ray PPO model, requires 8 gpus in default config ./train_ppo_llama_ray.sh # for 70B models and vLLM-based RLHF (important!) pip install vllm==0.3.2 # due to the compatibility of vLLM pip uninstall flash_attn -y ./train_ppo_llama_ray_70b.sh
reasoning
For reasoning and evaluation, we recommend reusing open source tools or codes in the industry. You can refer to the following scripts:
Future Work
The future development of OpenRLHF will focus on ease of use and practicality (documentation, tutorials, practical experience, etc.), cutting-edge (new algorithms, model pipelines, etc.) and stability. Specifically, there are the following potential tasks, and we hope that everyone can participate:
Documentation: Chinese and English versions
Tutorial: Provides goodTutorial
Environment: Provide a configured image or conda environment;
Performance testing, benchmark;
Testing of basic functions
Comparison with other frameworks
Support model testing
Testing of the alignment algorithm
Further performance optimization;
Stability improvement: regular code review;
New functions and new algorithms;
New model support: Google's new models, etc.
The evaluation module provides more comprehensive evaluation capabilities;
About the Organization
About the Organization
OpenLLMAI: Open AI for everyone.
We may be far from OpenAI, but we are very close to Open. Therefore, we have only two requirements for our members, that is, we hope that everyone is open enough and confident enough. Our attitude is: "One point of heat, one point of light". We are willing to walk the road of AI with you. A scholar must be broad-minded and resolute, and the road ahead is long and arduous!
We are all here for the love of LLM, and we want to do two things: 1. Exchange LLM technology (technology sharing, knowledge dissemination); 2. Develop LLM tools (training frameworks, models, data engineering, etc.). Interested students are welcome to join us! For a detailed introduction to the organization, please see the old Zhihu article OpenLLMAI Organization Introduction.
Developers
Along the way, the OpenRLHF project has attracted 20+ contributors, contributed 130+ commits, and received 800+ stars. I would like to thank all the contributors, especially hijkzzz, wuxibin and Xianyu for their outstanding contributions to the development of the project. hijkzzz and Xianyu are the initiators of the project. As the administrator of the project, hijkzzz submitted the first version of the code for the project and has devoted a lot of energy to maintenance for a long time, making an irreplaceable contribution to the development of the project. As the core developer of the project, wuxibin is mainly responsible for large-scale expansion of the framework based on Ray, and has devoted a lot of energy to daily maintenance for a long time. As the administrator of the project, Xianyu is responsible for the development of the NLP part and some project planning work. In addition, pikaqqqqqq, li-plus, wwxFromTju, jovany-wang, xffxff, dabney777, suc16, Dylancer1998 and other students have also made important contributions to the development of the project (it is impossible to list them one by one here, and all subsequent developers will explain them in the formal technical report/paper; there are also many students and teachers who did not directly participate in the contribution, but put forward many valuable suggestions. Thank you very much). We also welcome more and more like-minded friends to join us, and hope OpenLLMAI will grow with everyone!
Students who are interested in contributing can directly participate in the development on git, contact the relevant person in charge or the official email address.
RL: hijkzzz
Ray:wuxibin
NLP: Xianyu
Official email: xianyuai@openllmai.top
Sponsor Us
Currently, OpenLLMAI is a pure open source organization. Whether it is OpenRLHF/OpenLLMWiki and other projects, or OpenLLM Talk and technical exchange groups, they are all completely open source. But in the long run, without financial support, it is doomed to be unsustainable. It is not easy to get to where we are today with love. Thank you for your support along the way. Finally, please sponsor me. Welcome to support me with money (computing power!!!) and with people (participate in development or make other contributions)! For sponsorship or cooperation, please contact xianyuai@openllmai.top.
References
https://github.com/OpenLLMAI/OpenRLHF
https://github.com/NVIDIA/Megatron-LM
InstructGPT
LLaMA2
https://github.com/facebookresearch/llama
Hugging Face Transformers
DeepSpeed
https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat
Ray
https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat
https://github.com/CarperAI/trlx
https://github.com/NVIDIA/NeMo-Aligner
https://github.com/hiyouga/LLaMA-Factory
https://github.com/OpenLLMAI/OpenLLMWiki
【OpenLLMAI】Believe in the power of open source: We have our own organization! The road ahead is long and arduous, but we will reach our goal if we keep going! - Articles by OpenLLMAI - Zhihuhttps://zhuanlan.zhihu.com/p/647882819
How to correctly reproduce Instruct GPT / RLHF? - Article by Snail in the Garden Parkour - Zhihu https://zhuanlan.zhihu.com/p/622134699
Start the training journey: Building an open source RLHF full-scale training framework for 70B+ models based on Ray and vLLM - Articles by Snail in the Garden Parkour - Zhihuhttps://zhuanlan.zhihu.com/p/678828949
【OpenLLM 006】LoRA: Low-rank adaptation of large models - What is the recently popular lora? Why are stable diffusion and open source ChatGPT reproduction both used? - OpenLLMAI article - Zhihu
https://zhuanlan.zhihu.com/p/620327907
https://arxiv.org/abs/2005.12729
https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
https://github.com/baichuan-inc/Baichuan2
https://github.com/QwenLM/Qwen
https://mistral.ai/news/mixtral-of-experts/
https://github.com/OpenLLMAI/OpenRLHF/issues/221