OpenRLHF is a lightweight and efficient industrial-grade LLM training and alignment framework.

Lightweight and efficient industrial-grade LLM training and alignment framework, OpenRLHF supports full parameter and full process training of 70B model RLHF!

? sinceAfter the emergence of RLHF, everyone began to pay attention to the RLHF alignment technology represented by InstructGPT, and based on this, they tried to reproduce the training process of ChatGPT. Representative RLHF reproduction work such as ColossalChat and DeepSpeed-Chat gradually appeared. But at that time, everyone's understanding of alignment technology basically revolved around InstructGPT. Since OpenAI is not very open recently, it actually lacks full verification by a third party. Fortunately,It soon came out, which not only fully verified the effectiveness of RLHF technology, but also had enough innovations (such as rejection of sampling and multi-RM, etc.), which immediately detonated the entireOpen source community.

In view of the popularity of InstructGPT and LLaMA2, our OpenLLMAI open source community investigated the current mainstream alignment training frameworks and found that most frameworks still lack support for the full-process and full-parameter training of LLaMA2, lack sufficient scalability, or are not lightweight and easy to use. Therefore, we are determined to build a truly industrial-level LLM alignment training framework, reproduce the large model training process represented by InstructGPT and LLaMA2, support mainstream alignment technologies such as RLHF/DPO, and help everyone quickly realize their own alignment ideas.

So, welcome to, quickly start your alignment work!

OpenRLHF project introduction

Currently, OpenLLMAI is mainly working on two projects:

The rest of the projects will depend on the follow-up manpower and interest. In the near future, we may launch a KD or SE framework. We may not be interested in training a general small model for the time being. The main reason is that time, funds and energy are very limited. Use love to generate electricity. It’s a bit unsustainable, so most of the time I’m more interested-oriented. However, after all, interest cannot be used as food, so we have recently spent a lot of energy preparing this promotional material (in the past, Xianyu was too Buddhist/obsessive/busy and had many problems hh). It is true that OpenLLMAI is still very ignorant.OpenRLHF is not perfect yet, but we have tried our best to sincerely hope that we can get wider recognition and support from the community, and a group of people can go further!


OpenRLHF design ideas

1. Design goals:Lightweight and efficientofIndustrial gradeLLM training and alignment framework

Since the industry currently lacks a truly industrial-level LLM alignment framework, most manufacturers may choose to implement it themselves (thanks to OpenAI for a good start). This is understandable in the short term, but in the long run it is inevitable to reinvent the wheel.

Therefore, our goal is to make a lightweight and efficient industrial-level LLM training and alignment framework. In order to achieve this goal, on the one hand, we have done more careful development and testing in the first version to strive for the usability of the first version; on the other hand, we have officially open sourced it here to attract more like-minded people to participate in the co-construction . For frameworks, we firmly believe that only open source has vitality!

2. Design philosophy: easy to use, high performance, scalable, exploratory

  • Simple and easy to use: Ease of use is our first guiding principle in designing the OpenRLHF framework. Because high performance is the proper meaning of a qualified framework, we will not emphasize this matter too much. On the premise of ensuring high performance Next, improving ease of use is our first goal.

  • Scalable: Based on 7B, it is backward compatible with the training of 1-2B small models, and gradually supports the growing model scale upwards, such as the training of 34B/70B/170B.

  • Exploratory: In addition to ensuring basic framework functions, we will keep the forefront of alignment technology, track the latest progress and quickly implement it, and also provide the latest alignment algorithms developed by our team. We will also develop the LLMPipeline module in the future to provide quick practice and fair comparison of mainstream alignment algorithms or mainstream model training technologies.

3. Implement ideas

  • Ease of use: In terms of basic large model frameworks, we investigated LLM training frameworks such as DeepSpeed/Megatron-LM, and chose the simpler and easier-to-use DeepSpeed in the first version; in terms of model libraries, we chose it without hesitation In terms of distributed expansion, we chose ray, don’t ask, just ask and it will be auspicious ray! (Mainly used for resource scheduling)

  • Scalable and high performance: Use ray for reasonableGPU resource scheduling, allocate the Actor, Reward, Reference and Critic models to separate GPUs,Separate training and inferenceTo make full use of the excellent tools of the reasoning community, and cooperate with offload, PEFT, etc.Video memory savingstechnology to achieve scale expansion and efficient training of large models.

  • Exploratory: In the first version, we completely reproduced the training process of InstructGPT and LLaMA2, and supported newer alignment technologies such as DPO. In the future, we will continue to maintain exploratory and develop pipeline modules to support mainstream methods such as InstructGPTPipeline and LLaMA2Pipeline. The model pipeline helps the community conduct more scientific comparisons and research.

OpenRLHF main highlights

OpenRLHF main features

  • The first open source, comprehensive reproduction of LLaMA2and InstructGPT’s RLHF alignment framework;

    • Support SFT/RM/PPO full process training;

    • supportReject sampling, multiple RMs;

  • Easy to use: OpenRLHF is one of the simplest high-performance RLHF libraries available and can be implemented with only a single 8-card DGXA100 node 34B Model RLHF training can be done through scriptsStart training with one click;

  • Training and push separation, distributed and scalable RLHF;

    • Use multiple cards24GB RTX 4090 GPU for7B Full process training of the model

    • Use multiple cards A100 80G GPU and vLLM70B+ modelfull process training

    • Separation of training and promotion: Separate training and inference to reuse good inference tools from the community (we finally used vLLM) to reduce inference latency;

    • Distributed and scalable: With the support of ray/deepspeed/vLLM and reasonable resource scheduling, we have achieved efficient and scalable training. The following are two examples:

  • High performance: Thanks to ray/deepspeed and other memory saving technologies and inference acceleration framework, our training performance on the 13B LLaMA2 model is more than 4 times that of DeepSpeedChat;

    • zero series

    • FlashAttention2

    • LoRA, QLoRA

    • offload

    • gradient checkpointing

    • Inference acceleration: vLLM

    • Video memory saving tips:

  • Cutting edge: Keeping up with cutting-edge progress, currently supporting mainstream alignment technologies and mainstream large models;

    • LLAMA

    • baichuan

    • qwen

    • Mixtral 8*7b

    • Cutting-edge models:

    • Standard RLHF: SFT/RM/PPO;

    • Rejection Sampling;

    • DPO (direct-preference-optimization)/IPO/cDPO;

    • Kahneman-Tversky optimization (KTO);

    • Conditional SFT (;

    • State-of-the-art alignment technology:

  • Reinforcement learning techniques: We integrated the implementation tricks for PPO to improve the training stability, referencing Implementation Matters in Deep Policy Gradients and ppo-implementation-details.


OpenRLHF performance demonstration

Support matrix:

The following support matrix shows the comparison between OpenRLHF and mainstream LLM alignment training frameworks (there may be delays in the research, please contact us for corrections if there are any errors or omissions):

 PPO Tricks34B full parameters/4 A10070B+full parameters/16 A1007B full/4 RTX4090QLoRAMixtral MOE 8*7b

The main advantages of OpenRLHF areGood scalabilityandEfficient performance, which can support the efficient training of the entire process and parameters of the 70B model, and can also cope with larger-scale expansion in the future. Frameworks such as LLaMA-Factory/trl/trlx have similar problems. Does not support 70B full-parameter RLHF training, some frameworks focus on Lora fine-tuning 13b level models, general samplingPlan to merge actor critic(This is a stop-gap measure for performing RLHF on a small scale to save video memory, but it does not comply with the implementation of standard RLHF, and the scalability is very poor, and there will always be times when it cannot be put down). Of course, OpenRLHF also has some disadvantages, such as insufficient documentation and benchmarks.Ease of use needs to be improved. Specifically, we will explain the comparison between OpenRLHF and various popular RLHF frameworks as follows (you are welcome to correct any errors or omissions). A more detailed and comprehensive comparison can be found in our official technical report.

  • LLaMA-Factory: The advantage is efficient fine-tuning and ease of use (this is very worth learning, and there is even web-ui). Using merged actor-critic cannot support 70B full-parameter PPO training, and it is not easy to expand the model scale;

  • Colossal-Chat: uses single-step RL, while our framework uses step-wise RL. See OpenRLHF vs Colossal-Chat for details;

  • trl/trlx: The advantage is that it is very compatible with the Hugging Face ecosystem, but there may be a problem that the package is too deep and difficult to modify. Similarly, 70B full-parameter PPO training is not currently supported; and merged actor-critic is used to save money. video memory, but this is inconsistent with the standard implementation;

  • NeMo-Aligner: The generation based on Megatron is currently inefficient, which affects the overall training efficiency. The ecological compatibility with Hugging Face is not very good, and the model may need to be specially modified;

Performance data:

According to existing tests, the training efficiency of our OpenRLHF framework on the 13B model is about 4 times that of DeepSpeedChat (limited by manpower, there may be delays in testing. You can report the performance data of other frameworks to us for correction).

 7B llama2 RLHF13B llama2 RLHF (50k samples)
OpenRLHF17 hours with 8 A100
DeepSpeedChat48 hours with 16 A100

Training throughput:

  • default allocation:

  • 4 A100 80G for Actor, 2 A100 80G for Critic, 1 A100 80G for RM, and 1 A100 80G for InitPolicy

  • ZeRO2 with Adam Offload

  • Max Sequence Length: 2048

  • Performance throughput (samples/s in default configuration, will be replaced by tokens/s later):

  • 7B llama2: 0.105 samples/gpu/secsmicro_batch_size = 16/8 (rollout/train), generation_length = 100~300

  • 13B llama2: 0.04 samples/gpu/secsmicro_batch_size = 8/4 (rollout/train), generation_length = 200~400

  • 34B codellama: 0.007 samples/gpu/secsmicro_batch_size = 2/1 (rollout/train), generation_length = 300~800

Mainstream model performance data (due to manpower reasons, it is too late to re-test at the moment. What is reported here is the test data when the model was supported at that time. The current version of PPO should be much faster. More models and performance updates will be added in the official technical report in the future. data):


How to use OpenRLHF

Official documentation:

All official documents, including this article, will be maintained on Github. Improving document quality to improve ease of use is also one of the key directions of our follow-up work (due to manpower reasons, the document is currently rough, everyone is welcome to contribute):

  • Project home page

  • Official documentation

OpenRLHF installation

we supportnvidia-docker (recommended to avoid potential environmental issues)Or install it in conda environment (the configured conda environment or image can be provided later):

First, clone the repository:

Clone the repository: git clone


Then, install nv-docker or conda environment:

# install nv-docker cd examples/scripts # install nvidia-docker (Optional) ./ # launch nvidia container ./ # we need conda conda create -n openrlhf python=3.10 # so, we need install some package manually: when installing torch, you may need to match the corresponding cuda version. pip install packaging ninja pip3 install torch # check ninjia ninja --version echo $? # output: 0 # install flash-attn: may take some time. # For network error: you can download specified version from pip install flash-attn==2.4.2 ./ # enjoy it! conda activate openrlhf



OpenRLHF training

Training script:

After configuring the environment, enter the /openrlhf/examples/scripts directory, modify the training script according to your needs, and start training with one click. Single-machine and multi-machine training are supported.Supports full-volume and full-process training of 7B-70B+ models. The following are some important parameters, which users can modify according to the situation to support their own model training:

  • -pretrain: pre-training model address, hugging face format

  • -dataset: data set address, hug face format

  • -dataset_probs: Sampling probabilities for mixing multiple datasets, such as: 0.5,0.4,0.1

  • -save_path: model save address, hug face format

  • -max_epochs: Number of training epochs

  • -micro_train_batch_size: single GPU batch_size

  • -train_batch_size: global batch_size

  • -learning_rate: learning rate

Single machine training script:

cd examples/scripts # install nvidia-docker (Optional) ./ # launch nvidia container ./ # cd in container cd /openrlhf/examples/scripts # build OpenRLHF (ie, pip install) ./build_openrlhf. sh # huggingface login ~/.local/bin/huggingface-cli login # continue pretrain ./ # train SFT model ./ # train RM model ./ # train PPO model ./ # train DPO model ./ # train KTO model ./ # train Rejection Sampling model ./ # train Conditional SFT model ./


Multi-machine training script,16 card A100 70B model full parameter RLHF training:

cd examples/scripts # launch nvidia container ./ # cd in container cd /openrlhf/examples/scripts # build OpenRLHF (ie, pip install) ./ # due to the compatibility of nVIDIA PyTorch image pip uninstall xgboost transformer_engine -y # huggingface login ~/.local/bin/huggingface-cli login # launch the master node of ray in container ray start --head --node-ip-address --num-gpus 8 # if you want to launch ray on more nodes, use ray start --address {MASTER-NODE-ADDRESS}:6379 --num-gpus 8 # train ray PPO model, requires 8 gpus in default config ./ # for 70B models and vLLM-based RLHF (important!) pip install vllm==0.3.2 # due to the compatibility of vLLM pip uninstall flash_attn -y ./



For reasoning and evaluation, we recommend reusing industry open source tools or codes. You can refer to the following script:

future work

OpenRLHF's future development work will focus on ease of use and practicality (documents, tutorials, practical experience, etc.), cutting-edge (new algorithms, model pipelines, etc.) and stability. Specifically, there are the following potential works. We hope Everyone can participate together:

  • Document: Chinese and English version

    • Tutorials: Provide goodTutorial

    • Environment: Provide configured mirror or conda environment;

  • Performance testing, benchmark;

    • Testing of basic functions

    • Comparison with other frameworks

    • Support model testing

    • Testing of Alignment Algorithms

  • Further performance optimization;

  • Improved stability: regular code reviews;

  • New functions, new algorithms;

  • New model support: Google’s new model, etc.;

  • The evaluation module provides more comprehensive evaluation capabilities;

Organization introduction

Organization profile

OpenLLMAI: Open AI for everyone.

We may be far away from OpenAI, but we are very close to Open. so. We have only two requirements for members of the organization, that is, we hope that everyone is open enough and confident enough. Our attitude: "Everything that is hot brings a little light." I am willing to walk the path of AI with everyone. Scholars must be ambitious and have a long way to go!

Everyone gathers because of their love, and there are two main things they want to do: 1. Exchange LLM technology (technology sharing, knowledge dissemination); 2. Develop LLM tools (training framework, models, data engineering, etc.). Interested students are welcome join us! For a detailed introduction to the organization, see Zhihu’s old article OpenLLMAI Organization Introduction.


Along the way, the OpenRLHF project has attracted 20+ contributors, contributed 130+ submissions, and received 800+ stars. I would like to thank all the contributors, especially hijkzzz, wuxibin and Xianyu, who have made outstanding contributions to the development of the project. Among them, hijkzzz and Xianyu are the initiators of this project. As the project administrator, hijkzzz submitted the project's proposal. The first version of the code has invested a lot of energy in maintaining it for a long time, making an irreplaceable contribution to the development of the project; wuxibin, as the core developer of the project, is mainly responsible for large-scale expansion of the framework based on Ray, and has invested a lot of energy in the long term. Carry out daily maintenance; Xianyu, as the project administrator, is responsible for the development work of the NLP part and some project planning work; in addition, students such as pikaqqqqqqq, li-plus, wwxFromTju, jovany-wang, xffxff, dabney777, suc16, Dylancer1998 and other students also contribute to the project have made important contributions to the development of Thank you very much for your opinions). We also welcome more and more like-minded friends to join us, and we hope OpenLLMAI will grow with everyone!

Students who are interested in contributing can directly participate in development on git, contact the relevant person in charge or the official email.

  • RL:hijkzzz

  • Ray:wuxibin

  • NLP:Xianyu

  • Official email:

Sponsor us

At present, OpenLLMAI is a purely open source organization. Whether it is OpenRLHF/OpenLLMWiki and other projects, or OpenLLM Talk and technical exchange groups, they are completely open source and open. But in the long run, it is doomed to be unsustainable without financial support. It is not easy to use love to generate electricity today. Thank you for your support along the way. Finally, I’m asking for sponsorship. Everyone is welcome to contribute financially (computing power!!!) if you have money, and to support yourself (participating in development or making other contributions) if you have money! For sponsorship or cooperation, please contact




Hugging Face Transformers



[OpenLLMAI] Believe in the power of open source: we have our own organization! There is a long way to go, but the journey is about to begin! – OpenLLMAI’s article – Zhihu

How to correctly reproduce Instruct GPT / RLHF? – Article about snail parkour in the garden – Zhihu

Start the training journey: Open source RLHF full training framework based on Ray and vLLM to build 70B+ models – article about snail parkour in the garden – Zhihu

[OpenLLM 006] LoRA: Low-rank adaptation of large models - What exactly is lora, which has been so popular recently? Why are stable diffusion and open source ChatGPT used for recurrence? – OpenLLMAI’s article – Zhihu



Leave a Reply

Your email address will not be published. Required fields are marked *