Changes in short:
- Context doubled (x8)
- Keeping training secret
“Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report
contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.”
3. Performance improvement (better marks in well-known exams, better in languages)
4. Interesting capabilities: ”Certain capabilities remain hard to predict. For example, the Inverse Scaling Prize  proposed several tasks for which model performance decreases as a function of scale. Similarly to a recent result by Wei et al. , we find that GPT-4 reverses this trend, as shown on one of the tasks called Hindsight Neglect  in Figure We believe that accurately predicting future capabilities is important for safety. Going forward we plan to refine these methods and register performance predictions across various capabilities before large model training begins, and we hope this becomes a common goal in the field.”
5. 70% prefer GPT-4 output
”Additionally, GPT-4-launch substantially improves over previous models in the ability to follow user intent . On a dataset of prompts submitted to ChatGPT  and the OpenAI API , the responses generated by GPT-4-launch were preferred over the responses generated by GPT-3.5 RLHF on 70.2% of prompts and GPT-3.5 Turbo RLHF on 61.1% of prompts.1129"
6. Image inputs are still in the research preview
7. Data cuts off in September 2021
"GPT-4 generally lacks knowledge of events that have occurred after the vast majority of its pre-training data cuts off in September 2021, and does not learn from its experience. It can sometimes make simple reasoning errors which do not seem to comport with competence across so many domains, or be overly gullible in accepting obviously false statements from a user. It can fail at hard problems the same way humans do, such as introducing security vulnerabilities into code it produces.”
8. Model-Assisted Safety Pipeline — our models can still be brittle on unsafe inputs as well as sometimes exhibit undesired behaviours on both safe and unsafe inputs.
”Model-Assisted Safety Pipeline: As with prior GPT models, we fine-tune the model’s behavior using reinforcement learning with human feedback (RLHF) [34, 57] to produce responses better aligned with the user’s intent. However, after RLHF, our models can still be brittle on unsafe inputs as well as sometimes exhibit undesired behaviours on both safe and unsafe inputs. These undesired behaviours can arise when instructions to labelers were underspecified during reward model data collection portion of the RLHF pipeline. When given unsafe inputs, the model may generate undesirable content, such as giving advice on committing crimes. Furthermore, the model may also become overly cautious on safe inputs, refusing innocuous requests or excessively hedging. To steer our models towards appropriate behaviour at a more fine-grained level, we rely heavily on our models themselves as tools. Our approach to safety consists of two main components, an additional set of safety-relevant RLHF training prompts, and rule-based reward models (RBRMs). Our rule-based reward models (RBRMs) are a set of zero-shot GPT-4 classifiers. These classifiers provide an additional reward signal to the GPT-4 policy model during RLHF fine-tuning that targets correct behavior, such as refusing to generate harmful content or not refusing innocuous requests. The RBRM takes three inputs: the prompt (optional), the output from the policy model, and a human-written rubric (e.g., a set of rules in multiple-choice style) for how this output should be evaluated. Then, the RBRM classifies the output based on the rubric. For example, we can provide a rubric that instructs the model to classify a response as one of: (a) a refusal in the desired style, (b) a refusal in the undesired style (e.g., evasive or rambling), © containing disallowed content, or (d) …”
9. Disinformation and Influence Operations
”GPT-4 can generate plausibly realistic and targeted content, including news articles, tweets, dialogue, and emails. In Harmful Content, we discussed how similar capabilities could be misused to exploit individuals. Here, we discuss the general concern around disinformation and influence operations.Based on our general capability evaluations, we expect GPT-4 to be better than GPT-3 at producing realistic, targeted content. As such, there is risk of GPT-4 being used for generating content that is intended to mislead.”
10. Red teaming, but nothing concrete
”Participation in this red teaming process is not an endorsement of the deployment plans””We refer to these adversarial testing processes informally as “red teaming” in line with the definition given in , namely“a structured effort to find flaws and vulnerabilities in a plan, organization, or technical system, often performed by dedicated ’red teams’ that seek to adopt an attacker’s mindset and methods.” Red teaming has been applied to language models in various ways: to reduce harmful outputs; and to leverage external expertise for domain-specific adversarial testing. Some have explored red teamed language models using language models. Red teaming in general, and the type of red teaming we call ’expert red teaming,’is just one of the mechanisms we use to inform our work identifying, measuring, and testing AI systems. Our approach is to red team iteratively, starting with an initial hypothesis of which areas may be the highest risk, testing these areas, and adjusting as we go. It is also iterative in the sense that we use multiple rounds of red teaming as we incorporate new layers of mitigation and control, conduct testing and refining, and repeat this process. We reached out to researchers and industry professionals — primarily with expertise in bias and fairness, alignment research, industry trust and safety, dis/misinformation, chemistry, biorisk, cybersecurity, nuclear risks, economics, human-computer interaction, law, education, and healthcare — to help us gain a more robust understanding of the GPT-4 model and potential deployment risks. We selected these areas based on a number of factors including but not limited to: prior observed risks in language models and AI systems;[6, 30] and domains where we have observed increased user interest in the application of language models. Participants in this red team process were chosen based on prior research or experience in these risk areas, and therefore reflect a bias towards groups with specific educational and professional backgrounds (e.g., people with significant higher education or industry experience). Participants also typically have ties to English-speaking, Western countries (such as the US, Canada, and the UK). Our selection of red teamers introduces some biases, and likely influenced both how red teamers interpreted particular risks as well as how they probed politics, values, and the default behavior of the model. It is also likely that our approach to sourcing researchers privileges the kinds of risks that are top of mind in academic communities and at AI firms. These experts had access to early versions of GPT-4 (including GPT-4-early) and to the model with in-development mitigations (precursors to GPT-4-launch). They identified initial risks that motivated safety research and further iterative testing in key areas. We reduced risk in many of the identified areas with a combination of technical mitigations, and policy and enforcement levers; however, many risks still remain. We expect to continue to learn more about these and other categories of risk over time. While this early qualitative red teaming exercise is very useful for gaining insights into complex, novel models like GPT-4, it is not a comprehensive evaluation of all possible risks. We note further context, examples, and findings for some of the domains evaluated in the remainder in the subcategories listed in this section.”
11. Potential for Risky Emergent Behaviours — Some evidence already exists of such emergent behavior in models.
”We granted the Alignment Research Center (ARC) early access to the model. We provided them with early access to multiple versions of the GPT-4 model, but they did not have the ability to fine-tune it. They also did not have access to the final version of the model that we deployed.””To simulate GPT-4 behaving like an agent that can act in the world, ARC combined GPT-4 with a simple read-execute-print loop that allowed the model to execute code, do chain-of-thought reasoning, and delegate to copies of itself. ARC then investigated whether a version of this program running on a cloud computing service, with a small amount of money and an account with a language model API, would be able to make more money, set up copies of itself, and increase its own robustness.”