지시 없는 지시 따르기: 지시 조정 없이

초록

지시 튜닝은 일반적으로 언어 모델을 지시-응답 쌍에 대해 세세하게 조정하는 것을 의미합니다. 우리는 지시 튜닝과 비교했을 때 미흡한 두 가지 조정(튜닝) 형태를 발견했지만 여전히 지시 따르기를 이끌어 냅니다. 이를 암시적 지시 튜닝이라고 부릅니다. 먼저, 지시-응답 쌍이 필요하지 않음을 발견했습니다. 즉, 해당 지시 없이 응답만을 훈련시키면 지시 따르기가 가능합니다. 이는 사전 훈련된 모델이 원하는 응답 분포를 가르쳐줌으로써 드러나는 지시-응답 매핑을 갖고 있다는 것을 시사합니다. 그러나 원하는 응답 분포를 가르치는 것이 필요하지 않음을 발견했습니다. 시를 비롯한 좁은 영역 데이터에서의 지시-응답 훈련은 여전히 레시피 생성과 같은 넓은 지시 따르기 행동으로 이어집니다. 특히, 좁은 세밀 조정 도메인의 지시와 매우 다른 경우, 모델의 응답은 세밀 조정 도메인의 스타일을 따르지 않습니다. 암시적 지시 튜닝을 설명하기 위해, 언어 모델의 분포에 매우 간단한 변경이 지시 따르기를 이끌어낼 수 있다는 가설을 세웁니다. 이를 지지하기 위해 규칙 기반 언어 모델을 손으로 작성하여 사전 훈련된 모델과 함께 전문가들의 곱으로 지시 따르기를 이끌어 냅니다. 이 규칙은 순차열을 끝내는 확률을 천천히 증가시키고 반복을 벌점 부과하며 15개 단어의 확률을 균일하게 변경하는 것입니다. 요약하면, 지시 따르기를 이끌어내기 위해 설계되지 않은 조정이 암시적으로 그것을 할 수 있습니다.

English

Instruction tuning commonly means finetuning a language model on instruction-response pairs. We discover two forms of adaptation (tuning) that are deficient compared to instruction tuning, yet still yield instruction following; we call this implicit instruction tuning. We first find that instruction-response pairs are not necessary: training solely on responses, without any corresponding instructions, yields instruction following. This suggests pretrained models have an instruction-response mapping which is revealed by teaching the model the desired distribution of responses. However, we then find it's not necessary to teach the desired distribution of responses: instruction-response training on narrow-domain data like poetry still leads to broad instruction-following behavior like recipe generation. In particular, when instructions are very different from those in the narrow finetuning domain, models' responses do not adhere to the style of the finetuning domain. To begin to explain implicit instruction tuning, we hypothesize that very simple changes to a language model's distribution yield instruction following. We support this by hand-writing a rule-based language model which yields instruction following in a product-of-experts with a pretrained model. The rules are to slowly increase the probability of ending the sequence, penalize repetition, and uniformly change 15 words' probabilities. In summary, adaptations made without being designed to yield instruction following can do so implicitly.

지시 없는 지시 따르기: 지시 조정 없이

Instruction Following without Instruction Tuning

초록

Summary

Support

Support