Best Practices for Optimizing LLMs (Prompt Engineering, RAG and Fine-tuning)

LizAbout 5 min

Best Practices for Optimizing LLMs (Prompt Engineering, RAG and Fine-tuning)

The Optimization Strategies
Typical Optimization Pipeline
Comparison of Optimisation Approaches
OpenAI RAG Use Case

1. The Challenges of Optimizing LLMs

Extracting signal (meaningful information) from the noise (irrelevant information) is not easy
Performance can be abstract and difficult to measure
It’s not clear when to use what kind of method to optimize LLMs.

2. The Optimization Strategies

Optimizing LLMs can be thought of as a two-axis problem

First to Optimize Prompt
Context Optimization (What the model needs to know?) => RAG
LLM Optimization (How the model needs to act ?) => Fine-Tuning

3. Typical Optimization Pipeline

1.Start with a prompt
2.Get proper evaluation metrics
3.Once evaluation metrics are set, figure out the baseline
4.Once the baseline is known, add more few shot examples
- This is required to guide the model, about how the customer wants the model to act
5.If adding a few shots leads to an increase in performance, then follow the RAG process
6.After RAG is done, the model is getting the context right, but it is not producing the output in a required format, so then fine-tuning approach is taken.
7.It can be a possibility, that retrieval might not be as good as one wants, then go back to RAG and optimize the RAG again.
- For example, in RAG maybe add Hypothetical Document Embeddings (HyDE) retrieval + fast-checking step.
- Then fine-tune the model again, with these new examples added as context via RAG.

4. Comparison of Optimisation Approaches

	Prompt Engineering	RAG	Fine-tuning
Reducing token usage	✕	✕	✓
Introducing new information	✕	✓	✓
Testing and learning early	✓	✕	✕
Reducing hallucinations	✓	✓	✓
Improving efficiency	✕	✕	✓

5. Optimization via Prompt Engineering

Write clear instructions
Split complex tasks into simpler subtasks
Give llm time to “think”
Test changes systematically
Providing Examples
Using external Tools

6. Benefits of Using RAG

Can reduce hallucinations
No need to retrain or fine-tune LLM
Solves knowledge-intensive tasks by referencing existing resources
Data security: as an external knowledge base, no data needs to be put into the model and access rights can be set
Updating knowledge: not limited to the model's training data cut-off time

7. Optimization via Fine-Tuning

Fine-tuning is a process of continuing the training of a model on a smaller domain-specific dataset to optimize the model for a specific task.

Why Fine-Tune?

Reducing token usage
- Reduced limitations on context windows and exposed to much more data
Improving model efficiency
- It is observed that complex prompting techniques are not required to reach the desired level of performance, once a model is fine-tuned
- No need to provide a complex set of instructions, explicit schemas, in-context examples
- Fewer prompt tokens are needed per request, thus making the interaction cheaper and leading to a quicker response
- Knowledge distillation: knowledge distillation via fine-tuning say from a very large model like GPT-4 turbo, to a smaller model like GPT-3.5 which is much cheaper (cost and latency-wise)

8. Best Practices in Fine-Tuning

1.Start with prompt engineering and a few shot learning (FSL)
- These are low-investment techniques to quickly iterate and validate a use-case
- These can give intuition about how LLMs operate, and how they work on a specific problem.
2.Establish a baseline
- Ensure a performance baseline to compare the fine-tuned model to Understand the failure cases of the model
- Understand the exact target to be achieved via fine-tuning
3.Start small, focus on quality
- Datasets are difficult to build, so start small and invest intentionally
- Fine-tune a model on a smaller dataset, look at its output, see what area it struggles in, and then target those areas with new data
- Optimize for fewer high-quality training examples.
4.Devise proper evaluation strategies
- Expert humans to look at the output and rank them on some scale
- Generate output from different models, and get the model rank them for you. For example, use GPT-4 to rank the output of some open source model.
- Train the model, evaluate it, and deploy it to production. Then, collect the samples from it in production, use that to build a new dataset, downsample that dataset, curate it, and then fine-tune again on that dataset.

9. Best Practices in Fine-Tuning + RAG

1.Fine-tune the model to understand the complex instructions
- It will eliminate the need to provide complex few-shot examples to the model at a sample time
- It will also minimize the prompt-engineering tokens, leading to more space for retrieved context.
2.Next, use RAG to inject relevant knowledge into the context needed to be maximized, but do not over-saturate the context

10. OpenAI RAG Use Case

One of the OpenAI customers had a pipeline with 2 knowledge bases and an LLM. The job was to take the user question, decide which knowledge base to use, fire a query, and use one of the knowledge bases to answer a question.

10.1. Experiments that didn’t work

Retrieval with cosine similarity (gave 45 % accuracy)
HyDE retrieval was tried but didn’t pass through for production.
Fine-tuning the embeddings worked well from the accuracy perspective, but was slow and expensive, so they discarded it for non-functional reasons.

10.2. Experiments which worked

Chunk / embedding experiments
- Trying different size chunks of information and embedding different bits of content gave a 20% accuracy bump to 65%
Reranking
- Applied cross-encoder to rerank or rules-based ranking
- rerank
  - cohere -> rerank (good) https://docs.cohere.com/reference/rerank
  - bje reranker （open source）https://huggingface.co/collections/BAAI/bge-66797a74476eb1f085c7446d
  - jina
- Cross Encoder：https://www.cnblogs.com/huggingface/p/18010292
Classification Step
- Classify which two domains (two knowledge bases), the question could belong to and give extra metadata in the prompt to help it decide further
To reach accuracy levels of 98%, the following trials were successful
- Further prompt engineering to engineer the prompt a lot better
- Looked at the category of questions that gave wrong answers, and tools were introduced like giving access to SQL databases to pull the answers from.
- Query Expansion: where someone has asked 3 questions, in 1 prompt, then parses them into a list of queries, executes them in parallel, brings back the results, and synthesizes them into 1 result.
- Fine-tuning was not used here, which showed that the problem was context-related.
- Fine-tuning could have led to the waste of computational resources in a context-related setting.

11. OpenAI Fine-Tuning + RAG Use Case

Use case description: Given a natural language question and a database schema, can the model produce a semantically correct SQL query?

The approach was first followed with the GPT 3.5 Turbo Model

1.First, the performance was squeezed from the baseline model via prompt engineering ( 69%). A few shot examples were also added, which led to some improvement leading to the use of RAG
2.with the RAG question only, there was a 3% performance bump
3.Then using the answers, hypothetical document embedding, there was a further 5% improvement.
- Here just using hypothetical questions to search, rather than the actual input questions led to improvement
4.Increased examples from n=3 to n=5 lead to further improvement to 80%
5.However the target was to reach 84%.
6.Fine-tuning was employed (done at ScaleAI)
- ScaleAI fine-tuned GPT-4 with a simple prompt engineering technique of reducing the Schema gave almost 82%
- They used RAG along with fine-tuning to dynamically inject a few examples into the context window, which led to 83.5% accuracy score.
7.Thus, simple fine-tuning + RAG combined with simple prompt engineering brought the model accuracy to 83.5%

Best Practices for Optimizing LLMs (Prompt Engineering, RAG and Fine-tuning)

# Best Practices for Optimizing LLMs (Prompt Engineering, RAG and Fine-tuning)

# 1. The Challenges of Optimizing LLMs

# 2. The Optimization Strategies

# 3. Typical Optimization Pipeline

# 4. Comparison of Optimisation Approaches

# 5. Optimization via Prompt Engineering

# 6. Benefits of Using RAG

# 7. Optimization via Fine-Tuning

# 8. Best Practices in Fine-Tuning

# 9. Best Practices in Fine-Tuning + RAG

# 10. OpenAI RAG Use Case

# 10.1. Experiments that didn’t work

# 10.2. Experiments which worked

# 11. OpenAI Fine-Tuning + RAG Use Case

# 12. Reference

Best Practices for Optimizing LLMs (Prompt Engineering, RAG and Fine-tuning)

1. The Challenges of Optimizing LLMs

2. The Optimization Strategies

3. Typical Optimization Pipeline

4. Comparison of Optimisation Approaches

5. Optimization via Prompt Engineering

6. Benefits of Using RAG

7. Optimization via Fine-Tuning

8. Best Practices in Fine-Tuning

9. Best Practices in Fine-Tuning + RAG

10. OpenAI RAG Use Case

10.1. Experiments that didn’t work

10.2. Experiments which worked

11. OpenAI Fine-Tuning + RAG Use Case

12. Reference