We are evaluating a DSpy, LMQL and Jaclang's MTLLM Feature on problem set of different difficulty levels based on the technique used and difficulty of the problem itself. There are 3 difficulty levels of problems:
- Easy
- Problems that can be solved using simple prompt-based techniques. eg.
Translation
,Summarization
- Problems that can be solved using simple prompt-based techniques. eg.
- Medium
- Problems that require some level of understanding of the problem and the data. eg.
Text Classification
,Named Entity Recognition
,Question Answering
- Problems that require some level of understanding of the problem and the data. eg.
- Hard
- Problems that require a deep understanding of the problem and the data. eg.
Agents
,Tool Usage
,ReACT
- Problems that require a deep understanding of the problem and the data. eg.
We are evaluating DSpy, LMQL and Jaclang's MTLLM Feature on the following metrics:
- Overall Accuracy - The overall accuracy of the model on the problem set.
- Time Taken to Solve the Problem - The time taken to solve the problem.
- Readability of the Code - The readability of the code. (Human Evaluation)
- Number of Lines - The number of lines in the written. (After Formatting)
- Token Usage - The number of tokens used by the LLMs.
- How Good it work good with different LLMs - The performance of the model on different LLMs.
- OpenAI's GPT-3.5, GPT-4, GPT-4-llm, GPT-4o
- Anthrpic's Claude 3 (Sonnet, Opus)
- Google's Gemini Models
- OpenSource (LLama3 (8b, 70b), Mistral, Mixtral, Phi-3)
Problem Name | Methodology | Runtime (s) | Readability | Number of Lines |
---|---|---|---|---|
Odd Word Out | DSpy | 8.89 | 49 | |
LMQL | 1.86 | 23 | ||
Jaclang | 1.25 |
15 |
||
Joke Generation | DSpy | 3.21 | 37 | |
LMQL | 2.84 | 19 | ||
Jaclang | 1.62 |
16 |
||
Translation | DSpy | 3.39 | 40 | |
LMQL | 1.38 | 18 | ||
Jaclang | 0.97 |
16 |
||
Expert Answer | DSpy | 4.15 |
25 | |
LMQL | 6.65 | 20 | ||
Jaclang | 8.04 | 15 |
||
Essay Reviewer | DSpy | 17.80 | 66 | |
LMQL | ||||
Jaclang | 8.72 |
36 |
Problem Name | Methodology | Runtime (s) | Readability | Number of Lines |
---|---|---|---|---|
MCQ Reasoning | DSpy | 4.55 | 20 |
|
LMQL | 3.56 | 21 | ||
Jaclang | 2.05 |
20 |
||
Text to Type | DSpy | 2.70 | 29 | |
LMQL | 3.01 | 30 | ||
Jaclang | 1.73 |
23 |
||
Template | DSpy | 1.04 | 26 | |
LMQL | - | - | ||
Jaclang | 1.36 |
17 |
||
Personality Finder | DSpy | 1.06 |
31 | |
LMQL | - | - | ||
Jaclang | 1.77 | 25 |
Problem Name | Methodology | Runtime (s) | Readability | Number of Lines |
---|---|---|---|---|
Wikipedia (ReACT) | DSpy | 10.76 | 31 | |
LMQL | - | - | ||
Jaclang | 8.73 |
61 |
||
RPG Level Generation | DSpy | - | 120 | |
LMQL | ||||
Jaclang | 31.99 | 110 |
- Handling Examples in DSPy is very difficult as everything need to created using classes. This makes it difficult to handle simple problems. and always the examples need to be structured in a way that it can be used in a class.
- Unable to provide additional context as everything need to be in the docstring. and the docstring is not very flexible to handle additional context.
- As the Typed Output are handled as a JSON String and the Type Information is provided as a JSON Schema. When it comes to complex types, it is very difficult to handle them. (eg- level_generation)
- It doesnt support some LLMs like Anthropic's Claude 3, Google's Gemini Models
- Able to use chat models but the prompt generation is not optimized for that. Works well with completion models. which are currently getting deprecated.
- Even for small problems need a lot of code to be written and everytime need to create a class for the problem.
- Hard to Debug as the error messages are not very clear
- Very Hard to write the LMQL code as it is very difficult to understand the syntax.
- It feels more of prompting than writing code.
- Difficult to handle complex types as there is no mechanism to handle them.
- Its uses logit based approach for text generation, which is expensive and slow. Most of the models doesn't support logit based approach.
- Hard to Debug as the error messages are not very clear
- Clone the repository
- Set the following environment variables based on the LLMs you are using
export OPENAI_API_KEY=<API_KEY>
- Run the following command
python eval.py --profiler "cProfile"/"pyinstrument"
- Evaluation Results will be saved in
results
folder - To view profiling results, run the following command
# if used pyinstrument
Double click on the `profile.html` file in the `results` folder
# if used cProfile
snakeviz <path_to_cprofile_output(profile.prof)>
gprof2dot -f pstats <path_to_cprofile_output(profile.prof)> -o <output_dot_file>
Open the generated dot file in a dot viewer of your choice
If you need to track the token usages, you can make the following changes in relevant files
# file: dsp/modules/gpt3.py
# Replace line response = self.request(prompt, **kwargs) with following
print("Input Prompt:\n", prompt)
response = self.request(prompt, **kwargs)
print("Output:\n", response["choices"][0]["message"]["content"])
print(response.get("usage"))
# file: jaclang/core/llms/openai.py
# line: replace the out = self.client .... with following
print("Input Prompt:\n", meaning_in)
output = self.client.chat.completions.create(
model=kwargs.get("model_name", self.model_name),
temperature=kwargs.get("temperature", self.temperature),
max_tokens=kwargs.get("max_tokens", self.max_tokens),
messages=messages,
)
print("Output:\n", output.choices[0].message.content)
print(output.usage)
Token Usage will be recorded in the relevant results.txt files.
rm -rf ~/cachedir_joblib