This repo contains a list of sources, weblinks, blogs and Youtube channels from where LLMs can and should be learned.
- History of NLP
https://arxiv.org/pdf/2306.08302.pdf
It has been quite a journey to arrive at a ChatGPT model! It took some time before we thought about modeling language as a probabilistic generative process. NLP studies the interactions between computers and human language, and it is as old as computers themselves.
Warren Weaver was the first to suggest an algorithmic approach to machine translation (MT) in 1949, and this led to the Georgetown experiment, the first computer application to MT, in 1955. In 1957, Chomsky established the first grammar theory. ELIZA (1964) and SHRDLU (1968) can be considered to be the first natural-language understanding computer programs.
The 60s and early 70s marked the era of grammar theories. During the 70s, the concept of conceptual ontologies became quite fashionable. Conceptual ontologies are similar to knowledge graphs, where concepts are linked to each other by how they are associated. The famous ones are MARGIE (1975), TaleSpin (1976), QUALM (1977), SAM (1978), PAM (1978), Politics (1979) and Plot Units (1981).
The 80s showed a great period of success for symbolic methods. In 1983, Charniak proposed Passing Markers, a mechanism for resolving ambiguities in language comprehension by indicating the relationship between adjacent words. In 1986, Riesbeck and Martin proposed Uniform Parsing, a new approach to natural language processing that combines parsing and inferencing in a uniform framework for language learning. In 1987, Hirst proposed a new approach to resolving ambiguity: Semantic Interpretation.
The 90s saw the advent of statistical models. It was the beginning of thinking about language as a probabilistic process. In 1989, Balh proposed a tree-based method to predict the next word in a sentence. IBM presented a series of models for statistical machine translation. In 1990 Chitrao and Grishman demonstrated the potential of statistical parsing techniques for processing messages and Brill et al introduced a method for automatically inducing a part-of-speech tagger by training on a large corpus of text. In 1991, Brown proposed a method for aligning sentences in parallel corpora for machine translation applications.
In 2003, Bengio proposed the first neural language model, a simple feed-forward model. In 2008, Collobert and Weston applied multi-task learning with ConvNet. In 2011, Hinton built a generative text model with Recurrent Neural Networks. In 2013, Mikolov introduced Word2Vec. In 2014, Sutskever suggested a model for sequence-to-sequence learning. In 2017, Vaswani gave us the Transformer architecture that led to a revolution in model performance. In 2018, Devlin presented BERT, which popularized Transformers. And in 2022, we finally got to experience ChatGPT, which completely changed the way the public perceived AI!
-
How Large Language Models Work, https://www.youtube.com/watch?v=5sLYAQS9sWQ&ab_channel=IBMTechnology
-
Andrej Karpathy
-
1hr Talk Intro to Large Language Models Lecture by Andrej Karpathy, https://www.youtube.com/watch?v=zjkBMFhNj_g&ab_channel=AndrejKarpathy
Slide PDF: https://drive.google.com/file/d/1pxx_ZI7O-Nwl7ZLNk5hI3WzAsTLwvNU7/view
Slide PPT Keynote: https://drive.google.com/file/d/1FPUpFMiCkMRKPFjhi9MAhby68MHVqe8u/view
Makemore implementation from Andrej Karpathy https://github.com/karpathy/makemore
-
Neural Networks: Zero to Hero Lecture by Andrej Karpathy
A course on neural networks that starts all the way at the basics. The course is a series of YouTube videos where we code and train neural networks together. The Jupyter notebooks we build in the videos are then captured here inside the lectures directory [https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures]. Every lecture also has a set of exercises included in the video description. (This may grow into something more respectable).
-
Let's build GPT: from scratch, in code, spelled out.,
https://www.youtube.com/watch?v=kCc8FmEb1nY&ab_channel=AndrejKarpathy
https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing
-
Let's build the GPT Tokenizer, https://www.youtube.com/watch?v=zduSFxRajkE&ab_channel=AndrejKarpathy
https://colab.research.google.com/drive/1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L?usp=sharing
-
Let's reproduce GPT-2 (124M)
https://www.youtube.com/watch?v=l8pRSuU81PU&ab_channel=AndrejKarpathy
https://github.com/karpathy/build-nanogpt
-
Storyteller AI Large Language Model (LLM)
LLM building course by hand scratch https://github.com/karpathy/LLM101n
-
-
A Survey of Techniques for Maximizing LLM Performance
-
Create a Large Language Model from Scratch with Python – Tutorial, https://www.youtube.com/watch?v=UU1WVnMk4E8&t=24s&ab_channel=freeCodeCamp.org
-
Train a Llama model from scratch, https://huggingface.co/blog/nroggendorff/train-with-llama-architecture
-
[Build a Large Language Model (From Scratch)] (https://www.manning.com/books/build-a-large-language-model-from-scratch) by Sebastian Raschka, MEAP publications 2025.
-
How to Build an LLM from Scratch | An Overview, https://www.youtube.com/watch?v=ZLbVdvOoTKM&pp=ygUdQ3JlYXRlIGEgTGFyZ2UgTGFuZ3VhZ2UgTW9kZWw%3D
-
Train your own language model with nanoGPT | Let’s build a songwriter, https://www.youtube.com/watch?v=XS8eRtlcCGU&ab_channel=SophiaYang
-
A Hackers' Guide to Language Models, https://www.youtube.com/watch?v=jkrNMKz9pWU&ab_channel=JeremyHoward
-
Create your own Local Chatgpt for FREE, Full Guide: PDF, Image, & Audiochat (Langchain, Streamlit), https://www.youtube.com/watch?v=CUjO8b6_ZuM&t=452s&ab_channel=LeonExplainsAI
-
Train a Llama model from scratch, https://huggingface.co/blog/nroggendorff/train-with-llama-architecture
-
Fine Tuning and Evaluating LLMs with Anyscale and Arize, https://www.youtube.com/watch?v=b-MfkFz-A2E&ab_channel=ArizeAI
-
Building And Troubleshooting An Advanced LLM Query Engine, https://www.youtube.com/watch?v=_zDDErOaUqc&ab_channel=ArizeAI
-
Model Monitoring for LLMs, https://www.youtube.com/watch?v=zR1X5R_1TUw&ab_channel=SethJuarez
-
Let's pretrain a 3B LLM from scratch: on 16+ H100 GPUs, no detail skipped. https://youtu.be/aPzbR1s1O_8?si=2VEoUt9FFRUftctv
-
A simple generative ML model with just KNN, https://www.youtube.com/watch?v=aFuHPiJu0QA
-
The N Implementation Details of RLHF with PPO, https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo
-
Optimizing your LLM in production
-
LLM Tutorial, https://www.youtube.com/watch?v=JvLiEdTKKgk&list=PLpqh-PUKX-i4TT-vZXhFwI8Jdqr7J742n&pp=iAQB
-
Serve a custom LLM for over 100 customers
-
State of GPT | BRK216HFS, https://www.youtube.com/watch?v=bZQun8Y4L2A&ab_channel=MicrosoftDeveloper
-
Building Systems with the ChatGPT API, https://www.deeplearning.ai/short-courses/building-systems-with-chatgpt/
-
Flash Attention 2.0 with Tri Dao (author)! | Discord server talks, https://www.youtube.com/watch?v=IoMSGuiwV3g&ab_channel=AleksaGordi%C4%87-TheAIEpiphany
-
Train a language model from scratch
-
Outcome-based reward model (ORM)
Meet Stepwise ORMs (SORMs)
https://arxiv.org/abs/2402.10963
-
Datasets for Large Language Models: A Comprehensive Survey
https://arxiv.org/abs/2402.18041v1
LLM datasets from five perspectives:
- (1) Pre-training Corpora;
- (2) Instruction Fine-tuning Datasets;
- (3) Preference Datasets;
- (4) Evaluation Datasets;
- (5) Traditional Natural Language Processing (NLP) Datasets.
A timeline of some representative LLM datasets. Orange represents pre-training corpora, yellow represents instruction fine-tuning datasets, green represents preference datasets, and pink represents evaluation datasets
-
GPT-Fast - blazingly fast inference with PyTorch (w/ Horace He)
https://www.youtube.com/watch?v=18YupYsH5vY&ab_channel=AleksaGordi%C4%87-TheAIEpiphany
-
Genie: Generative Interactive Environments
A whole new world: Genie is capable of converting a variety of different prompts into interactive, playable environments that can be easily created, stepped into, and explored. This is made possible via a latent action interface, learned fully unsupervised from Internet videos. On the right we see a few generated steps for taking two latent actions.
https://sites.google.com/view/genie-2024/
https://arxiv.org/abs/2402.15391
https://www.youtube.com/watch?v=lhg7DOCGqtU&ab_channel=code_your_own_AI
-
3 ways to train LLMs
Transformers can be used for many learning tasks, and the only difference comes from the way we prepare the data, the modeling head we choose, and the loss function we use to optimize the model.
With Causal Language Modeling, the model learns the language statistics by focusing on predicting the next word in a sequence. This is the more common way to perform Language modeling nowadays, and it has been the approach taken in GPT-1, GPT2, and GPT-3. Causality is ensured by applying a mask to the attention matrices computed within the attention layers. To avoid paying attention to words later in the sequence, we just set the attention to 0 for those words. To train this model, we just need to shift the inputs by removing the first word to create the labels.
For text classification, we want to associate the input text data with some category. For example, in the context of sentiment analysis, we may want to categorize the input sentence into the following three categories: [POSITIVE], [NEGATIVE] and [NEUTRAL]. In the context of text classification, we only need one prediction vector, and the typical strategy is usually to choose one of the hidden states and project it into the prediction space. This works because, although there are as many hidden states as there are input tokens, after passing through multiple transformer blocks, they all represent an entangled representation of the entire sentence. To train that model, we only need to compare the prediction vectors to the categorical labels by using a loss function such as cross-entropy.
The token classification learning task is often used for applications such as Named Entity Recognition (NER). We want to categorize each of the tokens in the input sentence. For example, we may want to associate each of the words with their grammatical categories: [NOUN], [VERB], and [ADJECTIVE]. For each of the inputs in the sequence, we need a prediction vector of the size of the number of categories we want to predict. At training time, we compare that prediction matrix for all the tokens to their categories in the labels with a cross-entropy loss function and update the model weights.
-
Transformers
https://huggingface.co/blog/Esmail-AGumaan/attention-is-all-you-need
-
How LLMs generate text?
Generating text is by no means a trivial task! LLMs are optimized to predict the probability of the next token, but how do we generate text with that?
The naive approach is to use the probability vector generated by the model, choose the word with the highest probability, and autoregress. This is the greedy approach, but this tends to generate repetitive sentences that degenerate when they are too long. Another approach is to use the probabilities generated by the model and perform a sampling of the words based on those probabilities. Typically, we use a temperature parameter to adjust the level of randomness of this process. This allows to generate less repetitive and more creative sentences.
But those 2 techniques have a problem. When we generate a sentence, we want to maximize the probability of the whole output sequence and not just the next token:
P(Output sequence | Prompt)
Fortunately, we can express this probability as a product of the probabilities to predict the next token:
P(token 1, .., token N | Prompt) = P(token 1| Prompt) x ... P(token N |Prompt, token 1, ..., token N - 1)
But solving this problem exactly is an NP-hard problem. So, instead, we can approximate the problem by choosing k candidate tokens at each iteration, testing them, and keeping the k sequences that maximize the probability of the whole sequence. In the end, we just choose the sequence with the highest probability. This is called the Beam search generation and can be mixed with the greedy and the multinomial approach.
Another approach is the contrastive search, where we take into account additional metrics like fluency or diversity. At each iteration, we choose candidate tokens, penalize the probabilities with a similarity metric of the tokens that were previously generated, and choose the tokens that maximize the new score.
-
Self-attention vs cros-attention
What is the difference between Self-Attention and Cross-Attention? They are actually very similar! The self-attention computes the interactions between the different elements of an input sequence (for example, the different words in a sentence), and the cross-attention computes the interactions between the elements of 2 different input sequences (for example, how words in one sentence influence words of the next another sentence).
Both of those attentions can be computed by the same process. We have 3 matrices, Wk, Wq, and Wv, and they project the input vectors into Keys, Queries, and Values vectors. The self-attentions are computed by using the same input vectors, whereas the cross-attentions are computed by using vectors coming from 2 different sources. Those input vectors in the case of self-attention can be internal hidden states within a Transformer, for example, and they can be the encoder output and the internal hidden states of a decoder in the case of an encoder-decoder Transformer for the cross-attentions. For the cross-attentions, the encoder output gets projected as Keys and Values, whereas the decoder hidden states get projected as Queries.
Then, the softmax transformation of the matrix multiplication between Keys and Queries creates the attentions, self, or cross depending on the input vectors. The output of the attention layer is just the matrix multiplication between the attention matrix and the Values vectors.
It is much easier to train Language Models now than it used to be! The amount of text processing needed to obtain usable models was pretty intense. I remember spending many hours testing all the tricks like stemming or lemmatization in Spacy or NLTK!
Now, LLMs can take text pretty much as such. We just need to tokenize it! Tokenizing means we break down the text into sub-word units, but it also means that we need to add special tokens like the beginning or end of sentence tokens ([BOS], [EOS]). One particular token is the Padding token [PAD].
When we train LLMs, we apply the batched backpropagation algorithm. To parallelize the computations, we need the input sentences to all have the same length so we can treat the whole batch as one tensor. The [PAD] token allows to pad shorter sentences in the batch.
Those [PAD] tokens are semantically meaningless, and they should not contribute to the computed attentions within the transformer architecture. The trick is to add a padding mask to the attention computations by setting the elements related to the [PAD] tokens within the attention matrix to zero. This way, they don't contribute to the overall prediction process and text generation. We just need to make sure not to use the hidden states related to those [PAD] tokens for anything other than getting a tensor of the right size!
-
BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token
https://www.youtube.com/watch?v=90mGPxR2GgY&ab_channel=UmarJamil
-
Introduction to State Space Models (SSM), https://huggingface.co/blog/lbourdois/get-on-the-ssm-train
-
Variational Autoencoder - Model, ELBO, loss function and maths explained easily!
https://www.youtube.com/watch?v=iwEzwTTalbg&ab_channel=UmarJamil
-
LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU
https://www.youtube.com/watch?v=Mn_9W1nCFLo&ab_channel=UmarJamil
-
Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer
https://www.youtube.com/watch?v=UiX8K-xBUpE&ab_channel=UmarJamil
https://github.com/hkproj/mistral-src-commented
https://github.com/hkproj/mistral-llm-notes
- Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math
https://www.youtube.com/watch?v=8Q_tqwpTpVU&ab_channel=UmarJamil
https://github.com/hkproj/mamba-notes
-
Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm
https://www.youtube.com/watch?v=oM4VmoabDAI&ab_channel=UmarJamil
-
How to create tokens from words in LLMs?
Why do we keep talking about "tokens" in LLMs instead of words? It happens to be much more efficient to break the words into sub-words (tokens) for model performance!
The typical strategy used in most modern LLMs (GPT-1, GPT-2, GPT-3, ChatGPT, Llama 2, etc.) is the Byte Pair Encoding (BPE) strategy. The idea is to use as tokens sub-word units that appear often in the training data. The algorithm works as follows:
- We start with a character-level tokenization
- we count the pair frequencies
- We merge the most frequent pair
- We repeat the process until the dictionary is as big as we want it to be
The size of the dictionary becomes a hyperparameter that we can adjust based on our training data. For example, GPT-1 has a dictionary size of ~40K merges, GPT-2, GPT-3, ChatGPT have a dictionary size of ~50K, and Llama 2 only 32K.
-
How masked language moldeling works?
What is Language Modeling? That is the modeling task of learning the distribution of words in text data. One typical approach is Masked Language Modeling. We mask some tokens of the input data, and we want to predict what were those masked tokens. This has been the original way to train transformers since BERT.
We want to train the model to learn what are the probabilities of the words in the sequence. The prediction matrix for each sample in a batch has a dimension [Sequence size, Vocabulary size]. For each position in the token sequence, we have a probability for each token in the vocabulary. Of course, what interests us the most are the positions where the words are masked in the input data.
To get the prediction matrix with this dimension, we need to be careful about the prediction head we are using. For each input in the sequence, we get a hidden state coming out of the LLM. For each sample within a batch, the resulting tensor coming out of the LLM has a dimension [Sequence size, Hidden state size]. Therefore, the Language modeling head is a simple linear layer with the number of input features to be [Hidden state size] and the number of output features to be [Vocabulary size]. Think about the linear layer as a projection matrix of size [Hidden state size, Vocabulary size] that will resize the hidden state to the vocabulary size.
To train the model, we simply need to compare the predictions for the words that are masked and all the other words are ignored. Typically, we use the cross-entropy loss function for the LLM to learn to predict the masked words.
To generate a sequence at inference time, there might be multiple strategies. The simplest one is to choose the word with the highest predicted probability and to auto-regress. Let’s say we have the first word being “Machine“ as input. Using this as input, we choose the second word in the sequence with the highest probability. Let’s say it is “learning“; now the sequence becomes “Machine learning“. Using those two words as input, we choose the word with the highest probability for the 3rd word in the sequence. We iterate this process until we meet an ending condition, such as the maximum number of tokens or an token.
-
The RNN Encoder-Decoder Architecture
-
Attention mechanisms before transformers
The Attention Mechanism didn't start with Transformers! It was designed to alleviate typical weaknesses related to RNN. The idea was to be able to predict the next word in a sentence by taking into account the signal of all the words in the input sentence.
It was proposed in 2014 by Bahdanau and later improved by Luong in 2015, and it solved some concerns that were seen in the RNN encoder-decoder architecture. Recurring networks generate two types of output vectors: the output vectors at the last layer for each of the input words and the hidden states at the last time step for each layer in the recurring network. Because we may want to generate an output sequence that has a different size than the input sequence, it was considered a better idea to use the hidden states from the encoder encoding the input sequence that would be independent of the input sequence size. This tensor would be used as input to the decoder that was used to decode the output sequence. The hidden states are a tensor representation of the input sequence, but they lose the information related to the different words and their order. The Attention mechanism was just a way to use the output vectors instead that were dependent on the input sequence size and provide more refined information about the input sequence.
-
Attention is all you need
Transformers are taking every domain of ML by storm! I think it is becoming more and more important to understand the basics, so pay attention because Attention is there to stay!
At the center of Transformers is the self-attention mechanism, and once you get the intuition, it is not too difficult to understand. Let me try to break it down:
As inputs to a transformer, we have a series of contiguous inputs, for example, words (or tokens) in a sentence. When it comes to contiguous inputs, it is not too difficult to see why time series, images, or sound data could fit the bill as well.
Each has its vector representation in an embedding matrix. As part of the attention mechanism, we have 3 matrices Wq, Wk, and Wv, that project each of the input embedding vectors into 3 different vectors: the Query, the Key, and the Value. This jargon comes from retrieval systems, but I don't find them particularly intuitive!
For each word, we take its related Key vector and compute the dot products to the Query vectors of all the other words. This gives us a sense of how similar the Queries and the Keys are, and that is the basis behind the concept of "attention": how much attention should a word pay to another word in the input sequence for the specific learning task? A Softmax transform normalizes and further accentuates the high similarities of the resulting vector. This resulting matrix is called the self-attentions!
This results in one vector for each word. For each of the resulting vectors, we now compute the dot products to the Value vectors of all the other words. We now have computed hidden states or context vectors!
Repeat this process multiple times with multiple attention layers, and this gives you a multi-head attention layer. This helps diversify the learning of the possible relationships between the words. The resulting hidden states are combined into final hidden states by using a linear layer.
The original Transformer block is just an attention layer followed by a set of feed-forward layers with a couple of residual units and layer normalizations. A "Transformer" model is usually multiple Transformer blocks, one after the other. Most language models follow this basic architecture. I hope this explanation helps people trying to get into the field!
-
Diffusion Models
What is a Diffusion model in Machine Learning? Conceptually, it is very simple! You add some noise to an image, and you learn to remove it. Train a machine learning model that takes as input a noisy image and as output a denoised image, and you have a denoising model.
The typical way to do it is to assume a normal distribution of the noise and to parametrize the distribution mean and standard deviation matrix. Effectively, we can actually reduce the problem to just learning the mean matrix. The process can be divided into the forward process, where white noise (Gaussian distributed) is progressively added to a clean image, and the reverse process, where a learner progressively learns to denoise the noisy image until it is back to being clean: https://lnkd.in/gJ7gRJij.
Why is that called a diffusion model? What does that have to do with the diffusive process of particles in a fluid with a gradient of concentration (https://lnkd.in/gn_FR_Ua)? This is due to the way mathematicians have abused the jargon of the physical process to formalize a mathematical concept. It happens that physical phenomena like Fick diffusion (https://lnkd.in/gKRreTpn), heat diffusion (https://lnkd.in/gB5tWpp6), and Brownian motion (https://lnkd.in/gpKRbkak) are all well described by the diffusion equation: https://lnkd.in/gB5tWpp6, the first time derivative of a state function is equal to the second space derivative of that state function. That diffusion equation has an equivalent stochastic formulation known as the Langevin equation: https://lnkd.in/g9Fjwtxx. At the core of the Langevin equation is a mathematical object called the Wiener process: https://lnkd.in/gmf54HPX. Interestingly enough, this process is also called a Brownian motion (not to be confused with the physical process). It can be thought of as a Random Walk with infinitely small steps: https://lnkd.in/gh6ef5RB. The key feature of the Wiener process is that a time increment of that object is Normal distributed. That is why the concept of "diffusion" is intertwined with the white noise generation process, and that is why those ML models are called diffusion models.
Those diffusion models are generative models as data is generated using a Gaussian prior, and they are the core of the text-to-image generative models such as Stable Diffusion, DALL-E 2, and MidJourney.
-
How To Train an LLM With Diffusion From Scratch
https://www.youtube.com/watch?v=jMizUI8Ki1I&ab_channel=Oxen
https://www.oxen.ai/blog/arxiv-dives-text-diffusion-with-sedd
-
How to summarize texts with LLMs
With LangChain, it is not difficult to summarize text of any length. To summarize text with a LLM, there are a few strategies.
If the whole text fits in the context window, then you can simply feed the raw data and get the result. LangChain refers to that strategy as the “stuff“ chain type. Often, the number of tokens contained in the text is larger than the LLM's maximum capacity. A typical strategy is to break down the data into multiple chunks, summarize each chunk, and summarize the concatenated summaries in a final "combine" step. LangChain refers to this strategy as “map-reduce“.
Another strategy is to begin the summary with the first chunk and refine it little by little with each of the following chunks. LangChain refers to this as “refine“. For example here is the prompt template used by LangChain for the Refine step:
""" Your job is to produce a final summary We have provided an existing summary up to a certain point: {existing_answer} We have the opportunity to refine the existing summary (only if needed) with some more context below.
Given the new context, refine the original summary If the context isn't useful, return the original summary. """
- How to 16x Llama 2's context window size?
Did you know that LLama 2 is probably the best choice if you need a large context window? At first glance, LLama 2 has a context window size of 4096, which seems smaller than ChatGPT's 16K, GPT-4's 32K, and Claude 2's 100K, but the magic in the open source!
4096 tokens, that is about 3000 words. Not bad but it limits the possible applications. The typical Transformer architecture is composed of Embeddings to encode the text input, multiple transformer blocks, and a prediction head specific to the learning task the LLM is used for. To encode the text, we use a text embedding matrix T that has the size of the token vocabulary and a positional embedding P that encodes the position of the token in the input sequence. That position embedding size defines the context size. That embedding can be learned or it can be a simple sin function of the position index. Typically they are added together T + P such that the same word is encoded differently at positions i and j.
The great thing about LLama 2 is that it uses Rotary Positional Embeddings (RoPE) as opposed to the typical sin function encoding. Each Attention layer is modified using that embedding and it ensures the computed attention between input tokens to be only dependent on the distance between those tokens. If token T1 is at position i and a token T2 at position j, the attention A(T1, T2) = f(j - i) is a function of j - i. The attention is not dependent on the specific token's locations but on their relative positions.
The technique they use at Meta to extend the context window is to interpolate at non-integer positions. Basically, if the original window size is L, you can extend it to L' (with L' > L) by rescaling the integer positions
i' = i * L / L'
As an example, if you wanted to have a text input of 16,384 tokens (so 4x the window size of LLama 2) into LLama 2, you would just need to divide every integer position by 4: i' = i / 4. To be clear, if you look at the implementation of LLama 2 available on GitHub (line 50 in model.py today https://lnkd.in/gGvUye6K), you would just need to replace the following line of code
t = torch.arange(end, device=freqs.device) by t = torch.arange(end, device=freqs.device) / 4
How simple is that? Because the model was not trained for that position embedding, you would need to fine-tune a bit the model to adapt it to that new context window and position embedding. When we think that LLama 2 will most likely be used to be fine-tuned on private data, that is the icing on the cake to be able to dynamically adapt the context window to our needs as we fine-tune it.
You can look at the method here: https://lnkd.in/gPUzdBPi. They were able to extend LLama's context window by 16 times while keeping the performance at the same level!
-
Aligning Open Language Models, https://docs.google.com/presentation/d/1quMyI4BAx4rvcDfk8jjv063bmHg4RxZd9mhQloXpMn0/edit#slide=id.g2ca00c5c0f9_0_0 https://www.youtube.com/watch?v=AdLgPmcrXwQ&ab_channel=StanfordOnline
-
seemore: Implement a Vision Language Model from Scratch
https://huggingface.co/blog/AviSoori1x/seemore-vision-language-model
-
Vision Language Models Explained
-
1-Bit LLM INSTALLATION| 7B LOCAL LLMs in 1-Bit + Test Demo
https://www.youtube.com/watch?v=InMicVYVw-I&ab_channel=DataInsightEdge
https://colab.research.google.com/drive/1GODyuOcrj5ADkxqIN-F5L2FnTSO5C3cf?usp=sharing
https://mobiusml.github.io/1bit_blog/
https://huggingface.co/mobiuslabsgmbh/Llama-2-7b-chat-hf_2bitgs8_hqq
-
V-JEPA: Video Joint Embedding Predictive Architecture
https://www.youtube.com/watch?v=4X_26j5Z43Y&ab_channel=AIPapersAcademy
https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/
-
Generate Summaries with Topic Focus using CPU-friendly Model SLIM
https://www.youtube.com/watch?v=yNg_KH5cPSk&ab_channel=llmware
https://huggingface.co/llmware/slim-summary
-
Get Yes/No Answers from Text using Small Language Model (SLIM Boolean, CPU-friendly)
https://www.youtube.com/watch?v=jZQZMMqAJXs&ab_channel=llmware
-
Introduction to ggml
-
Gorilla
Gorilla is a LLM that can provide appropriate API calls. It is trained on three massive machine learning hub datasets: Torch Hub, TensorFlow Hub and HuggingFace. We are rapidly adding new domains, including Kubernetes, GCP, AWS, OpenAPI, and more. Zero-shot Gorilla outperforms GPT-4, Chat-GPT and Claude.
Gorilla is extremely reliable, and significantly reduces hallucination errors.Gorilla enables LLMs to use tools by invoking APIs. Given a natural language query, Gorilla comes up with the semantically- and syntactically- correct API to invoke. With Gorilla, we are the first to demonstrate how to use LLMs to invoke 1,600+ (and growing) API calls accurately while reducing hallucination. We also release APIBench, the largest collection of APIs, curated and easy to be trained on! Join us, as we try to expand the largest API store and teach LLMs how to write them! Hop on our Discord, or open a PR, or email us if you would like to have your API incorporated as well.
https://gorilla.cs.berkeley.edu/
https://github.com/ShishirPatil/gorilla
https://colab.research.google.com/drive/1DEBPsccVLF_aUnmD0FwPeHFrtdC0QIUP?usp=sharing
-
Benchmarking LLMs and what is the best LLM?
https://msandbu.org/benchmarking-llms-and-what-is-the-best-llm/
-
Multimodal LLMs
https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
-
Mixture of Experts (MoEs)
* What is a Mixture-of-Experts (MoE)? ![image](https://github.com/ParthaPRay/LLM-Learning-Sources/assets/1689639/536aecab-1e37-46d2-b2c8-82711b7f03cd) * towards understanding mixture of experts in deep learning https://arxiv.org/abs/2208.02813 * Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models https://arxiv.org/abs/2305.14705 * Mixture of Experts Explained https://huggingface.co/blog/moe * Welcome Mixtral - a SOTA Mixture of Experts on Hugging Face https://huggingface.co/blog/mixtral * SegMoE: Segmind Diffusion Mixture of Experts (MoEs) Model, https://www.youtube.com/watch?v=gIz7Td6WfEo * Mixtral Fine tuning and Inference, https://www.youtube.com/watch?v=EXFbZfp8xCI&ab_channel=TrelisResearch * Understanding Mixture of Experts, https://www.youtube.com/watch?v=0U_65fLoTq0&ab_channel=TrelisResearch * How To Install Uncensored Mixtral Locally For FREE! (EASY), https://www.youtube.com/watch?v=DC2te4CZXeM&ab_channel=WorldofAI * Fully Uncensored MIXTRAL Is Here 🚨 Use With EXTREME Caution, https://www.youtube.com/watch?v=q2KpPUOsBCs&ab_channel=MatthewBerman * Depliy your AI Streamlit App, https://youtu.be/74c3KaAXPvk?si=mHuW18-fvW1sJswn * **makemore** It takes one text file as input, where each line is assumed to be one training thing, and generates more things like it. Under the hood, it is an autoregressive character-level language model, with a wide choice of models from bigrams all the way to a Transformer (exactly as seen in GPT). For example, we can feed it a database of names, and makemore will generate cool baby name ideas that all sound name-like, but are not already existing names. Or if we feed it a database of company names then we can generate new ideas for a name of a company. Or we can just feed it valid scrabble words and generate english-like babble. https://github.com/karpathy/makemore * makeMoE: Implement a Sparse Mixture of Experts Language Model from Scratch ![image](https://github.com/ParthaPRay/LLM-Learning-Sources/assets/1689639/a359ba90-3bd1-4dbb-a9b0-b6fa8c586759) https://huggingface.co/blog/AviSoori1x/makemoe-from-scratch ![image](https://github.com/ParthaPRay/LLM-Learning-Sources/assets/1689639/b49caf89-b5bd-4d85-8724-696c776444ea) Top-k Gating Intuition through an Example ![image](https://github.com/ParthaPRay/LLM-Learning-Sources/assets/1689639/32c05293-402b-4cd4-9a3f-c5f56f9b3101) Router noisy Top-k Gating I ![image](https://github.com/ParthaPRay/LLM-Learning-Sources/assets/1689639/946a63cf-067e-41b7-9a88-b4afb22ce245) https://github.com/AviSoori1x/makeMoE/tree/main * Evolving New Foundation Models: Unleashing the Power of Automating Model Development ![image](https://github.com/ParthaPRay/LLM-Learning-Sources/assets/1689639/48d436f3-5a71-4d81-a049-c603faf9a4c5) https://sakana.ai/evolutionary-model-merge/ * Orchestration of Experts: The First-Principle Multi-Model System ![image](https://github.com/ParthaPRay/LLM-Learning-Sources/assets/1689639/c89c118e-0003-48b0-b576-c169e8e5b61f) https://huggingface.co/blog/alirezamsh/leeroo-multi-model-system * Mergoo: Efficiently Build Your Own MoE LLM https://huggingface.co/blog/alirezamsh/mergoo
-
How to play a Chess Game ChatGPT and Llama 2
It is not tomorrow that LLama 2 is going to replace ChatGPT, and it is not tomorrow that those LLMs are going to take over the world! In my opinion, LLama 2 only makes sense if you need to fine tune your model with your own data. The biggest LLama 2 model is 70B of parameters. With 4 bytes per parameter, that's a 240 GB model, so count ~400GB of GPU hardware to have one standing model for inference. Using AWS GPU pricing, that's $4 / hr on the low end. With ChatGPT on the other hand, the cost is $0.0015/1K tokens. If you count 4 tokens per word, to get to $4/hr, you need to send 700K words / hr to the API. That's about 10 books with 300 pages each. If your model consumes less input than that, don't bother with LLama2.
A fine-tuned model is another story. For both models, you need to swallow the training cost, but LLama inference's cost remains the same where the inference on a fine-tuned GPT-3 is 0.12 / 1K (~100 times the cost of the non-fine-tuned model) as OpenAI charges very differently for hosting custom models.
In terms of performance evaluation, what about a little chess tournament? I used the [replicate API to use LLama] (https://replicate.com/meta/llama-2-70b-chat) and the OpenAI API for ChatGPT and GPT-4. The AiEdge used the [Python Chess package for the game structure] (https://python-chess.readthedocs.io/en/latest/). The AiEdge feed the current state of the board, the history of the past moves and the current available legal moves within the prompt to guide the LLMs. After multiple rounds, ChatGPT destroyed LLama, it was a tie between GPT-4 and LLama and a tie between GPT-4 and ChatGPT (for some reason!). GPT-4 was not the greatest at chess but it was great at making a big hole in my bank account due to its cost! LLama seemed to play like a bored gold fish, moving the same pieces back and forth, not being really clear on what it was supposed to do.
The AiEdge tried to use the non-official Bard API (https://lnkd.in/gJUGA4fV) but that model is about as good as a 3 year old toddler listening to commands within the prompts. Whatever way I would engineer my prompts, Bard could not follow the basic instructions to get my code to work and would ramble like a drunk Woody Allen so The AiEdge gave up. Painful experience!
The AiEdge would have loved to get Claude 2 to participate but Anthropic keeps "forgetting" to provide API access to their customers. The AiEdge used a chess engine (https://lnkd.in/dG8TvhBQ) to compare and it crushed any of the LLMs in a few moves every time. It seems that LLMs are unable to form coherent strategies to solve these kinds of problems. LLMs are not ready to replace us anytime soon!
-
Merge Large Language Models with mergekit
Classification of model merging methods. We currently support the model merging methods outlined on the left, and we are actively working to incorporate additional merging techniques such as ZipIt, OT Fusion, and Git Rebasin.
MergeKit structure with key modules for adding new merge methods. The diagram depicts the workflow for introducing new merge methods in the MergeKit repository. Initially, tensors are extracted from two models, A and B, and processed by the ‘Architecture’ module to ensure their structural compatibility. Next, the ‘Plan’ component formulates a strategy detailing the merge process. This plan is then relayed to a ‘Graph’, outlining the necessary operations for merging. During ‘Graph Execution’, these operations are performed, resulting in the ‘Merged Model’—the integration of Models A and B via a specified merging technique within the system’s framework.
https://huggingface.co/blog/mlabonne/merge-models
https://colab.research.google.com/drive/1_JS7JKJAQozD48-LhYdegcuuZ2ddgXfr?usp=sharing
-
Deep dive: model merging using Mergekit
https://www.youtube.com/watch?v=cvOpX75Kz4M&ab_channel=JulienSimon
-
Merge LLMs with Mergekit: create your own medical mixture of experts
-
How to Merge LLMs Locally with MergeKit
https://www.youtube.com/watch?v=W5ep1oJb3ME&ab_channel=FahdMirza
-
A brief analysis of automerger data, feat. SLERP and DARE-TIES LLM merging
https://huggingface.co/blog/kgourgou/a-first-look-at-automerger-data
-
Merge Large Language Models with mergekit
https://huggingface.co/blog/mlabonne/merge-models
https://colab.research.google.com/drive/1_JS7JKJAQozD48-LhYdegcuuZ2ddgXfr?usp=sharing
-
Create Mixtures of Experts with MergeKit
-
Create Mixtures of Experts with MergeKit
-
Merge LLMs with No Code Mergekit GUI
https://www.youtube.com/watch?v=TkKAmwO8oOY&ab_channel=AIAnytime
-
Token Merging for fast LLM inference : Background and first trials with Mistral
https://huggingface.co/blog/samchain/token-merging-fast-inference
-
LLM-as-a-judge
LLM-evaluators, also known as “LLM-as-a-Judge”, are large language models (LLMs) that evaluate the quality of another LLM’s response to an instruction or query.
Their growing adoption is partly driven by necessity. LLMs can now solve increasingly complex and open-ended tasks such as long-form summarization, translation, and multi-turn dialogue. As a result, conventional evals that rely on n-grams, semantic similarity, or a gold reference have become less effective at distinguishing good responses from the bad. And while we can rely on human evaluation or finetuned task-specific evaluators, they require significant effort and high-quality labeled data, making them difficult to scale.
https://eugeneyan.com/writing/llm-evaluators/
-
Intro to Large Language Models by Andrej Karpathy
https://twitter.com/karpathy/status/1723140519554105733?lang=en
https://www.youtube.com/watch?v=zjkBMFhNj_g&ab_channel=AndrejKarpathy @ 42:
-
AIOS: LLM Agent Operating System
AIOS embeds large language model into Operating Systems (OS) as the brain of the OS, enabling an operating system "with soul" -- an important step towards AGI. AIOS is designed to optimize resource allocation, facilitate context switch across agents, enable concurrent execution of agents, provide tool service for agents, maintain access control for agents, and provide a rich set of toolkits for LLM Agent developers.
https://github.com/agiresearch/AIOS
- LLM agent operating system (AIOS) and the future of LLM-powered agents, https://medium.com/@simeon.emanuilov/llm-agent-operating-system-aios-and-the-future-of-llm-powered-agents-3d08b4e91c34
-
MemGPT: Towards LLMs as Operating Systems, https://arxiv.org/abs/2310.08560
Create LLM agents with long-term memory and custom tools.
-
TinyChatEngine TinyChatEngine: On-Device LLM Inference Library https://github.com/mit-han-lab/TinyChatEngine
-
Build the LLM OS | Autonomous LLMs as the new Operating System, https://www.youtube.com/watch?v=YMZm7LdGQp8&ab_channel=Phidata
https://github.com/phidatahq/phidata/tree/main/cookbook/llm_os
-
LLM OS with gpt-4o, https://www.youtube.com/watch?v=6g2KLvwHZlU&ab_channel=Phidata
-
LLM OS Blogs
- Part 1 : Introduction to LLM OS, https://medium.com/@protegeigdtuw/part-1-introduction-to-llm-os-1cfec39689f7
- Part 2 : Challenges and Solutions to LLM OS, https://medium.com/@protegeigdtuw/part-2-challenges-and-solutions-to-llm-os-1cc0fec2ac57
- Part 3 : Implementation and technology of LLM OS, https://medium.com/@protegeigdtuw/part3-implementation-and-technology-of-llm-os-a3d296a2ab73
- Part 4 : Use cases and User experiences -LLM OS, https://medium.com/@protegeigdtuw/part-4-use-cases-and-user-experiences-llm-os-71f6d0763773
-
Illustrated LLM OS: An Implementational Perspective, https://huggingface.co/blog/shivance/illustrated-llm-os
-
Want to understand the Transformer architecture?
- the encoder
- the decoder
- the position embedding
- the encoder block
- the self-attention layer
- the layer-normalization
- the position-wise feed-forward network
- the decoder block
- the cross-attention layer
- the predicting head
-
How to feed data to a Transformer
If you think about Transformers, chances are you are thinking about NLP applications, but how can we use Transformers for data types other than text? Actually, you can use Transformers on any data that you are able to express as a sequence of vectors, which is what Transformers feed on! Typically, any sequence or time series of data points should be able to fit the bill.
Let's consider image data, for example. An image is not per se a sequence of data, but the local correlation of the pixels sure resembles the concept. For the Vision Transformer (ViT: https://lnkd.in/gPC_iFaV), the guys at Google simply created patches of an image that were flattened through linear transformations into a vector format. By feeding images to Transformers through this process, they realized that typical CNNs were performing better on a small amount of data, but Transformers were getting better than CNNs if the scale of the data was very high.
Time series are obviously good candidates for Transformers. For example, for the Temporal Fusion Transformer (https://lnkd.in/gfMTHYBc), they transform the time series into the right-sized vector through LSTM layers, as they say, to capture the short-term correlations of the data where the multihead attention layers take care of capturing the long term correlations. They beat all the time series benchmarks with this model, but I wonder how scalable it is with those LSTM layers! You can use it in PyTorch: https://lnkd.in/gzisFCUF
Sequencing proteins seems to be an obvious application of Transformers, considering the language analogy of amino acid sequences. Here, you just need to have an amino acid embedding to capture the semantic representation of protein unit tokens. Here is a Nature article on generating new proteins with Transformers: https://lnkd.in/gzeiuZ8w, and here is its BioaRXiv version: https://lnkd.in/gQgHg-sm.
Reinforcement Learning expressed at a Markov chain sequence of states, actions, and rewards is another good one. For the Decision Transformer (https://lnkd.in/giJCnXJX), they encoded each state, action, and reward as a vector and concatenated them into 1 final vector. For example, in the case of video games, a state can simply be the image on the screen at a time t, and you extract the latent features with a CNN. An action can be encoded with embedding, and a scalar reward can be seen as a vector with 1 dimension. Apparently, they beat all the benchmarks as well! You can find the code here: https://lnkd.in/gwFdrZHX.
Looking forward to seeing what Transformers are going to achieve in the coming years!
-
What are Transformers and GPTs?, https://www.youtube.com/watch?v=ucityipiNtA&ab_channel=RicardoCalix
-
High overview of the original Transformer architecture for Large Language Models, https://www.youtube.com/watch?v=zxVhAYkSYcY&ab_channel=RicardoCalix
-
Coding a Transformer from scratch on Pytorch with full explanation training and Inference, https://youtu.be/ISNdQcPhsts?si=EA3BSRVo1Tr4Z4NC
- GPTs, BERTs, Full Transformers, in PyTorch (Part 1), https://www.youtube.com/watch?v=s6gys0iozLk&ab_channel=RicardoCalix
- GPTs, BERTs, Full Transformers, in PyTorch (Part 2), https://www.youtube.com/watch?v=a1qomZy_yfo&ab_channel=RicardoCalix
- GPU Scholar cloud, GPTs, BERTs, Full Transformers, in PyTorch (Part 3), https://www.youtube.com/watch?v=klQnQMoy9zI&ab_channel=RicardoCalix
- Embeddings, GPTs, BERTs, Full Transformers, in PyTorch (Part 4), https://www.youtube.com/watch?v=yNZCcF6a7a4&ab_channel=RicardoCalix
- The simple linear algebra for Attention, GPTs, BERTs, and Full Transformers in PyTorch (part 5), https://www.youtube.com/watch?v=zgH69JoAB_k&ab_channel=RicardoCalix
-
Implementing a simple GPT in PyTorch, https://www.youtube.com/watch?v=RsQxg913eXY&ab_channel=RicardoCalix
-
Implementing a simple GPT in PyTorch (Take Two), https://www.youtube.com/watch?v=zyDzpVu9lyA&ab_channel=RicardoCalix
-
Starting with GPTs (A Hello World Example), https://www.youtube.com/watch?v=oPcJg3QrKf4&ab_channel=RicardoCalix
-
Intro to Reinforcement Learning through Human Feedbacks (RLHF), https://www.youtube.com/watch?v=A8YqZKGRTAM&ab_channel=RicardoCalix
-
What is an instruct model? - Instruction and Chat Fine-Tuning,
As you browse the ever growing global catalogue of generative AI models, you will see some of the Large Language Models (LLMs) being listed with the suffix 'instruct' or 'chat'. What does this mean?
TL:DR; The 'instruct' version of the model has been fine-tuned to be able to follow prompted instructions. These models 'expect' to be asked to do something. Models with the 'chat' suffix have been fine-tuned to work in chatbots. These models 'expect' to be involved in a conversation with different actors. In contrast non-instruct tuned models will simply generate an output that follows on from the prompt. If you are making a chatbot, implementing RAG or using agents, use instruct or chat models. If in doubt us an instruct model.
https://community.aws/content/2ZVa61RxToXUFzcuY8Hbut6L150/what-is
-
Stanford CS25 - Transformers United Course
https://www.youtube.com/playlist?list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM
When we think about Transformers, we tend to think about LLMs, but it revolutionized the world of Computer Vision as well! The Vision Transformer has slowly been replacing typical convolutional networks when it comes to image analysis tasks.
Nothing in the Transformer architecture is intrinsically bound to NLP applications! As long as you can format your data into a sequence of vectors, you can feed it to a Transformer. It might feel odd to think about an image as a sequence of vectors, though.
The idea is to build visual tokens by breaking down the image into patches of pixels and flattening them down into vectors through a linear transformation. With a convolutional layer, we can transform an image into a sequence of vectors in one shot. As soon as we have vectors, we can pass them into a Transformer, as you would any textual tokens.
Image Credit: https://www.coursera.org/learn/generative-ai-with-llms/lecture/18SPI/generative-configuration
-
max token The "max token" setting serves as a cap on the number of tokens (words or subwords, depending on the tokenizer) that the model will produce. For example, setting "max tokens" to 100 means the model's output will not exceed 100 tokens in length. Remember it's max new tokens, not a hard number of new tokens generated.
- A smaller "max token" value might lead to more focused and relevant outputs, as the model is constrained to express ideas concisely.
- A larger "max token" value allows for more extensive exploration of ideas and concepts, potentially leading to more detailed and expansive outputs. However, it also increases the risk of the model veering off-topic or generating repetitive or irrelevant content.
-
Greedy Decoding
Most large language models by default will operate with so-called greedy decoding. This is the simplest form of next-word prediction, where the model will always choose the word with the highest probability. This method can work very well for short generation but is susceptible to repeated words or repeated sequences of words.
-
Random Sampling
If you want to generate text that's more natural, more creative and avoids repeating words, you need to use some other controls. Random sampling is the easiest way to introduce some variability. Instead of selecting the most probable word every time with random sampling, the model chooses an output word at random using the probability distribution to weight the selection. For example, in the illustration, the word banana has a probability score of 0.02. With random sampling, this equates to a 2% chance that this word will be selected. By using this sampling technique, we reduce the likelihood that words will be repeated. However, depending on the setting, there is a possibility that the output may be too creative, producing words that cause the generation to wander off into topics or words that just don't make sense. Note that in some implementations, you may need to disable greedy and enable random sampling explicitly. For example, the Hugging Face transformers implementation that we use in the lab requires that we set do sample to equal true.
One more parameter that you can use to control the randomness of the model output is known as temperature. This parameter influences the shape of the probability distribution that the model calculates for the next token. Broadly speaking, the higher the temperature, the higher the randomness, and the lower the temperature, the lower the randomness. The temperature value is a scaling factor that's applied within the final softmax layer of the model that impacts the shape of the probability distribution of the next token.
In contrast to the top k and top p parameters, changing the temperature actually alters the predictions that the model will make. If you choose a low value of temperature, say less than one, the resulting probability distribution from the softmax layer is more strongly peaked with the probability being concentrated in a smaller number of words.
![image](https://github.com/ParthaPRay/LLM-Learning-Sources/assets/1689639/2785ed16-6385-40fb-a1d0-e4b7af75f745)
![image](https://github.com/ParthaPRay/LLM-Learning-Sources/assets/1689639/52d7cbdf-d666-4429-9706-865fd96a117f)
-
A Gentle Introduction to LLM Evaluation, https://www.confident-ai.com/blog/a-gentle-introduction-to-llm-evaluation
-
LLM Evaluation Essentials: Statistical Analysis of Hallucination LLM Evaluations, https://www.youtube.com/watch?v=IH45ltIMC3k&ab_channel=ArizeAI
https://docs.arize.com/phoenix/llm-evals/running-pre-tested-evals/hallucinations
-
Advanced LLM Evaluation: Synthetic Data Generation, https://www.youtube.com/watch?v=AYehm7q6Oks&ab_channel=ArizeAI
-
Constructing an Evaluation Approach for Generative AI Models with Hugging Face's Rajiv Shah, https://www.youtube.com/watch?v=PtXOQDHPddE&ab_channel=ArizeAI
-
LLM Evaluation Essentials: Benchmarking and Analyzing Retrieval Approaches, https://www.youtube.com/watch?v=ExO3U0M3y_0&ab_channel=ArizeAI
-
The Ultimate Guide to Fine-Tune LLaMA 2, With LLM Evaluations, https://www.confident-ai.com/blog/the-ultimate-guide-to-fine-tune-llama-2-with-llm-evaluations
-
LLM Evaluation Metrics: Everything You Need for LLM Evaluation, https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
A simplified taxonomy of different metrics used in LLM evaluation
-
LLM Benchmarks: MMLU, HellaSwag, BBH, and Beyond, https://www.confident-ai.com/blog/llm-benchmarks-mmlu-hellaswag-and-beyond
An LLM Benchmark Architecture
LLM Benchmark Categories
A data synthesizer architecture
-
LLM Evaluation Metrics: Everything You Need for LLM Evaluation, https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
-
How to Evaluate LLM Applications: The Complete Guide, https://www.confident-ai.com/blog/how-to-evaluate-llm-applications
-
A Step-By-Step Guide to Evaluating an LLM Text Summarization Task, https://www.confident-ai.com/blog/a-step-by-step-guide-to-evaluating-an-llm-text-summarization-task
-
LLM Testing in 2024: Top Methods and Strategies, https://www.confident-ai.com/blog/llm-testing-in-2024-top-methods-and-strategies
Unit tests make up functional, performance, and responsibility tests, which in turn makes up a regression test
-
Evaluate LLMs with Prometheus LLM and Prometheus-Eval Locally
https://www.youtube.com/watch?v=YJ_jDZPj4V4&ab_channel=FahdMirza
-
MMLU-Pro : A New LLM Benchmark
There are two types of leaderboards for all competitions:
-
Public Leaderboard: This leaderboard is calculated on X% of the test dataset, and is what you see on the competition page all the time. The value of X will be mentioned in the problem statement by the organizers.
-
Private Leaderboard: This leaderboard is calculated on the remaining (100-X)% of the test dataset, and is made public only after the competition is over. Your final ranking is based on the private leaderboard.
-
Leaderboard Explorer, https://huggingface.co/spaces/leaderboards/LeaderboardsExplorer
-
Leaderboard on Huggingface, https://huggingface.co/spaces?sort=trending&search=leaderboard
-
Open LLM Leaderboard, https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
-
LMSys Chatbot Arena Leaderboard, https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
-
Open Medical-LLM Leaderboard, https://huggingface.co/spaces/openlifescienceai/open_medical_llm_leaderboard
-
LiveCodeBench Leaderboard, https://huggingface.co/spaces/livecodebench/leaderboard
-
OpenVLM Leaderboard, https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
-
Big Code Models Leaderboard, https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
-
Open ASR Leaderbaord, https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
-
LLM-Perf Leaderboard, https://huggingface.co/spaces/optimum/llm-perf-leaderboard
-
LLM Safety Leaderboard, https://huggingface.co/spaces/AI-Secure/llm-trustworthy-leaderboard
-
Hallucinations Leaderboard, https://huggingface.co/spaces/hallucinations-leaderboard/leaderboard
-
Powered-by-Intel LLM Leaderboard, https://huggingface.co/spaces/Intel/powered_by_intel_llm_leaderboard
-
Deep[ Reinforcement Learning leaderboard, https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
-
Artificial Analysis LLM Performance Leaderboard, https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard
Many more...
-
Ollama, https://github.com/ollama/ollama
-
Importing Open Source Models to Ollama, https://www.youtube.com/watch?v=fnvZJU5Fj3Q&ab_channel=Decoder**
-
Installing Ollama to Customize My Own LLM, https://www.youtube.com/watch?v=xa8pTD16SnM&ab_channel=Decoder
-
Use Your Self-Hosted LLM Anywhere with Ollama Web UI, https://www.youtube.com/watch?v=syR0fT0rkgY&ab_channel=Decoder
-
Ollama has a Python library!, https://www.youtube.com/watch?v=JwYwPiOh72w&ab_channel=LearnDatawithMark
-
Building a local ChatGPT with Chainlit, Mixtral, and Ollama, https://www.youtube.com/watch?v=MiJQ_zlnBeo&ab_channel=LearnDatawithMark
-
Langroid: Chat to a CSV file using Mixtral (via Ollama), https://www.youtube.com/watch?v=XFTFEKYLxyU
-
Few Shot Prompting with Llama2 and Ollama, https://www.youtube.com/watch?v=ocfzGBnhhDE
-
Hugging Face GGUF Models locally with Ollama, https://www.youtube.com/watch?v=7BH4C6-HP14&ab_channel=LearnDatawithMark
-
Autogen: Ollama integration 🤯 Step by Step Tutorial. Mind-blowing!, https://www.youtube.com/watch?v=UQw04VW60U0&ab_channel=MervinPraison
-
Writing Better Code with Ollama, https://www.youtube.com/watch?v=NNBWmIve3fQ&ab_channel=MattWilliams
-
Ollama meets LangChain, https://www.youtube.com/watch?v=k_1pOF1mj8k&ab_channel=SamWitteveen
-
Instructor: The Best Way to get Typed Data from Ollama
-
Running Mixtral on your machine with Ollama, https://www.youtube.com/watch?v=rfr4p0srlqs&ab_channel=LearnDatawithMark
-
Running Mistral AI on your machine with Ollama, https://www.youtube.com/watch?v=NFgEgqua-fg&ab_channel=LearnDatawithMark
-
Ollama Python Library Released! How to implement Ollama RAG? https://www.youtube.com/watch?v=4HfSfFvLn9Q&ab_channel=MervinPraison
-
Ollama Web UI 🤯 How to run LLMs 100% LOCAL in EASY web interface? CRAZY!!🚀 (Step-by-Step Tutorial), https://www.youtube.com/watch?v=84vGNkW1A8s&ab_channel=MervinPraison
-
How TO Install Ollama Web UI | ChatGPT LIKE Interface, https://www.youtube.com/watch?v=bt4AR7sK9tk&ab_channel=DataScienceBasics
-
Ollama: The Easiest Way to Run Uncensored Llama 2 on a Mac, https://www.youtube.com/watch?v=tIRx-Sm3xDQ&ab_channel=IanWootten
-
Using Ollama To Build a FULLY LOCAL "ChatGPT Clone", https://www.youtube.com/watch?v=rIRkxZSn-A8&ab_channel=MatthewBerman
-
Build a RAG app in Python with Ollama in minutes, https://www.youtube.com/watch?v=GxLoMquHynY&ab_channel=MattWilliams
-
Gollama - Manage Ollama Models Locally with Go
https://github.com/sammcj/gollama
https://www.youtube.com/watch?v=24yqFrQV-4Q&ab_channel=FahdMirza
-
Hugging Face GGUF Models locally with Ollama, https://www.youtube.com/watch?v=7BH4C6-HP14&t=8s&ab_channel=LearnDatawithMark
-
Using the Chat Endpoint in the Ollama API, https://www.youtube.com/watch?v=QUJHEvCqhdw&ab_channel=MattWilliams
-
Adding Custom Models to Ollama, https://www.youtube.com/watch?v=0ou51l-MLCo&t=211s&ab_channel=MattWilliams
-
Finally Ollama has an OpenAI compatible API, https://www.youtube.com/watch?v=38jlvmBdBrU&ab_channel=MattWilliams
-
Hosting Ollama Starts With Environment Variables, https://www.youtube.com/watch?v=H_cqBjDVinw&ab_channel=MattWilliams
-
Understanding How Ollama Stores Models, https://www.youtube.com/watch?v=6bF1uCHTFyk&ab_channel=MattWilliams
-
Run any AI model remotely for free on google colab, https://www.youtube.com/watch?v=Qa1h7ygwQq8&ab_channel=TechwithMarco
https://github.com/marcogreiveldinger/videos/tree/main/ollama-ai/run-on-colab
-
Run Mixtral 8x7B MoE in Google Colab, https://www.youtube.com/watch?v=Zo3CTapKJ4I&ab_channel=PromptEngineering
https://github.com/dvmazur/mixtral-offloading?tab=readme-ov-file
https://huggingface.co/lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo
-
Run Mixtral 8x7B Hands On Google Colab for FREE | End to End GenAI Hands-on Project
https://www.youtube.com/watch?v=vzUJ-yjA8Bw&ab_channel=AnalyticsVidhya
https://drive.google.com/drive/folders/1Bo4sJu9vEnjzV_h4FmBNb6dSZ8BxZxpa
https://drive.google.com/drive/folders/1AuReI63WzKRSdzRIlCxl6WuBkNMryPv9
-
Unleash the power of Local LLM's with Ollama x AnythingLLM, https://www.youtube.com/watch?v=IJYC6zf86lU&ab_channel=TimCarambat
Any LLM, unlimited documents, and fully private. All on your desktop. https://useanything.com/download
-
Ollama: How To Create Custom Models From HuggingFace ( GGUF ), https://www.youtube.com/watch?v=TFwYvHZV6j0&t=72s&ab_channel=DataScienceBasics
-
How to run Ollama on Docker, https://www.youtube.com/watch?v=ZoxJcPkjirs&t=127s&ab_channel=MattWilliams
-
Easy Access to GPUs for Ollama, https://www.youtube.com/watch?v=QRot1WtivqI&ab_channel=MattWilliams
Fine-tune, train, or deploy. Use your own notebook, or one of ours. SSH too. CUDA, Python, Jupyter Lab, all set up.
Tailscale is a zero config VPN for building secure networks. Install on any device in minutes. Remote access from any network or physical location.
-
Using Ollama as a local LLM for chat apps
https://www.youtube.com/watch?v=zEN_oKrttK0&ab_channel=PamelaFox
- How to Access Ollama Model with Public IP Remotely
https://www.youtube.com/watch?v=QSfvLWaJc2s&t=20s&ab_channel=FahdMirza
https://www.youtube.com/watch?v=6QAIbThWomc&ab_channel=MattWilliams
https://github.com/technovangelist/videoprojects
-
RAG from the Ground Up with Python and Ollama
https://www.youtube.com/watch?v=V1Mz8gMBDMo&ab_channel=Decoder
https://decoder.sh/videos/rag-from-the-ground-up-with-python-and-ollama
-
FREE Local RAG Chatbot with Ollama - Streamlit and Langchain. Build with open-source Mistral ai
https://www.youtube.com/watch?v=kfbTZFAikcE&ab_channel=AIProductBuilders
-
Ollama v0.1.27 AI benchmark
https://jasonchuang.substack.com/p/ollama-v0127-ai-benchmark
-
Design Your Own Ollama Model Now!
https://www.youtube.com/watch?v=bXf2Cxf3Wk0&t=55s&ab_channel=MattWilliams
-
Is Open Webui The Ultimate Ollama Frontend Choice?
https://www.youtube.com/watch?v=16fWf0VVeIo&ab_channel=MattWilliams
-
Ask Ollama Many Questions at the SAME TIME! Concurrency
https://www.youtube.com/watch?v=MDbdb-W4x4w&ab_channel=MattWilliams
-
Building AI Apps in Python with Ollama
https://www.youtube.com/watch?v=_4K20tOsXK8&ab_channel=MattWilliams
-
Easy 100% Local RAG Tutorial (Ollama) + Full Code
https://www.youtube.com/watch?v=Oe-7dGDyzPM&ab_channel=AllAboutAI
-
Create a New Ollama Model
https://www.youtube.com/watch?v=bXf2Cxf3Wk0&ab_channel=MattWilliams
-
Chat with multiple PDFs, using Ollama and LlamaIndex
https://youtu.be/BRHfHDXlk1U?si=KnVNoCejy70BELlm
-
How I built a Multi-PDF Chat App with FASTEST Inference using LLAMA3+OLLAMA+Groq|FULLY LOCAL Option
https://www.youtube.com/watch?v=WjoTAzuf1Dg&ab_channel=DataInsightEdge
-
Build Your Own RAG for Unstructured PDF, Website via chatgpt & LangChain
https://www.youtube.com/watch?v=5ModxAjKI3w&ab_channel=WhisperingAI
-
Marker: This Open-Source Tool will make your PDFs LLM Ready
https://www.youtube.com/watch?v=mdLBr9IMmgI&ab_channel=PromptEngineering
-
LlamaParse: Convert PDF (with tables) to Markdown
https://www.youtube.com/watch?v=7DJzHncUlpI&ab_channel=AlejandroAO-Software%26Ai
https://colab.research.google.com/drive/18KB9yXxDUeQGrEZEP1eCrXQ0dNB-Oazm?usp=sharing
-
Marker:Get Your PDFs Ready for RAG & LLMs|High Accuracy Open-Source Tool
https://www.youtube.com/watch?v=QmTtU-qbjUA&ab_channel=DataInsightEdge
-
How I built a Multiple CSV Chat App using LLAMA 3+OLLAMA+PANDASAI|FULLY LOCAL RAG #ai #llm
https://www.youtube.com/watch?v=QmTtU-qbjUA&ab_channel=DataInsightEdge
-
Ollama can run LLMs in parallel! (Concurrent) 0.1.33 version
https://www.youtube.com/watch?v=Cd6f86zsAyg&ab_channel=LearnDatawithMark
https://github.com/mneedham/LearnDataWithMark/blob/main/ollama-parallel/app.py
https://www.markhneedham.com/blog/2024/05/11/side-by-side-local-llms-ollama-streamlit/
-
Run multiple instances of Ollama in Parallel (Concurrent) 0.1.33 version
https://www.youtube.com/watch?v=8r_8CZqt5yk&ab_channel=PromptEngineer
-
Hugging Face SafeTensors LLMs in Ollama
https://www.youtube.com/watch?v=DSLwboFJJK4&ab_channel=LearnDatawithMark
Code: https://github.com/mneedham/LearnDataWithMark/tree/main/ollama-own-model
-
ChatTTS - Best Quality Open Source Text-to-Speech Model? | Tutorial + Ollama Setup
https://www.youtube.com/watch?v=MpVNZA6__3o&ab_channel=AllAboutAI
-
A code repository that cointains all the code for finetuning some of the popular LLMs on medical data
This repository contains all the code necessary to fine-tune(PEFT using LoRA/QLoRa) the most popular 7B parameters instruct LLMs(Mistral, Llama, Gemma), specifically on medical data by utilizing. The code repository is based on two parts:
- preparing the instruct medical datasets
- fine-tuning the instruct LLMs on the prepared datasets
-
Top Ten Fine Tuning Tips
-
Fine-tuning Mistral on Your Dataset https://huggingface.co/blog/nroggendorff/finetune-mistral
-
Fine Tuning LLaVA
https://www.youtube.com/watch?v=cambXXq9mrs&ab_channel=FourthBrainAI
https://colab.research.google.com/drive/1LFcri1CHxNWXG6W4DnTXorUvRT_xY0kd?usp=sharing
-
Finetuning Open-Source LLMs, https://www.youtube.com/watch?v=gs-IDg-FoIQ&ab_channel=SebastianRaschka
-
Efficient DiT Fine-Tuning with PixART for Text to Image Generation
-
Fine-tuning Tiny LLM on Your Data | Sentiment Analysis with TinyLlama and LoRA on a Single GPU, https://www.youtube.com/watch?v=_KPEoCSKHcU&ab_channel=VenelinValkov
-
Make LLM Fine Tuning 5x Faster with Unsloth, https://www.youtube.com/watch?v=sIFokbuATX4&ab_channel=AIAnytime
-
Fine-Tuning Your Own Llama 2 Model, https://www.youtube.com/watch?v=Pb_RGAl75VE&ab_channel=DataCamp
-
Fine Tune a Multimodal LLM "IDEFICS 9B" for Visual Question Answering, https://www.youtube.com/watch?v=usoTCfyQxjU&ab_channel=AIAnytime
-
Anyone can Fine Tune LLMs using LLaMA Factory: End-to-End Tutorial, https://www.youtube.com/watch?v=iMD7ba1hHgw&t=15s&ab_channel=AIAnytime
-
Fine Tune Phi-2 Model on Your Dataset, https://www.youtube.com/watch?v=eLy74j0KCrY&ab_channel=AIAnytime
-
LLM Fine Tuning Crash Course: 1 Hour End-to-End Guide, https://www.youtube.com/watch?v=mrKuDK9dGlg
-
Fine-tuning LLMs with PEFT and LoRA, https://www.youtube.com/watch?v=Us5ZFp16PaU&ab_channel=SamWitteveen
-
🤗 PEFT welcomes new merging methods
https://huggingface.co/blog/peft_merging
-
Prompt Tuning With PEFT
-
Very Few Parameter Fine tuning with ReFT and LoRA https://www.youtube.com/watch?v=TIUkONyNAb4&ab_channel=TrelisResearch
-
LoRA & QLoRA Fine-tuning Explained In-Depth
https://www.youtube.com/watch?v=t1caDsMzWBk&ab_channel=EntryPointAI
-
RLHF, DPO, & KTO Explained
https://www.youtube.com/watch?v=nSrj1J6ODoM&ab_channel=EntryPointAI
-
Fine Tuning TinyLlama for Text Generation with TRL, https://huggingface.co/blog/nroggendorff/finetune-tinyllama
-
Fine-tuning Mistral on Your Dataset, https://huggingface.co/blog/nroggendorff/finetune-mistral
-
Finetuning PaliGemma with AutoTrain, https://huggingface.co/blog/abhishek/paligemma-finetuning-autotrain
https://huggingface.co/blog/mlabonne/sft-llama3* Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth,
-
Train a Small Language Model for Disease Symptoms | Step-by-Step Tutorial, https://www.youtube.com/watch?v=1ILVm4IeNY8&ab_channel=AIAnytime
-
Fine tuning Whisper for Speech Transcription, https://www.youtube.com/watch?v=anplUNnkM68&ab_channel=TrelisResearch
-
Efficient Fine-Tuning for Llama-v2-7b on a Single GPU, https://www.youtube.com/watch?v=g68qlo9Izf0&t=17s&ab_channel=DeepLearningAI
-
How To Implement SetFit For Few-Shot Fine-Tuning of Sentence Transformers
https://www.youtube.com/watch?v=pTGr643h5VE&ab_channel=FahdMirza
-
Preference Optimization for Vision Language Models with TRL, https://huggingface.co/blog/dpo_vlm
-
Direct Preference Optimization (DPO), https://www.youtube.com/watch?v=E5kzAbD8D0w&ab_channel=TrelisResearch
-
Fine Tune LLaMA 2 In FIVE MINUTES! - "Perform 10x Better For My Use Case", https://www.youtube.com/watch?v=74NSDMvYZ9Y&ab_channel=MatthewBerman
-
How to Fine-Tune Mistral 7B on Your Own Data, https://www.youtube.com/watch?v=kmkcNVvEz-k&ab_channel=brev
-
Fine-Tune Your Own Tiny-Llama on Custom Dataset, https://www.youtube.com/watch?v=OVqe6GTrDFM&ab_channel=PromptEngineering
-
Fine-tune Mixtral 8x7B (MoE) on Custom Data - Step by Step Guide, https://www.youtube.com/watch?v=RzSDdosu_y8&ab_channel=PromptEngineering
-
Mistral: Easiest Way to Fine-Tune on Custom Data, https://www.youtube.com/watch?v=lCZRwrRvrWg&ab_channel=PromptEngineering
-
Self-Play Fine-Tuning (SPIN), https://www.youtube.com/watch?v=khPq69GgPAo&ab_channel=FahdMirza
The official implementation of Self-Play Fine-Tuning (SPIN), https://github.com/uclaml/SPIN
-
Fastest finetuning of Phi3 with LlaMa-Factory in 15 mins
https://www.youtube.com/watch?v=gcZ1EBNNi3E&ab_channel=SuperLazyCoder
https://huggingface.co/spaces/hiyouga/LLaMA-Board
https://colab.research.google.com/drive/1hNp9_ibG1qf7PT8NyW3TJiQ5I9IwXYOG?usp=sharing
-
LLAMA-3 🦙: The Best Open Source LLM - Easiest Finetuning For Custom Usecases with DPO.
https://www.youtube.com/watch?v=XFudZy11FJI&t=1s&ab_channel=WhisperingAI
-
Building Production-Ready RAG Applications: Jerry Liu, https://www.youtube.com/watch?v=TRjq7t2Ms5I&t=10s&ab_channel=AIEngineer
-
Custom Fine-tuning 30x Faster on T4 GPUs with UnSloth AI, https://www.youtube.com/watch?v=R4CUKAHShyE&ab_channel=PromptEngineering
-
To Fine Tune or not Fine Tune? That is the question, https://www.youtube.com/watch?v=XPU8PH0_d6g&ab_channel=SethJuarez
-
Get your own custom Phi-3-mini for your use cases, https://www.youtube.com/watch?v=U5jU4YJodJo&ab_channel=PromptEngineering
https://colab.research.google.com/drive/1zral6IXIwSd3nQGQSE_5WM_4RyqLFKYA?usp=sharing
-
Fine-tune TinyLlama 1.1B locally on own custom dataset, https://youtu.be/VoDHpnCN6PA?si=Aq7soXO6k83mJJVs
-
Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models https://huggingface.co/blog/finetune-florence2
-
Llama Factory: How to Fine-Tune LLMs easily?, https://youtu.be/G5ENOwfPHFE?si=2BZ6Zh5x55TDr2dl
-
How to create custom datasets to train Llama-2? https://youtu.be/z2QE12p3kMM?si=j52ptrx0GMnj9OSy
-
LocalGPT: Convert your chats with Docs to Fine-Tuing datasets, https://youtu.be/2_o6epQToVY?si=CZMdu1u2IU0wXUz8
-
D2SLM (Doc to Dataset to Fine-Tune Small Language Model), https://www.youtube.com/watch?v=khIDeJwBf4k&ab_channel=AIMakerspace
-
LLAMA2 🦙: EASIET WAY To FINE-TUNE ON YOUR DATA 🙌, https://www.youtube.com/watch?v=LslC2nKEEGU&t=2s&ab_channel=PromptEngineering
-
LLAMA3 🦙: EASIET WAY To FINE-TUNE ON YOUR DATA 🙌, https://www.youtube.com/watch?v=aQmoog_s8HE&t=0s&ab_channel=PromptEngineering
https://colab.research.google.com/drive/1mPw6P52cERr93w3CMBiJjocdTnyPiKTX#scrollTo=6bZsfBuZDeCL
-
Llama 3 Fine Tuning for Dummies
https://www.youtube.com/watch?v=3eq84KrdTWY&ab_channel=NodematicTutorials
-
The EASIEST way to finetune LLAMA-v2 on local machine!, https://www.youtube.com/watch?v=3fsn19OI_C8&ab_channel=AbhishekThakur
-
Stable Diffusion XL (SDXL) DreamBooth: Easy, Fast & Free | Beginner Friendly, https://www.youtube.com/watch?v=3fsn19OI_C8&ab_channel=AbhishekThakur
-
Fine-tuning Notebook on how to fine-tune MPT-7B on a free Google Colab instance to turn the model into a Chatbot. MPT7b sharded version + LoRA adapter
https://colab.research.google.com/drive/1HCpQkLL7UXW8xJUJJ29X7QAeNJKO0frZ?usp=sharing
Dataset: https://huggingface.co/datasets/timdettmers/openassistant-guanaco
-
How to Fine Tune Llama 3 for Better Instruction Following?
https://www.youtube.com/watch?v=WxQbWTRNTxY&ab_channel=MervinPraison
-
Fine-Tune Llama 3 Model on Custom Dataset - Step-by-step Tutorial
https://www.youtube.com/watch?v=BA9kcVofRNI&ab_channel=FahdMirza
https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing
-
Fine-tune Llama 3 with ORPO
https://colab.research.google.com/drive/1eHNWg9gnaXErdAa8_mcvjMupbSS6rDvi?usp=sharing
-
Fine tuning with LlamaIndex
https://docs.llamaindex.ai/en/stable/optimizing/fine-tuning/fine-tuning.html
-
Fine tuning Google Colab notebook - This notebook shows how to fine-tune a 4bit model on a downstream task using the Hugging Face ecosystem. We show that it is possible to fine tune GPT-neo-X 20B on a Google Colab instance!
https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k?usp=sharing
-
Fine Tune pre-trained GPT and BERT models with the Huggingface library, https://www.youtube.com/watch?v=g1dAsgibRcw&ab_channel=RicardoCalix
-
Fine-Tuning HF examples on GPU Scholar, scratch disk space, https://www.youtube.com/watch?v=_S01y-JY8k4&ab_channel=RicardoCalix
-
Fine-tune Multi-modal Vision and Language Models, https://www.youtube.com/watch?v=eIziN2QUt8U&ab_channel=TrelisResearch
-
Fine-Tuning Gemma Models in Hugging Face
-
Your Ultimate Guide to Instinct Fine-Tuning and Optimizing Google’s Gemma 2B Using LoRA https://medium.com/@mohammed97ashraf/your-ultimate-guide-to-instinct-fine-tuning-and-optimizing-googles-gemma-2b-using-lora-51ac81467ad2
-
Part 2: Fine Tune — Gemma 2b-it model
https://aashi-dutt3.medium.com/part-2-fine-tune-gemma-2b-it-model-a26246c530e7
-
Instruction Fine-Tuning Gemma-2B on Medical Reasoning and Convert the finetuned model into GGUF format using Llama.cpp
-
Fine Tune Vision Model LlaVa on Custom Dataset
https://www.youtube.com/watch?v=rbof1eYekvA&ab_channel=FahdMirza
https://colab.research.google.com/drive/10NLrfBKgt9ntPoQYQ24rEVWU-2rr1xf1
-
Tiny Text + Vision Models - Fine tuning and API Setup on Server using Moondream
https://www.youtube.com/watch?v=5rH_VjKXuzg&ab_channel=TrelisResearch
https://github.com/TrelisResearch/one-click-llms
https://docs.google.com/presentation/d/1LTF8PLe2kwLaddeqwgCRWnTabt7b5EEviFq29x3zlyw/edit
-
Vision Language Models Explained and fine tuning
Fine-tuning Vision Language Models with TRL: We are excited to announce that TRL’s SFTTrainer now includes experimental support for Vision Language Models! We provide an example here of how to perform SFT on a ]Llava 1.5 VLM](https://huggingface.co/llava-hf/llava-1.5-7b-hf) using the llava-instruct dataset which contains 260k image-conversation pairs. The dataset contains user-assistant interactions formatted as a sequence of messages. For example, each conversation is paired with an image that the user asks questions about.
-
The AiEdge+: How to fine-tune Large Language Models with Intermediary models
https://newsletter.theaiedge.io/p/the-aiedge-how-to-fine-tune-large
-
Train, Fine-Tune Models for Free on Lightning AI
https://www.youtube.com/watch?v=b6Pzgj9EQv8&ab_channel=FahdMirza
-
Fine-tune Idefics2 Multimodal LLM https://www.youtube.com/watch?v=4MzCpZLEQJs&ab_channel=DLExplorers
-
ReFT: Reasoning with Reinforced Fine-Tuning
Aligning LLMs: ReFT
https://www.youtube.com/watch?v=K_8a056X4ys&ab_channel=AIMakerspace
-
Fine Tune Multimodal LLM "Idefics 2" using QLoRA
https://www.youtube.com/watch?v=8GWmu99-sjA&ab_channel=AIAnytime
https://github.com/AIAnytime/Fine-Tune-Multimodal-LLM-Idefics-2
-
Fine-tune Multi-modal Video + Text Models, IDEFICS 2 https://www.youtube.com/watch?v=0cgCFRrPHtY&ab_channel=TrelisResearch
-
Combined Preference and Supervised Fine Tuning with ORPO
https://www.youtube.com/watch?v=OWMJ0rBUj04&ab_channel=TrelisResearch
-
Dickens: an LLM that writes Great Expectations
https://colab.research.google.com/drive/1MdZvYtm3xrkPrxzD71SZ6H9GTkG46VRF?usp=sharing
- Question Answering on FAQs of GST (Goods and Services Tax) in India
https://medium.com/analytics-vidhya/how-to-fine-tune-llms-without-coding-41cf8d4b5d23
https://colab.research.google.com/drive/1RQc035W1_7CTEViYrsnRwYvOtObvXo-B?usp=sharing
- Intent Classification with LLMs: Fine-Tuning on Support Call Transcripts using Ludwig
https://colab.research.google.com/drive/17fmNaq-2KwqJLHt4ZZ0X6FbmMlssq_vR?usp=sharing
- Democratize and Automate the Feature Engineering of Tabular Data using fine-tuned LLMs
https://colab.research.google.com/drive/1NLmQqbiXc-dU9C0ulNsUuubB3vbhaJbi?usp=sharing
-
Mistral-7B : EASIET WAY To FINE-TUNE ON YOUR DATA Using Direct Preference Optimization(DPO)
https://www.youtube.com/watch?v=XFudZy11FJI&ab_channel=WhisperingAI
-
DreamBooth - Fine Tuning Text-to-Image Diffusion Models
https://www.youtube.com/watch?v=_bFPL3ZD4Ko&ab_channel=FahdMirza
https://huggingface.co/papers/2208.12242
https://huggingface.co/docs/diffusers/v0.27.2/training/dreambooth
-
Fine-Tuning PaliGemma With Custom Data
https://www.youtube.com/watch?v=XKBxZk48Le8&ab_channel=MoslehMahamud
-
Track autotrain finetuning in real time with WANDB
https://www.youtube.com/watch?v=NfY28WXlHOs&ab_channel=SuperLazyCoder
-
Assessing Health Data with ML and Becoming More Aware
https://colab.research.google.com/drive/16Ofyeg2wse1UFEMwROCN5qqWHKgWZNIR?usp=sharing
-
NODES 2023 - Fine-Tuning an Open-Source LLM for Text-to-Cypher Translation https://www.youtube.com/watch?v=TB6URe5f3MA&ab_channel=Neo4j
-
Fine-tuning a Code LLM on Custom Code on a single GPU
https://github.com/huggingface/cookbook/tree/main/notebooks/en
-
Fine-tuning Zephyr-7B to znalyze customer support call logs
https://youtu.be/cwT5JAqtTM4?si=x5NZgXKzgNx6xlt-
https://pbase.ai/ZephyrWebinarSlides
https://pbase.ai/ZephyrCustomerSupport
- Building an LLM fine-tuning dataset,
https://youtu.be/pCX_3p40Efc?si=UKvB7DSVb366Zzbe
https://github.com/Sentdex/LLM-Finetuning
-
Fine tuning LLMs for Memorization
https://www.youtube.com/watch?v=_GkHZQYFOGM&ab_channel=TrelisResearch
https://docs.google.com/presentation/d/1Un-H9d3ghlR23VddD3aR8aSWHHg9vjIwvYC45o0Vn7g/edit?usp=sharing
https://huggingface.co/datasets/Trelis/touch-rugby-rules-memorisation
-
Fine-tuning a large language model on Kaggle Notebooks (or even on your own computer) for solving real-world tasks
https://huggingface.co/blog/lmassaron/fine-tuning-llms-on-kaggle-notebooks
Code references: Fine-tune Llama-2 for Sentiment Analysis: https://www.kaggle.com/code/lucamassaron/fine-tune-llama-2-for-sentiment-analysis Fine-tune Mistral v0.2 for Sentiment Analysis: https://www.kaggle.com/code/lucamassaron/fine-tune-mistral-v0-2-for-sentiment-analysis Fine-tune Phi 2 for Sentiment Analysis: https://www.kaggle.com/code/lucamassaron/fine-tune-phi-2-for-sentiment-analysis LSTM Baseline for Sentiment Analysis): https://www.kaggle.com/code/lucamassaron/lstm-baseline-for-sentiment-analysis
-
How to hack a LLM using PyReft (using your own data for Fine Tuning!)
https://www.youtube.com/watch?v=iy9Z4DyHxvE&ab_channel=NicholasRenotte
-
Phinetuning 2.0
Meet Phi-2, Microsoft’s newly released small model, remarkably powerful yet compact. This tutorial will guide you through fine-tuning Phi-2, demonstrating how to build a unique dataset and fine-tune the model using QLoRA.
-
Fine-tuning Language Models for Structured Responses with QLoRa, https://www.youtube.com/watch?v=OQdp-OeG1as&ab_channel=TrelisResearch
-
Fine-tuning Llama 2 on Your Own Dataset | Train an LLM for Your Use Case with QLoRA on a Single GPU, https://www.youtube.com/watch?v=MDA3LUKNl1E&ab_channel=VenelinValkov
https://github.com/curiousily/Get-Things-Done-with-Prompt-Engineering-and-LangChain
-
Easily Train a Specialized LLM: PEFT, LoRA, QLoRA, LLaMA-Adapter, and More, https://cameronrwolfe.substack.com/p/easily-train-a-specialized-llm-peft#:~:text=LoRA%3A%20Low%2DRank%20Adaptation%20of%20Large%20Language%20Models%20%5B1%5D&text=LoRA%20leaves%20the%20pretrained%20layers,of%20the%20model%3B%20see%20below.
-
Efficient Fine-Tuning for Llama 2 on Custom Dataset with QLoRA on a Single GPU in Google Colab, https://www.youtube.com/watch?v=YyZqcNo4hdo&pp=ygUQZmluZSB0dW5pbmcgTExNXA%3D%3D
-
QLoRA—How to Fine-tune an LLM on a Single GPU (w/ Python Code), https://www.youtube.com/watch?v=XpoKB3usmKc&ab_channel=ShawTalebi
https://colab.research.google.com/drive/1AErkPgDderPW0dgE230OOjEysd0QV1sR?usp=sharing
-
Fine-Tuning GPT for Automatic Email Responses: A Python Tutorial
https://www.youtube.com/watch?v=M46KvnbhkFI&ab_channel=AIFORDEVS
-
Preference Tuning LLMs with Direct Preference Optimization Methods
-
Fine-tune Llama 2 with DPO
-
Practical Fine-Tuning of LLMs
https://www.youtube.com/watch?v=Jp-6hyf_CoE&ab_channel=AIMakerspace
https://colab.research.google.com/drive/1Jw9jthx_S62MPwKH9lqb6xPRwec4OiI6?usp=sharing
-
How to Train a Multi Modal Large Language Model with Images?
https://huggingface.co/HuggingFaceM4/idefics-9b
https://www.youtube.com/watch?v=ojjIYAbWP6U&ab_channel=MervinPraison
-
Fine-tuning Llama 2 70B using PyTorch FSDP
-
Fine-Tune or Train LLMs on Intel GPUs Locally on Custom Dataset - IPEX-LLM
https://www.youtube.com/watch?v=MTK4UBccmLQ&ab_channel=FahdMirza
https://ipex-llm.readthedocs.io/en/latest/index.html
-
How to fine tune a model locally on mistralai/Mistral-7B-Instruct-v0.2 using HuggingFaceTB/cosmopedia-20k or Elriggs/openwebtext-100k dataset
-
Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers
https://huggingface.co/blog/fine-tune-whisper
https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/fine_tune_whisper.ipynb
- Fine-tuning on Wikipedia Datasets
https://www.youtube.com/watch?v=bo49U3iC7qY&ab_channel=TrelisResearch
-
Fine-Tune W2V2-Bert for low-resource ASR with 🤗 Transformers
-
Google Gemma Finetuning: how to teach a large language model?, https://youtu.be/RevZAM9taFk?si=QuNJAVrLdqs7SUgE
-
Steps to Master Fine Tuning LLMs To Ultimate AI Proficiency : A Definitive Guide
https://www.youtube.com/watch?v=GK860luUyEk&ab_channel=KamalrajMM
-
Fine tuing optimization DoRA, NEFT, LoRA+, Unsloth
https://youtu.be/ae2lbmtTY5A?si=0NXaw8tOXqh800x2
supervised fine tuning https://huggingface.co/docs/trl/main/en/sft_trainer
-
Building with Instruction-Tuned LLMs: A Step-by-Step Guide
https://www.youtube.com/watch?v=eTieetk2dSw&ab_channel=DeepLearningAI
-
Fine Tune Large Language Model (LLM) on a Custom Dataset with QLoRA
-
Unsloth: How to Train LLM 5x Faster and with Less Memory Usage?
https://www.youtube.com/watch?v=Gpyukc6c0w8&t=16s&ab_channel=MervinPraison
-
Mistral Fine Tuning for Dummies (with 16k, 32k, 128k+ Context)
https://www.youtube.com/watch?v=rANv5BVcR5k&ab_channel=NodematicTutorials
-
Fine-Tuning Gemma (Easiest Method with Unsloth & Colab)
https://www.youtube.com/watch?v=pWZfufhF45o&ab_channel=NodematicTutorials
https://github.com/nodematiclabs/gemma-fine-tune
https://github.com/unslothai/unsloth?tab=readme-ov-file#-finetune-for-free
-
Fine-Tune and Align LLMs Easily with NVIDIA NeMo Customizer
https://developer.nvidia.com/blog/fine-tune-and-align-llms-easily-with-nvidia-nemo-customizer/
-
Direct Preference Optimization (DPO)
- Direct Preference Optimization (DPO)
- [Identity Preference Optimisation (IPO)] (https://huggingface.co/papers/2310.12036)
- [Kahneman-Tversky Optimisation (KTO)] (https://github.com/ContextualAI/HALOs)
-
Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math
https://www.youtube.com/watch?v=hvGa5Mba4c8&ab_channel=UmarJamil
-
Reinforcement Learning with AI Feedback - RLAIF Github Link
Reinforcement Learning from AI Feedback (RLAIF) is a concept that describes a type of machine learning approach where an AI agent learns by receiving feedback or guidance from another AI system. This concept is closely related to the field of Reinforcement Learning (RL), which is a type of machine learning where an agent learns to make a sequence of decisions in an environment to maximize a cumulative reward.
-
Reasoning with Reinforced Fine-Tuning (ReFT)
-
Illustrating Reinforcement Learning from Human Feedback (RLHF)
Reinforcement learning from Human Feedback (also referenced as RL from human preferences) is a challenging concept because it involves a multiple-model training process and different stages of deployment. In this blog post, we’ll break down the training process into three core steps:
-
Pretraining a language model (LM),
-
gathering data and training a reward model, and
-
fine-tuning the LM with reinforcement learning.
Technical detail note: The above diagram makes it look like both models generate different responses for the same prompt, but what really happens is that the RL policy generates text, and that text is fed into the initial model to produce its relative probabilities for the KL penalty. This initial model is untouched by gradient updates during training.
-
-
Reinforcement Learning from Human Feedback (RLHF) explained with math derivations and the PyTorch code
https://www.youtube.com/watch?v=qGyFrqc34yc&ab_channel=UmarJamil
Open-source tools for RLHF
The first code released to perform RLHF on LMs was from OpenAI in TensorFlow in 2019.
Today, there are already a few active repositories for RLHF in PyTorch that grew out of this. The primary repositories are Transformers Reinforcement Learning (TRL), TRLX which originated as a fork of TRL, and Reinforcement Learning for Language models (RL4LMs).
TRL is designed to fine-tune pretrained LMs in the Hugging Face ecosystem with PPO. TRLX is an expanded fork of TRL built by CarperAI to handle larger models for online and offline training. At the moment, TRLX has an API capable of production-ready RLHF with PPO and Implicit Language Q-Learning ILQL at the scales required for LLM deployment (e.g. 33 billion parameters). Future versions of TRLX will allow for language models up to 200B parameters. As such, interfacing with TRLX is optimized for machine learning engineers with experience at this scale.
RL4LMs offers building blocks for fine-tuning and evaluating LLMs with a wide variety of RL algorithms (PPO, NLPO, A2C and TRPO), reward functions and metrics. Moreover, the library is easily customizable, which allows training of any encoder-decoder or encoder transformer-based LM on any arbitrary user-specified reward function. Notably, it is well-tested and benchmarked on a broad range of tasks in recent work amounting up to 2000 experiments highlighting several practical insights on data budget comparison (expert demonstrations vs. reward modeling), handling reward hacking and training instabilities, etc. RL4LMs current plans include distributed training of larger models and new RL algorithms.
Both TRLX and RL4LMs are under heavy further development, so expect more features beyond these soon.
There is a [large dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf) created by Anthropic available on the Hub.
- ORPO Odd Ratio Preference Optimization
Monolithic Preference Optimization without Reference Model.
Comparison of model alignment techniques. ORPO aligns the language model without a reference model in a single-step manner by assigning a weak penalty to the rejected responses and a strong adaptation signal to the chosen responses with a simple log odds ratio term appended to the negative log-likelihood loss
https://github.com/xfactlab/orpo
https://youtu.be/6kkJGkPZP88?si=CJf02_4Ub91Zz75I
- How to fine tune LLMs?
Fine-tuning an LLM may not be as trivial as we may think! Depending on your data, it may lead to the model forgetting what it learned in the pretraining phase! You want to fine-tune it but you also may want to retain its coding or chatting abilities. Because you most likely don't have the right benchmark data to validate it on different learning tasks, it might be difficult to understand the abilities it lost in the process!
Why would we want to fine-tune an LLM in the first place? There 2 main reasons! First, we may want to augment the model's data bank with private data, and second, we may want the model to specialize in specific learning tasks. A full fine-tuning takes time and money and generates a very large resulting model file. The typical way to go about it is to use Low-Rank Adaptaters (LoRA) to minimize the fine-tuning cost.
The idea is to replace within the model some of the large matrices with smaller ones for the gradient computation. Let's call W0 the weights of the pre-trained model for a specific layer matrix. After a gradient update ΔW, the weights will be
W = W0 + ΔW
and, if x is the input to that layer, the output of that layer will be
W . x = W0 . x + ΔW . x
If we use the LLama2 with 70B parameters, we need to update all the parameters for each backward pass: computationally very expensive! Instead, with LoRA, we insert next to each layer matrix of the pre-trained model, 2 matrices A and B such that the update is approximated by a lower rank decomposition: ΔW ~ B . A
The trick is that if ΔW has dimensions (R, C), we can create B with dimensions (R, r) and A with dimensions (r, C) such that r << R, C. For example if R = 10K, C = 20K and r = 4, then
ΔW has R x C = 10K x 20K = 200M elements B has R x r = 10K x 4 = 40K elements and A has r x C= 20K x 4 = 80K elements
Therefore A and B combined have 120K elements which is 1666 times less elements than ΔW. When we fine-tune, we only update the weights of those newly inserted matrices. The gradient matrices are much smaller and therefore require much less GPU memory space. Because the pre-trained weights are frozen, we don't need to compute the gradients for a vast majority of the parameters.
To gain even more space, we may want to quantize the float parameters into integers while applying LoRA (QLoRA). Now, the number of fine-tuned weights is just a fraction of the original model size and we can more easily store those weights for each of the learning tasks we needed fine-tuning for. When we need to deploy an inference server, we can use the original pre-trained model and combine it with the fine-tuned LoRA adapters for the specific learning task needed on that server.
That is worth a read: https://lnkd.in/d8sXWD_X
-
How to fine-tune LLMs for text encoding ?
Being able to encode text of any size into an embedding is one of the superpowers of LLMs! Do you remember when Word2Vec was the best we could do?!
Transformers are great candidates to project the text representation of a sentence into its latent space. The latent space is represented by vector representations of the text representation. This vector representation encodes the text into a shorter format. This text encoding can be used as input for other models or as an index for vector databases. A simple way to extract a text encoding is to pick one of the hidden states. Each of them captures a vector representation of the whole input sentence. Different pre-training tasks (language modeling, sentence classification, etc.) may lead to different vector representations that can be more or less useful depending on how they are used.
It is possible that the size of the hidden states is not adapted to the applications we may want to use the text encoding for, in which case, we want to resize the text encoding by using a linear layer to project the vectors onto the desired dimension. To train that projection layer, we need to plug a specific modeling head and fine-tune the model on the related learning task.
In the context of RAG, we want the text encoding a question to be similar to its answer. The text encodings described above will capture semantic similarity, but a question is not always semantically similar to its answer. We can enforce similarity in the vector representations of questions and their respective answers by using contrastive learning. The idea is to train the model such that the dot product (or the cosine similarity) computed on the questions and their related answers is ~1:
Vector(question) x Vector(answer) ~ 1
To do that, we need to construct a data set where pairs of related (Question, answer) are labeled 1 (similar) and 0 otherwise (dissimilar). We can train the model using contrastive learning where the weights are updated, such that the vector representations of the related (Question, answer) are similar.
-
Fine-tuning large language models (LLMs) in 2024
Fine Tuning
Fine-tuning methods
- Instruction fine-tuning: It's about training the machine learning model using examples that demonstrate how the model should respond to the query. The dataset you use for fine-tuning large language models has to serve the purpose of your instruction. ![image](https://github.com/ParthaPRay/LLM-Learning-Sources/assets/1689639/4cd9d6f7-9808-4463-a912-32a122f11a64) - Full fine-tuning: Instruction fine-tuning, where all of the model's weights are updated, is known as full fine-tuning - Parameter-efficient fine-tuning: PEFT methods only update a small set of parameters
Other types of fine-tuning
- Transfer learning: Transfer learning is about taking the model that had learned on general-purpose, massive datasets and training it on distinct, task-specific data. This dataset may include labeled examples related to that domain. Transfer learning is used when there is not enough data or a lack of time to train data; the main advantage of it is that it offers a higher learning rate and accuracy after training. You can take existing LLMs that are pre-trained on vast amounts of data, like GPT ¾ and BERT, and customize them for your own use case.
- Task-specific fine-tuning: Task-specific fine-tuning is a method where the pre-trained model is fine-tuned on a specific task or domain using a dataset designed for that domain. This method requires more data and time than transfer learning but can result in higher performance on the specific task.
- Multi-task learning: Multi-task fine-tuning is an extension of single-task fine-tuning, where the training dataset consists of example inputs and outputs for multiple tasks.
- Sequential fine-tuning: Sequential fine-tuning is about sequentially adapting a pre-trained model on several related tasks. After the initial transfer to a general domain, the LLM might be fine-tuned on a more specific subset.
-
Benefits of Fine Tuning
-
RAG Vs Fine-Tuning: How to Optimize LLM Performance
https://www.e2enetworks.com/blog/rag-vs-fine-tuning-how-to-optimize-llm-performance#:~:text=Trade%2Doffs%3A%20Fine%2Dtuning%20may%20provide%20more%20control%20over,reliability%20of%20the%20knowledge%20base.
-
Full-model Fine-tuning vs. LoRA vs. RAG
https://www.blog.dailydoseofds.com/p/full-model-fine-tuning-vs-lora-vs
-
Trade-Offs
The decision to employ fine-tuning or RAG depends on the specific goals of a task and the nature of the knowledge required. Here are some considerations and trade-offs:
Fine-tuning Considerations: Fine-tuning is suitable for tasks where specific, task-oriented improvements are needed. It is effective for refining a model's performance in a particular domain. However, fine-tuning may exhibit instability and might not be the optimal choice for addressing broad knowledge deficits. RAG Considerations: RAG excels in knowledge-intensive tasks where external information is valuable which is provided by feeding data to the knowledge base. It can address both knowledge deficits and factual errors by incorporating diverse knowledge from external sources. RAG's effectiveness relies on the quality and coverage of the knowledge base. Trade-offs: Fine-tuning may provide more control over specific task-related improvements, but it might struggle with broader knowledge adaptation. RAG, while powerful in leveraging external knowledge, depends on the availability and reliability of the knowledge base.
-
H2O LLM DataStudio: Streamlining Data Curation and Data Preparation for LLMs related tasks https://h2o.ai/blog/2023/streamlining-data-preparation-for-fine-tuning-of-large-language-models/
-
H2O LLM DataStudio Part II: Convert Documents to QA Pairs for fine tuning of LLMs https://h2o.ai/blog/2023/h2o-llm-datastudio-part-ii-convert-documents-to-qa-pairs-for-fine-tuning-of-llms/
Various Types of RAGs
RAG = Dense vector Retrieval (R) + In-Contsxt learning (AG)
-
3 Ways to build multimodal RAG pipeline
Text is not the only data type we use in RAG pipelines! We are still in the infancy of Generative AI, and text is now the primary information that we feed to LLMs, but that is going to change quickly! There is a lot more information contained in the different documents we use on a daily basis beyond just text data.
For example, GPT-4, Bard, and LlaVA are multimodal LLMs that can ingest images as well as text. The images are passed through a Vision Transformer, resulting in visual tokens. The visual tokens are then passed through a projection layer that specializes in aligning visual tokens with text tokens. The visual and text tokens are then provided to the LLM, which cannot make the difference between the different data modes.
In the context of RAG, the LLM plays a role at indexing time, where it generates a vector representation of the data to index it in a vector database. It is also used at retrieval time, where it uses the retrieved documents to provide an answer to a user question. A multimodal LLM can generate embedding representations of images and text and answer questions using those same data types. If we want to answer questions that involve information in different data modes, using a multimodal LLM at indexing and retrieval time is the best option.
If you want to build your RAG pipeline using API providers like OpenAI, there are currently no available options for multimodal LLMs. However, OpenAI is likely to release its API to ingest images with GPT-4 pretty soon, so it will be available for question-answering using multimodal prompts. Even if it is available for text generation, it might not be available for embedding generation. Remains creating embedding for images then? This can be achieved by prompting a multimodal LLM to describe in text the images we need to index. We can then index the images using the text descriptions and their vector representations.
The complexity of generating a text description of an image is not the same as answering questions using a large context of different data types. With a small multimodal LLM, we might get satisfactory results in describing images but subpar results in answering questions. For example, it is pretty simple to build an image description pipeline with LlaVA models and Llama.cpp as LLM backbone. Those descriptions can be used for indexing as well as for answering questions that may involve those images. The LLM answering questions would use the text description of images instead of the images themselves. Today that might be the simplest option to build a multimodal RAG pipeline. It might not be as performant, but the technology is going to improve very fast!
-
How to optimize your RAG pipelines
In RAG, the data you retrieve doesn't have to be the data you used to index it! Typically, when we talk about RAG, we assume that the data is stored in its vector representation in a vector database. When we query the database, we then retrieve the most similar data to the query vector. But it doesn't have to be the case!
In a typical RAG (Retrieval Augmented Generation), we have a document, we convert the document into its vector representation, and when a query vector is similar to the vector, we retrieve the document. However, the vector that is used to index the document doesn't have to be its direct vector representation.
For example, the document could be quite large and could contain multiple conflicting information about different concepts. The query vector usually comes from a question about a single concept, so it is unlikely that the vector representation of the question will be similar to the large document. Instead, we could break down the large document into smaller chunks, convert those into their vector representations, and index the large document multiple times using the child documents' vectors. The small child documents have more chance to contain a unique concept, so they are great for indexing the data for similarity search, but they don't contain a lot of context to answer the question, so it is better to retrieve the larger document.
We can also index the document by the questions that the document answers. As part of the indexing pipeline, we can have an LLM prompted with the task of generating the questions that the document could answer. We then get the embeddings of the questions and index the document by those embeddings. When we have a question, the resulting query vector will be much more similar to the questions about the document than the document itself. However, the data retrieved should be the document so that the LLM has all the context necessary to answer the question.
We could also index the document by its summary. Again, as part of the indexing pipeline, we could have an LLM tasked to summarize the incoming documents. The resulting text will be more concise and "semantically purer", so it could be a better option for a similarity search. This is a great option when your document contains tables (like .csv). Tables contain numbers, and it might be difficult to get a question whose vector representation could be similar to the table's. However, if, as part of the indexing pipeline, we have an LLM tasked to provide a text description of the table data, we can then index the table data using its text description. This will make it much easier on the similarity search! The retrieved data will be the original table data as it contains more information to answer the question.
-
Retrieval Augmented Generation (RAG) Explained: Embedding, Sentence BERT, Vector Database (HNSW)
https://www.youtube.com/watch?v=rhZgXNdhWDY&ab_channel=UmarJamil
https://github.com/hkproj/retrieval-augmented-generation-notes
-
Problems with RAG
Augmenting LLMs with databases is great, but there are major flaws in that approach! We see a lot of debates around fine-tuning versus Retriever Augmented Generation (RAG) with LLMs these days. Augmenting LLMs with small additional data is better served by RAG, but it is important to understand the shortcomings of that approach!
The idea with RAG is to encode the data you want to expose to your LLM into embeddings and index that data into a vector database. When a user asks a question, it is converted to an embedding, and we can use it to search for similar embeddings in the database. Once we found similar embeddings, we construct a prompt with the related data to provide context for an LLM to answer the question. Similarity here is usually measured using the cosine similarity metric.
The first problem is that a question is usually not semantically similar to its answers. At least, it is possible for the search to retrieve documents containing the same words as the question or that are used in the same context without providing relevant information to answer the question. Because the search retrieves the most similar documents to the question, depending on the data, too many irrelevant documents may show higher cosine similarity than the documents actually containing the answer.
To be fair, high cosine similarity does not exactly translate to semantic similarity with Transformers. High cosine similarity can also capture the high co-occurrence of 2 different terms within the same sub-text of the training data, which often happens for a specific question and its related answer.
Another problem may be related to the way the data has been indexed. If the data have been broken down into big chunks of text, then it is likely to contain multiple different and unrelated information within each chunk. If you perform a similarity search on that data, the pertinent information may be diluted, and the search may return irrelevant documents instead. It is important to break down the data so that each chunk contains no more than a few paragraphs to ensure more "uniqueness" in the concepts developed in each text.
With the RAG approach, it is very important to limit the type of questions we ask the LLM. If we ask questions that require aggregating data all over the database, the answers are most likely going to be wrong, but the LLM won't be able to know that. If the right information is local to one or a few documents, a similarity search may find it. However, if the information requires scanning all the documents to find the answer, a similarity search won't find it. Imagine each document is dated, and we ask, "What is the earliest document?". In that case, we can only know the answer if we scan the entire database, and a similarity search won't be helpful.
-
Vector Database vs Graph Database for RAG
Graph Databases should be the better choice for Retrieval Augmented Generation (RAG)! We have seen the debate RAG vs fine-tuning, but what about Vector databases vs Graph databases?
In both cases, we maintain a database of information that an LLM can query to answer a specific question. In the case of vector databases, we partition the data into chunks, encode the chunks into vector representations using an LLM, and index the data by their vector representations. Once we have a question, we retrieve the nearest neighbors to the vector representation of the question. The advantage is the fuzzy matching of the question to chunks of data. We don't need to query a specific word or concept; we simply retrieve semantically similar vectors. The problem is that the retrieved data may contain a lot of irrelevant information, which might confuse the LLM.
In the context of graphs, we extract the relationships between the different entities in the text, and we construct a knowledge base of the information contained within the text. An LLM is good at extracting that kind of triplet information:
[ENTITY A] -> [RELATIONSHIP] -> [ENTITY B]
For example:
- A [cow] IS an [animal]
- A [cow] EATS [plants]
- An [animal] IS a [living thing]
- A [plant] IS a [living thing]
Once the information is parsed, we can store it in a graph database. The information stored is the knowledge base, not the original text. For information retrieval, the LLM needs to come up with an Entity query related to the question to retrieve the related entities and relationships. The retrieved information is much more concise and to the point than in the case of vector databases. This context should provide much more useful information for the LLM to answer the question. The problem is that the query matching needs to be exact, and if the entities captured in the database are slightly semantically or lexically different, the query will not return the right information.
I wonder if there is a possibility to merge the advantages of vector and graph databases. We could parse the entities and relationships, but we index them by their vector representations in a graph database. This way, the information retrieval could be performed using approximate nearest neighbor search instead of exact matching. Does that exist already?
- Semantic Chunking for RAG
https://www.youtube.com/watch?v=TcRRfcbsApw&ab_channel=JamesBriggs
https://www.aurelio.ai/learn/semantic-chunkers-intro
- Semantic Chunking - 3 Methods for Better RAG
https://www.youtube.com/watch?v=7JS0pqXvha8&ab_channel=JamesBriggs
https://github.com/aurelio-labs/semantic-chunkers/blob/main/docs/00-chunkers-intro.ipynb
-
Realtime Powerful RAG Pipeline using Neo4j(Knowledge Graph Db) and Langchain
https://www.youtube.com/watch?v=Ik8gNjJ-13I&ab_channel=SunnySavita
https://github.com/sunnysavita10/Indepth-GENAI/blob/main/RAG_With_Knowledge_graph(Neo4j).ipynb
-
Powerful RAG Using Hybrid Search(Keyword+vVector search) | Ensemble Retrieval
https://www.youtube.com/watch?v=Wwuz1l-PCgQ&ab_channel=SunnySavita
https://github.com/sunnysavita10/Indepth-GENAI/blob/main/Hybrid_Search_in_RAG.ipynb
-
RAG using Hybrid Search (Keyword + Vector search) And Reranking With Cohere API | Ensemble Retrieval
https://www.youtube.com/watch?v=ZzJoD9JzQcc&ab_channel=SunnySavita
https://github.com/sunnysavita10/Indepth-GENAI/blob/main/Hybrid_Search_and_reranking_in_RAG.ipynb
-
Chat and RAG with Tabular Databases Using Knowledge Graph and LLM Agents
https://www.youtube.com/watch?v=3NP1llvtrbI&ab_channel=AIRoundTable
-
Chat With Knowledge Graph Data | Improved RAG
-
SUPERHUMAN RAG
https://www.youtube.com/watch?v=bek8AGvt7dg&ab_channel=code_your_own_AI
-
Real Time RAG with Haystack 2 0 and Bytewax
https://www.youtube.com/watch?v=_Rx7cnarGWE&ab_channel=AIMakerspace
-
What is Retrieval-Augmented Generation (RAG)?, https://www.youtube.com/watch?v=T-D1OfcDW1M&t=265s&ab_channel=IBMTechnology
-
Community Paper Reading: RAG vs Fine-tuning, https://www.youtube.com/watch?v=EbEPHOABgSY&ab_channel=ArizeAI
-
Build a RAG Evaluation Tool and Python Library
https://www.youtube.com/watch?v=IMN_bDVRZ1M&ab_channel=AIAnytime
-
Elevated RAG with LangChain
https://www.youtube.com/watch?v=XXnc55zypU0&ab_channel=FourthBrainAI
-
End-to-end Prototyping with Llama 3
https://www.youtube.com/watch?v=anIBtQNn1G0&ab_channel=AIMakerspace
https://colab.research.google.com/drive/1TX-N9E7lESNkxIrFEC6sn0rMMfYRCmKg#scrollTo=iRGG0QCNwT6J
-
Building corrective RAG from scratch with open source, local LLMs, https://youtu.be/E2shqsYwxck?si=LEeA5KXOQ6idzDd2
-
RAG from scratch, https://youtube.com/playlist?list=PLfaIDFEXuae2LXbO1_PKyVJiQ23ZztA0x&si=BtJ6KCTMfqBzIGya
-
Production RAG Must-have: Corrective RAG (CRAG)
https://www.youtube.com/watch?v=Har-Pzwx_8I&ab_channel=TwoSetAI
-
Meta Llama 3 Fine tuning, RAG, and Prompt Engineering for Drug Discovery https://www.youtube.com/watch?v=CS1O2ZCHsbY&ab_channel=ChemicalQDevice
-
User-Selected metadata in RAG Applications with Qdrant, https://www.youtube.com/watch?v=qcn7YAJfCeE&ab_channel=LearnDatawithMark
-
Ollama Python Library Released! How to implement Ollama RAG? https://www.youtube.com/watch?v=4HfSfFvLn9Q&ab_channel=MervinPraison
-
Building a Multimodal RAG App for Medical Applications, https://www.youtube.com/watch?v=fbbFrCfaF0w&ab_channel=AIAnytime
-
Track and Monitor RAG Pipelines using Weights & Biases (wandb), https://www.youtube.com/watch?v=8-exaASey6o&ab_channel=AIAnytime
-
Unlocking RAG Potential with LLMWare's CPU-Friendly Smaller Models, https://www.youtube.com/watch?v=qXEUqhqjHdg&ab_channel=AIAnytime
-
RAG Implementation using Zephyr 7B Beta LLM: Is this the best 7B LLM? https://www.youtube.com/watch?v=btuN-rrPhsM&ab_channel=AIAnytime
-
Better RAG with Merger Retriever (LOTR) and Re-ranking Retriever (Long Context Reorder), https://www.youtube.com/watch?v=uYZftCq2efg&ab_channel=AIAnytime
-
Build an End-to-End RAG API with AWS Bedrock & Azure OpenAI
https://www.youtube.com/watch?v=r6AeD-CH1Uw&ab_channel=AIAnytime
https://github.com/AIAnytime/RAG-using-AWS-Bedrock-and-Azure-OpenAI
-
Building a RAG application from scratch using Python, LangChain, and the OpenAI API, https://www.youtube.com/watch?v=BrsocJb-fAo&ab_channel=Underfitted
-
How to build Multimodal Retrieval-Augmented Generation (RAG) with Gemini
https://www.youtube.com/watch?v=LF7I6raAIL4&ab_channel=GoogleforDevelopers
-
Pinecone + LlamaIndex on Retrieval Augmented Generation (RAG) Systems, https://www.youtube.com/watch?v=FgLf5HjxI8w&ab_channel=ArizeAI
-
Optimizing RAG With LLMS: Exploring Chunking Techniques and Reranking for Enhanced Results, https://youtube.com/watch?v=QpRTdZDR4tE&ab_channel=ArizeAI
-
Check Hallucination of LLMs and RAGs using Open Source Evaluation Model by Vectara, https://www.youtube.com/watch?v=O-VYDADgc68&ab_channel=AIAnytime
-
How to Evaluate RAG Applications in CI/CD Pipelines with DeepEval, https://www.confident-ai.com/blog/how-to-evaluate-rag-applications-in-ci-cd-pipelines-with-deepeval
-
Learn to Evaluate LLMs and RAG Approaches, https://www.youtube.com/watch?v=97ftVtITKfo&ab_channel=AIAnytime
-
Evaluating Biases in LLMs using WEAT and Demographic Diversity Analysis, https://www.youtube.com/watch?v=eTenkUPsjko&ab_channel=AIAnytime
-
RAG with LlamaIndex - Qdrant and Azure OpenAI in 9 minutes, https://www.youtube.com/watch?v=h4F09fWhyhg&ab_channel=AmbarishGangulyAcademy
-
How to evaluate an LLM-powered RAG application automatically.
https://www.youtube.com/watch?v=ZPX3W77h_1E&t=492s&ab_channel=Underfitted
-
Search-in-the-Chain with LlamaIndex
This LlamaPack implements a framework called SearChain, which implements the interaction between LLM and IR in the form of the global reasoning chain called Chain-of-Query (CoQ).
https://llamahub.ai/l/llama-packs/llama-index-packs-searchain?from=
-
How I Built the Fastest FULLY LOCAL RAG PDF Chatbot Using GroqChat|Chainlit|Ollama
https://www.youtube.com/watch?v=UwKGBvFldho&ab_channel=DataInsightEdge
https://github.com/InsightEdge01/GroqPDFFastChatbot/tree/main
-
LLM Search & Retrieval Systems with Arize and LlamaIndex: Powering LLMs on Your Proprietary Data, https://www.youtube.com/watch?v=hbQYDpJayFw&ab_channel=ArizeAI
-
Building A RAG System With OpenAI Latest Embeddings, https://www.youtube.com/watch?v=OvvgaR1S4Xc&ab_channel=RichmondAlake
-
Transform RAG and Search with Azure AI Document Intelligence, https://www.youtube.com/watch?v=SOBdR-xxE04&ab_channel=SethJuarez
-
Best retrieval strategies for Generative AI applications: Semantic Search Benchmarking, https://www.youtube.com/watch?v=BvnOln6YZ_8&ab_channel=SethJuarez
-
RAG Evaluation Using Synthetic data and LLM-As-A-Judge, https://github.com/huggingface/cookbook/tree/main/notebooks/en
-
Bert Score for Contextual Similarity for RAG Evaluation, https://youtube.com/watch?v=7AVjk2k8Mbs&ab_channel=AIAnytime
-
Testing Framework Giskard for LLM and RAG Evaluation (Bias, Hallucination, and More), https://www.youtube.com/watch?v=KeY6qPAvyq0&ab_channel=AIAnytime
-
RAG Evaluation
-
Evaluate RAG pipeline using HuggingFace Open Source Models
-
Build an On-Device RAG App using Open Source AI Stack, https://www.youtube.com/watch?v=-ACc-NVjI5g&ab_channel=AIAnytime
-
How I Built a Medical RAG Chatbot Using BioMistral|Langchain | FREE Colab|ALL OPENSOURCE
https://www.youtube.com/watch?v=E53hc-jcUeE&ab_channel=DataInsightEdge
https://colab.research.google.com/drive/1Jk7M4N8O4kUEhHQSk5-J7bKFg0nod43k?usp=sharing
-
Advanced RAG Techniques by Pinecone
| Feature | Self RAG | Corrective RAG | RAG Fusion | |---------------|------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------| | Overview | Enhances LM's quality and factuality through retrieval and self-reflection. Uses special tokens for adaptive retrieval and critique of its own generations. | Improves robustness of generation with a lightweight retrieval evaluator and a decompose-then-recompose algorithm for document refinement. Utilizes large-scale web searches for augmentation. | Combines RAG and Reciprocal Rank Fusion (RRF) by generating multiple queries, reranking with reciprocal scores, and fusing documents and scores for more accurate and comprehensive answers. | | Key Mechanism | Fine-tuned with reflection tokens and critique tokens for on-demand retrieval and generation quality assessment. | Employs a retrieval evaluator to assess document quality and triggers actions (Correct, Incorrect, Ambiguous) based on confidence scores. | Generates multiple queries from the original query, reranks documents using RRF, and fuses them for the final output. | | Advantages | Increases factuality and versatility of responses. Adaptable to diverse task requirements. | Significantly improves the performance of RAG-based approaches in both short- and long-form generation tasks. | Provides more accurate and comprehensive answers by contextualizing the original query from various perspectives. |
-
RAG Time! Evaluate RAG with LLM Evals and Benchmarking
https://www.youtube.com/watch?v=LrMguHcbpO8&ab_channel=ArizeAI
-
Gemma with transformers: how to teach structured English quotes to LLM https://youtu.be/qeJgBkPLCxo?si=YzFFkJop1ptC_YBM
-
Chat with documents with Chainlit, Langchain, Ollama & Mistral, https://youtu.be/2IL0Sd3neWc?si=eXSH7WZa_bczTfTv
-
How I created AI Research Assistantand it costs 0$ to run, Ollama + qdrant + Gptforall + langchain, https://youtu.be/f1ihg20fQiU?si=VjaYv9yr9g-Ujvdk
-
Langchain + Qdrant Local | Server (Docker) | Cloud | Groq | Tutorial
https://www.youtube.com/watch?v=JSKZYgARffg&ab_channel=DataScienceBasics
-
Semantic Chunking for RAG
https://www.youtube.com/watch?v=dt1Iobn_Hw0&ab_channel=AIMakerspace
-
Adding Semantic Caching and Memory to Your RAG Application Using MongoDB and LangChain
https://www.mongodb.com/developer/products/atlas/advanced-rag-langchain-mongodb/
-
I didn't know RAG could be this easy using Gradient
https://www.youtube.com/watch?v=Hkgz1ysv9Fk&ab_channel=GregKamradt%28DataIndy%29
-
Question Answer Generator App using Mistral LLM, Langchain, and FastAPI, https://www.youtube.com/watch?v=Hcqmhhx30Pg&ab_channel=AIAnytime
-
RAG with LlamaParse, Qdrant and groq
https://youtu.be/w7Ap6gZFXl0?si=liBk9uDsOm9DbSi4
-
Better Retrieval Augmented Generation (RAG) with LangChain Parent-Child Retriever, https://www.youtube.com/watch?v=wSi0fxkH6e0
-
Advanced RAG on HuggingFace documentation using langchain, https://huggingface.co/learn/cookbook/advanced_rag
https://github.com/huggingface/cookbook/tree/main/notebooks/en
-
Advance RAG: LlamaParse + Reranker = Better RAG
https://www.youtube.com/watch?v=wCFXae8hiYA&ab_channel=DataScienceBasics
-
LangChain RAG featuring Shopify's Madhav Thaker, https://www.youtube.com/watch?v=IUEny5cbys8&ab_channel=ArizeAI
-
RAG-VectorDB-Embedings-LlamaIndex-Langchain, https://github.com/lucifertrj/Awesome-RAG
-
Q&A with RAG, https://python.langchain.com/docs/use_cases/question_answering/
Table of contents:
- Quickstart: We recommend starting here. Many of the following guides assume you fully understand the architecture shown in the Quickstart.
- Returning sources: How to return the source documents used in a particular generation.
- Streaming: How to stream final answers as well as intermediate steps.
- Adding chat history: How to add chat history to a Q&A app.
- Per-user retrieval: How to do retrieval when each user has their own private data.
- Using agents: How to use agents for Q&A.
- Using local models: How to use local models for Q&A.
-
Google Gemma Fully LOCAL RAG ChatBot using Ollama|LangChain|Chainlit|Chat with Docs #ai #ollama #llm, https://www.youtube.com/watch?v=8uo-GCIKim8&ab_channel=DataInsightEdge
-
Beyond RAG:How to Build an App with LOCAL LLMs to Generate Custom Datasets to Fine-tune Your LLMs, https://www.youtube.com/watch?v=vBC6Ym0cb0Y&ab_channel=DataInsightEdge
-
How to use MongoDB as vector store for RAG -Atlas vector search index,
https://youtu.be/IPbv5Fs3mis?si=5_frUdnXNLoVJEpM
-
Multi Needle in a Haystack, https://youtu.be/UlmyyYQGhzc?ref=blog.langchain.dev
https://github.com/gkamradt/LLMTest_NeedleInAHaystack/tree/main?ref=blog.langchain.dev
-
LangGraph for Code Generation, https://www.youtube.com/watch?v=MvNdgmM7uyc&ref=blog.langchain.dev
Flow for AlphaCodium
The recent AlphaCodium work showed that code generation can be improved by using a flow paradigm rather than a naive prompt:answer paradigm: answers can be iteratively constructed by (1) testing answers and (2) reflecting on the results of these tests in order to improve the solution.
-
How to use Langchain with multimodal AI to analyze images in financial reports using langchain and GPT-4
https://youtu.be/Rcqy92Ik6Uo?si=PPeKxtD5GHArV9iN
https://docs.google.com/presentation/d/1EJqIvYGbF5IGHX7orXaUSKVN3PVbQh7kOP7m5BEoyKQ/edit?usp=sharing
https://github.com/langchain-ai/langchain/blob/master/cookbook/Multi_modal_RAG.ipynb
-
How to analyze tables in large financial reports using GPT-4 with LkamaIndex
https://docs.google.com/presentation/d/1ug9jHtMFsGjNV7zp85hPUNjiiCGKz53wQb9mZh0B_ZI/edit?usp=sharing
https://colab.research.google.com/drive/1DldMhszgSI4KKI2UziNHHM4w8Cb5OxEL#scrollTo=Ht4oSN2PvzUJ
- A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://www.youtube.com/live/uVqrZhNdUAI?si=58gCEN7BW613l43a
https://github.com/Azure-Samples/azure-search-openai-demo
- Going Meta - ep 22: RAG with knowledge graph, neo4j
https://www.youtube.com/live/9DxwgIKVSHY?si=nXqLEDVbcWwfmzqf
https://github.com/jbarrasa/goingmeta
- Bhuilding RAG with knowledge graphs workshop with LlamaIndex,
https://youtu.be/VEvFPRlCcvI?si=rz_TMnuNrQuncusa
https://www.youtube.com/watch?v=ztBJqzBU5kc&ab_channel=TonyKipkemboi
-
RAGArch: Building a No-Code RAG Pipeline Configuration & One-Click RAG Code Generation Tool Powered by LlamaIndex
-
MultiModal RAG for Advanced Video Processing with LlamaIndex & LanceDB
-
Llama3 local RAG | Step by step chat with websites and PDFs
https://www.youtube.com/watch?v=-8NVHaKKNkM&ab_channel=Phidata
https://github.com/phidatahq/phidata/tree/main/cookbook/llms/ollama/rag
-
Introducing LlamaCloud and LlamaParse for production-grade context-augmentation to LLM and RAG applications
https://github.com/run-llama/llama_parse
https://github.com/run-llama/llama_parse/blob/main/examples/demo_basic.ipynb
https://github.com/run-llama/llama_parse/blob/main/examples/demo_advanced.ipynb
-
Chunking Strategies in RAG: Optimising Data for Advanced AI Responses
https://www.youtube.com/watch?v=pIGRwMjhMaQ&ab_channel=MervinPraison
-
Getting started with RAG in DSPy!, https://youtu.be/CEuUG4Umfxs?si=Dz_S5uOXSlo3yiIN
-
Building RAG with Command R+ from Cohere, DSPy, and Weaviate!
https://www.youtube.com/watch?v=6dgXALb_5Ag&ab_channel=ConnorShorten
https://github.com/weaviate/recipes/blob/main/integrations/dspy/llms/Command-R-Plus.ipynb
-
Llama 3 RAG Demo with DSPy Optimization, Ollama, and Weaviate!
https://www.youtube.com/watch?v=1h3_h8t3L14&ab_channel=ConnorShorten
-
Llama 3 RAG: Create Chat with PDF App using PhiData, Here is how..
https://www.youtube.com/watch?v=ucGvz7y-QPw&ab_channel=MervinPraison
https://github.com/phidatahq/phidata/tree/main/cookbook/llms/groq/rag
-
Building a RAG system with Google Gemma, Huggingface and MongoDB
-
Building A RAG System with Gemma, MongoDB and Open Source Models
https://huggingface.co/learn/cookbook/rag_with_hugging_face_gemma_mongodb
-
Claude 3 Opus RAG Chatbot (Full Walkthrough)
https://www.youtube.com/watch?v=rbzYZLfQbAM&ab_channel=JamesBriggs
-
But, How is Chunking Done ? Splitting Basics Using LangChain
https://www.youtube.com/watch?v=tMwdl9hFPns&ab_channel=DataScienceBasics
-
Build a RAG Based LLM App in 20 Minutes! | Full Langflow Tutorial
https://www.youtube.com/watch?v=rz40ukZ3krQ&ab_channel=TechWithTim
-
How to Improve LLMs with RAG (Overview + Python Code)
https://www.youtube.com/watch?v=Ylz779Op9Pw&ab_channel=ShawTalebi
https://colab.research.google.com/drive/1peJukr-9E1zCo1iAalbgDPJmNMydvQms?usp=sharing
https://www.youtube.com/watch?v=m6P1Rp91AzM&t=63s&ab_channel=TwoSetAI
-
Command-R
C4AI Command-R is a research release of a 35 billion parameter highly performant generative model. Command-R is a large language model with open weights optimized for a variety of use cases including reasoning, summarization, and question answering. Command-R has the capability for multilingual generation evaluated in 10 languages and highly performant RAG capabilities.
https://huggingface.co/CohereForAI/c4ai-command-r-v01
https://www.youtube.com/watch?v=YQFLdE3osws&ab_channel=FahdMirza
-
Transforming business process automation with retrieval-augmented generation and LLMs
- Retrieval-augmented generation in practice
- RAG in supply chain
- RAG in retail
- RAG in finance and insurance
- Case study: RFP processing with RAG
- Assembling RAG flows: From basic building blocks to valuable use cases
- Architecture of retrieval-augmented generation
- Orchestrating RAG processes
- Constructing the RAG pipeline: Essential building blocks
- Conclusion: The benefits of retrieval-augmented generation and large language models
https://www.griddynamics.com/blog/retrieval-augmented-generation-llm
-
How to evaluate an LLM-powered RAG application automatically
https://www.youtube.com/watch?v=ZPX3W77h_1E&ab_channel=Underfitted
-
Implementing semantic cache to improve a RAG system with FAISS.
In this notebook, they explore a typical RAG solution where we will utilize an open-source model and the vector database Chroma DB. However, we will integrate a semantic cache system that will store various user queries and decide whether to generate the prompt enriched with information from the vector database or the cache.
A semantic caching system aims to identify similar or identical user requests. When a matching request is found, the system retrieves the corresponding information from the cache, reducing the need to fetch it from the original source.
As the comparison takes into account the semantic meaning of the requests, they don’t have to be identical for the system to recognize them as the same question. They can be formulated differently or contain inaccuracies, be they typographical or in the sentence structure, and we can identify that the user is actually requesting the same information.
https://huggingface.co/learn/cookbook/semantic_cache_chroma_vector_database
-
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
https://github.com/parthsarthi03/raptor
https://colab.research.google.com/drive/1jbjC4Sh2YVZkpyUE4EB6y8wnZgO7uPUV?usp=sharing
https://www.youtube.com/watch?v=37JSz9SvECI&t=80s&ab_channel=TwoSetAI
-
GraphRAG
Violent Incident Information from News Articles (VIINA) https://github.com/zhukovyuri/VIINA
Base repositories https://github.com/microsoft/graspologic
Comparison, https://arxiv.org/pdf/2303.08896.pdf
-
Decentralized Knowledge RAG
Examples of different categories’ knowledge graphs, i.e., encyclopedic KGs, commonsense KGs, domain-specific KGs, and multi-modal KGs.
The general roadmap of unifying KGs and LLMs. (a.) KG-enhanced LLMs. (b.) LLM-augmented KGs. (c.) Synergized LLMs + KGs.
The general framework of the Synergized LLMs + KGs, which contains four layers: 1) Data, 2) Synergized Model, 3) Technique, and 4) Application.
-
AI RAG Chat App Evaluation,
https://www.youtube.com/watch?v=mM8pZAI2C5w&ab_channel=PamelaFox
https://github.com/Azure-Samples/ai-rag-chat-evaluator
developed by using https://github.com/Azure-Samples/azure-search-openai-demo/
-
AI RAG Chat App: CI/CD Deployment,
https://www.youtube.com/watch?v=GMy3v6UXkYs&ab_channel=PamelaFox
-
Building A RAG Ebook “Librarian” Using LlamaIndex https://huggingface.co/learn/cookbook/rag_llamaindex_librarian
-
Metadata Extraction & Chunking Using Unstructured | ChromaDB
https://www.youtube.com/watch?v=JjSCezpZbI0&ab_channel=DataScienceBasics
-
Build Your Own RAG Using Unstructured, Llama3 via Groq, Qdrant & LangChain
-
Extract Image & Image Info From PDF & Use LlaVa via Ollama To Explain Image | LangChain
-
Extract Table Info From PDF & Summarise It Using Llama3 via Ollama | LangChain
-
Build applications with LLMs: LangChain
-
Universal Document Loader with langchain-airbyte, https://www.youtube.com/watch?v=zQU_1sCLSMU&ab_channel=LangChain
-
Build with LangChain, https://youtube.com/playlist?list=PLfaIDFEXuae06tclDATrMYY0idsTdLg9v&si=0ypsn2axHsDSMs6b
-
LangGraph python, https://youtube.com/playlist?list=PLfaIDFEXuae16n2TWUkKq5PgJ0w6Pkwtg&si=haMafIbDjtLZ9hFU
-
RAG from Scratch
https://www.youtube.com/watch?v=wd7TZ4w1mSw&list=PLfaIDFEXuae2LXbO1_PKyVJiQ23ZztA0x&pp=iAQB
-
LangGraph (Python)
https://www.youtube.com/watch?v=5h-JBkySK34&list=PLfaIDFEXuae16n2TWUkKq5PgJ0w6Pkwtg&pp=iAQB
-
AutoPrompt Builder
https://www.youtube.com/watch?v=mmBo8nlu2j0&list=PLfaIDFEXuae06tclDATrMYY0idsTdLg9v&pp=iAQB
-
LangSmith Platform Overview
https://www.youtube.com/watch?v=3wAON0Lqviw&list=PLfaIDFEXuae2WCZ63usrRoriORSGmDQsg&pp=iAQB
-
Langchain Expression Languare
https://www.youtube.com/watch?v=9M8x485j_lU&list=PLfaIDFEXuae1Ed60mXaLZRXC_jv0IwxPl&pp=iAQB
-
Deep Dive: How to Build a Smart Chatbot in 10 mins with LangChain
https://newsletter.theaiedge.io/p/deep-dive-building-a-smart-chatbot
-
Building long context RAG with RAPTOR from scratch
-
Super Easy Way To Parse PDF | LlamaParse From LlamaIndex | LlamaCloud
https://www.youtube.com/watch?v=wRMnHbiz5ck&ab_channel=DataScienceBasics
https://www.llamaindex.ai/blog/introducing-llamacloud-and-llamaparse-af8cedf9006b
https://github.com/run-llama/llama_parse
Getting Started: https://github.com/run-llama/llama_parse/blob/main/examples/demo_basic.ipynb
Advanced: https://github.com/run-llama/llama_parse/blob/main/examples/demo_advanced.ipynb
Advacned RAG Example: https://github.com/run-llama/llama_parse/blob/main/examples/demo_advanced.ipynb
LlamParse Example: https://github.com/run-llama/llama_parse/tree/main/examples
RAW API usage: https://github.com/run-llama/llama_parse/blob/main/examples/demo_api.ipynb
- LlamaCloud: https://cloud.llamaindex.ai/
- Ollama: https://ollama.ai/
- GitHub repo for code: https://github.com/sudarshan-koirala/llamaparser-example
-
Superior RAGs for complex PDFs with LlamaParse
https://www.youtube.com/live/7qsxz2rURG4?si=GbuRI1hfqrwpA6XU
-
Stanford CS25: V3 I Retrieval Augmented Language Models
https://www.youtube.com/watch?v=mE7IDf2SmJg&t=22s&ab_channel=StanfordOnline
-
RAG over your code: a project by Akshay on creating a local code assistant using LlamaIndex, MistralAI, and Streamlit to index and query GitHub repositories, offering a foundational guide for advanced code QA
https://www.youtube.com/watch?v=3V-rpBofej8&ab_channel=AkshayPachaar
-
-
Build a real-time RAG chatbot using Google Drive and Sharepoint
Keep your chatbot’s knowledge base up-to-date with Pathway and LlamaIndex
In this post, they explore how to build a real-time RAG app with up-to-date information from your files stored in Google Drive or Sharepoint. This means that your chatbot will always have access to the most recent version of your knowledge base—no manual pipeline reruns needed. By the end of this tutorial, you’ll use Pathway and LlamaIndex to build a RAG chatbot that instantly updates.
https://www.youtube.com/watch?v=JLVsFIXtvKE&ab_channel=Streamlit
-
Build an AI Browser Copilot
LaVague is designed to automate menial tasks on behalf of its users. Many of these tasks are repetitive, time-consuming, and require little to no cognitive effort. By automating these tasks, LaVague aims to free up time for more meaningful endeavors, allowing users to focus on what truly matters to them.
By providing an engine turning natural language queries into Selenium code, LaVague is designed to make it easy for users or other AIs to automate easily express web workflows and execute them on a browser.
One of the key usages we see is to automate tasks that are personal to users and require them to be logged in, for instance automating the process of paying bills, filling out forms or pulling data from specific websites.
LaVague is built on open-source projects and leverages open-sources models, either locally or remote, to ensure the transparency of the agent and ensures that it is aligned with users' interests.
Large Action Model framework to automate browser interaction
A project by Daniel Huynh that demonstrates how to create a browser agent using RAG, local embeddings, and Mixtral to execute browser tasks from a Colab notebook, showcased with a video on navigating HuggingFace datasets
![image](https://github.com/ParthaPRay/LLM-Learning-Sources/assets/1689639/a176c50a-7a1c-47fb-8b84-73f6c6cdda01)
LaVague interacting with Hugging Face's website.
Features:
- Natural Language Processing: Understands instructions in natural language to perform browser interactions.
- Selenium Integration: Seamlessly integrates with Selenium for automating web browsers.
- Open-Source: Built on open-source projects such as transformers and llama-index, and leverages open-source models, either locally or remote, to ensure the transparency of the agent and ensures that it is aligned with users' interests.
- Local models for privacy and control: Supports local models like Gemma-7b so that users can fully control their AI assistant and have privacy guarantees.
- Advanced AI techniques: Uses a local embedding (bge-small-en-v1.5) first to perform RAG to extract the most relevant HTML pieces to feed the LLM answering the query, as directly dropping the full HTML code would not fit in context. Then leverages Few-shot learning and Chain of Thought to elicit the most relevant Selenium code to perform the action without having to finetune the LLM (Nous-Hermes-2-Mixtral-8x7B-DPO) for code generation.
https://github.com/lavague-ai/LaVague
https://colab.research.google.com/github/dhuynh95/LaVague/blob/main/LaVague.ipynb
-
LlamaIndex and Anthropic Cookbooks for RAG
LlamaIndex is a data framework for LLM-based applications that benefit from context augmentation.
Here they provide cookbooks for building LLM applications using Anthropic and LlamaIndex.
- [Basic_RAG_With_LlamaIndex.ipynb](https://github.com/anthropics/anthropic-cookbook/blob/main/third_party/LlamaIndex/Basic_RAG_With_LlamaIndex.ipynb) - Notebook to help you build RAG pipelines with LlamaIndex.
- [Router_Query_Engine.ipynb](https://github.com/anthropics/anthropic-cookbook/blob/main/third_party/LlamaIndex/Multi_Document_Agents.ipynb) - Notebook to help you use RouterQueryEngine to route user queries to different indices.
- [SubQuestion_Query_Engine](https://github.com/anthropics/anthropic-cookbook/blob/main/third_party/LlamaIndex/Multi_Modal.ipynb) - Notebook to help you to use SubQuestionQueryEngine to answer complex user queries spanning multiple documents.
- [ReAct_Agent.ipynb](https://github.com/anthropics/anthropic-cookbook/blob/main/third_party/LlamaIndex/ReAct_Agent.ipynb) - Notebook to help you to use ReActAgent for using Tools and QueryEngine Tools.
- [Multi_Document_Agents.ipynb](https://github.com/anthropics/anthropic-cookbook/blob/main/third_party/LlamaIndex/Router_Query_Engine.ipynb) - Notebook to help you build an efficient RAG pipeline for a large number of documents.
- [Multi_Modal.ipynb](https://github.com/anthropics/anthropic-cookbook/blob/main/third_party/LlamaIndex/SubQuestion_Query_Engine.ipynb) - Notebook to help you build Multi-Modal applications using LlamaIndex.
https://github.com/anthropics/anthropic-cookbook/tree/main/third_party/LlamaIndex
-
VideoDB Retriever from LlamaIndex: RAG: Instantly Search and Stream Video Results 📺
RAG: Instantly Search and Stream Video Results
VideoDB is a serverless database designed to streamline the storage, search, editing, and streaming of video content. VideoDB offers random access to sequential video data by building indexes and developing interfaces for querying and browsing video content. Learn more at docs.videodb.io.
Constructing a RAG pipeline for text is relatively straightforward, thanks to the tools developed for parsing, indexing, and retrieving text data. However, adapting RAG models for video content presents a greater challenge. Videos combine visual, auditory, and textual elements, requiring more processing power and sophisticated video pipelines.
While Large Language Models (LLMs) excel with text, they fall short in helping you consume or create video clips. VideoDB provides a sophisticated database abstraction for your MP4 files, enabling the use of LLMs on your video data. With VideoDB, you can not only analyze but also instantly watch video streams of your search results.
In this notebook, we introduce VideoDBRetriever, a tool specifically designed to simplify the creation of RAG pipelines for video content, without any hassle of dealing with complex video infrastructure.
-
StreamRAG: GPT-Powered Video Retrieval & Streaming 🚀
Video Search Agent for ChatGPT
What does it do? 🤔
It enables developers to:
- 📚 Upload multiple videos to create a library or collection.
- 🔍 Search across these videos and get real-time video responses or compilations.
- 🛒 Publish your searchable collection on the ChatGPT store.
- 📝 Receive summarized text answers (RAG).
- 🌟 Gain key insights from specific videos (e.g. "Top points from episode 31").
-
Semi-structured RAG - Langchain using Mistral 7B, Qdrant, Fastembed on pdf text using Tabular Data, https://colab.research.google.com/drive/1rLWrDwePwgtZAOUTL7RNpsS7tTQ3oWWQ?usp=sharing
https://youtu.be/2Id2KTrES2s?si=44IA8s3qHQYEUTkR
-
Improved Retrieval Augmented Generation with ALL-SORT (Assisted Large Language Sorting)
Smaug 34B Model: https://huggingface.co/abacusai/Smaug-34B-v0.1
E5 Embedding Model: https://huggingface.co/intfloat/e5-large-v2
Strucured Text Generation: https://github.com/outlines-dev/outlines, https://outlines-dev.github.io/outlines/
https://www.youtube.com/watch?v=biJmRQF8bmY&ab_channel=TrelisResearch
-
Build a RAG Based LLM App in 20 Minutes! | Full Langflow Tutorial
https://www.youtube.com/watch?v=rz40ukZ3krQ&ab_channel=TechWithTim
-
Building STORM from scratch with LangGraph, https://www.youtube.com/watch?v=1uUORSZwTz4&ab_channel=LangChain
https://github.com/langchain-ai/langgraph/blob/main/examples/storm/storm.ipynb
-
Reverse Image Search/Similarity App (Milvus and K8s) - Full AI Tutorial
https://www.youtube.com/watch?v=rscLNbnk53Y&ab_channel=NodematicTutorials
-
Create Medical Chatbot with Google Gemma 7B LLM LlamaIndex Colab Demo Qdrant FastEmbed Gradio
https://www.youtube.com/watch?v=23BU5Csi_3w&ab_channel=RitheshSreenivasan
https://colab.research.google.com/drive/1XBohRbAQchvxXVMi1Nap7JuRihjX-N9e?usp=sharing
-
Elevate Responses: RAG with LlamaIndex & MongoDB
https://huggingface.co/blog/Andyrasika/mongodb-llamaindex-rag
-
Retrieval Augmented Fine Tuning (RAFT)
🦍 RAFT: Adapting Language Model to Domain Specific RAG
How to preapre a LLM for an Exam? Closed-Book vs. Open-Book vs. RAFT
Train and Test Configuration for RAFT
https://gorilla.cs.berkeley.edu/blogs/9_raft.html
RAFT is a recipie to adapting LLMs to domain-specific RAG. You can learn more in our release-blogs here and here. RAFT takes an input document from the user and creates a dataset using the document, consisting of synthetically generated { question, answer, documents } triplets. The dataset can then be used to fine-tune models for improved question-answering and retrieval.
The input data from the user can be either a general text document (pdf, json, or txt) for general QA or an API documentation in the API Zoo JSONL format for API calling.
-
Langchain: Chat with PDF | ChatGPT for Your PDF FILES | PDF Question Answering | Python
https://www.youtube.com/watch?v=WmuSEfgzcJo&ab_channel=PythonCodeCamp
-
Python RAG Tutorial (with Local LLMs): AI For Your PDFs
https://www.youtube.com/watch?v=2TJxpyO3ei4&ab_channel=pixegami
-
- How to add Trustworthiness Score To RAG Answers
Overcoming Hallucinations with the Trustworthy Language Model
https://cleanlab.ai/blog/trustworthy-language-model/
![image](https://github.com/ParthaPRay/LLM-Learning-Sources/assets/1689639/61d46e84-dcb6-4132-97a3-dd364f0e3b95)
-
LLM Datasets
-
FineWeb and FineWeb-Edu - All Time Best Quality Dataset for Free
15 T tokens and takes 44 TB disk space. Best for pre-training LLMs.
https://www.youtube.com/watch?v=UUwilmud5v8&ab_channel=FahdMirza
https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
-
🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets
https://huggingface.co/blog/dvilasuero/synthetic-data-with-llama3-distilabel
-
⚗️ 🔥 Building High-Quality Datasets with distilabel and Prometheus 2
https://huggingface.co/blog/burtenshaw/distilabel-prometheus-2
-
Create a Diffusers-compatible Dataset for Stable Diffusion Fine-tuning, https://huggingface.co/blog/nroggendorff/create-diffusers-dataset
-
Convert Any Text to LLM Dataset Locally - Demo with Example
https://www.youtube.com/watch?v=ZiyCe_dRksM&ab_channel=FahdMirza
NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO
togetherai: The fastest cloud platform for building and running generative AI.
-
Install Genstruct 7B Locally - Best Model to Create Datasets of Any Domain
Genstruct 7B is an instruction-generation model, designed to create valid instructions given a raw text corpus. This enables the creation of new, partially synthetic instruction finetuning datasets from any raw-text corpus.
Feature | ChatGPT | Few-shot prompting | RAG | Ada-Instruct | Genstruct |
---|---|---|---|---|---|
Open models | ❌ | ☑️ | ☑️ | ✅ | ✅ |
Grounded generation | ❌ | ❌ | ✅ | ❌ | ✅ |
Complex questions | ❌ | ❌ | ❌ | ☑️ | ✅ |
Complex responses | ✅ | ☑️ | ❌ | ☑️ | ✅ |
-
Ada-Instruct: Adapting Instruction Generators for Complex Reasoning
-
Augmentoolkit
Convert Compute And Books Into Instruct-Tuning Datasets.
Turn any raw text into a high-quality dataset using local models. Make data gathering a painless step of the model creation process. Augmentoolkit is the easy-to-use, customizable, open-source, and cost-effective data generation solution. No OpenAI needed.
-
H2O LLM DataStudio Part II: Convert Documents to QA Pairs for fine tuning of LLMs
- H2O LLM DataStudio: Streamlining Data Curation and Data Preparation for LLMs related tasks
https://h2o.ai/blog/2023/streamlining-data-preparation-for-fine-tuning-of-large-language-models/
-
How to Create Synthetic Dataset with LLM Locally
Model, Code & Data for the EMNLP'23 paper "Making Large Language Models Better Data Creators"
https://github.com/microsoft/llm-data-creation
https://www.youtube.com/watch?v=kWooqJKJO7k&t=14s&ab_channel=FahdMirza
-
Part 1: Step-by-Step Dataset Creation- Unstructured to Structured
-
Fine-Tuned Q&A - create Q&A (Some Part Deprecated)
https://cookbook.openai.com/examples/fine-tuned_qa/olympics-2-create-qa
-
Create your own fine tuning datasets in @HuggingFace
https://www.youtube.com/watch?v=PGSkyUDzqx8&ab_channel=SuperLazyCoder
-
Detecting Issues in a Text Dataset with Cleanlab
In this 5-minute quickstart tutorial, they use Cleanlab to detect various issues in an intent classification dataset composed of (text) customer service requests at an online bank. We consider a subset of the Banking77-OOS Dataset containing 1,000 customer service requests which are classified into 10 categories based on their intent (you can run this same code on any text classification dataset). Cleanlab automatically identifies bad examples in our dataset, including mislabeled data, out-of-scope examples (outliers), or otherwise ambiguous examples. Consider filtering or correcting such bad examples before you dive deep into modeling your data!
https://huggingface.co/learn/cookbook/issues_in_text_dataset
Dataset: [Banking77-OOS Dataset] (https://arxiv.org/abs/2106.04564)
CleanLab: https://github.com/cleanlab/cleanlab https://cleanlab.ai/ The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
-
From screenshots to HTML code: Introducing the WebSight dataset
https://huggingface.co/blog/websight
Dataset: https://huggingface.co/datasets/HuggingFaceM4/WebSight
Examples of web pages included in WebSight.
Comparison of an original web page (input) on the left, and the rendering of the code generated by our model, Sightseer, (output) on the right.
https://colab.research.google.com/drive/1LdamGKR2oacrDk-kYwz_Wfc1-RBUdzcO?usp=sharing
-
Cosmopedia: how to create large-scale synthetic data for pre-training
https://huggingface.co/blog/cosmopedia
Cosmopedia is a dataset of synthetic textbooks, blogposts, stories, posts and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1.The dataset contains over 30 million files and 25 billion tokens, making it the largest open synthetic dataset to date.
https://huggingface.co/datasets/HuggingFaceTB/cosmopedia
The distribution of data sources for building Cosmopedia prompts (left plot) and the distribution of sources inside the Curated sources category (right plot).
-
Generating synthetic data with LLMs - Part 1
https://www.confident-ai.com/blog/how-to-generate-synthetic-data-using-llms-part-1
-
🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets
https://huggingface.co/blog/dvilasuero/synthetic-data-with-llama3-distilabel
-
Lavita's Collections Medical QA Datasets
https://huggingface.co/collections/lavita/medical-qa-datasets-6540b9b1992b1c560eda935c
Medical Foundation Models Evaluation: A collection of work on evaluating LLMs and Foundation Models on medical tasks (e.g., Medical Question Answering, etc.)
- Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine, https://huggingface.co/papers/2311.16452
- Towards Generalist Biomedical AI, https://huggingface.co/papers/2307.14334
- Almanac: Retrieval-Augmented Language Models for Clinical Medicine, https://huggingface.co/papers/2303.01229
-
How to Convert Any Dataset to DPO Dataset
https://www.youtube.com/watch?v=GGnBgpe1UiQ&ab_channel=FahdMirza
-
Preparing Fineweb - A Finely Cleaned Common Crawl Dataset
https://www.youtube.com/watch?v=20wbA_ijjmg&ab_channel=TrelisResearch
We have recently seen a surge in vector databases in this era of generative AI. The idea behind vector databases is to index the data with vectors that relate to that data. Hierarchical Navigable Small World (HNSW) is one of the most efficient ways to build indexes for vector databases. The idea is to build a similarity graph and traverse that graph to find the nodes that are the closest to a query vector.
Navigable Small World (NSW) is a process to build efficient graphs for search. We build a graph by adding vectors one after the other and connecting each new node to the most similar neighbors.
When building the graph, we need to decide on a metric for similarity such that the search is optimized for the specific metric used to query items. Initially, when adding nodes, the density is low, and the edges will tend to capture nodes that are far apart in similarity. Little by little, the density increases, and the edges start to be shorter and shorter. As a consequence, the graph is composed of long edges that allow us to traverse long distances in the graph and short edges that capture closer neighbors. Because of it, we can quickly traverse the graph from one side to the other and look for nodes at a specific location in the vector space.
When we want to find the nearest neighbor to a query vector, we initiate the search by starting at one node (i.e., node A in that case). Among its neighbors (D, G, C), we look for the closest node to the query (D). We iterate over that process until there are no closer neighbors to the query. Once we cannot move anymore, we found a close neighbor to the query. The search is approximate, and the found node may not be the closest as the algorithm may be stuck in a local minima.
The problem with NSW, is we spend a lot of iterations traversing the graph to arrive at the right node. The idea for Hierarchical Navigable Small World is to build multiple graph layers where each layer is less dense compared to the next. Each layer represents the same vector space, but not all vectors are added to the graph. Basically, we include a node in the graph at layer L with a probability P(L). We include all the nodes in the final layer (if we have N layers, we have P(N) = 1), and the probability gets smaller as we get toward the first layers. We have a higher chance of including a node in the following layer, and we have P(L) < P(L + 1).
The first layer allows us to traverse longer distances at each iteration, whereas in the last layer, each iteration will tend to capture shorter distances. When we search for a node, we start first in layer 1 and go to the next layer if the NSW algorithm finds the closest neighbor in that layer. This allows us to find the approximate nearest neighbor in fewer iterations on average.
Vector databases are often used for recommender engines, where we learn vector representations of users and items we want to recommend. This allows to quickly find similar items by using an approximate nearest neighbor search. As long as we can learn a vector representation of a piece of data, we can index it in a vector database. With the recent advent of LLMs, it became easier to compute vector representations of text documents, capturing the semantic meaning of that text, and vector databases make it easier to find semantically similar text documents.
When looking for the nearest neighbors, it is often not important to be perfectly accurate. Product Quantization (PQ) is a way to quantize the vector space to represent vectors with less precision. The idea is to cluster vectors and index the cluster centroids instead of the vectors themselves. When looking for the nearest neighbors to a query vector, we just need to pull the vectors from the closest clusters. It is a faster search, and indexing the vectors takes much less memory space.
We first need to partition each vector into smaller vectors and run a K-means algorithm on each partition. Instead of indexing the vectors, we index the centroid of the clusters they belong to. If we use 2 clusters per partition and have 6 vectors, that's 3X data compression. Obviously, compression would be much higher with more vectors. Each vector now maps to a set of clusters and their related centroids.
If we want to find the nearest neighbors from a query vector, we measure the squared Euclidean distance for each cluster in each partition and return the vectors with the lowest summed squared Euclidean distances.
Instead of having to iterate through each vector, we just need to iterate through the clusters' centroids. There is a balance between search latency and accuracy. The more clusters we use, the better the hash will be and the more accurate the returned nearest neighbors, but it will increase the search latency as we will need to iterate through more clusters.
This is still a brute force approach as the algorithm scales with the number of clusters, but it can be used in combination with other algorithms to have blasting fast retrieval.
There are tons of vector database providers: Pinecone, Deep Lake, Milvus, Qdrant, Weaviate, ... They all tend to provide similar capabilities with efficient similarity search, optimized storage formats for AI applications, unstructured data accessibility, and cloud-native infrastructure. Most of the game is about how to index billions of vectors for fast retrieval. One such indexing algorithm is Locality-sensitive hashing (LSH).
LSH aims to group vectors together based on similarity. For example, we could partition the vector space into multiple buckets, and we could call “nearest neighbors” whatever vectors belong to the same bucket. In practice, it is done a bit differently. An efficient way to partition the space is to project the vectors onto a space of a specific dimensionality and “binarize“ each component. The projection is done using a random matrix M of dimension (C, R) where C is the dimension of the original vector V and R is the dimension of the space we want to project the vectors into
V' = V. M
For example, if C = 2 and R = 3, we would project from a plane to a 3D space. We can now partition the space with regions above and below the hyperplanes passing by the origin. If we have, for example, a vector A = [0.5, -1.5, 0.3], we look at each of the components and assign a 1 if it is positive and 0 otherwise. The vector A would be hashed to [1, 0, 1] under that process. Every vector assigned the same hash will be close in the vector space and can be labelled “nearest neighbors”. The time complexity to hash a vector V is O(R x C + R) = O(R x C), and retrieving the vectors with the same hash can be done in constant time.
The hash of a vector under the LSH hashing process is a binary vector. To measure how different 2 binary vectors are, we use the Hamming Distance. The Hamming distance counts the number of times 2 strings have different characters. When we have strings of binary numbers, the Hamming distance can be computed using the XOR operation, and the number of resulting 1s can be counted.
- Embeddings: the superpower of deep learning
Deep Learning finds its strength in its ability to model efficiently with different types of data at once. It is trivial to build models from multimodal datasets nowadays. It is not a new concept, though, nor was it impossible to do it prior to the advent of DL, but the level of complexity of feature processing and modeling was much higher with much lower performance levels!
One key aspect of this success is the concept of Embedding: a lower dimensionality representation of the data. This makes it possible to perform efficient computations while minimizing the effect of the curse of dimensionality and providing more robust representations when it comes to overfitting. In practice, this is just a vector living in a "latent" or "semantic" space.
The first great success of embedding for word encoding was Word2Vec back in 2013 and later GloVe in 2014. Since AlexNet back in 2012, many Convolutional network architectures (VGG16 (2014), ResNet (2015), Inception (2014), …) have been used as feature extractors for images. As of 2018, starting with BERT, Transformer architectures have been used quite a bit to extract semantic representations from sentences.
One domain where embeddings changed everything is recommender engines. It all started with Latent Matrix Factorization methods made popular during the Netflix competition in 2009. The idea is to have a vector representation for each user and product and use that as base features. In fact, any sparse feature could be encoded within an embedding vector, and modern rec engines typically use hundreds of embedding matrices for different categorical variables.
Dimensionality reduction is by all accounts not a new concept in Unsupervised Learning! PCA, for example, dates back to 1901; the concept of Autoencoder was introduced in 1986, and the variational Autoencoders (VAE) were introduced in 2013. For example, VAE is a key component of Stable Diffusion. The typical difficulty with Machine Learning is the ability to have labeled data. Self-supervised learning techniques like Word2Vec, Autoencoders, and generative language models allow us to build powerful latent representations of the data at a low cost. Meta came out with Data2Vec 2.0 to learn latent representations of any data modality using self-supervised learning.
Besides learning latent representations, a lot of work is being done to learn aligned representations between different modalities. For example, CLIP is a recent contrastive learning method to learn semantically aligned representations between text and image data.
-
How LLMs answer questions with databases
How does an LLM ask a question to a database? The typical process is to use another LLM to encode the question into a vector representation and use this vector to query a vector database. By finding "similar" vectors in that database, we assume that the related documents should contain the answer to the original question. By feeding those documents into a prompt, we hope the LLM will have enough context to answer that question.
This process is called Retrieval Augmented Generation (RAG), and it suffers a simple problem: there is no reason for a question to be semantically similar to its answer. RAG can lead to many irrelevant documents being fed to the LLM without being provided the right context for an answer.
One solution to that is to use the Hypothetical Document Embeddings (HyDE) technique. The idea is to use the LLM to generate a hypothetical answer, embed that answer, and use this embedding to query the vector database. The hypothetical answer will be wrong, but it has more chance to be semantically similar to the right answer.
-
How to build Google image search engine
We can frame this problem as a ranking problem. We need a model that takes as input two images and returns a similarity score. Using that model, we can then rank the images based on that similarity score. A typical modeling approach is to utilize models that can learn a vectorial representation (embedding) of the images and compute a similarity metric on those vectors. We need a model that can extract the image features to learn a vector representation of images, and we need a model that can extract the text features to learn a vector representation of text inputs. We need to co-train the image and text models so the vector representations are semantically aligned.
To ensure fast retrieval, we need a way to store the existing images and quickly search for similar images. Since we are encoding the images into their vector representations, it seems logical to index the images into a vector database. The indexing pipeline converts the original images into their vector representations and indexes them into a vector database.
When a user inputs a text or image query, we need to return a list of images. The embedding generation service generates an embedding encoding of the input query. The embedding query is sent to the vector database that returns the nearest neighbors of the query. The reranking service is mainly used to rerank the nearest neighbors using a better model than the embedding generation model. It could be used to personalize the ranking to the specific user by using user-specific data. The resulting list is a list of image IDs, and it is then sent to the image store to retrieve the actual images to return to the user.
-
LanceDB, a free, open-source, serverless vectorDB that requires no setup. It integrates into python data ecosystem so you can simply start using these in your existing data pipelines in pandas, arrow, pydantic etc. LanceDB has native Typescript SDK using which you can run vector search in serverless functions!
-
Building Multi-Modal Search with Vector Databases
https://www.youtube.com/watch?v=3WUobZryyok&t=6s&ab_channel=DeepLearningAI
https://docs.google.com/presentation/d/1sS-bxJ-k9EuESH2VhpwnybY3QyV_9FdxHLmZLCSpuSM/edit?usp=sharing
-
How to select embedding model?
https://www.rungalileo.io/blog/mastering-rag-how-to-select-an-embedding-model
-
Fastmbed, FastEmbed is a lightweight, fast, Python library built for embedding generation. We support popular text models. Please open a Github issue if you want us to add a new model.
https://www.youtube.com/watch?v=1mMLVQE11Io&ab_channel=LearnDatawithMark
https://github.com/qdrant/fastembed
-
Embedding multimodal data for similarity search using 🤗 transformers, 🤗 datasets and FAISS
https://github.com/huggingface/cookbook/tree/main/notebooks/en
-
Introduction to Matryoshka Embedding Models
-
Protein similarity and Matryoshka embeddings https://huggingface.co/blog/monsoon-nlp/proteins-matryoshka-embeddings
-
Ollama 0.1.26 Makes Embedding 100x Better**
https://www.youtube.com/watch?v=Ml179HQoy9o&ab_channel=MattWilliams
nomic-embed-text works very faster than llama2 as of now.
-
Visualising embeddings with t-SNE
https://www.youtube.com/watch?v=MgayYUdI4is&ab_channel=LearnDatawithMark
https://github.com/mneedham/LearnDataWithMark/blob/main/tsne-playground/app.py
-
From HuggingFace dataset to Qdrant vector database in 12 minutes flat
https://www.gptechblog.com/from-huggingface-dataset-to-qdrant-vector-database-in-12-minutes-flat/
-
Transformers and Quadrant: Revolutionizing Data Integration for NLP Tasks
-
Ollama Embedding: How to Feed Data to AI for Better Response?
Model
Web
https://www.youtube.com/watch?v=jENqvjpkwmw&t=17s&ab_channel=MervinPraison
-
Cohere Embed v3 int8 & binary Embeddings 4X and 32X memory reduction 40x faster search
https://www.youtube.com/watch?v=P2dTCp-lGaE&ab_channel=RitheshSreenivasan
https://txt.cohere.com/int8-binary-embeddings/
-
Nomic's new embedding model : nomic-embed-text, https://youtu.be/LpcaeQZDVB8?si=VrJzmRSrwJRxHwzv
-
Crazy fast RAG, Ollama, Nomic embedding model, groq
-
Mixedbread mxbai-embed-large-v1 embedding model
This is a base sentence embedding model. It was trained using AnglE loss on our high-quality large scale data. It achieves SOTA performance on BERT-large scale. Find out more in our blog post
https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1
https://youtu.be/CXDOkHFboAU?si=m8OoaiPa0JHMDs1e
Model | Avg (56 datasets) | Classification (12 datasets) | Clustering (11 datasets) | PairClassification (3 datasets) | Reranking (4 datasets) | Retrieval (15 datasets) | STS (10 datasets) | Summarization (1 dataset) |
---|---|---|---|---|---|---|---|---|
mxbai-embed-large-v1 | 64.68 | 75.64 | 46.71 | 87.2 | 60.11 | 54.39 | 85.00 | 32.71 |
bge-large-en-v1.5 | 64.23 | 75.97 | 46.08 | 87.12 | 60.03 | 54.29 | 83.11 | 31.61 |
mxbai-embed-2d-large-v1 | 63.25 | 74.14 | 46.07 | 85.89 | 58.94 | 51.42 | 84.90 | 31.55 |
nomic-embed-text-v1 | 62.39 | 74.12 | 43.91 | 85.15 | 55.69 | 52.81 | 82.06 | 30.08 |
jina-embeddings-v2-base-en | 60.38 | 73.45 | 41.73 | 85.38 | 56.98 | 47.87 | 80.70 | 31.60 |
Proprietary Models | ||||||||
OpenAI text-embedding-3-large | 64.58 | 75.45 | 49.01 | 85.72 | 59.16 | 55.44 | 81.73 | 29.92 |
Cohere embed-english-v3.0 | 64.47 | 76.49 | 47.43 | 85.84 | 58.01 | 55.00 | 82.62 | 30.18 |
OpenAI text-embedding-ada-002 | 60.99 | 70.93 | 45.90 | 84.89 | 56.32 | 49.25 | 80.97 | 30.80 |
-
RAG Databases with Johannes Jolkkonen: When to Choose a Graph Database vs Alternatives
https://www.youtube.com/watch?v=1Iuuvk6yJME&ab_channel=Neo4j
-
Pdf reader using genai-stack using Langchain + Docker + Neo4j + Ollama
-
NODES 2023 - Using LLMs to Convert Unstructured Data to Knowledge Graphs
https://www.youtube.com/watch?v=qLdkRReMPvM&ab_channel=Neo4j
-
Implementing RAG: How to Write a Graph Retrieval Query in LangChain
https://neo4j.com/developer-blog/rag-graph-retrieval-query-langchain/
-
Index Guide
- Guidelines to choose an FAISS index
Selecting the appropriate FAISS index is crucial for optimizing performance and depends on the specific requirements of your project, such as dataset size, query frequency, and latency constraints. Here's a guide to selecting different indexes based on these criteria:
- For Small Datasets:
* FlatL2 or FlatIP: Ideal for smaller datasets due to their simplicity and moderate memory consumption. They perform exhaustive searches across all vectors and provide precise results.
* LSH (Locality-Sensitive Hashing): Suitable for small to medium datasets and recommended for vectors up to 128 dimensions. LSH is faster than exhaustive search but may trade off a bit of accuracy for speed.
- For Medium to Large Datasets:
* HNSW (Hierarchical Navigable Small World): Extremely fast for both indexing and querying and supports higher-dimensional data. However, it requires more memory, making it suitable for medium-sized datasets.
* IVF (Inverted File Indexing): Ideal for large datasets. It segments the search space into a predefined number of clusters and only searches within the most relevant clusters. IVF indexes balance between memory usage and search speed, making them efficient for large-scale applications.
- For Very Large Datasets:
* Advanced versions of IVF, such as IVFADC (Inverted File with Asymmetric Distance Computation) or IVFPQ (Product Quantization), can be used. These indexes further compress the dataset and reduce the search space, optimizing both memory usage and search speed at the scale of millions of vectors.
When integrating a semantic cache with a FAISS-based RAG system, it's essential to:
- Choose the right index type based on your dataset size and query characteristics.
- Consider the trade-offs between accuracy and speed, as some indexes may offer faster retrieval at the expense of precision.
- Test and evaluate different indexes to find the best configuration for your specific use case.
https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
https://github.com/facebookresearch/faiss
-
LlamaIndex Indexing Guide
-VectorStoreIndex
- Summary Index
- Tree Index
- Keyword Table Index
- Knowledge Graph Index
- Knowledge Graph Query Engine
- Knowledge Graph RAG Query Engine
- REBEL + Knowledge Graph Index
- REBEL + Wikipedia Filtering
- SQL Index
- SQL Query Engine with LlamaIndex + DuckDB
- Document Summary Index
- The ObjectIndex Class
https://docs.llamaindex.ai/en/stable/module_guides/indexing/index_guide.html
-
FlagEmbedding
FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently:
- Long-Context LLM: Activation Beacon
- Fine-tuning of LM : LM-Cocktail
- Dense Retrieval: BGE-M3, LLM Embedder, BGE Embedding
- Reranker Model: BGE Reranker
- Benchmark: C-MTEB
-
CPU Optimized Embeddings with 🤗 Optimum Intel and fastRAG
SFR-Embedding by Salesforce Research
Should dense vectors always be used for information retrieval? The two dominant approaches have trade-offs:
* Sparse retrieval matches n-grams, phrases, or metadata to search large collections efficiently and at scale. However, it may miss relevant documents due to lexical gaps between the query and the document.
* Semantic retrieval encodes text into dense vectors, capturing context and meaning better than bag-of-words. It can retrieve semantically related documents despite lexical mismatches. However, it's computationally intensive, has higher latency, and requires sophisticated encoding models compared to lexical matching like BM25.
Optimum Intel is an open-source library that accelerates end-to-end pipelines built with Hugging Face libraries on Intel Hardware. Optimum Intel includes several techniques to accelerate models such as low-bit quantization, model weight pruning, distillation, and an accelerated runtime.
The runtime and optimizations included in Optimum Intel take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs to accelerate models. Specifically, it has built-in BFloat16 (bf16) and int8 GEMM accelerators in every core to accelerate deep learning training and inference workloads. AMX accelerated inference is introduced in PyTorch 2.0 and Intel Extension for PyTorch (IPEX) in addition to other optimizations for various common operators.
Optimizing pre-trained models can be done easily with Optimum Intel; many simple examples can be found here.
https://huggingface.co/blog/intel-fast-embedding
.
LoRA
-
What is LoRA?
Edward Hu, https://edwardjhu.com/
https://lightning.ai/lightning-ai/studios/code-lora-from-scratch
-
LoRA training scripts of the world, unite!
-
Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA)
-
Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments
-
LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch
https://www.youtube.com/watch?v=PXWYUTMt-AU&ab_channel=UmarJamil
-
Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation)
https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed. https://github.com/Lightning-AI/lit-gpt
-
Optimzie LLM utilization with LoRA
How can we optimize machine utilization for multiple fine-tuned LLMs? Let’s consider OpenAI as an example and its API to fine-tune models. In the case of OpenAI, “fine-tuning“ means that the model is specialized by using some proprietary data, and it is then deployed on GPU hardware for API access. Naively, we could think that for each new customer wanting to fine-tune their model, we would need to deploy a new model on a new GPU cluster. However, it is unlikely that OpenAI proceed this way!
GPU hardware is really expensive, and they would need to allocate a GPU cluster for each new customer. OpenAI pricing model is based on model usage, meaning customers only pay when they use the model, but for OpenAI, the cost of serving the model never stops! It is very likely that there have been thousands of customers who just wanted to test OpenAI’s fine-tuning capabilities, and the resulting fine-tuned models were never actually used. Would OpenAI just handle the serving cost for each of those models?
One strategy to fine-tune LLMs is to use adapters that can be “plugged“ into the base model. The idea is to avoid updating the weights of the base model and have the adapters capture the information about the fine-tuning tasks. We can plug in and out different adapters that specialize the model on different tasks. The most common and efficient adapter type is the Low-Rank Adapter (LoRA). The idea is to replace some of the large matrices within the model with smaller ones for the gradient computation.
Because of the small size of those adapters and their simple additive logic, it is easy to add multiple adapters at once for different fine-tuning tasks. Those adapters can be trained separately and plugged together at serving time. We just need a logic to route the inputs to their respective task.
This is extremely beneficial when we have a low request volume for some of the tasks. In the case of OpenAI, with multiple LoRA adapters, it becomes easy for them to deploy multiple fine-tuned models on the same GPU cluster. After the LoRA weights have been trained during a fine-tuning process, we just store those in a model registry. The cost of storing those weights instead of a full fine-tuned model is going to be much lower! At serving time, we can plug multiple adapters into the same base model and route the customer’s request to its own adapter.
OpenAI can easily measure the adapter utilization and the customers’ request volume for the different fine-tuned models. If the volume is low, it can be deployed along with other low-utilization adapters on the same base model, and if it is high, the adapter can be allocated its own base model such that the users don’t wait too long for their requests to be completed.
-
Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch
https://magazine.sebastianraschka.com/p/lora-and-dora-from-scratch
-
Rank-Stabilized LoRA: Unlocking the Potential of LoRA Fine-Tuning
-
Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments
-
A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes
-
SDXL in 4 steps with Latent Consistency LoRAs
-
Deploy hundreds of open source models on one GPU using LoRAX
https://huggingface.co/blog/macadeliccc/deploy-hundreds-of-models-on-one-gpu
Quantization
* Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training
https://www.youtube.com/watch?v=0VdNflU08yA&ab_channel=UmarJamil
https://github.com/hkproj/quantization-notes
The Two Types of LLM Quantization: PTQ and QAT
While there are several quantization techniques, the most notable of which we detail later in this guide, generally speaking, LLM quantization falls into two categories:
Post-Training Quantization (PTQ): this refers to techniques that quantize an LLM after it has already been trained. PTQ is easier to implement than QAT, as it requires less training data and is faster. However, it can also result in reduced model accuracy from lost precision in the value of the weights.
Quantization-Aware Training (QAT): this refers to methods of fine-tuning on data with quantization in mind. In contrast to PTQ techniques, QAT integrates the weight conversion process, i.e., calibration, range estimation, clipping, rounding, etc., during the training stage. This often results in superior model performance, but is more computationally demanding.
Not too long ago, the largest Machine Learning models most people would deal with merely reached a few GB in memory size. Now, every new generative model coming out is between 100B and 1T parameters! To get a sense of the scale, one float parameter that's 32 bits or 4 bytes, so those new models scale between 400 GB to 4 TB in memory, each running on expensive hardware. Because of the massive scale increase, there has been quite a bit of research to reduce the model size while keeping performance up. There are 5 main techniques to compress the model size.
-
Model pruning is about removing unimportant weights from the network. The game is to understand what "important" means in that context. A typical approach is to measure the impact on the loss function of each weight. This can be done easily by looking at the gradient and second-order derivative of the loss. Another way to do it is to use L1 or L2 regularization and get rid of the low-magnitude weights. Removing whole neurons, layers or filters is called "structured pruning" and is more efficient when it comes to inference speed.
-
Model quantization is about decreasing parameter precision, typically by moving from float (32 bits) to integer (8 bits). That's 4X model compression. Quantizing parameters tends to cause the model to deviate from its convergence point, so it is typical to fine-tune it with additional training data to keep model performance high. We call this "Quantization-aware training". When we avoid this last step, it is called "Post training quantization", and additional heuristic modifications to the weights can be performed to help performance.
-
Low-rank decomposition comes from the fact that neural network weight matrices can be approximated by products of low-dimension matrices. A N x N matrix can be approximately decomposed into a product of 2 N x 1 matrices. That's an O(N^2) -> O(N) space complexity gain!
-
Knowledge distillation is about transferring knowledge from one model to another, typically from a large model to a smaller one. When the student model learns to produce similar output responses, that is response-based distillation. When the student model learns to reproduce similar intermediate layers, it is called feature-based distillation. When the student model learns to reproduce the interaction between layers, it is called relation-based distillation.
-
Lightweight model design is about using knowledge from empirical results to design more efficient architectures. That is probably one of the most used methods in LLM research.
-
Quantization
https://huggingface.co/docs/optimum/concept_guides/quantization
-
A Guide to Quantization in LLMs
https://symbl.ai/developers/blog/a-guide-to-quantization-in-llms/
-
Quantization in LLMs: Why Does It Matter?
https://blog.dataiku.com/quantization-in-llms-why-does-it-matter
-
What are Quantized LLMs?
-
The LLM Revolution: Boosting Computing Capacity with Quantization Methods
-
Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ) https://www.maartengrootendorst.com/blog/quantization/
-
Quantization and LLMs - Condensing Models to Manageable Sizes https://www.exxactcorp.com/blog/deep-learning/what-is-quantization-and-llms
-
Best LLM quantization (accuracy and speed)
https://scifilogic.com/best-llm-quantization-accuracy-and-speed/
-
Serving Quantized LLMs on NVIDIA H100 Tensor Core GPUs
https://www.databricks.com/blog/serving-quantized-llms-nvidia-h100-tensor-core-gpus
-
New Tutorial on LLM Quantization w/ QLoRA, GPTQ and Llamacpp, LLama 2
-
How to make your LLMs lighter with GPTQ quantization
https://bdtechtalks.com/2023/11/08/llm-quantization-gptq/
-
Model Quantization with 🤗 Hugging Face Transformers and Bitsandbytes Integration
-
How to Quantize an LLM with GGUF or AWQ
-
Effective Post-Training Quantization for Large Language Models
-
Overview of natively supported quantization schemes in 🤗 Transformers
https://huggingface.co/blog/overview-quantization-transformers
-
How to quantization an LLM with GGUF or AWQ
-
Making LLMs lighter with AutoGPTQ and transformers
GPTQ blogpost – gives an overview on what is the GPTQ quantization method and how to use it.
https://huggingface.co/blog/gptq-integration
https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb?usp=sharing
-
bistandbytes 4-bit quantization blogpost - This blogpost introduces 4-bit quantization and QLoRa, an efficient finetuning approach
This blogpost introduces 4-bit quantization and QLoRa, an efficient finetuning approach.
-
A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes
bistandbytes 8-bit quantization blogpost - This blogpost explains how 8-bit quantization works with bitsandbytes.
https://huggingface.co/blog/hf-bitsandbytes-integration
Basic usage Google Colab notebook for bitsandbytes - This notebook shows how to use 4-bit models in inference with all their variants, and how to run GPT-neo-X (a 20B parameter model) on a free Google Colab instance.
https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing
-
Comparing the Performance of LLMs: A Deep Dive into Roberta, Llama 2, and Mistral for Disaster Tweets Analysis with Lora https://huggingface.co/blog/Lora-for-sequence-classification-with-Roberta-Llama-Mistral
-
Introduction to Quantization cooked in 🤗 with 💗🧑🍳
Merve's blogpost on quantization - This blogpost provides a gentle introduction to quantization and the quantization methods supported natively in transformers.
-
Democratizing LLMs: 4-bit Quantization for Optimal LLM Inference
-
Quantize any LLM with GGUF and Llama.cpp, https://www.youtube.com/watch?v=wxQgGK5K0rE&ab_channel=AIAnytime
-
Quanto: a pytorch quantization toolkit https://huggingface.co/blog/quanto-introduction
-
Quantize LLMs with AWQ: Faster and Smaller Llama 3
https://www.youtube.com/watch?v=OMkyocVyEpQ&ab_channel=AIAnytime
-
Half Quadratic Quantization (HQQ)
HQQ is a fast and accurate model quantizer that skips the need for calibration data. It's super simple to implement (just a few lines of code for the optimizer). It can crunch through quantizing the Llama2-70B model in only 4 minutes!
Supported Models
LLMs
Llama (Hugging Face + VLLM) 🦙 Mistral (Hugging Face) Mixtral-8x7B (Hugging Face) Phi + Phi_opt (Hugging Face)
Vision ViT-CLIP (timm) 🖼️
https://huggingface.co/posts/macadeliccc/282259361762056
AutoHQQ: https://colab.research.google.com/drive/1cG_5R_u9q53Uond7F0JEdliwvoeeaXVN?usp=sharing
https://huggingface.co/macadeliccc/Nous-Hermes-2-Mixtral-8x7B-DPO-HQQ https://mobiusml.github.io/hqq_blog/
https://github.com/mobiusml/hqq
-
Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval
https://huggingface.co/blog/embedding-quantization
mixedbread-ai/mxbai-embed-large-v1 SentenceTransformer.encode quantize_embeddings
Vector Databases | Support |
---|---|
Faiss | Yes |
USearch | Yes |
Vespa AI | Yes |
Milvus | Yes |
Qdrant | Through Binary Quantization |
Weaviate | Through Binary Quantization |
-
GaLore: Advancing Large Model Training on Consumer-grade Hardware
https://huggingface.co/blog/galore
Authors' Reporting: https://x.com/AnimaAnandkumar/status/1765613815146893348?s=20
-
A Chatbot on your Laptop: Phi-2 on Intel Meteor Lake
https://huggingface.co/blog/phi2-intel-meteor-lake
OpenVINO: https://github.com/openvinotoolkit/openvino, documentation
optimum-intel: https://github.com/huggingface/optimum-intel, documentation
Developer resources from Intel and Hugging Face
Sharding
How to shard LLMs locally, https://youtu.be/F0pkj2trRcI?si=zAsZmmbhsp1wqlBe
-
Right on Track: NVIDIA Open-Source Software Helps Developers Add Guardrails to AI Chatbots (NeMo)
NeMo: https://www.nvidia.com/en-us/ai-data-science/generative-ai/nemo-framework/
-
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
-
Introducing the Chatbot Guardrails Arena
-
Learn RAG From Scratch – Python AI Tutorial from a LangChain Engineer
https://www.youtube.com/watch?v=sVcwVQRHIc8&ab_channel=freeCodeCamp.org
-
LLM Chat App in Python w/ Ollama-py and Streamlit, https://www.youtube.com/watch?v=ZHZKPmzlBUY&ab_channel=Decoder
-
Claude 3 Opus in ML Pipelines (Python & Kubeflow Example)
https://www.youtube.com/watch?v=VEjlxzvEV88&ab_channel=NodematicTutorials
-
Visual Question Answering with IDEFICS 9B Multimodal LLM, https://www.youtube.com/watch?v=hyP1ekLKtiI&ab_channel=AIAnytime
-
Outfit Anyone: A Diffusion Project for Virtual Try On, https://www.youtube.com/watch?v=V21GfgSFuGk&ab_channel=AIAnytime
-
Oncology RAG App - Powered by Meditron 7B Medical LLM, https://www.youtube.com/watch?v=kvbjB-q5Dss&ab_channel=AIAnytime
-
Investment Banker RAG Chatbot using Intel's Neural Chat LLM, https://www.youtube.com/watch?v=d9wCHH3iknM&ab_channel=AIAnytime
-
Build A Talking AI with LLAMA 3 (Python tutorial) using AssemblyAI, using LLAMA 3 as the language model with Ollama, and ElevenLabs for text-to-speech
https://www.youtube.com/watch?v=6ghNRkVNODo&ab_channel=AssemblyAI
-
How to Build a RAG Application for Multi-Speaker Audio Data AssemblyAI, https://www.youtube.com/watch?v=Rh9Jl0hJSws&ab_channel=AssemblyAI
-
Extract Table Info From PDF & Summarise It Using Llama3 via Ollama | LangChain, https://www.youtube.com/watch?v=hQu8WN8NuVg&ab_channel=DataScienceBasics
-
Extract Image & Image Info From PDF & Use LlaVa via Ollama To Explain Image | LangChain, https://www.youtube.com/watch?v=Ad-87wzJouk&ab_channel=DataScienceBasics
-
Deploy RAG App built using Create Llama on Vercel: Free and Easy Method, https://www.youtube.com/watch?v=euYa4iesOm8&ab_channel=AIAnytime
-
Create a LlamaIndex App with Create Llama: No Code tool for RAG, https://www.youtube.com/watch?v=JkGU3d8IM1c&ab_channel=AIAnytime
-
AutoLLM: Ship RAG based LLM Apps and API in Seconds, https://www.youtube.com/watch?v=iTGbwD-sSxM&ab_channel=AIAnytime
-
Query Your CSV using LIDA: Automatic Generation of Visualizations with LLMs, https://www.youtube.com/watch?v=U9K1Cu45nMQ&ab_channel=AIAnytime
-
Chat with Data App: RAG using Mistral 7B, Haystack, and Chainlit, https://www.youtube.com/watch?v=01_2-Dy57ys&ab_channel=AIAnytime
-
Building LLM Applications with Langchain, https://www.youtube.com/watch?v=HmZzbhL8Tf8&list=PLfaIDFEXuae2Zb0phFLWAxgrJT7f416xq&pp=iAQB
-
RAG Implementation using Mistral 7B, Haystack, Weaviate, and FastAPI, https://www.youtube.com/watch?v=C5mqILmVUEo&ab_channel=AIAnytime
-
Let's Build an AI News Anchor Generator App using Generative AI, https://www.youtube.com/watch?v=cddahTnCo10&ab_channel=AIAnytime
-
Screenshot to Code Generation: 10x Faster Frontend/UI Development, https://www.youtube.com/watch?v=52Xq6AaRnT4&ab_channel=AIAnytime
-
ComfyUI GUI for Image and Video Generation: Google Colab Setup, https://www.youtube.com/watch?v=PYEnK_iQeZU&ab_channel=AIAnytime
-
Build a Containerized Transcription API using Whisper Model and FastAPI, https://www.youtube.com/watch?v=NU406wZz1eU&ab_channel=AIAnytime
-
Build Your RAG-based ChatGPT Web App with Azure: LawGPT Use Case Tutorial, https://www.youtube.com/watch?v=wmfAJWwyaQA&ab_channel=AIAnytime
-
Creating a Veterinary Chatbot using Llama 2: Harnessing Gen AI for Pet Care, https://www.youtube.com/watch?v=Iyzvka711pc&ab_channel=AIAnytime
-
Build Your API for Llama 2 on AWS: Lambda Function and API Gateway, https://www.youtube.com/watch?v=Nlo7WclRBXc&t=512s&pp=ygUGb2xsYW1h
-
Deploy Llama 2 for your Entire Organisation, https://www.youtube.com/watch?v=Ror2xOOA-VE&ab_channel=TrelisResearch
-
Install and Run Mistral 7B on AWS, https://www.youtube.com/watch?v=aSh66tG1B5o&pp=ygUNb2xsYW1hIG9uIEFXUw%3D%3D
-
Deploy Llama 2 on AWS SageMaker using DLC (Deep Learning Containers), https://www.youtube.com/watch?v=rQq1m2aJ_fk&ab_channel=AIAnytime
-
Enterprise Chat App using Azure Cognitive Search and Azure OpenAI: End-to-End Tutorial, https://www.youtube.com/watch?v=hkSnPhhjm1Y&ab_channel=AIAnytime
-
Containerizing LLM-Powered Apps: Part 1 of the Chatbot Deployment, https://www.youtube.com/watch?v=7CeAJ0EbzDA&ab_channel=AIAnytime
-
Deploy LLM Powered Apps on Azure App Service: Part 2 of the Chatbot Deployment, https://www.youtube.com/watch?v=vYIlhgVHAls&ab_channel=AIAnytime
-
Serve a Custom LLM for Over 100 Customers, https://www.youtube.com/watch?v=1TU9ZrZhqw0&ab_channel=TrelisResearch
-
Long Context Summarization, https://www.youtube.com/watch?v=I83TH4x9keo&ab_channel=TrelisResearch
-
Install OpenUI Locally on Windows - Create User Interface Using Text or image with AI, https://www.youtube.com/watch?v=6S57NYqaO4g&ab_channel=FahdMirza
-
Function Calling Datasets, Training and Inference, https://www.youtube.com/watch?v=hHn_cV5WUDI&ab_channel=TrelisResearch
-
How to Build an OpenAI LLM on a Private Network with AWS, https://www.youtube.com/watch?v=6LGGQERxrQo&ab_channel=SingleStore
-
Amazon Bedrock: Generative AI on AWS without the Headaches, https://www.youtube.com/watch?v=Yj_7FuFgPyI
-
Chat With Documents | Fully Managed RAG on Amazon Bedrock | NO-CODE, https://www.youtube.com/watch?v=Qtn_LDVW6Ww&ab_channel=DataScienceBasics
-
FULLY LOCAL Mistral AI PDF Processing Hands-on Tutorial, https://www.youtube.com/watch?v=wZDVgy_14PE&pp=ygUNb2xsYW1hIG9uIEFXUw%3D%3D
-
PrivateGPT 2.0 - FULLY LOCAL Chat With Docs (PDF, TXT, HTML, PPTX, DOCX, and more), https://www.youtube.com/watch?v=XFiof0V3nhA&ab_channel=MatthewBerman
-
AutoLLM: Create RAG Based LLM Web Apps in SECONDS!, https://www.youtube.com/watch?v=kPaiZe_qD34&ab_channel=WorldofAI
-
Use OpenChat and LM Studio with LLMWare, https://www.youtube.com/watch?v=h2FDjUyvsKE&ab_channel=llmware
-
Compare Embedding Models for Side by Side Queries Using Postgres with LLMWare, https://www.youtube.com/watch?v=Bncvggy6m5Q&ab_channel=llmware
-
AutoGen Studio with 100% Local LLMs (LM Studio), https://www.youtube.com/watch?v=ob45YmYD2KI&ab_channel=PromptEngineering
-
This is a lightweight app using the Web Research Retriever. It uses langchain to search and chat on web data on streamlit.
-
Your LLM Powered Financial Analyst, https://www.youtube.com/watch?v=JeruKKuMxCg&ab_channel=PromptEngineering
-
How I built the FASTEST Multiple CSV Chat App using LLAMA3+GROQ+PANDASAI
https://www.youtube.com/watch?v=FiCsuN7aPF8&ab_channel=DataInsightEdge
-
How to Create a Web UI for AutoGen by Using Panel
-
Build a Full Stack AI Web App: AI Website Reviewer With Python, Django, Voiceflow, JS & Tailwind
https://www.youtube.com/watch?v=tN9iVDppx2A&ab_channel=CodeWithTomi
-
Create Full Function UI for AutoGen Powered by Panel (Human Input Enabled)
-
AutoGen + Function Calling + Open Source LLMs, Here is How
https://www.youtube.com/watch?v=UIBerUGqHjc&ab_channel=YeyuLab
-
Use Open Source LLMs in AutoGen powered by Fireworks AI, without GPU/CPU
https://www.youtube.com/watch?v=HN96PTdiseo&ab_channel=YeyuLab
-
Speech-to-Code - The Future of Programming with AI? | feat Claude 3 Haiku
https://www.youtube.com/watch?v=gW0RmrhoSyA&ab_channel=AllAboutAI
-
Make an offline GPT voice assistant in Python
-
Build and Run a Medical Chatbot using Llama 2 on CPU Machine: All Open Source, https://www.youtube.com/watch?v=kXuHxI5ZcG0&ab_channel=AIAnytime
-
Chat With Websites Using ChainLit / Streamlit, LangChain, Ollama & Mistral 🧠, https://www.youtube.com/watch?v=FZrkm0vaYYQ&ab_channel=DataScienceBasics
-
LocalGPT API: Serve Multiple Users At the Same time, https://www.youtube.com/watch?v=z9wDKwgQojM&ab_channel=PromptEngineering
-
Deploy and Use any Open Source LLMs using RunPod, https://www.youtube.com/watch?v=nHuHGoLSXb0&ab_channel=AIAnytime
-
Function Calling using Open Source LLM (Mistral 7B), https://www.youtube.com/watch?v=MQmfSBdIfno&t=337s&ab_channel=AIAnytime
-
4 LLM frameworks to build AI apps with voice data
- LeMUR: https://www.assemblyai.com/docs/getting-started/apply-llms-to-audio-files
- LangChain: https://www.langchain.com/langchain
- LlamaIndex: https://www.llamaindex.ai/
- Haystack: https://haystack.deepset.ai/
https://www.youtube.com/watch?v=wdF-0CGkoeQ&ab_channel=AssemblyAI
-
Unmatched Accuracy and Lightning Speed in Python for Speech Recognition by AssemblyAI
https://www.youtube.com/watch?v=5Uw-r36XQYk&ab_channel=AssemblyAI
-
vector search, RAG, and Azure AI search,
https://www.youtube.com/live/vuOA13Y_Qzk?si=bT6zY4piPt_yUn_Q
https://github.com/pamelafox/vector-search-demos
-
META LLAMA 3 8B INSTRUCT LLM – How to Create Medical Chatbot with LlamaIndex FastEmbed Colab Demo
https://www.youtube.com/watch?v=yGk_eVQdjSU&ab_channel=RitheshSreenivasan https://colab.research.google.com/drive/1LgYtDgJlseOe78fauU8DXMawShL8YiQg?usp=sharing
-
Manage vector databases and long term memory in flowwise, AI vector tools Review part 1
https://youtu.be/d7nAcshOe4w?si=kArGQ_Ua8pFdvzFy
-
Learn how to use LlamaIndex with LanChainin Flowwise,LlamaIndex vs Langchain part 2,
-
Create a Web Interface for your LLM in Python
https://huggingface.co/blog/Alex1337/create-a-web-interface-for-your-llm-in-python
Turns Data and AI algorithms into production-ready web applications in no time.
-
I made AI to auto categorise 10000 comments on Google Sheet with 0$
-
Build a medical RAG app using Biomistral, Qdrant and Llama.cpp
-
AnythingLLM - Chat with Any Docs with full Privacy|Runs Offline|FREE LOCAL LLMs + NO Code
https://www.youtube.com/watch?v=J6NJCg-hI9c&ab_channel=DataInsightEdge
https://github.com/Mintplex-Labs/anything-llm
-
Steerable AI with Pinecone + Semantic router, https://youtu.be/qjRrMxT20T0?si=hQj7YxUJAj2Y2unV
-
Constitutional AI with Open LLMs
https://huggingface.co/blog/constitutional_ai
https://github.com/huggingface/alignment-handbook/tree/main/recipes/constitutional-ai
-
Stop paying for ChatGPT with these two tools | LMStudio x AnythingLLM\
https://www.youtube.com/watch?v=-Rs8-M-xBFI&ab_channel=TimCarambat
-
Create Chat UI Using ChainLit, LangChain, Ollama & Gemma 🧠 https://www.youtube.com/watch?v=n9AMtXLveMs&t=11s&ab_channel=DataScienceBasics
-
LangSmith For Beginners | Must know LLM Evaluation Platform 🔥 https://www.youtube.com/watch?v=FgG-trkAMwU&ab_channel=DataScienceBasics
-
Create-Llama: deploy LlamaIndex RAG App to Vercel
-
PhiData: How to Seamlessly Integrate AI into Your Application
https://www.youtube.com/watch?v=fLGj63fiYfM&ab_channel=MervinPraison
-
Taipy: Creating Production-Grade Apps with Taipy vs Streamlit, https://www.youtube.com/watch?v=MgAIrGxnN-8&ab_channel=WorldofAI
-
Anthropic Claude API: Supercharge Your AI App with Large Context, https://www.youtube.com/watch?v=Wtt9tuO8UPY&ab_channel=MervinPraison
-
Build an AI Applicant Tracking System(ATS) for Your Resumes with LLMs|Get JOBS 100X FASTER, https://www.youtube.com/watch?v=7lP7fune0Gw&ab_channel=DataInsightEdge
-
Build & Chat with Invoices using Google’s Gemini Pro VisionIStreamlit + Use Case| Tutorial, https://www.youtube.com/watch?v=7_926xGDbDY&ab_channel=DataInsightEdge
-
Chat with Multiple Documents using Gemini Pro with LangChain| Step-by-Step Tutorial #ai #llm, https://www.youtube.com/watch?v=UXLWLFOB0Xo&ab_channel=DataInsightEdge
-
GEMINI Pro with LangChain | Chat, MultiModal and Chat with your Documents, https://www.youtube.com/watch?v=7h8ZHSkAkas&ab_channel=PromptEngineering
-
Gemini Pro + LangChain - Chains, Mini RAG, PAL + Multimodal, https://www.youtube.com/watch?v=G3-YOEVg-xc&ab_channel=SamWitteveen
-
AutoVector | Chat With Your Both SQL and Free Text Data Simultaneously, https://www.youtube.com/watch?v=KWidGdSi2AE&ab_channel=MG
-
Visual Question Answering with Google Deplot #huggingface
https://www.youtube.com/watch?v=n_h_XWM2vzg&ab_channel=SuperLazyCoder
-
Google’s Med-Gemini Multimodal LLM: The Best Medical AI Model https://www.youtube.com/watch?v=GA5i5M_Bh50&ab_channel=AIAnytime
https://arxiv.org/pdf/2404.18416
-
Build an LLM powered chrome extension, https://youtu.be/9RKXffJsJhs?si=Ly_ocxdSttphdhKk
-
Claude 3 Function Calling: How to Integrate your own Software?, https://www.youtube.com/watch?v=LuBROahHvfo&ab_channel=MervinPraison
-
Anthropic Tools for Seamless Automation: 3 Use Cases Explained
https://www.youtube.com/watch?v=nv_Ghb5i1jU&ab_channel=MervinPraison
https://mer.vin/2024/04/anthropic-tools-stock-price-integration/
-
Building Production-Grade LLM Apps
https://www.youtube.com/watch?v=fo0F-DAum7E&ab_channel=DeepLearningAI
-
Images Interpolation with Stable Diffusion
This notebook shows how to use Stable Diffusion to interpolate between images. Image interpolation using Stable Diffusion is the process of creating intermediate images that smoothly transition from one given image to another, using a generative model based on diffusion.
https://huggingface.co/learn/cookbook/stable_diffusion_interpolation
-
Building Google's Dramatron with LangGraph JS & Anthropic's Claude 3
https://www.youtube.com/watch?v=alHnQjyn7hg&ab_channel=LangChain
-
AutoGen + Knowledge Graph + GPT-4 = Graph Chatbot
-
AutoGen + LangChian + SQLite + Schema Function = Super SQL Chabot
-
Microsoft PHI-2 + Huggine Face + Langchain = Super Tiny Chatbot
-
Real-Time Car Speed Tracking & Object Classification Revealed (Not LLM but Yolo v8.1) https://www.youtube.com/watch?v=-Q81tuPB0Ok&ab_channel=MervinPraison
-
How to Build AI Chatbot with Hugging Face Quickly and Easily using blenderbot-400M-distill
https://www.youtube.com/watch?v=FXbSdspVtNE&ab_channel=FahdMirza
-
Automatically extract phone call insights with LLMs and Python | Full tutorial
https://www.youtube.com/watch?v=5ZII6vvRFes&ab_channel=AssemblyAI
https://github.com/AssemblyAI-Examples/extract-call-insights
https://www.assemblyai.com/blog/extract-call-insights-llms-python/
-
Create AI News Channel for FREE in Minutes | Make Money with AI News (CAPCUT 2024)
https://www.youtube.com/watch?v=K6w3vU2_i3U&ab_channel=SkillCurb
-
Voice to Text Transcription with CPU-Friendly AI (Whisper CPP)
https://www.youtube.com/watch?v=YG5u5AOU9MQ&ab_channel=llmware
-
Voice Transcription with CPU Friendly AI Models Example (Greatest Speeches of 20th Century)
https://www.youtube.com/watch?v=5y0ez5ZBpPE&ab_channel=llmware
-
How does Groq LPU work? (w/ Head of Silicon Igor Arsovski!), https://www.youtube.com/watch?v=WQDMKTEgQnY&ab_channel=AleksaGordi%C4%87-TheAIEpiphany
-
Getting Started with Groq API | Making Near Real Time Chatting with LLMs Possible
https://www.youtube.com/watch?v=S53BanCP14c&ab_channel=PromptEngineering
-
Groq API: Make your AI Applications Lighting Speed, https://www.youtube.com/watch?v=vKWtFVqr6Wc&t=96s&ab_channel=MervinPraison
-
Build the fastest AI chatbot with memory using Groq, gradio, Langchain
https://github.com/InsightEdge01/GroqchatbotwithMemory/tree/main
- Fastest talking AI I could build deepgram + groq*
https://youtu.be/J2sbC8X5Pp8?si=6L4sqm2izVXkDgR7
https://aura-tts-demo.deepgram.com
Code: https://github.com/gkamradt/QuickAgent
-
Create table question answering with Gen AI LLMs @HuggingFace
https://www.youtube.com/watch?v=qZCmXY_-on8&ab_channel=SuperLazyCoder
https://colab.research.google.com/drive/1Iz_aoskOMYqdFWfpwk5YJWuBPfJkGxao?usp=sharing
-
Build a real AI model that can try any cloth
https://www.youtube.com/watch?v=C94pTaKoLbU&ab_channel=AIJason
-
Huggingface docs, https://huggingface.co/docs
-
Hugging Face Text Generation Inference available for AWS Inferentia2
https://huggingface.co/blog/text-generation-inference-on-inferentia2
This tutorial shows how easy it is to deploy a state-of-the-art LLM, such as Zephyr 7B, on AWS Inferentia2 using Amazon SageMaker. Zephyr is a 7B fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on a mix of publicly available and synthetic datasets using Direct Preference Optimization (DPO), as described in detail in the technical report. The model is released under the Apache 2.0 license, ensuring wide accessibility and use.
Following steps are performed: 1. Setup development environment 2. Retrieve the TGI Neuronx Image 3 .Deploy Zephyr 7B to Amazon SageMaker 4. Run inference and chat with the model
-
Custom architectures with HuggingFace 🤗
https://huggingface.co/blog/not-lain/custom-architectures-with-huggingface
-
Train Custom Models on Hugging Face Spaces with AutoTrain SpaceRunner
-
Pushing Models and Adapters to HuggingFace | Free Notebook,
https://www.youtube.com/watch?v=Kd4JL7GnR8Y&ab_channel=TrelisResearch
https://github.com/TrelisResearch/install-guides/blob/main/Pushing_to_Hub.ipynb
-
Deep Dive: Hugging Face models on AWS AI Accelerators
https://www.youtube.com/watch?v=66JUlAA8nOU&ab_channel=JulienSimon
-
A guide to setting up your own Hugging Face leaderboard: an end-to-end example with Vectara's hallucination leaderboard
-
The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models https://huggingface.co/blog/leaderboards-on-the-hub-hallucinations
-
Creating open machine learning datasets? Share them on the Hugging Face Hub! https://huggingface.co/blog/researcher-dataset-sharing
-
Deploy Embedding Models with Hugging Face Inference Endpoints https://huggingface.co/blog/inference-endpoints-embeddings
-
Bhilding a self-corrective coding assistant from scratch https://youtu.be/MvNdgmM7uyc?si=b78VIhFapFo2U8NV
-
Pollen-Vision: Unified interface for Zero-Shot vision models in robotics
https://huggingface.co/blog/pollen-vision
-
Experiments with Bitnet 1.5
https://huggingface.co/blog/pollen-vision
https://huggingface.co/blog/joey00072/arxiv.org/abs/2402.17764
https://github.com/joey00072/ohara/tree/master/experiments/bitnet
-
Powerful ASR + diarization + speculative decoding with Hugging Face Inference Endpoints
- ML pipeline with Pandas and Sklearn, https://www.youtube.com/watch?v=Zpy9npXnW00&ab_channel=RicardoCalix
- LangChain for LLM Application Development, https://www.deeplearning.ai/short-courses/langchain-for-llm-application-development/
- How would you build an enterprise solution for AutoML?
Here are the different components to consider:
-
Frontend client: we need to allow the user to input parameters to set up the model training and start the process. The user should be able to visualize the results of a specific run along with its related metrics. We could also provide a way to compare training runs for a better model selection process.
-
A backend server: this is where the logic displayed on the frontend is implemented. It connects to a Run Metadata database that captures the different run parameters and metrics. This database should contain all the information necessary to restart identical training runs. MLFLow is an excellent example of a training runs management system.
-
A message queue for training requests: Because we may have multiple users submitting training requests simultaneously, we need to buffer those requests. If we have a cap on the number of training servers we can use simultaneously, it is better to buffer requests until enough machines are available for the next requests.
-
An orchestration scheduler: The orchestration system can plan the various stages and restart one in case of failure. Airflow and Kubeflow are examples of such a system. The scheduler will monitor the message queue and trigger a training pipeline once a user request is received.
-
A training pipeline: The different steps are captured in a DAG and are handled by the orchestration workers.
-
The Data pull module: we need to establish a logic to pull the correct data from the feature store. Once the data is pulled, it must be validated to ensure that it follows the requirements for the particular training run and is consistent with features metadata.
-
The Data processing module: once the data is ready, we need, at the very least, to carve out a validation set for model performance evaluation.
-
The Model selection module: this is where most of the process will be spent. That module handles the model selection process, including choosing the ML model, the hyperparameters, the model architecture, and performing the feature selection. The result of this module is a trained optimal model.
-
The model validation module: after training the model, we need to capture the different validation metrics that will help the user make an educated decision about the resulting model. Beyond ML metrics, we must capture information about hardware utilization, such as memory and CPU usage. We need to send the resulting metadata to the Run Metadata database.
-
The model push module: the resulting model needs to be pushed to a model registry and its version number.
-
What is CI/CD/CT for machine learning
If you are working in a big tech company on ML projects, chances are you are working on some version of Continuous Integration / Continuous Deployment (CI/CD). It represents a high level of maturity in MLOps with Continuous Training (CT) at the top. This level of automation really helps ML engineers to solely focus on experimenting with new ideas while delegating repetitive tasks to engineering pipelines and minimizing human errors.
On a side note, when I was working at Meta, the level of automation was of the highest degree. That was simultaneously fascinating and quite frustrating! I had spent so many years learning how to deal with ML deployment and management that I had learned to like it. I was becoming good at it, and suddenly all that work seemed meaningless as it was abstracted away in some automation. I think this is what many people are feeling when it comes to AutoML: a simple call to a "fit" function seems to replace what took years of work and experience for some people to learn.
There are many ways to implement CI/CD/CT for Machine Learning but here is a typical process:
-
The experimental phase - The ML Engineer wants to test a new idea (let's say a new feature transformation). He modifies the code base to implement the new transformation, trains a model, and validates that the new transformation indeed yields higher performance. The resulting outcome at this point is just a piece of code that needs to be included in the master repo.
-
Continuous integration - The engineer then creates a Pull Request (PR) that automatically triggers unit testing (like a typical CI process) but also triggers the instantiation of the automated training pipeline to retrain the model, potentially test it through integration tests or test cases and push it to a model registry. There is a manual process for another engineer to validate the PR and performance reading of the new model.
-
Continuous deployment - Activating a deployment triggers a canary deployment to make sure the model fits in a serving pipeline and runs an A/B test experiment to test it against the production model. After satisfactory results, we can propose the new model as a replacement for the production one.
-
Continuous training - as soon as the model enters the model registry, it deteriorates and you might want to activate recurring training right away. For example, each day the model can be further fine-tuned with the new training data of the day, deployed, and the serving pipeline is rerouted to the updated model.
The Google Cloud documentation is a good read on the subject:
-
Machine Learning Engineering for Production (MLOps)
-
Let's Learn LangChain! // Applied AI Workshops March 2024
https://github.com/justingrammens/LetsLearnLangChain
https://www.youtube.com/watch?v=QT3wALFDZBo&ab_channel=AppliedAI
-
Build Real-World Machine Learning Project: Step-by-Step Guide using FastAPI, DVC & Poetry
https://www.youtube.com/watch?v=ug1FA7qzWSc&ab_channel=VenelinValkov
https://github.com/curiousily/Get-Things-Done-with-Prompt-Engineering-and-LangChain
-
- Have You Picked the Wrong AI Agent Framework?
https://www.youtube.com/watch?v=jLVl5V8roMU&ab_channel=MattWilliams
-
"Make Agent 10x cheaper, faster & better?" - LLM System Evaluation 101
https://www.youtube.com/watch?v=MGkByeDm-90&ab_channel=AIJason
-
“Wait, this Agent can Scrape ANYTHING?!” - Build universal web scraping agent
https://www.youtube.com/watch?v=dSX5eoD4-u4&ab_channel=AIJason
-
Boost Gmail Efficiency with AI: Python Tutorial (CrewAI, LangChain, LangGraph)
https://www.youtube.com/watch?v=o4-4NvrcOvs&ab_channel=AIFORDEVS
https://github.com/joaomdmoura/crewAI
-
How to do Web Scraping in 2024 with LLM Agents
https://www.youtube.com/watch?v=od6AaKhKYmg&ab_channel=PromptEngineering
-
OpenAGI aims to make human-like agents accessible to everyone, thereby paving the way towards open agents and, eventually, AGI for everyone.
-
Qwen-Agent: Build Autonomous Agents with The Best Open Weight Model
-
How to Implement a Simple UI for CrewAI applications
https://www.youtube.com/watch?v=gWrqfnTGtl8&ab_channel=YeyuLab
-
Build Anything with Llama 3 Agents CrewAI, Ollama
https://www.youtube.com/watch?v=i-txsBoTJtI&ab_channel=DavidOndrej
-
CrewAI RAG: How I Created AI Assistants to Run My News Agency, https://www.youtube.com/watch?v=77xSbC-9yn4&ab_channel=MervinPraison
-
I built a AI Investment Property Bot in 15 Minutes (to automate deals)
https://www.youtube.com/watch?v=2IGbkWm0cNo&ab_channel=NicholasRenotte
-
Automate Recruitment Process using AI Agents: Crew AI
https://www.youtube.com/watch?v=OQJ4gp70Zg0&ab_channel=MervinPraison
-
Crew AI 2.0: How AI Agents Perform Healthcare Research Simultaneously
https://www.youtube.com/watch?v=3siqEzMeWdM&ab_channel=MervinPraison
-
CrewAI: AI-Powered LinkedIn Posts Creator Agents using Azure, Gemini & Mistral
https://www.youtube.com/watch?v=U5TAI_SGllA&ab_channel=MervinPraison
-
I Created AI Assistants to Automate Recruitment Process: Crew AI
https://www.youtube.com/watch?v=OQJ4gp70Zg0&ab_channel=MervinPraison
-
CrewAI agents for stock analysis (works with local Ollama LLMs), https://youtu.be/U_Sg3Odf1vk?si=gzDboL0gLYTPn7Q6
-
CrewAI + Claude 3 Haiku, https://www.youtube.com/watch?v=K0mb-pXdqsI&ab_channel=SamWitteveen
Sequential Colab: https://colab.research.google.com/drive/1npc4TpcqC_LxKaU8Nv9HQUvtn02QG7pv?usp=sharing
Hierarchical Colab: https://colab.research.google.com/drive/1hn6XJwnGUJHlkHAYzrtys2-i7Eo3Tq3j?usp=sharing
-
How to Create an Interactive Web UI for CrewAI Applications By Panel
https://www.youtube.com/watch?v=pODI1SWTVeo&ab_channel=YeyuLab
-
CrewAI - A Smartly Designed Multi-agent Framework for LLM App Development https://www.youtube.com/watch?v=tKYr0fgkSPo&ab_channel=YeyuLab
-
Creating an AI Agent with LangGraph Llama 3 & Groq
https://www.youtube.com/watch?v=lvQ96Ssesfk&ab_channel=SamWitteveen
https://colab.research.google.com/drive/1WemHvycYcoNTDr33w7p2HL3FF72Nj88i?usp=sharing
-
Llama3 + CrewAI + Groq = Email AI Agent
https://www.youtube.com/watch?v=1D4YoAUpjlg&ab_channel=SamWitteveen
https://colab.research.google.com/drive/1eT82D9g3bp1-uf4HDv_PPWaK2keZKS4K?usp=sharing
-
How to Implement a Simple UI for CrewAI applications
https://www.youtube.com/watch?v=gWrqfnTGtl8&t=6s&ab_channel=YeyuLab https://github.com/yeyu2/Youtube_demos
-
How to Create an Interactive Web UI for CrewAI Applications By Panel
https://www.youtube.com/watch?v=pODI1SWTVeo&ab_channel=YeyuLab
-
Llama3 + CrewAI + Groq = Email AI Agent
https://www.youtube.com/watch?v=1D4YoAUpjlg&ab_channel=SamWitteveen
https://colab.research.google.com/drive/1eT82D9g3bp1-uf4HDv_PPWaK2keZKS4K?usp=sharing
-
CrewAI + Claude 3 Haiku
https://www.youtube.com/watch?v=K0mb-pXdqsI&ab_channel=SamWitteveen
-
CrewAI - Building a Custom Crew
https://www.youtube.com/watch?v=2Uar1fDKYpQ&ab_channel=SamWitteveen
-
AI Agents with LangChain, CrewAI and Llama 3: Build an AI Tweet Writing App | Step-by-Step Tutorial
https://www.youtube.com/watch?v=fzqSP3vTZ4E&ab_channel=VenelinValkov
-
CrewAI with Open LLM (Llama 3) using Groq API: AI Agents for Data Analysis with Custom Tools
https://www.youtube.com/watch?v=N5sos1X30Rw&ab_channel=VenelinValkov
-
Crew AI: The Future of Travel Planning with AI Agents!
https://www.youtube.com/watch?v=zAjcPRjnORg&list=PLYQsp-tXX9w6dCJBsgZfddg5pAuWXapU4&ab_channel=MervinPraison https://mer.vin/2024/01/crew-ai-travel-agents/
-
Create Illustrated Storybooks Instantly with Crew AI Agents! (Groq)
https://www.youtube.com/watch?v=vWukuS48RbY&ab_channel=MervinPraison https://mer.vin/2024/03/crewai-groq-create-story-books/
-
Crew AI: The Future of AI-Powered Research and Writing Agents!
https://www.youtube.com/watch?v=qFNge4IrERk&ab_channel=MervinPraison https://mer.vin/2024/01/crewai-example-code/
-
CrewAI Code Interpreter: How I Made AI Agents to Generate Execute Code (Vs AutoGen)
https://www.youtube.com/watch?v=DDDXO_Y_YAI&ab_channel=MervinPraison
https://mer.vin/2024/02/crewai-open-interpreter/
-
- CrewAI with Open LLM (Llama 3) using Groq API: AI Agents for Data Analysis with Custom Tools
https://www.youtube.com/watch?v=N5sos1X30Rw&ab_channel=VenelinValkov
https://github.com/curiousily/AI-Bootcamp
-
Build a Web App (GUI) for your CrewAI Automation (Easy with Python)
https://www.youtube.com/watch?v=vhbfs38XmKk&ab_channel=AlejandroAO-Software%26Ai
-
Python: Automating a Marketing Team with AI Agents | Planning and Implementing CrewAI
https://www.youtube.com/watch?v=LHXujQ19Euo&ab_channel=AlejandroAO-Software%26Ai
-
Build a Research Assistant AI Agent using CrewAI
-
CrewAI + Exa: Generate a Newsletter with Research Agents (Part 1)
https://www.youtube.com/watch?v=gXET04dJ66A&ab_channel=AlejandroAO-Software%26Ai
-
Llama Index AI Agents: How to Get Started for Beginners?
-
AutoGen + Custom Model + Gemma (or Any Model), Here is the Ultimate Solution https://www.youtube.com/watch?v=H0h78EBzz0o&ab_channel=YeyuLab
-
AutoGen + Ollama + Gemma: How to Create LLM Agents Locally
https://www.youtube.com/watch?v=bkBOuBxsxeM&t=1s&ab_channel=YeyuLab
-
Is Gemma Capable of Building Multi-agent Applications in AutoGen?
https://www.youtube.com/watch?v=L7ABsqsPN_A&ab_channel=YeyuLab
-
AutoGen + Function Calling + Open Source LLMs, Here is How
https://www.youtube.com/watch?v=UIBerUGqHjc&t=5s&ab_channel=YeyuLab
https://colab.research.google.com/drive/18p6j0R4fj9q7DnuiIIxEIGl_6fT4FkKV?usp=sharing
-
AutoGen Technique - Use
Description
Field to Manage the Conversation Between Multiple Agents https://www.youtube.com/watch?v=2YACB_N2bI8&ab_channel=YeyuLab -
How to Use Open Source LLMs in AutoGen Powered by vLLM
https://www.youtube.com/watch?v=ds032PYcpgs&ab_channel=YeyuLab
Colab notebook for AutoGen w/ GPT-4 https://colab.research.google.com/drive/1nq20eu_T3vFklHIv8781zfnW2d3BgISZ?usp=sharing
Colab notebook for AutoGen w/ Phi-2 https://colab.research.google.com/drive/1xee3xdatViM4pespvLVVOrHJ8sB1MgO5?usp=sharing
-
AI Agents with GPT-4 Turbo and CrewAI | Cryptocurrency Market Report with News
https://www.youtube.com/watch?v=Ev0uzdzesjU&ab_channel=VenelinValkov
https://github.com/curiousily/AI-Bootcamp
-
AutoGen + Panel Ep.3 - Web UI for Multi-agent with Document Retrieval
https://www.youtube.com/watch?v=98Ri4VVBP_8&t=432s&ab_channel=YeyuLab
-
How to Create a Web UI for AutoGen by Using Panel
https://www.youtube.com/watch?v=mFmPDyLlj1E&t=1s&ab_channel=YeyuLab
https://github.com/yeyu2/Youtube_demos
https://levelup.gitconnected.com/how-to-create-a-web-ui-for-autogen-132df43fb2ed
-
AutoGen + Panel Ep.3 - Web UI for Multi-agent with Document Retrieval
-
Building Agents: Copilot Streaming an Agentic Workflow w/ Fast Inference (Llama 3, Groq, LangGraph)
https://www.youtube.com/watch?v=YIdvcKHovjo&ab_channel=DeployingAI
-
Create AI Chatbot from Tabular Data using VectorShift AI Agent
https://www.youtube.com/watch?v=gFn2tINuKIU&ab_channel=MervinPraison
-
Build Generative AI Agents using Dialogflow CX and Vertex AI on GCP, https://www.youtube.com/watch?v=cDY8lm6vg7w&ab_channel=AIAnytime
-
AutoGen Studio UI 2.0: Easiest Way to Create Custom Agents, https://www.youtube.com/watch?v=KIvl-VY8H0Y&ab_channel=PromptEngineering
-
Development with Large Language Models Tutorial – OpenAI, Langchain, Agents, Chroma
-
CPU-based SLMs for AI Agents and Function Calling by LLMWare, https://www.youtube.com/watch?v=0MOMBJjytkQ&ab_channel=AIAnytime
-
AutoGen + Ollama + Gemma: How to Create LLM Agents Locally
https://www.youtube.com/watch?v=bkBOuBxsxeM&ab_channel=YeyuLab
-
JSON agents with Ollama & LangChain
Learn to implement an open-source Mixtral agent that interacts with a graph database Neo4j through a semantic layer
https://blog.langchain.dev/json-based-agents-with-ollama-and-langchain/
-
"I want Llama3 to perform 10x with my private knowledge" - Local Agentic RAG w/ llama3, LlamaParse, Firecrawl, Corrective RAG agent
https://www.youtube.com/watch?v=u5Vcrwpzoz8&t=982s&ab_channel=AIJason
-
How to augment LLMs with Agents and Tools
ere is how to augment LLMs with tools!
We build a prompt with the following items:
- a list of the possible and description of what they are and how to use them
- the template of the Reasoning-Act (ReAct) prompt technique
- the scratch book showing the results of the previous steps
- the output indicator to guide the LLM in formatting its output correctly
The ReAct technique forces the LLM to think about the next step to solve the question and choose a tool and a tool input to get more information based on that thought. We then extract the tool name and input with Regex and programmatically call the tool with the input and get the response. For example, one tool could be the Python package of the Wikipedia search engine.
We use the tool response to help further the LLM investigation to find the right answer. An agent is a wrapper around an LLM that is augmented with a bunch of tools. The agent iterates until the answer is found:
agent -> prompt with past steps -> LLM -> next steps -> tool -> reponse -> agent -> ...
-
Adding RAG to LangGraph Agents
https://www.youtube.com/watch?v=WyIWaopiUEo&ab_channel=SamWitteveen
https://colab.research.google.com/drive/1TSke71zmtkmwv83JOmaplNWXDisf8jHG?usp=sharing
-
AI Agent Evaluation with RAGAS (Retrieval Augmented Generation Assessment)
https://www.youtube.com/watch?v=-_52DIIOsCE&ab_channel=JamesBriggs
-
-
Jina Reader API: Build better AI Agents and RAG systems with Reader, https://www.youtube.com/watch?v=GllAqZE6uws&ab_channel=AIAnytime
-
Unlock AI Agents, Function Calls and Multi-Step RAG with LLMWare https://www.youtube.com/watch?v=cQfdaTcmBpY&ab_channel=llmware
-
RAG ipynb: CRAG, LlamaIndex, Ollama, ReAct Agent
https://www.youtube.com/watch?v=qPsmRk14BNM&ab_channel=code_your_own_AI
-
Hands on with LangGraph Agent Workflows: Build a LangChain Coding Agent with Custom Tools
https://www.youtube.com/watch?v=oMRJ--GJCKQ&ab_channel=DeployingAI
-
Build Generative AI Agents using Dialogflow CX and Vertex AI on GCP, https://www.youtube.com/watch?v=cDY8lm6vg7w&ab_channel=AIAnytime
-
AutoGen Studio UI 2.0: Easiest Way to Create Custom Agents, https://www.youtube.com/watch?v=KIvl-VY8H0Y&ab_channel=PromptEngineering
-
Development with Large Language Models Tutorial – OpenAI, Langchain, Agents, Chroma
-
CPU-based SLMs for AI Agents and Function Calling by LLMWare, https://www.youtube.com/watch?v=0MOMBJjytkQ&ab_channel=AIAnytime
-
AutoGen + Ollama + Gemma: How to Create LLM Agents Locally
https://www.youtube.com/watch?v=bkBOuBxsxeM&ab_channel=YeyuLab
-
-
Create Complex Research Analysis with AI Agents using SLIM models on CPU with LLMWare
https://www.youtube.com/watch?v=y4WvwHqRR60&ab_channel=llmware
https://huggingface.co/llmware
https://github.com/llmware-ai/llmware
https://github.com/llmware-ai/llmware/tree/main/examples/SLIM-Agents/
-
LangGraph + function call + Yahoofinance = Multi-agent application, https://youtu.be/r2PvHdkaXWc?si=alEiCMZwy0xAwNwG
-
LangGraph and OpenGPTs: building agent forward applications with Langchain , https://www.youtube.com/live/NdF609kO8FY?si=OLcaLpy3ALBUeOUF
-
VectorShift + Pipelines + System Prompt = Ai Agent Chatbot
https://youtu.be/0HxHkNT4_EU?si=qeOsaRbRC6gt-rtA
- Builx an Agent with Long-Term personalized memory, https://youtu.be/oPCKB9MUP6c?si=FGDDaDm1KuXVazhP
-
CodeHierarchyAgentPack from LlamaIndex
The CodeHierarchyAgentPack is useful to split long code files into more reasonable chunks, while creating an agent on top to navigate the code. What this will do is create a "Hierarchy" of sorts, where sections of the code are made more reasonable by replacing the scope body with short comments telling the LLM to search for a referenced node if it wants to read that context body.
Nodes in this hierarchy will be split based on scope, like function, class, or method scope, and will have links to their children and parents so the LLM can traverse the tree.
https://llamahub.ai/l/llama-packs/llama-index-packs-code-hierarchy?from=llama-packs
-
Llama Index AI Agents: How to Get Started for Beginners?
https://www.youtube.com/watch?v=jnWaUtS2Fr8&ab_channel=MervinPraison
- Navigating LLM Threats: Detecting Prompt Injections and Jailbreaks
https://www.youtube.com/watch?v=kH4ZoZSvddM&ab_channel=DeepLearningAI
-
Many-shot Jailbreaking
-
CryptGPT: Privacy-Preserving Language Models Using Vigenere Cipher (Part 1)
-
Uncensor any LLM with abliteration
-
Beware of Contaminated AI Models - Contaminated Proof 7B
https://www.youtube.com/watch?v=VABZSdFXtD4&ab_channel=FahdMirza
https://huggingface.co/Contamination/contaminated_proof_7b_v1.0
-
Avoid Attacks to Your LLM Generative AI Apps with PyRIT
-
Python Risk Identification Tool for generative AI (PyRIT)
It is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.
-
Attacks on LLMs
Ref: (Peter Gostev)[https://www.linkedin.com/in/peter-gostev?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAAAN3UfQB6o49_7GyUESrYmKdZQgXfwzAkaE&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3BIYMhIp9CS%2BSfvCxyDqKK3w%3D%3D]
A Taxonomical Ontology of Prompt Hacking techniques. Blank lines are hypernyms (i.e., typos are an instance of obfuscation), while grey arrows are meronyms (i.e., Special Case attacks usually contain a Simple Instruction). Purple nodes are not attacks themselves but can be a part of attacks. Red nodes are specific examples. Ref: [Article: Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition] (https://arxiv.org/pdf/2311.16119)
-
Anonymizing Sensitive Data in LLM Prompts
https://www.youtube.com/watch?v=7NrzPuICLtg&ab_channel=TrelisResearch
-
How to Un-censor Any LLM Locally - LLM Abliteration
https://www.youtube.com/watch?v=ojtWa-mU1qI&ab_channel=FahdMirza
-
Using Ollama to Run Local LLMs on the Raspberry Pi 5, https://www.youtube.com/watch?v=ewXANEIC8pY&ab_channel=IanWootten
-
Private AI Revolution: Setting Up Ollama with WebUI on Raspberry Pi 5!, https://www.youtube.com/watch?v=jJKbYj8mIy8&ab_channel=KevinMcAleer
-
I Ran Advanced LLMs on the Raspberry Pi 5!, https://www.youtube.com/watch?v=Y2ldwg8xsgE&ab_channel=DataSlayer
-
How to Run a ChatGPT-like AI on Your Raspberry Pi, https://www.youtube.com/watch?v=idZctq7WIq4&ab_channel=GaryExplains
-
Local AI Just Got Easy (and Cheap), https://www.youtube.com/watch?v=mdOEaNV8NXw&ab_channel=DataSlayer
Following boards are needed:
- Zima Board
- Coral USB TPU
- Coral PCie TPU
- M.2 Adapter
- Raspberry Pi 5
-
Power of Generative AI + Common-Sense of Reasoning AI = All-Pervasive Conversational Ux, https://www.youtube.com/watch?v=j1uZ1NpC_4M&ab_channel=Travellingwave
Paper Link: www.isca-speech.org/archive/pdfs/interspeech_2023/rao23_interspeech.pdf or www.travellingwave.com/TwIS2023.pdf
-
Running SDXL on the Raspberry Pi 5 is now POSSIBLE!, https://www.youtube.com/watch?v=XVS8oiuU6sA&ab_channel=AiFlux
-
World's Easiest GPT-like Voice Assistant https://github.com/nickbild/local_llm_assistant?tab=readme-ov-file
-
Run LLMs Locally on Raspberry Pi Using Ollama AI
-
Run Llama on your Raspberry Pi 5 without using Ollama
-
How to Run Multi-LLM Agents on Raspberry Pi with CrewAI
https://fleetstack.io/blog/run-multi-llm-agents-raspberry-pi-crewai-guide
-
Ollama benchmark on Raspberry Pi 5 RAM 8GB
https://aidatatools.com/2024/01/ollama-benchmark-on-raspberry-pi-5-ram-8gb/
-
Comparing recent smaller large language models (LLMs) locally on an OrangePi5b
-
Practical AI - Local LLM and machine learning for plant care with OrangePi5
https://www.viam.com/post/practical-ai-local-llm-and-machine-learning-for-plant-care
-
OpenAI Whisper C++ Raspberry Pi 5 Voice Assistant
https://www.youtube.com/watch?v=jpW9foRIwv0&ab_channel=SamWechsler
-
LLM-ollama-webui-Raspberry-Pi5 using using Docker + Ollama + WebUI
https://github.com/adijayainc/LLM-ollama-webui-Raspberry-Pi5/
-
GenAI on the Edge Forum: Running an LLM on a Raspberry Pi https://www.youtube.com/watch?v=bU5F0bVOMIA
-
Running Large Language Models on Raspberry Pi at the Edge
https://www.hackster.io/mjrobot/running-large-language-models-on-raspberry-pi-at-the-edge-63bb11
-
Harnessing AI at the Edge: Building a RAG System with Ollama, Qdrant and Raspberry Pi https://blog.gopenai.com/harnessing-ai-at-the-edge-building-a-rag-system-with-ollama-qdrant-and-raspberry-pi-45ac3212cf75
-
SmolLM - blazingly fast and remarkably powerful
-
Build your own LLM AI on a Raspberry Pi
https://www.youtube.com/watch?v=6NCjTof5snk&ab_channel=WiseCat https://github.com/adamjenkins/pillama
-
Best ChatGPT AI on Raspberry Pi 5 - Qwen2 1.5b
https://www.youtube.com/watch?v=y2HRaBlzQCM&ab_channel=alby13
===========================================================================================
-
Raspberr Pi Forum Discussions
https://forums.raspberrypi.com/viewtopic.php?t=366146
- Use bitnet: Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch, https://github.com/kyegomez/BitNet
- OpenVINO: https://docs.openvino.ai/2024/home.html; https://docs.openvino.ai/2024/learn-openvino/interactive-tutorials-python.html
======================================================================================
Philippe Charrière's Blog https://k33g.hashnode.dev/series/ai-experiments
-
Run Ollama on a Pi5 : Host Ollama and TinyDolphin LLM on a Pi5 with Docker Compose
-
Ollama on my Pi5: The Python dev environment : First Steps with LangChain and the Python toolkit
-
Let's talk with a GenAI French cook How to use RAG with LangChain, Chroma, Ollama and Gemma (on a Pi 5)
https://k33g.hashnode.dev/lets-talk-with-a-genai-french-cook
-
Prompts and Chains with Ollama and LangChain And, of course, it runs on my 🥰 Pi5.
https://k33g.hashnode.dev/prompts-and-chains-with-ollama-and-langchain
-
Make a GenAI Web app in less than 40 lines of code. With Ollama, LangChain & StreamLit. And, again, it runs on my 🥰 Pi5.
https://k33g.hashnode.dev/make-a-genai-web-app-in-less-than-40-lines-of-code
-
Make a GenAI Conversational Chatbot with memory. With Ollama, LangChain & StreamLit. And, again, it runs on my 🥰 Pi5.
https://k33g.hashnode.dev/make-a-genai-conversational-chatbot-with-memory
-
Create a GenAI Rust Teacher How to learn Rust with Ollama and DeepSeek Coder Instruct
-
Let's chat about programming with LangChainJS and Ollama And this is still happening on a Pi 5 (and propelled by 🐳 Docker Compose)
https://k33g.hashnode.dev/lets-chat-about-programming-with-langchainjs-and-ollama
-
GenAI streaming API with LangChainJS, Ollama and Fastify And this is still happening on a Pi 5 (and propelled by 🐳 Docker Compose)
https://k33g.hashnode.dev/genai-streaming-api-with-langchainjs-ollama-and-fastify
-
Create a Web UI to use the GenAI streaming API With LangChainJS, Ollama and Fastify, still on a Pi 5 (and propelled by 🐳 Docker Compose)
https://k33g.hashnode.dev/create-a-web-ui-to-use-the-genai-streaming-api
-
Add memory to our GenAI application With LangChainJS, Ollama and Fastify, still on a Pi 5 (and propelled by 🐳 Docker Compose)
https://k33g.hashnode.dev/add-memory-to-our-genai-application
-
Connect your LLM to the world with the Ollama functions With LangChainJS, Ollama, still on a Pi 5 (and propelled by 🐳 Docker Compose)
https://k33g.hashnode.dev/connect-your-llm-to-the-world-with-the-ollama-functions
https://fullstackdeeplearning.com/cloud-gpus/
By Sergey Karayev and Charles Frye. Updated October 30, 2023.
Discussion of this page on Hacker News [https://news.ycombinator.com/item?id=36025099] May 21, 2023.
-
GPU Cloud Server Comparison
- The table below does not include all possible configurations for all providers, as providers differ in their configuration strategy.
- Most providers, including AWS, Azure, and Lambda, provide instances with pre-set configurations.
- On GCP, any suitable machine can be connected to a configuration of GPUs.
- On other providers, like Oblivus Cloud, Cudo Compute, and RunPod, users have precise control over the resources they request. Note that RunPod's Community Cloud, Oblivus, and Cudo are all "open clouds", meaning compute is provided by third parties.
- For providers without pre-set instance configurations, we have selected configurations that are roughly equivalent to AWS's options. Generally, these configurations are good for workloads that require heavy inter-GPU communication.
- Where possible, regions were set to be the west or central parts of the United States. GPU availability depends on the region.
- Raw data can be found in a csv on GitHub, https://github.com/full-stack-deep-learning/website/blob/main/docs/cloud-gpus/cloud-gpus.csv.
- Costs can be substantially reduced via preemption recovery and failover across clouds. If you don't want to roll your own, consider a tool like SkyPilot - https://github.com/skypilot-org/skypilot. See discussion of their launch on Hacker News - https://news.ycombinator.com/item?id=33964285, December 13, 2022.
-
How do I choose GPU?
-
This page is intended to track and make explorable the current state of pricing and hardware for cloud GPUs.
-
If you want advice on which machines and cards are best for your use case, we recommend Tim Dettmer's blog post on GPUs for deep learning.
-
The whole post is a tutorial and FAQ on GPUS for DNNs, but if you just want the resulting heuristics for decision-making, see the "GPU Recommendations" section, which is the source of the chart below.
-
-
GPU Raw Performance Numbers and Datasheets
Model | Arch | FP32 | Mixed-precision | FP16 | Source |
---|---|---|---|---|---|
A100 | Ampere | 19.5 | 156 | 312 | Datasheet |
A10G | Ampere | 35 | 35 | 70 | Datasheet |
A6000 | Ampere | 38 | ? | ? | Datasheet |
V100 | Volta | 14 | 112 | 28 | Datasheet |
T4 | Turing | 8.1 | 65 | ? | Datasheet |
P4 | Pascal | 5.5 | N/A | N/A | Datasheet |
P100 | Pascal | 9.3 | N/A | 18.7 | Datasheet |
K80 | Kepler | 8.73 | N/A | N/A | Datasheet |
A40 | Ampere | 37 | 150 | 150 | Datasheet |
-
GPU Performance Benchmarks
Below are some basic benchmarks for GPUs on common deep learning tasks.
Benchmark of different GPUs on a single ImageNet epoch, by AIME
Benchmark of different GPUs on a mix of tasks, by Lambda Labs
- OpenAI-backed "AGI ROBOT" SHOCKED The ENTIRE Industry, https://www.youtube.com/watch?v=yauNW4C-Tfo&ab_channel=MatthewBerman
-
Explainable machine learning: LIME
It is so intuitive that I couldn't believe that nobody really thought about it before. Well, it is easy to be surprised after the facts! It is very reminiscent of Partial Dependence plots or ICE plots, but instead of looking at the global contributions of the different features, it provides local explanations for each prediction.
LIME (Local Interpretable Model-agnostic Explanations) looks at an ML model as a black box, and it tries to estimate the local variations of a prediction by perturbing the feature values of the specific data instance. The process is as follows:
- Choose a data instance x with the prediction y you want to explain
- Sample multiple data points around the initial data point by perturbing the values of the features
- Take those new samples and get the related inferences from our ML model
- We now have data points with features X' and predictions y' => Train a simple linear model on those data points and weigh the samples by how far they are from the original data point x in the feature space (low weights for high distance and high weights for low distance).
Linear models are readily interpretable. For example, if we have
y = w_1 x_1 + w_2 x_2 + w_3 * x_3
w_1 * x_1 is the contribution to the prediction of the feature X_1 for the specific data instance, and a high value means a high contribution. So with this linear model, we can rank and quantify in an additive manner the contributions of each feature and for each instance to the predictions, and this is what we call "explanations" for the predictions.
LIME works a bit differently for different data types:
-
For tabular data, we can perturb the feature by simply adding some small noise to the continuous variables. For categorical variables, it is more delicate as the concept of distance is more subjective. Another way to do it is to choose another value of the feature from the dataset.
-
For text data, the features are usually the words or the tokens. The typical way to perturb the features is to remove at random a few words from the original sentence. It is intuitive to think that if we remove an important word, the predictions should change quite a bit.
-
For image data, pixels are not really representative of what "matters" in an image. "Super-pixels" are created by segmenting the image (clustering similar close pixels) and then serve as the main features. We can turn on and off those new features by zeroing their values. By turning off a few super-pixels, we effectively perturb the feature set enough to estimate which segments contribute the most to the predictions.
Here is the original paper: “Why Should I Trust You?” Explaining the Predictions of Any Classifier, and the Python package.
-
Explainable AI: SHAP
SHAP is certainly one of the most used techniques for explainable AI these days, but I think many people don't know why. Some researchers had a huge impact on the history of ML, and most people will never know about them.
SHAP (SHapley Additive exPlanations) is a framework that provides explanations of predictions as a sum of the contributions of the underlying features used in the model. We have known about the Shapley value since 1951 (https://lnkd.in/e6jBm8YD), and since then, people have tried to use them as a way to measure feature attributions in Machine Learning models, but it was not until 2017 that a team from the University of Washington proposed a unified framework to apply those in any ML models.
-
Kernel SHAP is a black box method that builds on top of LIME (https://lnkd.in/gpjdUNxw). Let's say you want to explain a specific prediction p with the related features values x. The idea is to create many news samples around x by replacing some of the values with others pulled at random from the data set and to see the predictions of those new samples by the model. We can then use those samples and predictions to train a linear model and use the fitted weights to understand the local contributions of the different features. The difference between LIME and SHAP is the way the samples are weighted in the MSE loss function. LIME uses a Gaussian, whereas SHAP uses the Shapley weights.
-
Tree SHAP is the exact and faster estimate of those numbers by utilizing the structure of tree-based algorithms. In a tree, we can compute the exact predictions with a subset of the features by skipping the removed features and averaging the predictions of the resulting subtrees. We understand the contribution of a feature by measuring the variation of the predictions with and without it. In 2019, the same team proposed an algorithm to explore all the feature contributions of the feature power-set at once: https://lnkd.in/gDhHeQJP.
-
Linear SHAP is the exact analytic simplification of the original formula for linear models. For a model f(x) = w_1 * x_1 + w_2 * x_2 + …, the contribution of the feature x_1 is simply w_1 * ( x_1 - E[x_1]).
-
Deep SHAP is an application of DeepLIFT (https://lnkd.in/gtRtxhZq) using the Shapley values as a measure of contribution. DeepLIFT is a way to decompose the predictions of Neural Networks as a linear combination of contributions of the underlying features. The idea is that we can backpropagate the contributions as we do the gradient.
You can find the original SHAP papers here: https://lnkd.in/gWfEGkHt, https://lnkd.in/gDhHeQJP. SHAP is obviously, for most people, a Python package, and make sure to check it out if you haven't.
-
ML Interpretability: feature visualization, adversarial example, interp. for language models
https://youtube.com/playlist?list=PL8P_Z6C4GcuVMxhwT9JO_nKuW0QMSJ-cZ&si=vtxnKLMZwB8SGz6y
https://github.com/aws-samples/aws-machine-learning-university-responsible-ai/
-
How to convert any problem into a machine learning problem
https://www.youtube.com/watch?v=-MTW39At8F0&ab_channel=RicardoCalix
-
Intro to Reinforcement Learning through Human Feedbacks (RLHF)
https://www.youtube.com/watch?v=A8YqZKGRTAM&ab_channel=RicardoCalix
-
A Simple Generative Adversarial Network (GAN) in PyTorch
https://www.youtube.com/watch?v=BGtSw0XNthY&ab_channel=RicardoCalix
-
Learn More about ML and AI and Gen AI on https://www.youtube.com/@ricardocalix188/videos
-
Super VIP Cheatsheet: Deep Learning
-
Kolmogorov-Arnold Networks: MLP vs KAN, Math, B-Splines, Universal Approximation Theorem
-
Full Stack Deep Learning Course for Free
- [FSDL 2022 (Online)](https://fullstackdeeplearning.com/course/2022/): A fully online course, taught via YouTube, Crowdcast, and Discord. - [FSDL 2021 (Online)](https://fullstackdeeplearning.com/spring2021/): Contemporaneous with the Berkeley course, we taught an online cohort course. - [FSDL 2021 (Berkeley)](https://bit.ly/berkeleyfsdl): Taught as a UC Berkeley undergrad course CS194-080 in Spring 2021 - [FSDL 2020 (UW)](https://bit.ly/uwfsdl): Taught as University of Washington Professional Master's Program course CSEP 590C in Spring 2020 - [FSDL 2019 (Online)](https://fall2019.fullstackdeeplearning.com/): Materials from the November 2019 bootcamp held on Berkeley campus organized in a nice online format. - [FSDL 2019 (Bootcamp)](https://fullstackdeeplearning.com/course/): Raw materials from the March 2019 bootcamp, held on Berkeley campus. - [FSDL 2018 (Bootcamp)](https://fullstackdeeplearning.com/course/): Our first bootcamp, held on Berkeley campus in August 2018 * **Deep Learning Fundamentals (Full Stack Deep Learning - Spring 2021)** https://www.youtube.com/watch?v=fGxWfEuUu0w&list=PL1T8fO7ArWlcWg04OgNiJy91PywMKT2lv&ab_channel=TheFullStack * **Full Stack Deep Learning - 2022** https://www.youtube.com/watch?v=-Iob-FW5jVM&list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur&ab_channel=TheFullStack
-
What is the difference between the model parameters and the model hyperparameters?
What is the difference between the model parameters and the model hyperparameters? The hyperparameters are the parameters we cannot co-train with the other parameters through the statistical learning optimization used to learn from the data. So we need to alternate between learning the parameters through minimizing the loss function and tuning the hyperparameters through different optimization techniques. And that can be computationally very expensive! Neural Architecture Search is about tuning the network architecture as hyperparameters and the search space dimension can be as big as 10^40!
One technique that gave me something to think about is DARTS. Instead of tuning the architecture through typical optimization techniques like Reinforcement Learning or Bayesian optimization, we jointly learn the architecture and the model parameters through the gradient descent process. That's AutoML taken to the next level!
The idea is to first establish a SuperNet of all the possible operations you may want to evaluate within your network. For example, you may want to test different convolution strides or kernel sizes, and you may want to discover new useful ways to connect them. Typically, we fix the skeleton of the network (the number of computational blocks - for example, ResNet-50 contains 16 residual blocks), and we search within each block. You put all the operations you want to test in each of the blocks and you create all the possible connections you may want to exist between those operations. Those connections contain parameters you can learn through gradient descent and they parametrize the connection probabilities. To make sure the model generalizes well, the model parameters are learned by minimizing the loss function measured on training data batches while the architecture parameters are learned by minimizing the loss function measured on the validation dataset (as you would in typical hyperparameter optimization).
Once trained, you just keep the connections with the highest probabilities and remove the unused operations. This allows you to discover the optimal sub-network. You can then retrain from scratch using this time the sub-network.
DARTS is the seminal work on differential architecture search and has seen a lot of improvement since then. You can read more about it here: https://lnkd.in/ggwr9afT. If you are interested in learning more about Neural Architecture Search, I would advise reading this review: https://lnkd.in/geAA-c8f.
-
ML model optimization
Do we need to train a model to understand how good it would be? Can't we "guess" its potential predictive power just based on its architecture or training parameters? That's the idea behind Meta-Learning: learn the patterns that make a model better than another one for some learning task!
The concepts are simple: featurize the learning meta-data, train a model to predict performance metrics with those features, and use that meta-model to search the optimization space when tuning another model.
Featurizing the learning meta-data means that we create features from the training settings. We can capture the architecture of a network as a one-hot encoded feature vector. We can capture the different hyperparameter values and the training parameters such as the number of epochs or the hardware (CPU / GPT). We can extend the meta-feature space to the dataset used for training. For example, we can include a one-hot encoded representation of the features used and the number of samples that were used (this will allow you to perform feature selection as well). We could capture anything that could influence the learning and the resulting performance metrics. The more meta-features you include, the greater the space you will able to optimize over, but also the more difficult it will be to correctly learn the target variable.
Now that you can featurize training experiments, you can train a meta-learner to learn the relationship between the training parameters and a performance metric. Because you will most likely have very few samples, your meta-learner should be a simple model such as a linear regression or a shallow neural network.
Now that you have a model that understands the relationship between the learning meta-data and the performance metrics, you can search for the learning meta-data that maximizes the performance metric. Because you have a model, you can assess billions of different learning meta-data in seconds and converge to the optimal meta-features quickly. The typical approach is to use Reinforcement Learning or supervised fine-tuning. Fine-tuning means that if you have specific training data or if you want to focus on a subset of the search space, you can train a couple of new models on that data and get the resulting performance metrics. This will allow you to fine-tune the meta-learner to get a more optimal optimization search.
This is a good read to get started on the subject: https://lnkd.in/e9VafpST
-
What happens when your Machine Learning model breaks?
What happens when your Machine Learning model breaks? Imagine if the Netflix movie ranking model, the Facebook feed ranking model, or the Google search engine model suddenly stopped working. Nothing would show on those websites! Would that be an acceptable user experience?
In reality, those websites are extremely reliable! To run any of them, thousands of microservices or databases are always running in the background, and some of them are doomed to crash from time to time. In many cases, we can make the systems fault tolerant by adding redundancy.
This doesn't always work for ML pipelines! Suddenly your model can start to output unusable predictions or errors. Those predictions may be widely inaccurate or simply non-numerical values. If a prediction request fails, it may be due to some hardware failure, in which case redundancy could solve the problem. It could also be due to bugs that have been introduced due to the way a specific feature is computed, which would lead to any redundant model to fail as well. It is often important to have fallback strategies in place to handle this kind of situation. A fallback model could be a previous version of the production model, a simpler model, or a simple heuristic rule that outputs sub-optimal predictions, but predictions nonetheless. If a request fails, you can have a retry step with exception handling that reroutes the request to a fallback model.
It is quite easy to detect failures when a model throws errors or non-numerical values, but it is much harder when the model seemingly predicts meaningful values. That is why it is always important to monitor input features and model outputs. If some feature statistics start to drastically change over time, you may want to temporarily disable any model feeding on that feature and re-route requests to simpler models not using the feature, or you could simply replace the feature value with a constant while you investigate. Similarly, your prediction statistics, the model calibration, or the online model performance could start shifting, in which case you need to make sure your monitoring system automatically enables re-routing of the requests to a different model.
Fallback mechanisms become critical in big tech companies. You may have hundreds of engineers working on separate aspects of the ML pipelines, testing different techniques to improve those pipelines. Multiple engineers may deploy a new model, a new feature, a new feature transformation, or a new optimization technique that may lead to the pipelines suddenly failing. The monitoring system may detect outlier behavior but it may take days to debug the problem, and it is often easier to revert to a previous state of the pipelines until the problem is resolved.
Reliability for ML systems can be tricky and it is important to adopt ML specific strategies to handle it!
There definitively is no shortage of ways Data can go wrong when it comes to Machine Learning! There are no magic tricks to avoid those but there are ways to mitigate them to some degree.
-
Leaky variables are when you are using information you could not have known at the time of prediction in your training data. In a sense, you are including what you are trying to predict as part of your feature set which leads to seemingly overperforming models.
-
Concept drift is when the distribution of the underlying input variables remains the same but their relationships to the target variable change. That is why it is important to have periodic retraining or continuous training strategies in place.
-
Feedback loops are when the current model's predictions are used to accumulate future training data. Because of it, it leads to selection bias with future models trained on data that do not represent well production data. That happens a lot in recommender engines! That can actually tend to lead to better models but it also can reinforce mistakes made by previous models.
-
Stationarity is a fundamental assumption in statistical learning as we assume that samples are identically distributed. If their probability distribution evolves over time (non-stationary), the identical distribution assumption is violated. That is why it is critical to build features that are as stationary as possible. For example dollar amount is not a good feature (because of inflation), but relative dollar changes (Δ$ / $) may be better.
-
Population shift is a typical problem leading to concept shift and non-stationarity. The underlying population used for the model to infer changes over time, and the original training data isn't anymore representative of the current population. Again periodic retraining is a good remedy for this problem.
-
Regulatory changes are a difficult one! One day, a new data law is voted or the Apple Store changes its privacy policies making capturing a specific feature impossible. Whole companies went bankrupt because they were relying on specific data that Google Play or Apple Store allowed to capture one day, but prevented the next.
-
Overfitting is obviously the most well-known one and it is fortunately the one that every ML engineer is well prepared for! This is when the model does not generalize well to test data because it captured too much of the statistical noise within the training data.
-
Training data bias is when the sample distribution during training does not well represent the production data distribution, leading to biased models. It is crucial to understand how the bias will affect the inferences.
-
Covariate shift is when the input feature distribution P(X) changes but not their relation to the target P(Y|X). This may lead to biases in the training data selection process that may result in inaccurate models.
-
Simplify Model AI Model Training With AutoGluon
https://www.youtube.com/watch?v=H90z30dO6hM&ab_channel=FahdMirza
https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html
-
All metrics on the Hugging Face Hub
-
Deploy (Tiny) LLM to Production: Merge Lora Adapter, Push to HF Hub, Rest API with FastAPI & Docker
https://www.youtube.com/watch?v=c10rsQkczu0&ab_channel=VenelinValkov
-
Hamel Husain - Building LLM Apps in Production
https://www.youtube.com/watch?v=MFSd-_pMExI&ab_channel=AleksaGordi%C4%87-TheAIEpiphany
-
Building Julius AI to 500.000 users w/ Rahul (founder)
https://www.youtube.com/watch?v=CX2VLowIOXE&ab_channel=AleksaGordi%C4%87-TheAIEpiphany
-
DeepMind's TacticAI: an AI assistant for football tactics | Petar Veličković
https://www.youtube.com/watch?v=BRbUikWdXhI&ab_channel=AleksaGordi%C4%87-TheAIEpiphany
-
Ishan Misra (Meta) - Emu Video Generation
https://www.youtube.com/watch?v=dLcsreHRF1s&ab_channel=AleksaGordi%C4%87-TheAIEpiphany
-
OpenAI Cook Book
-
Mervin Praison https://www.youtube.com/@MervinPraison
-
James Briggs https://www.youtube.com/@jamesbriggs ****************
-
AI Anytime https://www.youtube.com/@AIAnytime ****************
-
All About AI https://www.youtube.com/@AllAboutAI ****************
-
Sam Witteveen https://www.youtube.com/@samwitteveenai ****************
-
AutoGPT Tutorials ManuAGI - AutoGPT Tutorials ****************
-
AI Makerspace https://www.youtube.com/@AI-Makerspace ****************
-
AssemblyAI https://www.youtube.com/@AssemblyAI ****************
-
Venelin Valkov https://www.youtube.com/@venelin_valkov ****************
-
Trelus Research https://youtube.com/@TrelisResearch?si=We9ORBTjY3teMpq4 ****************
-
Connor Shorten https://youtube.com/@connorshorten6311?si=YA9lHWPqWaAdOtSy ****************
-
Julien Simon https://www.youtube.com/@juliensimonfr ****************
-
Matthew Berman https://www.youtube.com/@matthew_berman ****************
-
DataScience Basics https://youtube.com/@datasciencebasics
-
Aleksa Gordić - The AI Epiphany https://www.youtube.com/@TheAIEpiphany **************** https://github.com/gordicaleksa
-
Jeff Heaton https://youtube.com/@HeatonResearch?si=hfcA9vNxWsk05Uws **************** www.heatonresearch.com
-
Prompt Engineering https://www.youtube.com/@engineerprompt
-
Umar Jamil https://www.youtube.com/@umarjamilai/videos ***********
-
WorldofAI https://www.youtube.com/@intheworldofai
-
AlejzndroAO Software and AI, https://youtube.com/@alejandro_ao?si=1TRHMqnIpQGUjJG6
-
Learn Data With Mark https://youtube.com/@learndatawithmark?si=Sf7QWUJd6Jn2K5CR
-
SkillCurb https://www.youtube.com/@skillcurb
-
Seth Juarez https://www.youtube.com/@sethjuarez
-
Nicholas Renotte https://www.youtube.com/@NicholasRenotte/
-
Mat Williams https://youtube.com/@technovangelist?si=UiLCumC6anKxbzB-
-
Ian Wootten https://youtube.com/@IanWootten?si=4xbHzdFIIX7n9SMS
-
AI for Devs https://youtube.com/@ai-for-devs?si=4TrsM8CP7VBO-2a_
-
code_your_own_AI https://www.youtube.com/@code4AI
-
Sebastian Raschka https://www.youtube.com/@SebastianRaschka
-
Jeremy Howard https://www.youtube.com/@howardjeremyp *************
-
Leon Explains AI https://www.youtube.com/@leonsaiagency
-
Skill Leap AI https://www.youtube.com/@SkillLeapAI/
-
AI Jason https://www.youtube.com/@AIJasonZ
-
Abhishek Thakur https://www.youtube.com/@abhishekkrthakur
-
Fahd Mirza https://www.youtube.com/@fahdmirza *****
-
Yeyu Lab https://www.youtube.com/@yeyulab
-
Steve (Builder.io) https://www.youtube.com/@Steve8708
-
Andrej Karpathy https://youtu.be/VMj-3S1tku0?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ
-
AI Engineer https://www.youtube.com/@aiDotEngineer
-
Whispering AI https://www.youtube.com/@WhisperingAI/videos
-
Greg Kamradt (Data Indy) https://www.youtube.com/@DataIndependent
-
Rob Mulla https://www.youtube.com/@robmulla
-
DataInsightEdge https://www.youtube.com/@DataInsightEdge01
-
AI Papers Academy https://www.youtube.com/@aipapersacademy ***
-
Mosleh Mahamud https://www.youtube.com/@moslehmahamud9574 ***
-
FourthBrainAI https://www.youtube.com/@FourthBrainAI ***
-
Underfitted, https://www.youtube.com/@underfitted
-
Nodematic Tutorials, https://www.youtube.com/@nodematic/ ****
-
Super Lazy Coder, https://www.youtube.com/@superlazycoder1984/
-
DataMListic, https://www.youtube.com/@datamlistic/videos ****
-
Entry Point AI https://www.youtube.com/@EntryPointAI
-
Shaw Talebi https://www.youtube.com/@ShawhinTalebi
-
Deploying AI https://youtube.com/@deployingai?si=pXZDOETUDdqiB_9I
-
Predibase https://youtube.com/@Predibase?si=HbdO89yPruuKJp6I
-
GaoDalie_AI https://www.youtube.com/@GaoDalie_AI ***
-
Sunny Savita https://www.youtube.com/@sunnysavita10 ****
-
Stanford Online, https://www.youtube.com/@stanfordonline/videos **********
-
Ricardo Calix, https://www.youtube.com/@ricardocalix188 ****************
-
3Blue1Brown https://www.youtube.com/@3blue1brown
If few shot learning is not enough, then Fine-Tuning is required.
-
Token Cost Reduction through LLMLingua's Prompt Compression, https://www.youtube.com/watch?v=xLNL6hSCPhc&ab_channel=AIAnytime
-
Prompting Guide, https://www.promptingguide.ai/research/rag
-
Prompt Engineering, RAG, and Fine-tuning: Benefits and When to Use, https://www.youtube.com/watch?v=YVWxbHJakgg&ab_channel=EntryPointAI
-
Text to Speech Tortoise versus Openvoice Comparison | How to Clone Any Voice for FREE !!, https://www.youtube.com/watch?v=E9jWEmUSxyo&ab_channel=SkillCurb
-
ChatGPT Vision API End to End Project with Zapier and MindStudio, https://www.youtube.com/watch?v=4UsQxuhxB7c&ab_channel=SkillCurb
-
Vibe-Based Prompt Engineering with PromptLayer's Jared Zoneraich, https://www.youtube.com/watch?v=SEgwj6SVWyQ&ab_channel=ArizeAI
-
Prompt Templates, Functions and Prompt Window Management, https://www.youtube.com/watch?v=YaYaZu6NbS0&ab_channel=ArizeAI
-
ChatGPT Prompt Engineering for Developers, https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/
* **Free Course on** (https://course.fast.ai/) by Jeremy Howard's Fastai
**Practical Deep Learning:** A free course designed for people with some coding experience, who want to learn how to apply deep learning and machine learning to practical problems.
Book PDF: https://dl.ebooksworld.ir/books/Deep.Learning.for.Coders.with.fastai.and.PyTorch.Howard.Gugger.OReilly.9781492045526.EBooksWorld.ir.pdf
-
Learn from Huggingface
- Open-Source AI Cookbook: https://huggingface.co/learn/cookbook
- NLP Course: https://huggingface.co/learn/nlp-course
- Deep RL Course: https://huggingface.co/learn/deep-rl-course
- Audio Course: https://huggingface.co/learn/audio-course
-
LLM University
LLM University by Cohere
https://docs.cohere.com/docs/llmu
-
🚀 Full Stack LLM Bootcamp 🚀
https://fullstackdeeplearning.com/llm-bootcamp/
https://fullstackdeeplearning.com/llm-bootcamp/spring-2023/
The Full Stack (https://www.youtube.com/@The_Full_Stack/videos)
https://github.com/the-full-stack/website
Lectures https://www.youtube.com/watch?v=twHxmU9OxDU&list=PL1T8fO7ArWleyIqOy37OVXsP4hFXymdOZ&pp=iAQB
- Learn to Spell: Prompt Engineering https://youtu.be/JnBHR_yL2w8?list=PL1T8fO7ArWleyIqOy37OVXsP4hFXymdOZ
- LLMOps https://youtu.be/Fquj2u7ay40?list=PL1T8fO7ArWleyIqOy37OVXsP4hFXymdOZ
- UX for Language User Interfaces https://youtu.be/l5mG4z343qg?list=PL1T8fO7ArWleyIqOy37OVXsP4hFXymdOZ
- Augmented Language Models https://youtu.be/YdeuQhlHmCA?list=PL1T8fO7ArWleyIqOy37OVXsP4hFXymdOZ
- Launch an LLM App in One Hour https://youtu.be/twHxmU9OxDU?list=PL1T8fO7ArWleyIqOy37OVXsP4hFXymdOZ
- LLM Foundations https://youtu.be/MyFrMFab6bo?list=PL1T8fO7ArWleyIqOy37OVXsP4hFXymdOZ
- Project Walkthrough: askFSDL https://www.youtube.com/watch?v=pUKs4xM1r5U&list=PL1T8fO7ArWleyIqOy37OVXsP4hFXymdOZ&index=5&pp=iAQB
- What's Next? https://youtu.be/ax_R4yz1WwM?list=PL1T8fO7ArWleyIqOy37OVXsP4hFXymdOZ
- UX for Language user interfaces (LLM Bootcamp) https://www.youtube.com/watch?v=l5mG4z343qg&list=PL1T8fO7ArWleyIqOy37OVXsP4hFXymdOZ&index=6&pp=iAQB
- Invited Talks
- Fireside Chat with Peter Welinder https://www.youtube.com/watch?v=54UThDl00qI&list=PL1T8fO7ArWleyIqOy37OVXsP4hFXymdOZ&index=9&pp=iAQB
- Harrison Chase: Agents https://www.youtube.com/watch?v=DWUdGhRrv2c&list=PL1T8fO7ArWleyIqOy37OVXsP4hFXymdOZ&index=10&pp=iAQB
- Reza Shabani: How To Train Your Own LLM https://www.youtube.com/watch?v=roEKOzxilq4&list=PL1T8fO7ArWleyIqOy37OVXsP4hFXymdOZ&index=11&pp=iAQB
-
Machind Learning University by AWS, https://youtube.com/@machinelearninguniversity1942?si=pD5dszE0HTiOclcu
https://github.com/aws-samples/aws-machine-learning-university-accelerated-nlp
https://github.com/aws-samples/aws-machine-learning-university-accelerated-cv
https://github.com/aws-samples/aws-machine-learning-university-accelerated-tab
https://github.com/aws-samples/aws-machine-learning-university-dte
https://github.com/aws-samples/aws-machine-learning-university-responsible-ai
-
PyTorch
Real-World PyTorch: From Zero to Hero in Deep Learning & LLMs | Tensors, Operations, Model Training
Explore PyTorch from basics to advanced model training. Through hands-on examples, learn tensor manipulation, GPU utilization, and model optimization. Ideal for anyone eager to master deep learning with PyTorch, this video ensures you're equipped for the AI revolution.
https://www.youtube.com/watch?v=dgs_9quxZXk&ab_channel=VenelinValkov
https://github.com/curiousily/Get-Things-Done-with-Prompt-Engineering-and-LangChain
-
CS50
This is CS50, Harvard University's introduction to the intellectual enterprises of computer science and the art of programming. Demanding, but definitely doable. Social, but educational. A focused topic, but broadly applicable skills. CS50 is the quintessential Harvard (and Yale!) course.
-
Ahead of AI magazine by Sebastian Raschka https://magazine.sebastianraschka.com/archive
-
Edx: cs50.edx.org
-
FreeCodeCamp https://www.youtube.com/@freecodecamp
-
Generative AI Full course - Gemini Pro, openAI, Llama, Langchain, Pinecone, vector databases and more, https://youtu.be/mEsleV16qdo?si=K4ZFHW2ZRG7EtL3Q
-
The AiEdge
https://www.linkedin.com/company/the-aiedge-newsletter/posts/?feedView=all
-
Create a Large Language Model from Scratch with Python – Tutorial https://www.youtube.com/watch?v=UU1WVnMk4E8&t=24s
-
Prompt Engineering for Web Devs - ChatGPT and Bard Tutorial https://youtu.be/ScKCy2udln8
-
Deep Learning for Computer Vision with Python and TensorFlow – Complete Course https://youtu.be/IA3WxTTPXqQ
-
Machine Learning with Python and Scikit-Learn – Full Course https://youtu.be/hDKCxebp88A
-
MLOps Course – Build Machine Learning Production Grade Projects https://youtu.be/-dJPoLm_gtE
-
code_your_own_AI https://www.youtube.com/@code4AI
-
The Ethics of AI & Machine Learning - Full Course https://youtu.be/qpp1G0iEL_c
-
Google
Google Cloud Skills Boost https://www.cloudskillsboost.google/paths/118 Google Cloud Generative AI Learning Path
- Introduction to Generative AI https://www.cloudskillsboost.google/course_templates/536
- Introduction to Large Language Models https://www.cloudskillsboost.google/course_templates/539
- Generative AI Fundamentals https://www.cloudskillsboost.google/course_templates/556
- Encoder-Decoder Architecture https://www.cloudskillsboost.google/course_templates/543
- Attention Mechanism https://www.cloudskillsboost.google/course_templates/537
- Transformer Models and BERT Model https://www.cloudskillsboost.google/course_templates/538
- Generative AI Explorer - Vertex AI https://www.cloudskillsboost.google/quests/299
- Blogs
- Microsoft Resesrch Blog, https://www.microsoft.com/en-us/research/blog/
- Philippe Charrière's Blogm, https://k33g.hashnode.dev/series/ai-experiments
- Confident AI blog, https://www.confident-ai.com/blog
- Huggingface blog, https://huggingface.co/blog
- Langchain blog, https://blog.langchain.dev/
- LlamaIndex blog, https://www.llamaindex.ai/blog
- Pinecone, https://www.pinecone.io/learn/
- Maxime Labonne, https://mlabonne.github.io/blog/
- LLM Watch, https://www.llmwatch.com/
- AI by Hand ✍️, https://aibyhand.substack.com/