Contact us
Our friendly team would love to hear from you.
Day 1
Day 2
Day 3
Day 4
Despite the inclusion of self-supervised learning in the process of training the Large Language Models (LLMs), these largest models, as well as the vast majority of other, smaller ones that solve various NLP tasks, still require labeled data to produce any meaningful results. More often than not the problems start at the very beginning, when gathering the data. Then another set of impediments arise… who should annotate these texts, do we have to hire experts, how many of them, what tools to use and so on. Or maybe we can use the LLM to do our bidding and be done with the problem quicker and cheaper? Today we will address some of those issues, and show you how the data annotation process can look like.
Beginner
This workshop is considered to be beginner-friendly as it mainly focuses on the showcase of tools for data annotation and describes terms connected to it. Regarding the programming aspects, we don’t anticipate any difficulties for the participants.
This day is dedicated to those who seek to streamline the data gathering process, especially the annotation part. However, as the topic is broad, we expect that everyone will find something of interest here.
This day focuses on tools and techniques for monitoring and optimizing the training process of machine learning models. We introduce MLflow as an open source tool, allowing us to monitor the fine-tuning experiments. We present the experiment card as the basic means for documenting the progress related to searching for a better model. Then we move to the methods allowing for trining optimization and we present Optuna as one of the tools which automates this procedure. We conclude the day with the introduction of the model’s registry.
Intermediate
New day, new tools, and once again the main challenge of the workshop is to understand and leverage them to the greatest extent possible. Thus, the difficulty will be perceived differently by the participants based on their previous experiences with such tools. On average, we expect a medium level of difficulty.
All interested in neural networks can find some interesting bits here, but the main beneficiaries of the workshop should be those directly involved in the very process of training or fine-tuning models.
In the entire NLP pipeline, before modeling we have to process and clean the data, and after training we should analyze and understand our models. This workshop focuses on data-centric AI, data quality testing, as well as resulting model testing and analysis. Understanding data quality, problems and errors is crucial to design proper preprocessing. These factors also influence models, causing lower quality and stability, often related to bias and unfairness. Spotting those problems is particularly challenging for large text corpora, which cannot be analyzed manually.
Intermediate
Similarly to the previous days in the last two sessions, this workshop introduces many new concepts and tools like CleanLab and Giskard, thus the participants should focus on exploring these, learning how to use them and how to recognize scenarios in which leveraging them would be beneficial. The coding side isn’t particularly demanding, but due to the multitude of novelties the difficulty level is estimated at medium.
This workshop was created to optimize the data mining phase, as well as the latter quality assurance processes, after the models are trained. All tools and techniques introduced on this day could benefit users that dedicate themselves to obtaining the best possible data and understanding how it affects model’s quality and interpretability, including data engineers and researchers on R&D teams.
After gathering data, creating and deploying a model, a whole new phase begins: post-deployment. It covers versioning, monitoring, retraining, updating, and more. Here we focus on monitoring, which helps us to detect changes in data and model behavior. This is also known as drift monitoring, and can be divided into data drift and concept (model) drift. These are particularly challenging to get right in NLP applications, where we have text data, embeddings, multimodal datasets, and LLMs.
Advanced
This workshop is expected to be the most challenging of the session in terms of the concepts and solutions introduced, mainly due to new tools, but also due to the phenomena of data drift and concept drift.
This part of the course aims to provide an answer to the following question: What to do after the model is trained and deployed? The conclusions should be of value to all involved in the training and deployment processes, but especially to data scientists responsible for data mining and cleansing, and engineers responsible for model instantiation and its training or fine-tuning, as the data and concept drifts affect primarily those areas in further support of the deployed solution.
Interested in mastering Large Language Models? Fill out the form below to receive a tailored quotation for a course designed to meet your specific needs and objectives.