We use cookies

This website uses cookies to provide necessary website functionality, improve your experience and analyze our traffic. By using our website, you agree to our Privacy Policy and our cookies usage.

Our friendly team would love to hear from you.

Thank you for your interest!

We will contact you as soon as possible.

Back to homepage

"We would definitely recommend Cognitum as a IT partner in the field of AI, management systems, semantic technologies and software development."

Annotation, data and MLops

Day 1

Day 2

Day 3

Day 4

Day 1 - Data Annotation

Agenda

Overview of open-source annotation tools
Data annotation
– Annotating data for Named Entity Recognition (NER) task with Argilla
– Annotation of Question-Answering (QA) pairs with custom tool
– Survey of approaches to resolving annotation conflicts
Evaluation
– Comparison of annotation results with LLMs performance on the same tasks
– Reflection on the legitimacy of synthetically generated datasets

Description

Despite the inclusion of self-supervised learning in the process of training the Large Language Models (LLMs), these largest models, as well as the vast majority of other, smaller ones that solve various NLP tasks, still require labeled data to produce any meaningful results. More often than not the problems start at the very beginning, when gathering the data. Then another set of impediments arise… who should annotate these texts, do we have to hire experts, how many of them, what tools to use and so on. Or maybe we can use the LLM to do our bidding and be done with the problem quicker and cheaper? Today we will address some of those issues, and show you how the data annotation process can look like.

Level

Beginner

This workshop is considered to be beginner-friendly as it mainly focuses on the showcase of tools for data annotation and describes terms connected to it. Regarding the programming aspects, we don’t anticipate any difficulties for the participants.

Target Participants

This day is dedicated to those who seek to streamline the data gathering process, especially the annotation part. However, as the topic is broad, we expect that everyone will find something of interest here.

Day 2 - Training Monitoring

Agenda

Introduction to MLflow
– Model/Experiment card
Training monitoring
– Optimizing classification metrics
– Optimizing training speed
Hyperparameter tuning with Optuna
Model registry

Description

This day focuses on tools and techniques for monitoring and optimizing the training process of machine learning models. We introduce MLflow as an open source tool, allowing us to monitor the fine-tuning experiments. We present the experiment card as the basic means for documenting the progress related to searching for a better model. Then we move to the methods allowing for trining optimization and we present Optuna as one of the tools which automates this procedure. We conclude the day with the introduction of the model’s registry.

Level

Intermediate

New day, new tools, and once again the main challenge of the workshop is to understand and leverage them to the greatest extent possible. Thus, the difficulty will be perceived differently by the participants based on their previous experiences with such tools. On average, we expect a medium level of difficulty.

Target Participants

All interested in neural networks can find some interesting bits here, but the main beneficiaries of the workshop should be those directly involved in the very process of training or fine-tuning models.

Day 3 - Data-centric AI and Model Testing

Agenda

Text Data Quality Assurance
– Data preprocessing and initial quality checking
– CleanLab blackbox testing
Model Quality Testing
– Common metrics
– Behavioral testing with Giskard
Interpretability
– Global vs local explainability
– Local explainability with Captum

Description

In the entire NLP pipeline, before modeling we have to process and clean the data, and after training we should analyze and understand our models. This workshop focuses on data-centric AI, data quality testing, as well as resulting model testing and analysis. Understanding data quality, problems and errors is crucial to design proper preprocessing. These factors also influence models, causing lower quality and stability, often related to bias and unfairness. Spotting those problems is particularly challenging for large text corpora, which cannot be analyzed manually.

Level

Intermediate

Similarly to the previous days in the last two sessions, this workshop introduces many new concepts and tools like CleanLab and Giskard, thus the participants should focus on exploring these, learning how to use them and how to recognize scenarios in which leveraging them would be beneficial. The coding side isn’t particularly demanding, but due to the multitude of novelties the difficulty level is estimated at medium.

Target Participants

This workshop was created to optimize the data mining phase, as well as the latter quality assurance processes, after the models are trained. All tools and techniques introduced on this day could benefit users that dedicate themselves to obtaining the best possible data and understanding how it affects model’s quality and interpretability, including data engineers and researchers on R&D teams.

Day 4 - Monitoring

Agenda

Introduction to Monitoring
– Introduction to Evidently
– Data quality testing
– Data drift detection
NLP Data Monitoring
– Text and embeddings monitoring
– Multimodal NLP monitoring
– Customized metrics
NLP Model Monitoring and Concept Drift
– General approaches and label availability
– Label-available concept drift detection
– Label-free model performance estimation
– Introduction to NannyML

Description

After gathering data, creating and deploying a model, a whole new phase begins: post-deployment. It covers versioning, monitoring, retraining, updating, and more. Here we focus on monitoring, which helps us to detect changes in data and model behavior. This is also known as drift monitoring, and can be divided into data drift and concept (model) drift. These are particularly challenging to get right in NLP applications, where we have text data, embeddings, multimodal datasets, and LLMs.

Level

Advanced

This workshop is expected to be the most challenging of the session in terms of the concepts and solutions introduced, mainly due to new tools, but also due to the phenomena of data drift and concept drift.

Target Participants

This part of the course aims to provide an answer to the following question: What to do after the model is trained and deployed? The conclusions should be of value to all involved in the training and deployment processes, but especially to data scientists responsible for data mining and cleansing, and engineers responsible for model instantiation and its training or fine-tuning, as the data and concept drifts affect primarily those areas in further support of the deployed solution.

Get Your Personalized
LLMs Course Quote

Interested in mastering Large Language Models? Fill out the form below to receive a tailored quotation for a course designed to meet your specific needs and objectives.

Thank you for your interest!

We will contact you as soon as possible.

Back to homepage

Your certified partner!

Empower your projects with Cognitum, backed by the assurance of our ISO 27001 and ISO 9001 certifications, symbolizing elite data security and quality standards.

Annotation, data and MLops

Get Your Personalized
LLMs Course Quote

Your certified partner!

Empower your projects with Cognitum, backed by the assurance of our ISO 27001 and ISO 9001 certifications, symbolizing elite data security and quality standards.

PN-EN ISO 9001:2015

ISO/IEC 27001

EN ISO 14001:2015

Annotation, data and MLops

Get Your PersonalizedLLMs Course Quote

Your certified partner!

Empower your projects with Cognitum, backed by the assurance of our ISO 27001 and ISO 9001 certifications, symbolizing elite data security and quality standards.

PN-EN ISO 9001:2015

ISO/IEC 27001

EN ISO 14001:2015

Get Your Personalized
LLMs Course Quote