We use cookies

This website uses cookies to provide necessary website functionality, improve your experience and analyze our traffic. By using our website, you agree to our Privacy Policy and our cookies usage.

Contact us

Our friendly team would love to hear from you.




    or contact us directly at office@cognitum.au

    ionicons-v5-e

    Thank you for your interest!

    We will contact you as soon as possible.

    Back to homepage

    Annotation, data and MLops

    Day 1

    Day 2

    Day 3

    Day 4

    Day 1 - Data Annotation
    Agenda
    • Overview of open-source annotation tools
    • Data annotation
      – Annotating data for Named Entity Recognition (NER) task with Argilla
      – Annotation of Question-Answering (QA) pairs with custom tool
      – Survey of approaches to resolving annotation conflicts
    • Evaluation
      – Comparison of annotation results with LLMs performance on the same tasks
      – Reflection on the legitimacy of synthetically generated datasets
    Description

    Despite the inclusion of self-supervised learning in the process of training the Large Language Models (LLMs), these largest models, as well as the vast majority of other, smaller ones that solve various NLP tasks, still require labeled data to produce any meaningful results. More often than not the problems start at the very beginning, when gathering the data. Then another set of impediments arise… who should annotate these texts, do we have to hire experts, how many of them, what tools to use and so on. Or maybe we can use the LLM to do our bidding and be done with the problem quicker and cheaper? Today we will address some of those issues, and show you how the data annotation process can look like.

    Level

    Beginner

    This workshop is considered to be beginner-friendly as it mainly focuses on the showcase of tools for data annotation and describes terms connected to it. Regarding the programming aspects, we don’t anticipate any difficulties for the participants.

    Target Participants
    manager

    This day is dedicated to those who seek to streamline the data gathering process, especially the annotation part. However, as the topic is broad, we expect that everyone will find something of interest here.

    Day 2 - Training Monitoring
    Agenda
    • Introduction to MLflow
      – Model/Experiment card
    • Training monitoring
      – Optimizing classification metrics
      – Optimizing training speed
    • Hyperparameter tuning with Optuna
    • Model registry
    Description

    This day focuses on tools and techniques for monitoring and optimizing the training process of machine learning models. We introduce MLflow as an open source tool, allowing us to monitor the fine-tuning experiments. We present the experiment card as the basic means for documenting the progress related to searching for a better model. Then we move to the methods allowing for trining optimization and we present Optuna as one of the tools which automates this procedure. We conclude the day with the introduction of the model’s registry.

    Level

    Intermediate

    New day, new tools, and once again the main challenge of the workshop is to understand and leverage them to the greatest extent possible. Thus, the difficulty will be perceived differently by the participants based on their previous experiences with such tools. On average, we expect a medium level of difficulty.

    Target Participants
    programmer

    All interested in neural networks can find some interesting bits here, but the main beneficiaries of the workshop should be those directly involved in the very process of training or fine-tuning models.

    Day 3 - Data-centric AI and Model Testing
    Agenda
    • Text Data Quality Assurance
      – Data preprocessing and initial quality checking
      – CleanLab blackbox testing
    • Model Quality Testing
      – Common metrics
      – Behavioral testing with Giskard
    • Interpretability
      – Global vs local explainability
      – Local explainability with Captum
    Description

    In the entire NLP pipeline, before modeling we have to process and clean the data, and after training we should analyze and understand our models. This workshop focuses on data-centric AI, data quality testing, as well as resulting model testing and analysis. Understanding data quality, problems and errors is crucial to design proper preprocessing. These factors also influence models, causing lower quality and stability, often related to bias and unfairness. Spotting those problems is particularly challenging for large text corpora, which cannot be analyzed manually.

    Level

    Intermediate

    Similarly to the previous days in the last two sessions, this workshop introduces many new concepts and tools like CleanLab and Giskard, thus the participants should focus on exploring these, learning how to use them and how to recognize scenarios in which leveraging them would be beneficial. The coding side isn’t particularly demanding, but due to the multitude of novelties the difficulty level is estimated at medium.

    Target Participants
    devops

    This workshop was created to optimize the data mining phase, as well as the latter quality assurance processes, after the models are trained. All tools and techniques introduced on this day could benefit users that dedicate themselves to obtaining the best possible data and understanding how it affects model’s quality and interpretability, including data engineers and researchers on R&D teams.

    Day 4 - Monitoring
    Agenda
    • Introduction to Monitoring
      – Introduction to Evidently
      – Data quality testing
      – Data drift detection
    • NLP Data Monitoring
      – Text and embeddings monitoring
      – Multimodal NLP monitoring
      – Customized metrics
    • NLP Model Monitoring and Concept Drift
      – General approaches and label availability
      – Label-available concept drift detection
      – Label-free model performance estimation
      – Introduction to NannyML
    Description

    After gathering data, creating and deploying a model, a whole new phase begins: post-deployment. It covers versioning, monitoring, retraining, updating, and more. Here we focus on monitoring, which helps us to detect changes in data and model behavior. This is also known as drift monitoring, and can be divided into data drift and concept (model) drift. These are particularly challenging to get right in NLP applications, where we have text data, embeddings, multimodal datasets, and LLMs.

    Level

    Advanced

    This workshop is expected to be the most challenging of the session in terms of the concepts and solutions introduced, mainly due to new tools, but also due to the phenomena of data drift and concept drift.

    Target Participants
    programmer

    This part of the course aims to provide an answer to the following question: What to do after the model is trained and deployed? The conclusions should be of value to all involved in the training and deployment processes, but especially to data scientists responsible for data mining and cleansing, and engineers responsible for model instantiation and its training or fine-tuning, as the data and concept drifts affect primarily those areas in further support of the deployed solution.

    Get Your Personalized
    LLMs Course Quote

    Interested in mastering Large Language Models? Fill out the form below to receive a tailored quotation for a course designed to meet your specific needs and objectives.





      ionicons-v5-e

      Thank you for your interest!

      We will contact you as soon as possible.

      Back to homepage

      Your certified partner!

      Empower your projects with Cognitum, backed by the assurance of our ISO 27001 and ISO 9001 certifications, symbolizing elite data security and quality standards.