Tutorial 1: Implicit Image Compression: Encoding Pictures with Implicit Neural Representations
Organizers: Lorenzo Catania, Dario Allegra, University of Catania, Italy

Implicit Neural Representations (INRs) are a very recent paradigm for information representation where discrete data are interpreted as continuous functions from coordinates to samples. In the case of images, this function maps each pixel’s coordinates to the colour of the pixels. A neural network is then over-fit to this function, then the image is reconstructed through inference of this network. By following this workflow, the image data are encoded as network parameters.

Recent researches refer to this paradigm as Implicit Image Compression and have demonstrated how codecs based on this emerging approach obtain good visual and quantitative results and can outperform well-established codecs, while not suffering from long-known defects such as block artifacts. Plus, it is possible to fit directly on specific metrics during the training process but, in contrast with other learned methods such as autoencoders, no pre-trained models are needed to encode images.

This tutorial will begin with a brief introduction to the concepts behind this innovative technique, and then a Python implementation of a basic yet complete image codec using INRs will be presented, focusing on modularity and ease of comparison between different results, to provide a speed-up to the research on the field. An in-depth overview of the contribution of each network architecture choice and common model compression techniques, such as quantization, will be presented, such that the audience will achieve a consistent knowledge of the field and will be suddenly able to experiment with the state-of-the-art of implicit image compression.

Tutorial 2: Point Cloud Coding, Enhancement and Analysis: Towards Perception and Reliability
Organizers: Wei Gao and Ge Li, Peking University, China

The technologies and applications of 3D point clouds have raised much attention from both the academia and industry, which can effectively model the 3D scenes and objects with high-precision representation of geometry and associated attributes, such as colors and reflectances. 3D point clouds can improve both the immersive visual experience and the machine vision analysis performances. Similar with the big image and video data, the huge amount data of point clouds require more efficient compression algorithms to obtain desirable rate-distortion tradeoff. Deep learning-based end-to-end compression methods have been successfully utilized for image and video compression, and the attempts have been also made for deep learning-based point cloud compression. Due to the different characteristics of density and application scenarios, different data structures and organization approaches have been devised for different utility optimization, as well as different neural network architectures. Both human and machine perception can be effectively optimized in the deep learning-based frameworks. Moreover, the large-scale datasets are also being constructed for point clouds in different application scenarios, and the quality assessment methods are also comprehensively studied by designing subjective experiments and objective models. Additionally, the deep learning-based enhancement and restoration methods for degenerated point clouds have also been extensively explored, where the samples with compression artifacts and the low-resolution, noisy and incomplete samples can be effectively dealt with. The quality improvements play the critical role in boosting the application utilities of point clouds in the wild. Besides, the degraded point clouds influence the performance of machine vision tasks, and therefore the research of reliable analysis also has attracted much interests to investigate solutions in the era of trustworthy AI. Both enhancement for machine and anti-degradation analysis can effectively improve the robustness of the practical point cloud analysis systems. In this tutorial, we will provide an overview of these technologies and the recent progress during the past few years. We will also discuss recent efforts in MPEG and AVS standardization groups for deep learning-based point cloud compression, and our established first open-source projects for deep learning-based point cloud compression and processing, namely OpenPointCloud, as well as the advances in trustworthy AI for the point cloud technologies and applications. This tutorial will introduce the basic knowledge in the field of 3D point cloud technologies, including the data acquisition and assessment, compression and processing algorithms, standardization progress, open source efforts, and diverse practical applications.

Tutorial 3: Tensor Regression for Visual Data Processing
Organizers: Yipeng Liu, Jiani Liu, University of Electronic Science and Technology of China, China

Regression analysis is a key area of interest in the field of data analysis and pattern recognition which is devoted to exploring the dependencies between variables. For example, one can predict the future climate state from previous recordings or infer ones’ age from their corresponding facial images. However, traditional modeling methods rely on the representation and computation in the form of vectors and matrices, where multidimensional signal needs to be unfolded for subsequent processing. And the multilinear structure would be lost in such vectorization or matricization, which leads to sub-optimal performance.

Tensors, as high dimensional extensions of vectors, are considered as natural representations of high dimensional data. Driven by the recent advances in applied mathematics, it is natural for us to move from classical matrix based methods to tensor based methods for better performance and dimensionality reduction. In many fields, such as sociology, climatology, geography, economics, computer vision, and neuroscience, tensor regression has been widely employed and proven useful. This tutorial will provide you a thorough overview of tensor-based regression methods and their applications. We hope it will help participants build a good understanding of why tensor-based learning is important for regression analysis, what the main ideas and techniques are, and what applications it is suitable for.

Tutorial 4: Imagining Beyond Pixels: Bridging Modalities for Multimodal Representation and Learning
Organizers: Muhammad Haroon Yousaf, University of Engineering and Technology Taxila, Pakistan; Muhammad Saad Saeed, National Centre of Robotics and Automation, Pakistan; Shah Nawaz, Johannes Kepler University Linz, Austria

Our perception of the environment is multimodal, encompassing visual observations, auditory stimuli, tactile receptors, aromatic sensations, and more. Modality denotes the way the world is perceived and encountered. In the perspective of machine learning, modality pertains to the specific category of data that a model is capable of handling, such as audio, images, or text. Each modality possesses distinct characteristics and attributes, necessitating varied processing and analytical approaches to extract valuable information. Multimodal Representation and Learning processes and analyze data from multiple sources or modalities simultaneously, such as video, audio, and sensor signal. This approach quite resembles to the natural phenomenon and is essential for handling the heterogeneity and complexity of real-world data, thus been applied to various applications, including sentiment analysis, natural language processing, and computer vision. In machine learning, Multi-modal learning is a paradigm focused on combining multiple modalities of data such as audio-image, image-text learning to improve the performance of a model. The idea behind multimodal learning is that different modalities can provide complementary cues that can help a model make more accurate predictions or decisions. For example, a model that can process both images and text can better understand the context of image and make accurate predictions.

Recognizing the significance of multimodal representation and learning, this tutorial has been crafted to familiarize participants with cutting-edge research trends, applications, and hands-on experiences in multimodal representation and learning. It delivers a comprehensive introduction to multimodal representation and learning and emphasizing practical aspects. Furthermore, the tutorial explores applications and research challenges that participants can engage with and address throughout the course.

At the end of this tutorial, the audience will be able to:

  • Demonstrate basic concepts and rationale about multimodal learning – it’s functionality, applications, and challenges.
  • Understand different applications and state-of-the-art research in the domain of multimodality.
  • Explore applications to carry out research in multimodal fusion, face-voice association, and image-text joint representation learning.
  • Grab theoretical knowledge and practical aspects on various challenges (Feature Fusion, Missing Modalities, trustworthy AI etc.).

Tutorial 5: An Evaluation Perspective in Visual Object Tracking: from Task Design to Benchmark Construction and Algorithm Analysis
Organizers: Xin Zhao, University of Science and Technology, Beijing; Shiyu Hu, Institute of Automation Chinese Academy of Sciences, China

The Visual Object Tracking (VOT) task, a foundational element in computer vision, seeks to emulate the dynamic vision system of humans and attain human-like object tracking proficiency in intricate environments. This task has found widespread application in practical scenarios such as self-driving, video surveillance, and robot vision. Over the past decade, the surge in deep learning has spurred various research groups to devise diverse tracking frameworks, contributing to the advancement of VOT research. However, challenges persist in natural application scenes, with factors like target deformation, fast motion, and illumination changes posing obstacles for VOT trackers. Instances of suboptimal performance in authentic environments underscore a significant disparity between the capabilities of state-of-the-art trackers and human expectations. This observation underscores the imperative to scrutinize and enhance evaluation aspects in VOT research.

Therefore, in this tutorial, we aim to introduce the basic knowledge of dynamic visual tasks represented by VOT to the audience, starting from task definition and incorporating interdisciplinary research perspectives of evaluation techniques. The tutorial includes four parts: First, we discuss the evolution of task definition in research, which has transitioned from perceptual to cognitive intelligence. Second, we introduce the principal experimental environments utilized in VOT evaluations. Third, we present the executors responsible for executing VOT tasks, including tracking algorithms as well as interdisciplinary experiments involving human visual tracking. Finally, we introduce the evaluation mechanism and metrics, comprising traditional machine-machine comparisons and novel human-machine comparisons.

This tutorial aims to guide researchers in focusing on the emerging evaluation technique, improving their understanding of capability bottlenecks, facilitating a more thorough examination of disparities between current methods and human capabilities, and ultimately advancing towards the goal of algorithmic intelligence.

Tutorial 6: Exploration and standardization of deep learning-based video compression technologies
Organizers: Iole Moccagatta, Intel; Yan Ye, Alibaba Group US, USA

Video compression technologies have broad impact on many facets of the modern society, including entertainment, collaboration and communication, education, digital commerce, just to name a few.  Generations of video compression standards have been able to deliver increased bandwidth efficiency to satisfy the demand of a wide range of video applications. In recent years, the arrival of deep learning technologies has not only powered new applications, but also shown significant capability to improve the efficiency of existing applications. Among the latter, deep learning-based video compression has become a very active research area, and seen compression efficiency significantly improved within just a few years’ time. Standards development organizations have taken notice of this fast-developing technology area, and have been investigating and exploring deep learning-based video compression technologies for standardization. Some of the investigation has reached the standardization stage, for example, the Joint Video Experts Team (JVET) recently published a new edition of Versatile Supplemental Enhancement Information (H.274/VSEI) with the addition of a new neural network-based post filter SEI message. Other technologies, including neural network-based video coding and generative face video coding, are being actively explored and could be considered for standardization in the near future. The Moving Picture Experts Group (MPEG) is also developing a new standard, video coding for machines, which is a highly-related topic that applies deep learning-based video compression and processing technologies to achieve higher compression efficiency for learning-based machine tasks such as video analysis and video understanding. In this tutorial, we will present an overview of the deep learning-based video coding activities in JVET and MPEG, including timeline associated with various exploration and standardization activities. We will focus on several important topics, e.g., neural network-based video coding, generative face video coding, and video coding for machines, and provide a technical deep dive on them.

The expected adoption of deep learning-based video compression in video standards is forecasted to have a disruptive impact on hardware and software implementations, and to bring into consideration new video accelerators such as NPU and GPU. In such fast and disruptive technical area this tutorial is of interest to a wide audience, starting with attendees from industry and practitioners who want to learn and get updates on the latest developments and latest specifications in standard organizations. Students who are in the field will gain understanding of how this new technology would impact the industry landscape, and deep learning researchers and experts will get insights on how they can influence and contribute to the standardization of deep learning-based video coding technologies.

Tutorial 7: A Journey to Volumetric Video – the Past, the Present and the Future
Organizers: Oliver Schreer and Ingo Feldmann, Fraunhofer Heinrich Hertz Institute, Berlin, Germany

Volumetric video is considered as one of the major technological break troughs for highly realistic representation of humans in eXtended Reality applications, future video communication services and collaboration tools. This technology becomes especially important in use cases, where convincing and natural representation of humans and their emotions is required. It significantly paves the way to overcome the uncanny valley for representation of digitized humans. This talk will focus on the state of the art of volumetric video as well as potential future directions. We present details on various volumetric capture systems with different complexities and give insights on overall volumetric video production and processing workflows. Novel 3D content generation and rendering concepts, such as AI-based dynamic surface reconstruction, neural rendering and Gaussian Splatting will be discussed. An overview on interactive volumetric video solutions, volumetric video encoding and streaming will be given. While keeping a strong technical focus, the authors will enrich the tutorial by their long years scientific and practical experience and give examples from most recent productions.

Tutorial 8: Energy efficiency and sustainability of Broadcast and Streaming Technology. Measuring, understanding, and raising the awareness for proposing new technologies
Organizers: Olivier Le Meur, InterDigital; Christian Herglotz, Brandenburgisch-Technische Universität Cottbus-Senftenberg; Angeliki Katsenou, University of Bristol, UK

Significant changes in social behavior and consumer demand has triggered exponential growth in video consumption, with video now accounting for more than 82% of online data traffic. This rapid growth will continue in the coming years requiring massive energy to support the production, the delivery and the end-user consumption. This is not at all sustainable with respect to the amount of energy required and the ecological emergency we have to face. Given the complexity of the challenges we have to overcome, every industry, regardless of size and scope, has a duty of care to ensure the environmental impact of the sector is not only measured, but that it actively invests in sustainability efforts. Everyone is accountable when it comes to combating climate change.

This tutorial will not only present the magnitude of the energy consumption problem in video services, but will also elaborate on key initiatives for reducing the energy consumption.

Tutorial 9: JPEG AI – The First Learning Based Image Coding Standard
Organizers: João Ascenso, Instituto Superior Técnico; Elena Alshina, Huawei Technologies, Portugal

In the last decade, the multimedia arena has been shaken by the impact of deep learning (DL)-based technologies, notably for computer vision tasks, e.g., classification, detection, and recognition, with above human performance levels often achieved. In this context, it was just a question of time for DL-based tools to enter the image and video coding arena since it is impossible to ignore its potential benefits regarding conventional coding approaches. DL-based coding solutions create the so-called latent representation, containing the most important learned content features to describe the input data, following a training process where some loss function controls the DL-based model optimization. The training process is at the heart of the DL-based media representation paradigm, especially when the goal is to have a single compressed representation, which is efficient both for fidelity decoding as well as computer vision tasks, e.g., classification and recognition, since these goals are both important for an increasing number of application scenarios. Following evidence on competitive image compression performance, the idea of targeting a single, efficient DL-based compressed representation for coding, processing and computer vision tasks has recently been embraced by JPEG for image coding with the JPEG AI standard. In the video coding domain, the Joint Video Exploration Team (JVET) of ITU-T VCEG and ISO/IEC MPEG have started an activity on neural network (NN)-based video coding technology. Both the JPEG and JVET/MPEG initiatives highlight that DL-based image and video coding solutions are ready for primetime in the multimedia coding arena.

This tutorial will address the JPEG AI Learning-based Image Coding System, which is an ongoing joint standardization effort between ISO, IEC and ITU-T for the development of the first image coding standard based on machine learning, offering a single stream, compact compressed domain representation, targeting both human visualisation and machine consumption. This tutorial presents and discusses the rationale behind the JPEG AI vision, notably how this new standardization initiative aims to shape the future of image coding. Moreover, it will present the JPEG AI Verification Model (VM) characterizing its coding efficiency, and complexity, especially on the decoder side. From several well-known neural network based compression algorithms, a set of elements was carefully selected based on following principles: one tool per functionality, optimal complexity performance trade-off while ensuring device interoperability. The JPEG AI VM has several unique characteristics, such as a parallelizable context model to perform latent prediction, decoupling of prediction and sample reconstruction, and rate adaptation, among others.

Tutorial 10: Visual Data Processing for Drone AI Technology: A Practical Perspective
Organizers: Muhammad Haroon Yousaf, University of Engineering and Technology Taxila Pakistan; Muhammad Saad Saeed, Muhammad Naeem Mumtaz, Swarm Robotics Lab-National Centre of Robotics and Automation, Pakistan

Advancements in drone technology have led to a surge in applications ranging from surveillance and agriculture to cinematography and search-and-rescue missions. Additionally, drones serve a vital role in the healthcare sector by enabling the aerial delivery of essential medical supplies, such as blood, vaccines, drugs, and laboratory samples, to remote areas in developing nations during critical health emergencies. In the early years of drone technology, while it was in its infancy, operators had to manually control the drones and actively monitor the live video feed from drones. In the past few years, AI has gone through significant evolution, resulting in transition from rudimentary methods to highly sophisticated, efficient, and automated processes. The modern-day drone surveillance is characterized by advanced automation features such as waypoint navigation, GPS tracking, and obstacle avoidance technology. Moreover, the drones are equipped with high-resolution cameras, capturing everything from high-resolution images to multispectral nuances. However, a critical aspect of maximizing the potential of drones lies in harnessing visual data efficiently through AI technologies. Recently, the fusion of edge computing with drone vision has heralded a transformative era of real-time aerial intelligence. Keeping in view the rapidly evolving landscape of drone technology and the pressing need for advanced real-time AI-based data processing, we have designed this tutorial to acquaint the audience with the latest tools, methods, and research trends in visual data processing for drone AI technology. This tutorial navigates through the intricacies of computer vision, visual data processing, drone technology, and the integration of edge computing along with trustworthiness perspective. At the end of this tutorial, the participants will have:

  • An extensive understanding of the complete visual data processing pipeline for drone AI technology, encompassing data acquisition, preprocessing.
  • Familiarization with latest tools and methods for implementing techniques for diverse drone applications.
  • Knowledge about real-time processing constraints, and optimization strategies for deploying models on resource-constrained drone platforms.
  • Understanding of trustworthiness perspective and ethical considerations with respect to visual data process for drone AI technology

Tutorial 11: Emerging inverse problems in computational imaging
Organizers: Shirin Jalali, Xin Yuan, Rutgers University, Westlake University, China

In recent years, underdetermined linear inverse problems, closely related to computational imaging applications such as MRI, have witnessed substantial advancements in both theoretical understanding and algorithmic solutions. However, a diverse set of inverse problems associated with emerging computational imaging applications remains less explored compared to their linear counterparts. This tutorial aims to bridge this knowledge gap by highlighting these vital yet less-studied categories of inverse problems, providing a comprehensive overview of existing solutions and their inherent limitations. Importantly, we emphasize not only the pivotal role of AI/ML in addressing these complexities but also underscore the significance of the underlying mathematical frameworks and the integration of classical and novel signal processing techniques.

Tutorial 12: Quantum Image Processing: Unveiling the Computational Realm
Organizers: Monika Aggarwal, IIT Delhi, India

The interplay between machine learning and quantum physics has the potential to revolutionize various aspects of modern society. Recent advancements in deep learning have led to intriguing commercial applications, but many unsolved problems mainly for which classical resources are not sufficient still pose challenge. Quantum computing, an emerging technology, holds promise in addressing computationally challenging problems that classical resources may not handle effectively.

Motivated by the success of classical deep learning and the rapid development of quantum computing, QNNs have gained significant attention. QNNs share similarities with classical neural networks and contain variational parameters. Developing quantum versions of neural networks serves multiple purposes: quantum computers can outperform classical ones in various algorithms, quantum resources offer quantum advantages in certain computational problems, and handling quantum datasets naturally benefits from QNNs’ capabilities. Also quantum-enhanced machine learning models have emerged, offering notable examples like quantum support vector machines, quantum generative models, and quantum neural networks (QNNs) with the potential for exponential speedups. Rigorous quantum speedups have been proven in supervised learning tasks, further motivating the exploration of advanced quantum learning algorithms.

The tutorial begins with the fundamental concepts in quantum computing, representation of signals and images in qubits and principles of quantum which make them powerful and unique with broader categorization of quantum classifiers will be discussed. This tutorial provides an overview of the early QNN models, including quantum convolutional neural networks, continuous-variable quantum neural networks, tree tensor network classifiers, and multi-scale entanglement renormalization ansatz classifiers. Various QNN models proposed in recent years and theoretical works analyzing QNNs’ expressive power will also be covered.

The tutorial’s focus lies on the classification of relatively high-dimensional data, particularly for block-encoding strategies, with benchmarks and an open-source repository providing valuable guidance for future explorations as datasets scale up. We also plan to provide hand-ons on to build quantum circuits and play with quantum models for various application of image matching, segmentation tasks.

Tutorial 13: JPEG Trust, a framework for establishing trust in media
Organizers: Deepayan Bhowmik, Newcastle University (UK); Touradj Ebrahimi, Swiss Federal Institute of Technology Lausanne, Switzerland; Jaime Delgado, Universitat Politècnica de Catalunya, Spain

Thanks to the success of technologies such as Generative AI and mobile computing, recent years have seen the emergence of large-scale media content creation and consumption. While progress in this space opens up new opportunities, especially in creative industries,  this also enables problematic issues in cyberspace, including cyber attacks, piracy, fake media distribution, and concerns about trust and privacy. Misuse of manipulated media evidently caused social unrest, spread rumours for political gain, or encouraged hate crimes in the recent past.

Media modifications are not always negative, as they are increasingly a normal and legal component of many production pipelines, and even new knowledge production. However, in many application domains, creators need or want to declare the type of modifications that were performed on the media asset. A lack of such declarations in these situations may reveal the lack of trustworthiness of media assets or worse, the intention to hide the existence of manipulations.

Such a need triggered an initiative by JPEG committee (https://jpeg.org/) to standardize a way to annotate media assets (regardless of the intent) and securely link the assets and annotations together. The JPEG Trust standard, to be released in 2024, ensures interoperability between a wide range of applications dealing with media asset creation and modification, providing a set of standard mechanisms to describe and embed information about the creation and modification of media assets.

This tutorial aims to present an in-depth description of the upcoming JPEG Trust international standard, including a limited but core hands-on coding exercise on the implementation and several use case scenarios.

Tutorial 14: Generative Modeling with Limited Data, Few Shots, and Zero Shot
Organizers: Ngai-Man Cheung, Milad Abdollahzadeh, Singapore University of Technology and Design, Singapore

Generative modeling is a field of machine learning that focuses on learning the underlying distribution of the training samples, enabling the generation of new samples that exhibit similar statistical properties to the training data. Over the years, significant advancements have been made by innovative approaches such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Diffusion Models (DMs) which have played a pivotal role in enhancing the quality and diversity of generated samples. Research on generative modeling has been mainly focusing on setups with sizeable training datasets. For example, StyleGAN learns to generate realistic and diverse face images using FFHQ, a high-quality dataset of 70k human face images, or the more recent text-to-image generative model is trained on millions of image-text pairs, e.g. Latent Diffusion Model is trained on LAION-400M with 400 million samples. However, in many domains (e.g., medical), the collection of data samples is challenging and expensive. This has given rise to the new research direction which considers Generative Modeling under Data Constraints (GM-DC). Advances in GM-DC enable high-quality and diverse sample generation with less amount of training samples (e.g., 0~5000 samples).

This tutorial discusses different aspects of the GM-DC based on our thorough literature review on learning generative models under limited data, few shots, and zero shot. More specifically, this tutorial provides a comprehensive overview and detailed analysis of all types of generative models, tasks, and approaches studied in GM-DC, offering an accessible guide on the research landscape. We cover the essential backgrounds, provide a detailed analysis of the unique challenges of GM-DC, discuss current trends, and present the latest advancements in GM-DC. We propose two important taxonomies in GM-DC: a task taxonomy for the 8 generative tasks studied in GM-DC, and a taxonomy of 7 different approaches to address GM-DC. Finally, we analyze the research landscape and discuss the potential future research directions in GM-DC.