publications | German Barquero

2025

arXiv

SneakPeek: Future-Guided Instructional Streaming Video Generation

Cheeun Hong , German Barquero , Fadime Sener , and 6 more authors

arXiv preprint arXiv:2512.13019, 2025

Abs arXiv HTML

Instructional video generation is an emerging task that aims to synthesize coherent demonstrations of procedural activities from textual descriptions. Such capability has broad implications for content creation, education, and human-AI interaction, yet existing video diffusion models struggle to maintain temporal consistency and controllability across long sequences of multiple action steps. We introduce a pipeline for future-driven streaming instructional video generation, dubbed SneakPeek, a diffusion-based autoregressive framework designed to generate precise, stepwise instructional videos conditioned on an initial image and structured textual prompts. Our approach introduces three key innovations to enhance consistency and controllability: (1) predictive causal adaptation, where a causal model learns to perform next-frame prediction and anticipate future keyframes; (2) future-guided self-forcing with a dual-region KV caching scheme to address the exposure bias issue at inference time; (3) multi-prompt conditioning, which provides fine-grained and procedural control over multi-step instructions. Together, these components mitigate temporal drift, preserve motion consistency, and enable interactive video generation where future prompt updates dynamically influence ongoing streaming video generation. Experimental results demonstrate that our method produces temporally coherent and semantically faithful instructional videos that accurately follow complex, multi-step task descriptions.
CVPR’25

From Sparse Signal to Smooth Motion: Real-Time Motion Generation with Rolling Prediction Models

German Barquero , Nadine Bertsch , Manojkumar Marramreddy , and 8 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2025

Abs HTML

In extended reality (XR), generating full-body motion of the users is important to understand their actions, drive their virtual avatars for social interaction, and convey a realistic sense of presence. While prior works focused on spatially sparse and always-on input signals from motion controllers, many XR applications opt for vision-based hand tracking for reduced user friction and better immersion. Compared to controllers, hand tracking signals are less accurate and can even be missing for an extended period of time. To handle such unreliable inputs, we present Rolling Prediction Model (RPM), an online and real-time approach that generates smooth full-body motion from temporally and spatially sparse input signals. Our model generates 1) accurate motion that matches the inputs (i.e., tracking mode) and 2) plausible motion when inputs are missing (i.e., synthesis mode). More importantly, RPM generates seamless transitions from tracking to synthesis, and vice versa. To demonstrate the practical importance of handling noisy and missing inputs, we present GORP, the first dataset of realistic sparse inputs from a commercial virtual reality (VR) headset with paired high quality body motion ground truth. GORP provides >14 hours of VR gameplay data from 28 people using motion controllers (spatially sparse) and hand tracking (spatially and temporally sparse). We benchmark RPM against the state of the art on both synthetic data and GORP to highlight how we can bridge the gap for real-world applications with a realistic dataset and by handling unreliable input signals. Our code, pretrained models, and GORP dataset are available in the project webpage.
CVPR’25

MixerMDM: Learnable Mixing of Human Motion Diffusion Models

Pablo Ruiz-Ponce , German Barquero , Cristina Palmero , and 2 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2025

Abs HTML Code

Generating human motion guided by conditions such as textual descriptions is challenging due to the need for datasets with pairs of high-quality motion and their corresponding conditions. The difficulty increases when aiming for finer control in the generation. To that end, prior works have proposed to combine several motion diffusion models pre-trained on datasets with different types of conditions, thus allowing control with multiple conditions. However, the proposed merging strategies overlook that the optimal way to combine the generation processes might depend on the particularities of each pre-trained generative model and also the specific textual descriptions. In this context, we introduce MixerMDM, the first learnable model composition technique for combining pre-trained text-conditioned human motion diffusion models. Unlike previous approaches, MixerMDM provides a dynamic mixing strategy that is trained in an adversarial fashion to learn to combine the denoising process of each model depending on the set of conditions driving the generation. By using MixerMDM to combine single- and multi-person motion diffusion models, we achieve fine-grained control on the dynamics of every person individually, and also on the overall interaction. Furthermore, we propose a new evaluation technique that, for the first time in this task, measures the interaction and individual quality by computing the alignment between the mixed generated motions and their conditions as well as the capabilities of MixerMDM to adapt the mixing throughout the denoising process depending on the motions to mix.

2024

CVPR’24

Seamless Human Motion Composition with Blended Positional Encodings

German Barquero , Sergio Escalera , and Cristina Palmero

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024

Abs HTML Code

Conditional human motion generation is an important topic with many applications in virtual reality, gaming, and robotics. While prior works have focused on generating motion guided by text, music, or scenes, these typically result in isolated motions confined to short durations. Instead, we address the generation of long, continuous sequences guided by a series of varying textual descriptions. In this context, we introduce FlowMDM, the first diffusion-based model that generates seamless Human Motion Compositions (HMC) without any postprocessing or redundant denoising steps. For this, we introduce the Blended Positional Encodings, a technique that leverages both absolute and relative positional encodings in the denoising chain. More specifically, global motion coherence is recovered at the absolute stage, whereas smooth and realistic transitions are built at the relative stage. As a result, we achieve state-of-the-art results in terms of accuracy, realism, and smoothness on the Babel and HumanML3D datasets. FlowMDM excels when trained with only a single description per motion sequence thanks to its Pose-Centric Cross-ATtention, which makes it robust against varying text descriptions at inference time. Finally, to address the limitations of existing HMC metrics, we propose two new metrics: the Peak Jerk and the Area Under the Jerk, to detect abrupt transitions.

2023

ICCV’23

BeLFusion: Latent Diffusion for Behavior-Driven Human Motion Prediction

German Barquero , Sergio Escalera , and Cristina Palmero

In Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023

Abs HTML Code

Stochastic human motion prediction (HMP) has generally been tackled with generative adversarial networks and variational autoencoders. Most prior works aim at predicting highly diverse movements in terms of the skeleton joints’ dispersion. This has led to methods predicting fast and motion-divergent movements, which are often unrealistic and incoherent with past motion. Such methods also neglect contexts that need to anticipate diverse low-range behaviors, or actions, with subtle joint displacements. To address these issues, we present BeLFusion, a model that, for the first time, leverages latent diffusion models in HMP to sample from a latent space where behavior is disentangled from pose and motion. As a result, diversity is encouraged from a behavioral perspective. Thanks to our behavior coupler’s ability to transfer sampled behavior to ongoing motion, BeLFusion’s predictions display a variety of behaviors that are significantly more realistic than the state of the art. To support it, we introduce two metrics, the Area of the Cumulative Motion Distribution, and the Average Pairwise Distance Error, which are correlated to our definition of realism according to a qualitative study with 126 participants. Finally, we prove BeLFusion’s generalization power in a new cross-dataset scenario for stochastic HMP.

2022

PMLR

Chalearn LAP challenges on self-reported personality recognition and non-verbal behavior forecasting during social dyadic interactions: Dataset, design, and results

Cristina Palmero , German Barquero , Julio CS Jacques Junior , and 8 more authors

In Understanding Social Behavior in Dyadic and Small Group Interactions , 2022

Abs HTML

This paper summarizes the 2021 ChaLearn Looking at People Challenge on Understanding Social Behavior in Dyadic and Small Group Interactions (DYAD), which featured two tracks, self-reported personality recognition and behavior forecasting, both on the UDIVA v0.5 dataset. We review important aspects of this multimodal and multiview dataset consisting of 145 interaction sessions where 134 participants converse, collaborate, and compete in a series of dyadic tasks. We also detail the transcripts and body landmark annotations for UDIVA v0.5 that are newly introduced for this occasion. We briefly comment on organizational aspects of the challenge before describing each track and presenting the proposed baselines. The results obtained by the participants are extensively analyzed to bring interesting insights about the tracks tasks and the nature of the dataset. We wrap up with a discussion on challenge outcomes, and pose several questions that we expect will motivate further scientific research to better understand social cues in human-human and human-machine interaction scenarios and help build future AI applications for good.
PMLR

Didn’t see that coming: a survey on non-verbal social human behavior forecasting

German Barquero , Johnny Núñez , Sergio Escalera , and 4 more authors

In Understanding Social Behavior in Dyadic and Small Group Interactions , 2022

Abs

Non-verbal social human behavior forecasting has increasingly attracted the interest of the research community in recent years. Its direct applications to human-robot interaction and socially-aware human motion generation make it a very attractive field. In this survey, we define the behavior forecasting problem for multiple interactive agents in a generic way that aims at unifying the fields of social signals prediction and human motion forecasting, traditionally separated. We hold that both problem formulations refer to the same conceptual problem, and identify many shared fundamental challenges: future stochasticity, context awareness, history exploitation, etc. We also propose a taxonomy that comprises methods published in the last 5 years in a very informative way and describes the current main concerns of the community with regard to this problem. In order to promote further research on this field, we also provide a summarised and friendly overview of audiovisual datasets featuring non-acted social interactions. Finally, we describe the most common metrics used in this task and their particular issues.
PMLR

Comparison of Spatio-Temporal Models for Human Motion and Pose Forecasting in Face-to-Face Interaction Scenarios

German Barquero , Johnny Núñez , Zhen Xu , and 4 more authors

In Understanding Social Behavior in Dyadic and Small Group Interactions , 2022

Abs HTML

Human behavior forecasting during human-human interactions is of utmost importance to provide robotic or virtual agents with social intelligence. This problem is especially challenging for scenarios that are highly driven by interpersonal dynamics. In this work, we present the first systematic comparison of state-of-the-art approaches for behavior forecasting. To do so, we leverage whole-body annotations (face, body, and hands) from the very recently released UDIVA v0.5, which features face-to-face dyadic interactions. Our best attention-based approaches achieve state-of-the-art performance in UDIVA v0.5. We show that by autoregressively predicting the future with methods trained for the short-term future (<400ms), we outperform the baselines even for a considerably longer-term future (up to 2s). We also show that this finding holds when highly noisy annotations are used, which opens new horizons towards the use of weakly-supervised learning. Combined with large-scale datasets, this may help boost the advances in this field.

2021

T-BIOM

Rank-based verification for long-term face tracking in crowded scenes

German Barquero , Isabelle Hupont , and Carles Fernandez Tena

IEEE Transactions on Biometrics, Behavior, and Identity Science, 2021

Abs arXiv HTML Code

Most current multi-object trackers focus on short-term tracking, and are based on deep and complex systems that often cannot operate in real-time, making them impractical for video-surveillance. In this paper we present a long-term, multi-face tracking architecture conceived for working in crowded contexts where faces are often the only visible part of a person. Our system benefits from advances in the fields of face detection and face recognition to achieve long-term tracking, and is particularly unconstrained to the motion and occlusions of people. It follows a tracking-by-detection approach, combining a fast short-term visual tracker with a novel online tracklet reconnection strategy grounded on rank-based face verification. The proposed rank-based constraint favours higher inter-class distance among tracklets, and reduces the propagation of errors due to wrong reconnections. Additionally, a correction module is included to correct past assignments with no extra computational cost. We present a series of experiments introducing novel specialized metrics for the evaluation of long-term tracking capabilities, and publicly release a video dataset with 10 manually annotated videos and a total length of 8’ 54". Our findings validate the robustness of each of the proposed modules, and demonstrate that, in these challenging contexts, our approach yields up to 50% longer tracks than state-of-the-art deep learning trackers.

2020

NeuroImage

RimNet: A deep 3D multimodal MRI architecture for paramagnetic rim lesion assessment in multiple sclerosis

G. Barquero , F. La Rosa , H. Kebiri , and 13 more authors

NeuroImage: Clinical, 2020

Abs HTML PDF

Objectives
In multiple sclerosis (MS), the presence of a paramagnetic rim at the edge of non-gadolinium-enhancing lesions indicates perilesional chronic inflammation. Patients featuring a higher paramagnetic rim lesion burden tend to have more aggressive disease. The objective of this study was to develop and evaluate a convolutional neural network (CNN) architecture (RimNet) for automated detection of paramagnetic rim lesions in MS employing multiple magnetic resonance (MR) imaging contrasts.
Materials and methods
Imaging data were acquired at 3 Tesla on three different scanners from two different centers, totaling 124 MS patients, and studied retrospectively. Paramagnetic rim lesion detection was independently assessed by two expert raters on T2*-phase images, yielding 462 rim-positive (rim+) and 4857 rim-negative (rim-) lesions. RimNet was designed using 3D patches centered on candidate lesions in 3D-EPI phase and 3D FLAIR as input to two network branches. The interconnection of branches at both the first network blocks and the last fully connected layers favors the extraction of low and high-level multimodal features, respectively. RimNet’s performance was quantitatively evaluated against experts’ evaluation from both lesion-wise and patient-wise perspectives. For the latter, patients were categorized based on a clinically relevant threshold of 4 rim+ lesions per patient. The individual prediction capabilities of the images were also explored and compared (DeLong test) by testing a CNN trained with one image as input (unimodal).
Results
The unimodal exploration showed the superior performance of 3D-EPI phase and 3D-EPI magnitude images in the rim+/- classification task (AUC = 0.913 and 0.901), compared to the 3D FLAIR (AUC = 0.855, Ps < 0.0001). The proposed multimodal RimNet prototype clearly outperformed the best unimodal approach (AUC = 0.943, P < 0.0001). The sensitivity and specificity achieved by RimNet (70.6% and 94.9%, respectively) are comparable to those of experts at the lesion level. In the patient-wise analysis, RimNet performed with an accuracy of 89.5% and a Dice coefficient (or F1 score) of 83.5%.
Conclusions
The proposed prototype showed promising performance, supporting the usage of RimNet for speeding up and standardizing the paramagnetic rim lesions analysis in MS.
IJCB

Long-term face tracking for crowded video-surveillance scenarios

German Barquero , Carles Fernandez , and Isabelle Hupont

In IEEE International Joint Conference on Biometrics , 2020

Abs arXiv HTML Code

Most current multi-object trackers focus on short-term tracking, and are based on deep and complex systems that do not operate in real-time, often making them impractical for video-surveillance. In this paper, we present a long-term multi-face tracking architecture conceived for working in crowded contexts, particularly unconstrained in terms of movement and occlusions, and where the face is often the only visible part of the person. Our system benefits from advances in the fields of face detection and face recognition to achieve long-term tracking. It follows a tracking-by-detection approach, combining a fast short-term visual tracker with a novel online tracklet reconnection strategy grounded on face verification. Additionally, a correction module is included to correct past track assignments with no extra computational cost. We present a series of experiments introducing novel, specialized metrics for the evaluation of long-term tracking capabilities and a video dataset that we publicly release. Findings demonstrate that, in this context, our approach allows to obtain up to 50% longer tracks than state-of-the-art deep learning trackers.