Abstract
241725
Introduction: As Artificial Intelligence (AI) continues to make remarkable strides in various domains of medical imaging, high-quality training data has become an integral part of advancing AI in clinical diagnostics. However, the acquisition and annotation of real medical images are notably time-intensive, requiring considerable manpower and specific clinical expertise. In an effort to alleviate the annotation workload, the generation of realistic medical images for the development of AI algorithms has emerged as a valuable yet challenging area. This is especially pertinent in the context of PET images with tumors, where accurate tumor segmentation and quantitative analysis are critical for patient diagnosis and treatment. Presently, methods for simulating tumors typically involve modifying the activity values in certain regions through image processing or incorporating tumor-related measurement data into listmode data. Such methods often result in overly simplistic tumor shapes, markedly distinct from those observed in actual images. As a type of generative model, diffusion models have shown impressive potential in the conditional generation of natural images (e.g. Midjourney, DALL-E). In this work, we introduce a novel approach that employs latent diffusion model (LDM) to directly generate PET images with tumors, using tumor masks as the conditional input. Utilizing masks with customizable shapes and positions, the proposed method enables the generation of realistic tumors in any specified location, size and shape.
Methods: The LDM employed in this study comprises three main components as shown in Fig. 1. The first part is Vector Quantized Variational Auto-encoder(VQVAE), which encodes images into latent space with an 8-fold spatial resolution down-sampling. During the training of VQVAE, the input images and the decoder output are compared at multiple resolution levels to compute losses and update parameters. In the second phase of training, we focus on the parameters in diffusion model and condition encoder. The latent space encoded by pre-trained VQVAE is diffused into Gaussian noise and a Unet model was used to denoise the latent space concatenated with the condition latent variable encoded by the condition encoder. During sampling, we utilize the denoising diffusion implicit model (DDIM) acceleration with a total of 150 steps. The training data we used is the public Head-Neck 3D PET data. Each sample within the dataset contains a minimum of one tumor, accompanied by an associated segmentation mask. Two hundred 3D samples are used for training VQVAE and the diffusion model. Another twenty samples are used for testing. In this work, all the networks are implemented in Pytorch framework, the training batch size is 1, the image size is 128×128×128, the latent space variable size is 16×16×16.
Results: Figure 2 presents a comparison between authentic PET images and the images generated using our LDM. The generated images demonstrate a notable fidelity to real images, particularly in replicating the metabolic activity of tumors. The contours and internal structures of the tumors bear a striking resemblance to those in the original images. In terms of macroscopic shape and size, the tumors closely match the descriptions provided by the tumor mask. While there are minor discrepancies in some of the anatomical details when compared to the real images, these images still serve as invaluable assets for the development of medical AI algorithms, especially in scenarios with limited availability of training samples.
Conclusions: This research presents a novel method for generating realistic tumor PET images utilizing a LDM. The results indicate that this method can generate tumor images with edges and internal structures that are nearly indistinguishable from the real tumors. A significant advantage is that it enables users to specify the tumor’s location, size, heterogeneity, and shape. In future work, we aim to enhance the generation guided by texts using multi-modality LDM.