Food Detection to Estimate Calories Using Detection Transformer
Downloads
Accurately estimating calorie intake remains a common challenge, as many individuals have limited understanding of portion sizes and the caloric content of foods. This lack of nutritional knowledge is a major cause of both over- and under-calorie consumption and contributes to significant public health problems, including obesity, cardiovascular disease, and chronic metabolic disorders. Although computer vision–based approaches for dietary assessment have advanced, many methods still rely on handcrafted features, anchor-based CNN detectors, or controlled geometric assumptions. This indicates a practical gap in developing a fully functional system that operates on basic RGB images captured under everyday conditions. This study aims to develop an end-to-end food detection and calorie estimation system using the Detection Transformer (DETR) to predict calorie values directly from food images. The main contributions of this study include: (1) employing DETR to address non-maximum suppression limitations and improve the stability of multi-food recognition; (2) using a bounding box area-to-weight ratio as a low-complexity alternative to segmentation-based food portion estimation; and (3) developing a user-friendly interface for output visualization that displays detected food items and their estimated calorie values in real-world scenarios involving irregular food shapes and varying focal lengths. A DETR-based detector was trained using 2,228 COCO-formatted images across six distinct food classes. Calorie values were estimated by predicting food weight based on bounding box measurements, followed by calorie calculation using standardized reference weights. The method assessed robustness by evaluation on both controlled and real-life food images. Experimental results demonstrated moderate performance, with 0.617 mean Average Precision (mAP) and 0.656 mean Average Recall (mAR). The weight prediction module served as the primary estimation component, achieving a mean absolute residual of 8.7. These findings suggest that bounding box area is a reliable estimator of serving size. This study serves as a proof of concept for monitoring individual food intake and provides a foundation for further improvement in sub-item recognition, three-dimensional volume estimation, and the inclusion of broader food classes.
ahir and C. K. Loo, “A comprehensive survey of image-based food recognition and volume estimation methods for dietary assessment,” Dec. 01, 2021, MDPI. doi: 10.3390/healthcare9121676.
[22] W. Zafar et al., “Enhanced TumorNet: Leveraging YOLOv8s and U-net for superior brain tumor detection and segmentation utilizing MRI scans,” Results in Engineering, vol. 24, Dec. 2024, doi: 10.1016/j.rineng.2024.102994.
[23] D. Gyawali, “Comparative Analysis of CPU and GPU Profiling for Deep Learning Models,” Dec. 2023, [Online]. Available: http://arxiv.org/abs/2309.02521
[24] F. S. Konstantakopoulos, E. I. Georga, and D. I. Fotiadis, “A novel approach to estimate the weight of food items based on features extracted from an image using boosting algorithms,” Sci Rep, vol. 13, no. 1, Dec. 2023, doi: 10.1038/s41598-023-47885-0.
[25] G. A. Tahir and C. K. Loo, “A comprehensive survey of image-based food recognition and volume estimation methods for dietary assessment,” Dec. 01, 2021, MDPI. doi: 10.3390/healthcare9121676.
[26] Y. Han, Q. Cheng, W. Wu, and Z. Huang, “DPF-Nutrition: Food Nutrition Estimation via Depth Prediction and Fusion,” Foods, vol. 12, no. 23, Dec. 2023, doi: 10.3390/foods12234293.
[27] E. Shonkoff et al., “AI-based digital image dietary assessment methods compared to humans and ground truth: a systematic review,” 2023, Taylor and Francis Ltd. doi: 10.1080/07853890.2023.2273497.
[28] E. Robinson, M. Khuttan, I. McFarland-Lesser, Z. Patel, and A. Jones, “Calorie reformulation: a systematic review and meta-analysis examining the effect of manipulating food energy density on daily energy intake,” International Journal of Behavioral Nutrition and Physical Activity, vol. 19, no. 1, Dec. 2022, doi: 10.1186/s12966-022-01287-z.
[29] J. Maurício, I. Domingues, and J. Bernardino, “Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review,” May 01, 2023, MDPI. doi: 10.3390/app13095521.
[30] F. Jubayer et al., “Detection of mold on the food surface using YOLOv5,” Curr Res Food Sci, vol. 4, pp. 724–728, Jan. 2021, doi: 10.1016/J.CRFS.2021.10.003.
[31] Z. Kuang, H. Gao, J. Zhao, L. Wang, and L. Sun, “FFFNet: A Food Feature Fusion Model with Self-Supervised Clustering for Food Image Recognition,” Applied Sciences (Switzerland), vol. 15, no. 17, Sep. 2025, doi: 10.3390/app15179542.
[32] X. Gao, Z. Xiao, and Z. Deng, “High accuracy food image classification via vision transformer with data augmentation and feature augmentation,” J Food Eng, vol. 365, p. 111833, Mar. 2024, doi: 10.1016/J.JFOODENG.2023.111833.
[33] X. Chen and E. N. Kamavuako, “Vision-Based Methods for Food and Fluid Intake Monitoring: A Literature Review,” Jul. 01, 2023, Multidisciplinary Digital Publishing Institute (MDPI). doi: 10.3390/s23136137.
Copyright (c) 2025 Joshua Putra Fesha Kristanto, Dedy Agung Prabowo, Yohani Setiya Rafika Nur (Author)

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlikel 4.0 International (CC BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).






