Food Detection to Estimate Calories Using Detection Transformer

Computer Vision Deep Learning Food Detection Detection Transformer Calories

Authors

Downloads

Accurately estimating calorie intake remains a common challenge, as many individuals have limited understanding of portion sizes and the caloric content of foods. This lack of nutritional knowledge is a major cause of both over- and under-calorie consumption and contributes to significant public health problems, including obesity, cardiovascular disease, and chronic metabolic disorders. Although computer vision–based approaches for dietary assessment have advanced, many methods still rely on handcrafted features, anchor-based CNN detectors, or controlled geometric assumptions. This indicates a practical gap in developing a fully functional system that operates on basic RGB images captured under everyday conditions. This study aims to develop an end-to-end food detection and calorie estimation system using the Detection Transformer (DETR) to predict calorie values directly from food images. The main contributions of this study include: (1) employing DETR to address non-maximum suppression limitations and improve the stability of multi-food recognition; (2) using a bounding box area-to-weight ratio as a low-complexity alternative to segmentation-based food portion estimation; and (3) developing a user-friendly interface for output visualization that displays detected food items and their estimated calorie values in real-world scenarios involving irregular food shapes and varying focal lengths. A DETR-based detector was trained using 2,228 COCO-formatted images across six distinct food classes. Calorie values were estimated by predicting food weight based on bounding box measurements, followed by calorie calculation using standardized reference weights. The method assessed robustness by evaluation on both controlled and real-life food images. Experimental results demonstrated moderate performance, with 0.617 mean Average Precision (mAP) and 0.656 mean Average Recall (mAR). The weight prediction module served as the primary estimation component, achieving a mean absolute residual of 8.7. These findings suggest that bounding box area is a reliable estimator of serving size. This study serves as a proof of concept for monitoring individual food intake and provides a foundation for further improvement in sub-item recognition, three-dimensional volume estimation, and the inclusion of broader food classes.