Abstract
Despite the effectiveness of closed-set object detectors, recent advancements have introduced zero-shot detectors that can recognize a wide range of object categories across different environments. These detectors rely on text prompts, such as object tags. This study explores using multimodal large language models (MLLMs) to gather and refine object information from NeRF scenes into tags. We propose a training-free pipeline for extracting object-specific details, such as category, color, material, and functionality, from 3D scenes via prompting. Subsequently, we investigate how to apply the object tagging problem to NeRF-reconstructed scenes, particularly in a manufacturing context. This pipeline is evaluated in manufacturing environments for object recognition, with the resulting categories serving as inputs for zero-shot object detection and other tasks.
| Original language | English |
|---|---|
| Title of host publication | Advances in Artificial Intelligence in Manufacturing II - Proceedings of the 2nd European Symposium on Artificial Intelligence in Manufacturing, 2024 |
| Publisher | Springer |
| Pages | 242-250 |
| ISBN (Print) | 9783031864889 |
| DOIs | |
| Publication status | Published - 2025 |
| MoE publication type | A4 Article in a conference publication |
| Event | 2nd European Symposium on Artificial Intelligence in Manufacturing, ESAIM 2024 - Athens, Greece Duration: 16 Oct 2024 → 16 Oct 2024 |
Publication series
| Series | Lecture Notes in Mechanical Engineering |
|---|---|
| ISSN | 2195-4356 |
Conference
| Conference | 2nd European Symposium on Artificial Intelligence in Manufacturing, ESAIM 2024 |
|---|---|
| Country/Territory | Greece |
| City | Athens |
| Period | 16/10/24 → 16/10/24 |
Funding
This research funded by the VTT Technical Research Centre of Finland.
Keywords
- 3D Scene Understanding
- Multimodal Large Language Models
- NeRF
- Object Recognition
- Prompting