How multimodal AI surpasses traditional industrial sensors

The limits of traditional sensors in manual operations

Industrial sensors are now mature in monitoring machine health: vibrations, temperature and telemetry allow predicting a failure before it causes down-time on the line (Vibration Sensors for Condition Monitoring, 2024). But when the analysis shifts from the asset to the workforce, this same infrastructure shows its limit. Inertial sensors and telemetry record that something has happened, not how or whether it is correct, and they fail to capture the nuances of manual labour.

This opens an information gap: the distance between what sensors measure and what is truly necessary to know about a manual operation. This is where multimodal AI comes into play. Vision-Language Models (VLM) transform the video feed into structured descriptions, combining the semantics of language with the precision of computer vision.

When the sensor fails to understand the context

Automatic Human Activity Recognition (HAR) typically relies on a network of accelerometers, gyroscopes and pressure sensors integrated into tools or worn by the workforce. The limitation is structural because these systems struggle to interpret the context of the action. Scientific reviews on the subject (HAR: Review, Taxonomy and Open Challenges, 2022) highlight two recurring problems:

Different operators perform the same movement in different ways, and this makes it difficult for the model to generalise.
False positives can become so frequent as to render the systems unusable in practice.

A sensor detects a signal but cannot say whether that signal truly belongs to the correct activity.

Consider an assembly line with tens of thousands of manual cycle times per month. In that volume, the noise generated by false alarms buries useful information and renders the data unusable to optimise processes. Understanding a manual operation requires a reading of the environment that point sensors cannot offer. Without the support of vision, the system remains blind to decisive variables such as component position, material wear status or the presence of bottlenecks at the workstation.

Immagine articolo Procedo - AI industriale e manifattura

Vision-Language Models as a bridge between image and meaning

Bridging that gap requires models capable of processing multimodal input. For decades the Industrial IoT has focused on collecting quantitative data; the frontier today is the qualitative understanding of the action. The MIT Technology Review describes multimodal AI as the new frontier of artificial intelligence, capable of fusing multiple senses like sight and sound into a coherent picture of reality, exactly as the human brain does (Multimodal: AI's new frontier, MIT Technology Review, 2024). Applied to the shop floor, this capability marks the leap from "what" happened to "how" it was done.

Vision-Language Models map video footage onto textual descriptions of the procedures. The system does not merely see that there is activity at a workstation for two minutes: it grasps its meaning, distinguishes the individual phases and recognises when a sequence deviates from the documented Standard Operating Procedures. It can detect, for example, that a quality control phase does not appear in the footage because the part is never oriented towards the inspection point. It is a detail useful for quality and documentation that no traditional sensor could capture.

From video to a reusable asset for documentation, optimisation and training

The value does not lie in "surveilling" the workforce in real time but in transforming the footage into structured data that remains and can be reused. Once the video becomes a readable phase-by-phase description, the same asset feeds three concrete directions.

Documentation. Standard Operating Procedures are generated and kept updated based on what actually happens on the line instead of remaining in a manual that no one rereads.
Optimisation. By comparing the real variability of cycle times, the bottlenecks of manual processes emerge with the same precision with which CNC machines are monitored today.
Training. New hires learn from a structured recording of the correct movement, accelerating onboarding without taking time away from the experienced workforce.

Where needed, this data can then flow into factory management systems like the MES that governs production execution and the ERP that plans resources and costs, closing the loop between what physically happens on the line and the systems that plan it. However, this remains an optional downstream phase: the primary value is already created by transforming a video into structured tacit knowledge.

Privacy by design: the focus is on the movement, not the person

A frequent concern when adopting video analysis on the shop floor relates to data confidentiality. Current systems respond by integrating anonymisation during the acquisition phase: faces and personal identifiers (so-called personally identifiable information, PII) are obscured before the video is processed. The movements of hands and tools remain readable, which is what is truly needed to document and validate the procedure. The result is consistent with the logic of the entire methodology. The objective is not to observe the workforce but to understand how the operation is performed, in compliance with privacy regulations and without constraining model performance.

From data to meaning

Moving from sensors to vision does not mean collecting more data but collecting data that finally has meaning. Transforming video into structured tacit knowledge (documentation that updates itself, more transparent processes, faster training) is the first phase. Once this new method of capturing what happens on the line is established, the subsequent question becomes more ambitious: how can AI understand not only the single action but the logical reasoning that links the phases of a complex industrial procedure together? This will be the subject of our next insight.

Sources

Human Activity Recognition: Review, Taxonomy and Open Challenges (Sensors / PMC, 2022). Limitations of sensor-based HAR systems: difficulty in generalising across the workforce and false positives that constrain their practical use.
An In-Depth Study of Vibration Sensors for Condition Monitoring (Sensors / MDPI, 2024). Effectiveness of sensors in monitoring machine health.
Multimodal: AI's new frontier (MIT Technology Review, 2024). Multimodal AI as the fusion of multiple senses into a coherent understanding of reality.

Enjoyed this article? Share it!

LinkedIn Facebook