
Why the Hormuz blockade threatens Italian manufacturing EBITDA
The disruption of flows in the Strait of Hormuz dictates a drastic revision of procurement strategies for Italian manufacturing.
Industrial sensors are now mature in monitoring machine health: vibrations, temperature and telemetry allow predicting a failure before it causes down-time on the line (Vibration Sensors for Condition Monitoring, 2024). But when the analysis shifts from the asset to the workforce, this same infrastructure shows its limit. Inertial sensors and telemetry record that something has happened, not how or whether it is correct, and they fail to capture the nuances of manual labour.
This opens an information gap: the distance between what sensors measure and what is truly necessary to know about a manual operation. This is where multimodal AI comes into play. Vision-Language Models (VLM) transform the video feed into structured descriptions, combining the semantics of language with the precision of computer vision.
Automatic Human Activity Recognition (HAR) typically relies on a network of accelerometers, gyroscopes and pressure sensors integrated into tools or worn by the workforce. The limitation is structural because these systems struggle to interpret the context of the action. Scientific reviews on the subject (HAR: Review, Taxonomy and Open Challenges, 2022) highlight two recurring problems:
A sensor detects a signal but cannot say whether that signal truly belongs to the correct activity.
Consider an assembly line with tens of thousands of manual cycle times per month. In that volume, the noise generated by false alarms buries useful information and renders the data unusable to optimise processes. Understanding a manual operation requires a reading of the environment that point sensors cannot offer. Without the support of vision, the system remains blind to decisive variables such as component position, material wear status or the presence of bottlenecks at the workstation.

Bridging that gap requires models capable of processing multimodal input. For decades the Industrial IoT has focused on collecting quantitative data; the frontier today is the qualitative understanding of the action. The MIT Technology Review describes multimodal AI as the new frontier of artificial intelligence, capable of fusing multiple senses like sight and sound into a coherent picture of reality, exactly as the human brain does (Multimodal: AI's new frontier, MIT Technology Review, 2024). Applied to the shop floor, this capability marks the leap from "what" happened to "how" it was done.
Vision-Language Models map video footage onto textual descriptions of the procedures. The system does not merely see that there is activity at a workstation for two minutes: it grasps its meaning, distinguishes the individual phases and recognises when a sequence deviates from the documented Standard Operating Procedures. It can detect, for example, that a quality control phase does not appear in the footage because the part is never oriented towards the inspection point. It is a detail useful for quality and documentation that no traditional sensor could capture.
The value does not lie in "surveilling" the workforce in real time but in transforming the footage into structured data that remains and can be reused. Once the video becomes a readable phase-by-phase description, the same asset feeds three concrete directions.
Where needed, this data can then flow into factory management systems like the MES that governs production execution and the ERP that plans resources and costs, closing the loop between what physically happens on the line and the systems that plan it. However, this remains an optional downstream phase: the primary value is already created by transforming a video into structured tacit knowledge.

A frequent concern when adopting video analysis on the shop floor relates to data confidentiality. Current systems respond by integrating anonymisation during the acquisition phase: faces and personal identifiers (so-called personally identifiable information, PII) are obscured before the video is processed. The movements of hands and tools remain readable, which is what is truly needed to document and validate the procedure. The result is consistent with the logic of the entire methodology. The objective is not to observe the workforce but to understand how the operation is performed, in compliance with privacy regulations and without constraining model performance.
Moving from sensors to vision does not mean collecting more data but collecting data that finally has meaning. Transforming video into structured tacit knowledge (documentation that updates itself, more transparent processes, faster training) is the first phase. Once this new method of capturing what happens on the line is established, the subsequent question becomes more ambitious: how can AI understand not only the single action but the logical reasoning that links the phases of a complex industrial procedure together? This will be the subject of our next insight.

The disruption of flows in the Strait of Hormuz dictates a drastic revision of procurement strategies for Italian manufacturing.

The departure of senior technicians leads to a loss of tribal knowledge that increases Mean Time to Repair by 40-60%, directly eroding EBITDA.

The technological obsolescence of manufacturing assets is accelerating workforce turnover and eroding talent management strategies.