Visual perception is of great importance for intelligent systems to understand and interact with the real world [1], which is widely applied in fields such as autonomous driving [2], remote sensing [3], and security monitoring [4]. Image sensors [5] are the core of visual systems, whose dynamic range dictates the ability to perceive complex lighting scenes. The luminance range captured by conventional image sensors is substantially narrower than the actual luminance range of natural scenes [6]. This limitation makes it difficult for sensors to maintain sufficient sensing precision in high-dynamic-range (HDR) scenes [7], which seriously compromises image discriminability. For example, in scenes with coexisting highlights and shadows, sensors suffer from overexposure and underexposure, which leads to information loss. This can cause serious errors in downstream image recognition algorithms, threatening the safety and reliability of intelligent visual systems in real-world tasks. To reduce the image quality degradation caused by limited sensor dynamic range, conventional methods rely on fusing multi-exposure images [8] to reconstruct the scene. This is prone to artifacts and imposes significant burdens on data storage, transmission, and computation [9]. While high-bit-depth sensors can extend single-frame dynamic range to a certain degree, the massive data redundancy and high energy consumption hinder the applications in real-time and edge visual systems [10]. A superior visual scheme is highly needed that can transcend sensing limitations for high-fidelity imaging in complex lighting environments.
With the trend of blurring the boundary between sensing and computing, visual systems are shifting from passive image acquisition toward intelligent information processing. The exploration of In-sensor computing (ISC) is becoming a research focus [[11], [12], [13], [14]]. These ISC architectures shift partial information processing to the sensing end to reduce backend latency and energy consumption, which normally integrates image acquisition and computation at the physical layer. Recent studies have demonstrated the ISC merits in contact-based modalities, particularly for tactile intent recognition and bio-inspired mechanosensory perception [[15], [16], [17]]. Besides these contact-dependent approaches, direct modulation of optical signals to process spatial information from a distance is of great importance for achieving non-contact ISC. Various devices have been employed for non-contact ISC architectures, including two-dimensional (2D) material photodetectors [18], optoelectronic memristors [19], integrated photonic chips [20], and metasurfaces [[21], [22], [23], [24], [25]]. Among them, metasurfaces, planar arrays of sub-wavelength structures, stand out as a particularly promising platform. By precisely modulating the amplitude, phase, and polarization of light, metasurfaces can process information. This capability enables large-scale parallel processing with ultra-low latency, making metasurfaces a promising platform for ISC. Moreover, optical neural networks (ONNs) [26,27] based on metasurface devices combining light-field modulation and intelligent algorithms, show great potential in efficient front-end feature extraction and image recognition. However, metasurface-based ONNs still rely on multi-exposure preprocessing to acquire complete scene information in HDR scenes. A strategy supporting direct in-sensor processing of HDR information is urgently required for AI-based visual recognition.
In this work, a graphene metasurface is proposed for an ISC architecture. By synergizing polarization sensitivity and voltage tunability, the architecture supports single-exposure HDR image recognition. The polarization sensitivity originates from the asymmetry of the double split-ring resonator (DSRR) structure, which allows the parallel luminance channels to provide complementary scene information. The voltage tunability benefits from the dynamic modulation of the graphene Fermi level, which is employed to emulate synaptic weights. By integrating intensity-spanning visual features extracted from distinct luminance channels, multi-exposure image fusion can be avoided to reduce the data redundancy and high latency. In the NWPU Remote Sensing Image Scene Classification 45 (NWPU-RESISC45) classification tasks, an accuracy of 95.14% is achieved. This method contributes to efficient recognition in complex real-world scenes and delivers a compact and energy-efficient solution for intelligent visual perception.
Comments (0)