MIT AI Model to Improve Image Quality for Autonomous Vehicles

MIT AI Model to Improve Image Quality for Autonomous Vehicles
Image Source:

Many people are unaware of machine learning, although they may have used or encountered the system daily. By definition, machine learning is a part of computer science and artificial intelligence. It uses algorithms and data to continuously simulate how humans learn to improve its accuracy. 

As the applications of machine learning (ML) and the advancements in technology continue to rise, the projection is that ML will transform the future of many fields, including healthcare, automation, natural language processing, entertainment, science, transportation, and cybersecurity. 

However, the current process is long and repetitive. Humans have to collate data from various sources, filter them, and select only the most appropriate materials to teach the machine according to expertise. The information they feed the machine includes numerical, text, time series, and categorical data. Each piece of information should be labeled to ensure proper learning. 

Speeding up the process

As the demand for AI and machine learning escalates, there should be a better method to process the data faster. This is what the MIT researchers are doing. They recently introduced EfficientViT, a computer vision model theMITand MIT-IBM Watson AI Lab researchers developed to speed up the real-time semantic segmentation of high-resolution images. Through this process, they can optimize the data for devices with limited hardware. For example, they can use the process for autonomous vehicles. 

They cited the autonomous vehicle as an example because it needs to react quickly and accurately to the objects around it. Thus, it needs a powerful computer vision model to immediately categorize each pixel in a high-resolution image of a scene so the vehicle will not miss the objects that a lower-quality image might hide. They call this task semantic segregation, a complex process that requires a lot of computation when handling high-resolution images.

The EfficientViT, the more efficient computer vision from MIT and the MIT-IBM Watson AI Lab, considerably reduces the computational complexity of the job. The model performs semantic segmentation accurately in real time. 

The EfficientViT
Image Source:×1024.jpg?ezimgfmt=ng:webp/ngcb2

Optimizing semantic segregation for real-time processing

Many new state-of-the-art semantic segmentation models process calculations quickly as the image resolution increases because they now directly learn the interaction between each pair of pixels in an image. Therefore, the models are more accurate. However, they are slow to process high-resolution images in real-time for smartphones or sensors. 

The MIT researchers created a new building block specifically for semantic segmentation models. Their building block has similar abilities to the state-of-the-art models, although it only has linear computational complexity and operations that are hardware-efficient.

As such, their new model for high-resolution computer vision can work up to nine times faster than previous models when applied on a mobile device. Further, the new model series shows the same or better accuracy than the other models.

A better solution

The team sees the application of this technique to autonomous vehicles to help them make decisions in real time. They also believe it will improve the efficiency of various high-resolution computer vision works like the segmentation of medical images.

The senior author of the MIT paper, associate professor Song Han, says they want people to see their model’s efficiency, which can reduce the computation; thus, real-time image segmentation can take place within a device. 

Simpler procedure

In developing the EfficientViT, the researchers employed a simpler method to build the attention map using a linear similarity function instead of a nonlinear one. This method allowed them to rearrange the order of operations to minimize total calculations without changing the functionality of the system and removing the global receptive field. Likewise, the amount of computation the system needs for its prediction grows linearly as the image’s resolution becomes higher. 

However, linear attention only captures the global context regarding the image. It loses local information that lowers its accuracy, according to Prof. Han. Therefore, to compensate for the loss of accuracy, the team includes two additional components – one that helps the model capture local feature interactions and another module that enables multiscale learning. These two components help the model recognize small and large objects. 

The team envisions various applications of their ML model and works towards the model to run on cloud and mobile devices.