Real-Time Video Stitching Method Based on Dense Viewpoint Interpolation


引用 0|浏览1
Objective A video stitching method based on dense viewpoint interpolation is proposed to solve the problem of artifacts and defects caused by parallax when stitching under wide baseline scenes. Video stitching technology can facilitate access to a broader field of view and plays a vital role in security surveillance, intelligent driving, virtual reality, and video conferencing. One of the biggest challenges of the stitching task is the parallax. When the cameras' optical centers perfectly coincide, they are unaffected by parallax and can easily synthesize perfect images. However, achieving the complete coincidence of camera optical centers in practical applications is not easy. The cameras are also scattered in some scenes, such as vehicle-mounted panoramic systems and wide field security surveillance systems. Therefore, it is important to study the problem of stitching in wide baseline scenes. A standard method uses a global homography matrix for alignment, but it has no parallax processing capability, which results in obvious flaws in wide baseline and large parallax scenes. In order to solve the above problems, many researchers have proposed corresponding solutions from the perspectives of multiple homography and mesh optimization. However, the mesh deformation may have significant shape distortion. Some deep learning methods combine vision tasks of optical flow, semantic alignment, image fusion, and image reconstruction to help deal with the stitching problem. However, the parameter information of cameras is not fully utilized, so the stitching results sometimes still show defects. Therefore, we wish to make full use of the parameter information of cameras and synthesize the smooth interpolated view by supplementing intermediate viewpoints between cameras to achieve better visual perception. Methods The present study proposes a real-time video stitching method based on dense viewpoint interpolation. The method focuses on the overlapping regions of stitching and synthesizes the smooth interpolated view by supplementing dense intermediate viewpoints on the baseline of cameras, which can better align multiple inputs. In the first place, binocular camera calibration is performed to obtain internal parameters and the transformation matrix of the cameras. The original views acquired by cameras are de-distortioned and adjusted to the same horizontal plane for stitching in the horizontal direction. The maximum possible overlap regions are separated and adjusted to coplanarity and row alignment by stereo correction so that the image data can be processed in only one dimension. Subsequently, pixel-level displacement fields sampled in the original views for the overlapping regions are predicted by using the cost volume in stereo matching. Without the ground truth of the interpolated view, the network is guided to learn view generation rules by using spatial transformation relationships between viewpoints. Through the pixel-level displacement fields generated by the network, two images are sampled in the input views respectively and fused by linear weights to generate the interpolated view of the overlapping regions. Finally, the generated interpolated view is combined with non-overlapping regions of two views. The cylinder projection is performed to align the fusion boundaries of three regions and obtain the final stitching result. Results and Discussions In this paper, the stitching results of the proposed method are compared with mainstream stitching methods. Multiband blending may show artifacts under the influence of parallax, while the method based on multiple homography and mesh optimization may have significant shape distortion in non-overlapping regions after mesh deformation. The proposed method can eliminate artifacts and smoothly align the inputs with little shape distortion, resulting in better visual perception (Fig. 9 and Fig. 10). Furthermore, we evaluate the alignment quality of the overlapping regions. The traditional methods only deal with stitching from the perspective of image features, and the alignment quality is relatively low in the case of large parallax variations. The proposed method combines camera calibration information for preprocessing and deals explicitly with the parallax problem to obtain better alignment quality (Table 1). Regarding model size and speed, the proposed method has advantages because it can initially align images after camera calibration and uses a lightweight construction method of cost volume. The processing frame rate of 720 p video can reach more than 30 fps to meet the demand for online video stitching (Table 2). In the analysis of the variation of baseline width, the proposed method can align well under different baseline widths (Fig. 12). In addition, all of them can obtain a high improvement of indicators (Table 3), which is robust to the variation of the baseline width. In conclusion, the proposed method can improve the visual perception after stitching, eliminate artifacts, and smoothly align the inputs. It has high alignment quality, little shape distortion, and great application value because of its lightweight design and fast processing speed. Conclusions Applying the proposed video stitching method based on dense viewpoint interpolation can effectively deal with the problem of stitching in wide baseline and large parallax scenes. The interpolated view with the smooth transition is synthesized for the overlapping regions of stitching by supplementing dense intermediate viewpoints on the baseline of the left and right cameras. A network for generating the interpolated view is proposed, which is divided into modules of feature extraction, correlation calculation, and high-resolution optimization to predict the sampling locations in the original views. The generated interpolated view is combined with the non-overlapping regions to obtain the stitching result. Moreover, the proposed method calculates the three-dimensional information at the original viewpoint in the virtual environment without the ground truth of the interpolated view. The corresponding spatial region of the interpolated viewpoint is searched by dichotomization. The interpolated view is transformed into the original viewpoint under the constructed loss function, which guides the network to learn the view generation rules. Various experiments have proved that the proposed method can improve the visual perception of video frames after stitching. It is adaptive for different baseline widths, has great generalization ability, and achieves real-time performance to meet the online stitching requirements in practical applications.
machine vision,video stitching,wide baseline,deep learning,view interpolation
AI 理解论文