Pallet Localization Techniques of Forklift Robot: A Review of Recent Progress
Author(s) : Yongyao Li 1 2 , Xiaohe Chen 1 2 , Guanyu Ding 3 , Chao Li 3 , Sen Wang 1 2 , Qinglei Zhao 4 and Qi Song 2 3
1 School of Electronic and Information Engineering , Changchun University of Science and Technology , China
2 Suzhou Institute of Biomedical Engineering and Technology , Chinese Academy of Sciences , China
3 , Pilot AI Company , China
4 Changchun Institute of Optics, Fine Mechanics and Physics , Chinese Academy of Sciences , China
J Robot Mech Eng
Article Type : Review Article
Pallets are intensely used in the warehouses and retailing stores and the automation of pallet localization and detection are highly desired and studied for forklift robot and pallet-picking instruments. Due to the fact that pallet types are varied a lot in practice, it’s extremely difficult to develop single solution to detect all types of pallet. This article presents a general review of pallet identification and localization techniques for industrial forklift robot and pallet-picking instrument. Some modern computer-vision techniques are reviewed and compared. In particular, Deep Neural Network (DNN) method is usually applied to detect and locate the pallet in the RGB images. The Point Cloud method is used to label region of interest (RoI) in 2D range data and the pallet’s feature is extracted and this method is able to provide the precise localization of the pallets. Here, Pallet identification and localization algorithm (PILA) strategy is introduced and this approach could deliver highly-precise orientation angle and centric location of the pallets without any artificial assistance, which utilizes RGB image and Point-cloud data to balance the localization precision and running-time with low-cost hardware. The experimental results show that the pallet could be located with the 3D localization accuracy of 1cm and angle resolution of 0.4 degree at the distance of 3m with running time less than 700ms. PILA is a promising solution for autonomous pallet picking instrument and self-driving forklift applications.
Keywords: Pallet Recognition; Pallet Localization; Deep Neutral Network; RGBD camera
In recent decades the unmanned industry automation techniques have raised enough attention and especially in logistics applications. With the COVID-19 pandemic continues and spreads all over the world, the pallet picking by unmanned forklifts robot and AGV has become much more desired [1]. The major challenges of the pallet detection and localization are In this paper, Section 2 introduces visual detection techniques of pallet and some neutral network models such as RCNN, Fast-RCNN, SSD and YOLO. In this part, the pallet training dataset is also described with great details. Section 3 introduces Point Cloud approach is explained with some examples. Section 4 presents the pipeline strategy of PILA model and experimental results show that PILA outperform other two approaches in some aspect (Figure 1).
1. x, y and z value of the pallet center and the orientation angle.
2. “Real-time” operation requirement to guarantee the pallet-picking.
3. The pallet types and size may vary dramatically in practice, while the typical model-based pallet localization methods are not capable to handle all the cases properly.
Pallet detection and localization issue has been investigated from the 80’s. In the very early stage, infrared sensors or RFID are used to provide the distance of pallet and only point-to-point measurement can be implemented on forklift [2]. With vision technique developed with embedded system hardware, the object detection algorithm is used to locate the artificial features for more precise pallet identification and positioning. However, this method is difficult to implement in the warehouse since it requires a significant modification and will increases the cost dramatically [3, 4]. 2D laser rangefinder or 3D depth camera is another approach to locate the pallets. However, it could be challenging to capture the enough features from the 2D depth information [5-8]. Alternatively, the plane segmentation on 3D point cloud data could deliver more precise results with the template matching method [9]. Unfortunately, this method is limited by the detection speed and the recognition accuracy is seriously affected by the pallet type and Point Cloud data quality, which may impose the strict requirement on the depth image hardware and computing unit. To best of our knowledge, all the existed methods of pallet recognition and localization, using single data source like RGB images or Point Cloud may either lead to the high probability of the false positioning, or consume a lot of computing power and raise the cost dramatically. In the last part we introduce the third approach which is based on 2D image object detection and 3D Point Cloud processing and can deliver precise location data. This pipeline strategy, which we call as the pallet identification and localization algorithm (PILA), uses a low-cost hardware and only requires a minor computing resource. In this pipeline, the deep neural network (DNN) [10, 11] methods are used to detect the pallet in RGB images and Point Cloud data are aligned to the region of interest (RoI) with RGB image. Then the pallet’s location and angle are extracted by the geometric features extracted from Point Cloud data. To sum up, the DNN method is designed to recognize the pallets with the high rate, and the Point Cloud data to deliver the precise localization results with less computing time or resource. The results show the excellent performance on the pallet recognition as the 3D localization accuracy is above 1cm and pose estimation error is below 0.4 degree.
Figure 1: The front surface view of a normal pallet used in warehouse.
As shown in Figure 1, pallets used in warehouse normally include ISO pallets, European pallets and North American pallets with sizes ranging from 800×600 to 2240×2740. The plastic and wood pallets are the most common ones in practice. As Figure 2 shows, more than five types of pallets are collected in the dataset for different scenarios and conditions, including cases on the rack, on the ground, with the card box, or with small angles. Furthermore, the pallet images of different lighting conditions, floor conditions and partial occlusion are also included to match the real environment in the warehouse. Generally, there are two-way pocket and four-way pocket pallet, which makes forklift can pick from two-way or four-way directions. We have collected both types in the dataset to make model more generic. Pallet assembly information is shown in Table 1. The pallet dataset contains more than 1000 pictures. Pallet material Color Dimensions (W×L×H) mm Recognition rate (%) Wood Wooden 700×1400×130 98 Plastic White 914×1200×150 99 Plastic Blue 1000×1200×150 98.5
Figure 2: Various types of pallets used in the training dataset. Pallets on the ground, pallets with the card box, on the racks, and tilting pallets are among the multiple types of pallets.
Table 1: Information about some pallet types in the dataset.
As one of the most popular topics of computer vision, the object detection has been recognized as one of the most popular topics for decades and intense research activities have been conducted. Traditional model-based target detection techniques require to manually designing the strategy to segment the geometry of pallet and recognize each block. This process involves lots of human effort like picking feature descriptors such as Haar-like features and Ad boost algorithm to cascade multiple classifier [12]. However, the hierarchical feature generation by target detectors based on deep learning is an automatic process, which shows great potential in recognition and classification compared with other methods. There are two major object detection architectures in general. One is the one-stage detector and Single Shot Multi-box Detector (SSD) and You Only Look Once (YOLO). Another one is the two-stage detector as Regional Proposed Network (RPN) method, R-CNN and Faster R-CNN [13]. The one-stage detector contains a single feed-forward fully convolution network that directly provides the object classification with tackled area. Two-stage detector normally consists of the individual region proposal and classification stages. In particular, the reference anchors are used to locate the proposed regions of interest (RoI) for multiple object candidates. The content of RoI is further categorized and the location is fine-tuned in the second stage. Two-stage architecture is able to deliver more precise result while needs longer running time [14]. Different than YOLO, SSD receives a whole image as input and passes it over multiple convolution layers and convolution feature maps are utilized to predict the bounding boxes. The model generates a vector of object class probabilities for predicting bounding boxes. The architecture used in this case is shown as Figure 3, which is a VGG-16 model pre-trained on Image Net for image classification. A feed-forward convolutional network is used to generate a fixed-size set of bounding boxes and score are given for the object class instances existing in these boxes. Instead of predicting a score value of the potential object, SSD model directly gives the likelihood of a class present in the bounding box. Faster R-CNN is the two-stage architecture that utilizes a multi-task learning process to address the detection issue by combining classification and bounding box regression. The system usually comprises of two stages as a region proposal network and a Fast R-CNN header network that employs a convolution backbone to extract high-level features from input pictures. Faster R-CNN replaces the Selective Search method [15] in the original algorithm of RPN. In the first stage, to produce proposals, RPN employs a sliding window over the feature maps generated by the backbone network. Multi-scale anchors are used on the feature map to predict multiple candidate boxes. The anchors are defined with various scales and aspect ratios to identify arbitrary objects. The judge function decides whether the anchors are foreground or background and then modifies them using boundary regression to obtain a precise region proposal. The top-ranked object candidates are next cropped using a RoI pooling layer derived from the feature extractor's intermediate layer, which can cope with the issue of varied sizes of feature maps being input to the network with fully connected layer. In the second stage, each proposal undergoes a final classification and box-refinement procedure [16, 17]. Broad results have shown that the performance of Faster R-CNN and SSD could deliver the better detection accuracy compared to YOLO. However, YOLO is the faster than SSD and Faster R-CNN. Pallet material Color Dimensions (W×L×H) mm Recognition rate (%) Wood Wooden 700×1400×130 98 Plastic White 914×1200×150 99 Plastic Blue 1000×1200×150 98.5
Neural Network Architecture for Pallet Recognition
As a typical single-stage detector, YOLO takes the whole object detection issue as a regression problem. The input image is divided into a set of grid cells. Each grid cell predicts a fixed number of bounding boxes with a confidence score, which is calculated by multiplying the object detection probability with the intersection over the union (IoU), where IoU is the overlap ratio between the area of predicted bounding box and ground truth bounding box, and the class probability of a bounding box is finally from the IoU score. As shown in Equation (1), if IoU score is greater than 0.5, the matching value m is 1, which means the positive match. On the contrary, zero or negative match means that the object is detected or not.
Figure 3: The diagram of SSD architecture.
The recognition rates of three types of pallets are shown in Table 2. The average rate is above 98 percent, which is pretty durable for the warehouse operation. The pallet detection results with labeled RoI of pallets are presented in Figure 4. The presence of multiple pallets in the scene, or the tilting of the pallets, can be identified well regardless of card box’s presence.
Table 2: Pallet Recognition Results of SSD Model.
Figure 4: Pallet images (a) The scene with multiple pallets in the field of vision during detection, (b) The tilted wooden pallet, (c) The tilted plastic pallet.
Usually, 2D LRF are mostly used for the mobile robot SLAM. With the extensively used in unmanned robot navigation, there are some methods to detect and localize pallets based on LRF device. In contract to visual-based solution, this approach does not suffer from imaging distortion, illumination condition or scaling issues which can lead to false detection or feature misdetection. In the early work laser scan data was used for scene segmentation, object detection, and object recognition. The method was present for 3D objects detecting and classifying based on 3D point cloud data [18]. However, 3D solution requires more stringent condition on hardware and algorithm and will increase the cost dramatically. In order to utilize well-developed object-detection technique and also obtain fast processing, 2D range data is converted to 2D image and DNN technique could be employed [19, 20]. The pallet detection pipeline with 2D rang data [21] is depicted in Figure 5. It consists of three phases as data preparation, training and test, and pallet tracking. The data preparation phase is used to convert the 2D laser scanner data into 2D image. Then, the training and testing phase takes 2D images as input. Once the model is fine-tuned and verified, the tracking phase is executed to detect and keep tracking all possible pallets from the scene. The 2D Laser scanner used to get the range data is shown in Figure 6 (a) and the RoI of pallet tracking with the range data is shown in Figure 6 (b). Range data is converted to bitmap after acquisition and detected by the trained model. If it is greater than a certain threshold, it will be identified as a pallet. Algorithm 1 describes online image creation as follows. 2D range data is read and then X and Y range data is converted to 2D image for imaging processing. 2D image dataset are collected and trained to identify the possible pallet pattern in 2D range data. Algorithm 1: On-line Image Creation 1: function Read frame 2: Subscribe to ROS Laser topic 3: Receive range data as a ROS message 4: Convert laser scan ranges to Cartesian coordinates 5: Convert the X and Y point cloud into 2D image 6: if Training Phase 7: Define ROIs in the image 8: Generate an artificial data by rotating the image by 90°& - 90° 9: else 10: Break 11: end if 12: end function
Figure 5: The pallet detection pipeline with 2D rang data.
Figure 6: (a) S3000 industrial 2D Laser scanner and (b) 2D range data contains pallet pattern.
Algorithm 1: On-line Image Creation
PILA Description
Two stages architecture of PILA pipeline strategy are introduced in this part. The Figure 7 shows the pipeline flowchart of PILA consequentially. Deep neural network is utilized to recognize the possible pallet from the RGB images of available scenes. The model is generated by off-line training and the transferred model is used for online detection from camera. The algorithm is divided into 3 functional stages. In the first stage, the pallet is detected and the confidence score of detection is given. In the second stage, RGB-Depth images are used to align the pallet in RGB image to depth image. In the third stage, the Point Cloud data is used to extract the pallet front-face plane and line segments are extracted to locate a “T-shape” of pallet center. In particular, the horizontal (x) and vertical (y) line segments at the pallet’s edge are detected according to the pallet shape, which may be varied across different pallet types and the decision rule used here is designed to find “T-section” of pallet center as more universal and loose solution. Finally, the x, y and z value of centric location and orientation angle of the pallet facet could be obtained.