End-to-End Domain Adaptation Network for Cross-Domain Image Retrieval

In order to verify the effectiveness of the proposed method, we conducted experiments on three different tasks: USPS to MNIST, Amazon to Webcam and Real-World Reconstruction to ModeNet10. In each task, the first dataset is used as the source domain and the second is used as the target domain. In this section, we used a variety of evaluation criteria to evaluate the effectiveness of the model for image retrieval tasks. For the performance evaluation on these three tasks, we show the experimental results separately.

Evaluation Criteria

In our work, we used Mean Average Precision (MAP) and First Tier (FT) to evaluate its retrieval performance.

The metric MAP provides an overall measurement of the retrieval performance. For the whole query set, MAP is the mean of the AP (average precision) for each query image, where AP is defined as:

$$\begin AP=\frac\sum _^n \frac}\cdot rel_, \end$$

(12)

where n is the size of database set, R is the number of relevant images in database set, \(R_\) is the number of relevant images in the top k returns, \(rel_=1\) if the image ranked at kth position is relevant and 0 otherwise.

The metric FT records the recall of the returned top R results. For each query image, FT is defined as:

$$\begin FT=\frac, \end$$

(13)

where \(R_r\) is the number of relevant images in the returned top R results. By calculating the mean of the FT for each query image, we can measure the retrieval performance effectively.

Experimental Setup

We conducted experiments on the NVIDIA GeForce GTX 1080Ti GPU with Python 3.6 and Pytorch 1.2, the memory of our GPU is about 11G. We used SGD optimizer to optimize our model and set different learning rates according to the needs of these tasks. For all experiments in our work, the Euclidean distance is used to measure the similarity of the deep features.

Datasets

The MNIST(Modified National Institute of Standards and Technology database) is the most commonly used dataset in the field of deep learning and it contains a total of 70,000 grayscale images of handwritten digits with 10 labels. This dataset is divided into two parts in advance, the first part is the training set with 60,000 images and the second part is the test set containing 10,000 images.

The USPS dataset contains the same 10 labels as MNIST dataset, it contains more than 7,000 images in the training set and about 2,000 images in the test set. The size of each image is 16 16, which is smaller than the MNIST dataset.

Both Amazon and Webcam are part of the OFFICE dataset [33], which is a color image dataset widely used in the field of visual domain adaptation currently.

The images in the Amazon (A) domain are obtained from the shopping site Amazon. These images are of products shot at medium resolution taken in an environment with ideal lighting conditions. This domain contains 31 classes with an average of 90 images each. Webcam (W) domain consists of images with 31 classes same as Amazon recorded by a simple webcam. The images are of low resolution (640 480) and show significant noise and color as well as white balance artifacts. The total number of images in Webcam domain is 795.

The ModelNet10 is the most commonly used dataset in the filed of 3D object recognition. It is artificially synthesized through CAD software and consists of 10 classes such as bathtub, bed, chair, desk and sofa. The dataset is divided into training set and test set in advance and the number of objects in each category is different.

Different from the ModelNet10, Real-World Reconstruction [32] is a real-world scanning dataset benchmark, comprising 243 objects of 12 classes such as bathtub, bed, chair and cup. The geometry is captured with an ASUS Xtion Pro and a dense reconstruction is obtained using the publicly-available Voxel Hashing framework [31]. Compared with the ModelNet dataset, this dataset has more occlusions and is more difficult to identify.

Experimental ResultsExperimental results on USPS to MNIST task

In the experiment, we used the USPS dataset as the labeled source domain and the MNIST dataset as the unlabeled target domain. In our work we used the original training set of the MNIST dataset and USPS dataset to train the network. After the training, 100 images from each class in the test set of MNIST were selected as the query set and the original USPS training set was used as the database set to perform retrieval. According to the Euclidean distance, the images in the database set were sorted and returned according to the similarity of the images in the query set.

Our feature extractor F consisted of two convolutional layers and a fully connected layer, the number of the convolution kernels about the two convolution layers was 20 and the size of the convolution kernels was 5. The size of each batch was set to 64 and the initial learning rate was set to 0.01, weight decay was 0.0005, and momentum used was 0.9.

In order to prove the effectiveness of the proposed method, we compared it with a variety of other methods. Table 1 is the result of our experiment. Among them, CNN means applying the model trained on the labeled USPS dataset to the MNIST dataset directly for feature extraction and the final retrieval performance is not great. DANN adds the adversarial loss on the basis of CNN to improve the feature extraction ability of the model for unlabeled images. Compared with CNN, the performance of DANN is slightly improved. The CDAN method is equivalent to adding the conditional adversarial loss to CNN. It can be seen that compared with the traditional adversarial loss, the conditional adversarial loss improves the model retrieval performance more significantly. After adding the center loss with CDAN, the retrieval performance of the model has been further improved. The last line represents the complete method proposed in this paper. It can be seen that the method has achieved very great performance. Compared with the model without adding the nuclear norm loss, the improvement of this method on MAP is about 0.04 and the improvement on FT is more than 0.035.

Table 1 Retrieval performance on the USPS to MNIST task. The best results are highlighted in boldfaceExperimental Results on Amazon to Webcam Task

In the experiment, we used the Amazon (A) domain as the labeled source domain and the Webcam (W) domain as the unlabeled target domain. This paper used all the images in the Webcam (W) domain and Amazon (A) domain to train the network. After the training, We used the Amazon domain as the database set and constructed the query set using the Webcam domain according to the relevant method [27]. According to the Euclidean distance, the images in the database set were sorted and returned according to the similarity of the images in the query set.

The paper used ResNet50 pre-trained on the ImageNet dataset as the feature extractor F. Taking into account the characteristics of the dataset, we set different learning rates to different layers. For the convolution part of the network, the learning rate we used was set to 0.001. For the discriminator D, classifier C and the fully connected layer of the feature extractor F, we expanded the learning rate to 0.01. Weight decay in this paper was 0.0005 and the momentum was 0.9.

In order to prove the effectiveness, we compared our approach with a variety of other methods. Table 2 is the result of our experiment. Among them, ResNet50 means apply the model trained on the labeled Amazon (A) domain to the Webcam (W) directly. The DANN method is equivalent to adding an adversarial loss on the basis of the ResNet50 network. The CDAN method replaces the adversarial loss used in the DANN method with the conditional adversarial loss. It can be seen that the network with conditional adversarial loss has better feature extraction ability than the traditional adversarial loss. The last line represents the complete method proposed in our paper. Compared with the CDAN, the improvement of this method on MAP is about 0.14 and the improvement on FT is more than 0.13.

Table 2 Retrieval performance on the Amazon to Webcam task. The best results are highlighted in boldfaceExperimental Results on Real-World Reconstruction to ModeNet10 Task

In this paper, the objects of the Real-World Reconstruction dataset were regarded as the labeled source domain and the ModelNet10 dataset was regarded as the unlabeled target domain. We used the whole Real-World Reconstruction dataset and part of the training set of ModelNet10 to train our network. After the training, we selected 20 objects of each class from the test set of the ModelNet dataset as the query set. According to the similarity measure, the objects in the database set were returned in an orderly manner according to the similarity.

This paper followed the experimental settings of the paper [20], used multi-view convolutional neural networks(MVCNN) [42] as the feature extractor network F. In order to transform the 3D object into a set of views, Phong reflection model was used to capture and render multiple views of 3D models. As most of the related works, we created 12 views by placing 12 virtual cameras around the model every 30 degrees. Figure 4 shows the 12 views of some 3D objects processed by the Phong reflection model. The top two objects are in Real-World Reconstruction dataset, and the bottom two are from the ModelNet10 dataset. From Fig. 4, we can also see that the styles of the two datasets are different.

Fig. 4Fig. 4The alternative text for this image may have been generated using AI.

Examples of the used 3D dataset. The top two objects are from the Real-World Reconstruction dataset, and the bottom two are from the ModelNet10 dataset

Figure 5 shows the structure of MVCNN. Each image in the 3D multi-view representation was passed through the first part of the network (CNN1) separately, aggregated at a view pooling layer, and then sent through the remaining part of the network (CNN2). All branches in the first part of the network share the same parameters in CNN1. For the view-pooling layer, we used element-wise maximum operation to process all views. Compared with the previous method that generates feature descriptor for each view, the MVCNN extracted the deep features of all views, which greatly accelerates the retrieval speed.

Fig. 5Fig. 5The alternative text for this image may have been generated using AI.

The structure of MVCNN. A 3D image is rendered from 12 different views and is passed through CNN1 to extract view-based features. These are then pooled across views and passed through CNN2 to obtain a compact feature

In our paper, we used the AlexNet pre-trained on the ImageNet dataset as the CNN1 of MVCNN. After CNN2, we added a fully connected layer to adjust the dimension of the deep features. During training, the size of each view is adjusted to 224 224. For the convolution part of the network, the learning rate we used was set to 0.0001. For the discriminator D, classifier C and the fully connected layer of the feature extractor F, we expanded the learning rate to 0.001.

Table 3 shows the performance in the Real-World Reconstruction to ModeNet10 task. Among them, MVCNN represents the model obtained by training on the pre-trained MVCNN network using only the labeled Real-World Reconstruction dataset. The last line is the retrieval effect of the method proposed in this paper. It can be seen that this method has achieved very good performance. Compared with the model without adding center loss, this method has improved MAP by more than 0.09 and improved FT by about 0.02.

Table 3 Retrieval performance on the Real-World Reconstruction to ModeNet10 task. The best results are highlighted in boldface Analysis of the Effect of \(\lambda _1\)

Here we further analyze the influence of \(\lambda _1\) through experiments. Specifically, we train our model with different \(\lambda _1\) on USPS to MNIST and Amazon to Webcam tasks and compare the image retrieval performance. \(\lambda _1\) starts from 0, added 0.005 after each experiment, and finally reaches 0.02. Figure 6 shows the results with different \(\lambda _1\). It is easy to see that the best performance is achieved at \(\lambda _1=0.005\) and either smaller or larger \(\lambda _1\) yields less desirable performance.

Fig. 6Fig. 6The alternative text for this image may have been generated using AI.

The performance of our method trained with different \(\lambda _1\) on USPS to MNIST (a) and Amazon to Webcam task (b)

Analysis of the Effect of \(\lambda _2\)

Here we further analyze the influence of \(\lambda _2\) through experiments. Specifically, we train our model with different \(\lambda _2\) on USPS to MNIST and Amazon to Webcam tasks and compare the image retrieval performance. \(\lambda _2\) starts from 0, added 0.005 after each experiment, and finally reaches 0.02. Figure 7 shows the results with different \(\lambda _2\). It is easy to see that the best performance is achieved at \(\lambda _2=0.01\) and either smaller or larger \(\lambda _2\) yields less desirable performance.

Fig. 7Fig. 7The alternative text for this image may have been generated using AI.

The performance of our method trained with different \(\lambda _2\) on USPS to MNIST (a) and Amazon to Webcam task (b)

Analysis of the Effect of \(\lambda _3\)

Here we further analyze the influence of \(\lambda _3\) through experiments. Specifically, we train our model with different \(\lambda _3\) on USPS to MNIST and Amazon to Webcam tasks and compare the image retrieval performance. \(\lambda _3\) starts from 0, added 0.005 after each experiment, and finally reaches 0.02. Figure 8 shows the results with different \(\lambda _3\). It is easy to see that the best performance is achieved at \(\lambda _3=0.01\) and either smaller or larger \(\lambda _3\) yields less desirable performance.

Fig. 8Fig. 8The alternative text for this image may have been generated using AI.

The performance of our method trained with different \(\lambda _3\) on USPS to MNIST (a) and Amazon to Webcam task (b)

Analysis of the Effect of Different Combinations of \(\lambda _1\), \(\lambda _2\) and \(\lambda _3\)

Here we further analyze the effect of different combinations of \(\lambda _1\), \(\lambda _2\) and \(\lambda _3\) through experiments. Specifically, we train our model with different combinations of \(\lambda _1\), \(\lambda _2\) and \(\lambda _3\) on USPS to MNIST and compare the image retrieval performance. Figure 9 shows the results . It is easy to see that the best performance is achieved at \(\lambda _1=0.005\) ,\(\lambda _2=0.015\) and \(\lambda _3=0.01\)

Fig. 9Fig. 9The alternative text for this image may have been generated using AI.

The performance of our method trained with different combinations of \(\lambda _1\), \(\lambda _2\) and \(\lambda _3\) on USPS to MNIST

Comments (0)

No login
gif