Multi-task Learning for Outfit Recommendation

Author: Pengjie Ren.

In outfit recommendation systems recommend fashionable products to people. The task is attracting a lot of attention. It is being used widely in fashion-oriented online communities such as Polyvore, Chictopia and online retail markets such as Amazon, Tmall, JD.COM and so on. As shown in Figure 1, given a top (i.e., upper garment), outfit recommendation tries to recommend a list of bottoms (e.g., trousers or skirts) from a large collection that best match the top, and vice versa.

Figure 1: Outfit recommendation.

There have been several studies trying to improve the performance of outfit recommendation, most of which focus on designing a single model that is specific to the recommendation task. Multi-task learning has shown promising performance in many applications, which aims to leverage useful information contained in related tasks to help improve the generalization performance of the target task. In this blog post, I will introduce two recent attempts by our team to use multi-task learning for outfit recommendation.  

Outfit Recommendation + Text Generation

User comments provide valuable information about users’ preferences and attitudes toward products that can help improve recommendation performance. Existing studies in the outfit recommendation literature mostly neglect user comments. Figure 2 shows an outfit with its user comments collected from Polyvore, a well-known online fashion community that was recently acquired by SSENSE.

Figure 2: Outfits with user comments from Polyvore. Users share their outfit compositions with a broad public (left) and others express their comments about the outfit compositions (right).

We can see that some user comments (e.g., the first one) are helpful for better understanding the outfit. For example, the words “black” and “yellow” will help understand the color of the outfit. Besides, it also helps to explain why the outfit is a good combination. Specifically, the words “office” and “color” show two different aspects for the recommendation interpretability.

Neural Outfit Recommendation

To leverage the valuable information in user comments and at the same time provide explainable outfit recommendations, we propose a neural multi-task learning framework, called neural outfit recommendation (NOR), as shown in Figure 3. NOR consists of two core ingredients: outfit matching and comment generation. For outfit matching, we employ a convolutional neural network (CNN) with a mutual attention mechanism to extract visual features of outfits.

Figure 3: Overview of NOR.

Specifically, we first utilize CNNs to model tops and bottoms as latent vectors. Then we propose a mutual attention mechanism that extracts better visual features of both tops and bottoms by employing the top vectors to match the bottom vectors, and vice versa. The visual features are then decoded into a rating score as the matching prediction. For abstractive comment generation, we employ a gated recurrent neural network with a cross-modality attention mechanism to transform visual features into a concise sentence. Specifically, for generating a word, NOR learns a mapping between the visual and textual space, which is achieved with a cross-modality attention mechanism. The two parts are jointly trained based on a multi-task learning framework, using end-to-end back-propagation.

Experiments with NOR

In Figure 4 we show the results of comparing NOR with other fashion recommendation methods on the ExpFashion dataset. This dataset has been collected from Polyvore. Bottom recommendation is the task of recommending bottoms for given tops, and for top recommendation we recommend tops for given bottoms. We can see that NOR achieves the best performance in both recommendation tasks and all metrics (AUC and MRR). By comparing NOR with NOR-CG, which is NOR without the generation part, we learn that NOR performs better than NOR-CG and, hence, improves outfit recommendation by combining it with comment generation.

Figure 4: Experimental results on NOR.

Figure 5 shows some instances from our test set, where we generate comments for each combination. We find that the generated comments are basically grammatical and syntactic. And most of them express some feelings and opinions about the combinations, which can be treated as explanations about why the top and the bottom match. For example, “Love the color combination” points out directly that color matching is the reason of the recommendation. And “great denim look,” expresses that the material of the outfit is nice, which is a good explanation for recommending this particular combination.

Figure 5: Recommendation and generated comments with NOR.

Outfit Recommendation + Image Generation

Most studies on outfit recommendation only try to recommend outfits by exploring existing items in database. But how about directly generating a bottom that can match the given top? Will the bottom generation help improve the recommendation performance? Motivated by these questions, we have designed a new pipeline for outfit recommendation, as shown in Figure 6. For a given top, we aim to not only recommend bottoms but also generate bottoms. To promote personalization, we also allow users to provide some descriptions as conditions that the recommended items should accord with as much as possible.


Figure 6: Joint outfit recommendation and generation.

We address this new task with a neural co-supervision learning framework, called FAshion Recommendation Machine (FARM). See Figure 7. By incorporating the generation process as a supervision signal, FARM is able to encode more aesthetic characteristics, based on which we can directly generate the output items. By incorporating a novel layer-to-layer matching mechanism to evaluate the matching score of generated and candidate items at different neural layers, FARM fuses the generation features from different visual levels to improve the recommendation performance. This layer-to-layer matching mechanism also ensures that FARM avoids paying too much attention to the generation quality and ignoring the recommendation performance.

Figure 7: Overview of FARM.

Experiments with FARM

As shown in Figure 8, we compare FARM with other fashion recommendation methods again on the ExpFashion dataset. We find FARM can outperform state-of-the-art baselines. We also analyze the impact of the generation part on the recommendation performance and find the generation part will help improve the recommendation performance.

Figure 8: Experimental results of FARM.

Figure 9 shows some instances from our test set, where we generate the top/bottom given either bottom/top and a description of the imaginary target. Overall, the generated items are able to match the given input which means they can be good references for recommendation. For example, in the sixth case of the top generation, the generated navy blouse with the yellow keen length skirt looks beautiful and elegant. From these samples we can see that FARM is able to generate fashion items based on the relation between the visual features of different fashion items. And the generated items can accord with the given descriptions no matter what they are about. For example, in the second case of the top generation, the description is “greywool coats,” so the generated top is a grey coat which also looks like wool. In the bottom generation, we also observe that FARM is able to distinguish between skinny jeans and bootcut jeans from the first and the second cases.

Figure 9: Outfit generation with FARM.

Want to know more?

The NOR has been introduced in Explainable Outfit Recommendation with Joint Outfit Matching and Comment Generation, a paper recently published in IEEE Transactions on Knowledge and Data Engineering (TKDE), 2019. Codes and datasets for NOR can be found here. The FARM has been introduced in Improving Outfit Recommendation with Co-supervision of Fashion Generation, a paper that is about to be presented at The Web Conference (WWW), 2019. Code and datasets for FARM can be found here.

Pengjie Ren is a postdoc within ILPS. His research interests include dialogue systems, recommender systems and text summarization.