Even in the era of Large Language Models (LLMs) which are claimed to be solutions for many tasks, finetuning language models remains a core methodology used in deployment for a variety of reasons – computational efficiency and performance maximization among them [1]. Finetuning could be single-task or multi-task joint learning (MTL) where the tasks support each other thus boosting their performance.
MTL learns shared representations across tasks and jointly optimizes the losses of all included tasks, which reduces the risk of over-fitting [2]. Compared to single-task learning (STL), MTL has been shown to improve performance and generalization capabilities in many natural language processing (NLP) tasks [[3], [4], [5], [6]]. However, empirical results also suggest that MTL is not always effective and naively grouping tasks brings negative transfer [[7], [8], [9], [10], [11], [12]]. The space of possible task combinations can be massive, and naively searching that space to find the best joint learning models is inefficient [10,13].
To find the best task combinations, some recent studies have developed new optimization methods that focus on measuring the relatedness among tasks [10,[12], [13], [14], [15]]. For example, Vu et al. [16] applied task embeddings to predict the transferability of source tasks to a target task. Fifty et al. [10] compared the inter-task affinity by examining how one task's gradient updates on the shared parameters would influence the objective of another task. Song et al. [13] leveraged a meta-learning framework on task combinations. Li et al. [12] applied surrogate models to identify negative transfers among different groupings during MTL and to identify the best combinations for joint learning. However, finding the optimal task grouping usually involves combining many, if not all, tasks for training and optimization, which becomes computationally intensive as the number of tasks increases. For a deeper understanding of the task-relatedness of MTL in neural networks, researchers also provide initial clues to formalize the definition through measurable variables. Specifically, some work suggests that auxiliary tasks with compact and more uniform label distributions are preferable for semantic sequence prediction problems [8]. Others found that gains are more likely to occur for main tasks that plateau quickly with non-plateauing auxiliary tasks [17]. In certain domains like financial NLP tasks, study results show that MTL works well when tasks are related and with diverse skills [14]. Nevertheless, we still lack a shared definition of task-relatedness or a metric to measure the amount of cross-task usable information for a given model under the joint learning context.
This work studies the use of pointwise V-usable information (PVI) [18] to measure the usable information of different datasets and to jointly train tasks with similar information gains given a model. PVI, recently introduced by Ethayarajh et al. [18], estimates the difficulty of data instances for a given model in supervised learning. It builds on the predictive V-information framework [19] which incorporates mutual information and the coefficient of determination to quantify data instance difficulty. The metric applies instance-level predictions to quantify how much information a given model can extract from a dataset. The higher the PVI estimate, the easier it is for the model to represent a given data point. Under this context, we cast PVI as an estimate of task-relatedness to guide MTL. By grouping tasks according to similar PVI distributions, or in other words, tasks of comparable difficulty, we hypothesize that this approach promotes model generalization across the targeted tasks in MTL.
To investigate the feasibility of identifying the best task groupings for MTL using PVI, we conducted experiments with 15 NLP datasets in the general, biomedical and clinical domains. We compared the MTL results with task groupings selected by the PVI estimate distributions against the best-performing fine-tuned single-learner models. The performances were also compared against recent LLMs, including Llama 2 [20], Llama 3 [21], and GPT-4 [22], which have demonstrated their ability as general-purpose NLP task solvers across a wide range of NLP tasks, either with or without downstream data adaptation [[23], [24], [25], [26]]. We also provide a comparison to two baseline task grouping methods: task embedding [16] and surrogate models [12] considered state of the art.
Comments (0)