Classic task parallelism is not a direct fit to SYCL (or GPUs in general), as tasks are often chosen to be tiny without further nested parallelism. If we deploy a task to a GPU, we however expect it to exploit the accelerator’s hardware concurrency and to be reasonably large. Within the ExaHyPE project, we work with tiny tasks. However, we do not deploy all tasks directly into the tasking runtime. Instead, we buffer them in application-specific queues. If many appropriate tasks “assemble” within this queue and if we identify that they don’t have any side-effects, i.e. alter the global simulator state, we merge them into one large meta-task and deploy this meta task one specialised compute kernel which can batch the computations over multiple patches, vectorise aggressively, and even deploy whole task assemblies to an accelerator. To support multiple vendors and to allow for a smooth transition into the OneAPI era, we implement this task merging layer on top of OpenMP, TBB and C++ threading, and designed the actual compute kernels such that they are interoperable with all three “backends”.