• arxiv preprint - MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

  • Jun 27 2024
  • Length: 6 mins
  • Podcast

arxiv preprint - MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning  By  cover art

arxiv preprint - MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

  • Summary

  • In this episode, we discuss MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning by Xiangyu Zhao, Xiangtai Li, Haodong Duan, Haian Huang, Yining Li, Kai Chen, Hua Yang. The study presents MG-LLaVA, a multi-modal large language model designed to process both low-resolution and high-resolution images along with object-centric features for improved perception tasks. It includes a high-resolution visual encoder and a Conv-Gate fusion network to amalgamate fine-grained details with base features, enhancing object recognition using bounding box-derived data from offline detectors. Extensive benchmarking demonstrates MG-LLaVA's superior performance over comparable MLLMs, validated by evaluations using various language encoders ranging from 3.8B to 34B parameters.

    Show more Show less
activate_primeday_promo_in_buybox_DT

What listeners say about arxiv preprint - MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Average customer ratings

Reviews - Please select the tabs below to change the source of reviews.