MixEval-X: Evaluaties van elk-naar-elk van mengsels van real-world data

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

October 17, 2024
Auteurs: Jinjie Ni, Yifan Song, Deepanway Ghosal, Bo Li, David Junhao Zhang, Xiang Yue, Fuzhao Xue, Zian Zheng, Kaichen Zhang, Mahir Shah, Kabir Jain, Yang You, Michael Shieh
cs.AI

Samenvatting

Het waarnemen en genereren van diverse modaliteiten zijn cruciaal voor AI-modellen om effectief te leren van en in te spelen op signalen uit de echte wereld, wat betrouwbare evaluaties voor hun ontwikkeling noodzakelijk maakt. We identificeren twee belangrijke problemen in de huidige evaluaties: (1) inconsistente normen, gevormd door verschillende gemeenschappen met uiteenlopende protocollen en volwassenheidsniveaus; en (2) aanzienlijke vraag-, beoordelings- en generalisatievooroordelen. Om deze aan te pakken, introduceren we MixEval-X, de eerste any-to-any benchmark in de echte wereld die is ontworpen om evaluaties over input- en outputmodaliteiten te optimaliseren en standaardiseren. We stellen multimodale benchmarkmenging en aanpassing-rectificatiepijplijnen voor om echte taakverdelingen te reconstrueren, waardoor evaluaties effectief generaliseren naar echte gebruiksgevallen. Uitgebreide meta-evaluaties tonen aan dat onze aanpak benchmarkmonsters effectief afstemt op echte taakverdelingen en dat de modelranglijsten sterk correleren met die van door de menigte-sourced echte wereld evaluaties (tot 0.98). We bieden uitgebreide leaderboards om bestaande modellen en organisaties opnieuw te rangschikken en bieden inzichten om het begrip van multimodale evaluaties te verbeteren en toekomstig onderzoek te informeren.
English
Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any real-world benchmark designed to optimize and standardize evaluations across input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions and the model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98). We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.

Summary

AI-Generated Summary

Paper Overview

LLaMo is a Large Language Model-based Molecular graph assistant that excels in molecular tasks by integrating a graph encoder, multi-level graph projector, and a large language model. It outperforms existing models in molecular description generation, property prediction, and IUPAC name prediction, showcasing its superiority in both generalist and specialist settings.

Core Contribution

  • Integration of a graph encoder, multi-level graph projector, and a large language model for instruction-following responses in the molecular domain.
  • Novel multi-level graph projector capturing multi-hop graph information by leveraging node representations from all layers of a GNN.
  • Two-stage training pipeline involving graph encoder training and LLM fine-tuning using LoRA.
  • Superior performance in molecular tasks like molecule description generation, property prediction, and IUPAC name prediction compared to existing LLM-based models.

Research Context

LLaMo addresses the need for enhanced instruction-following capabilities in molecular tasks by leveraging a multi-level graph projector and GPT-generated instruction-following data. It builds upon existing research in molecular modeling and language models, offering a comprehensive solution for accurate and informative molecule descriptions.

Keywords

Large Language Model, Molecular Graph, Graph Encoder, Multi-level Graph Projector, Graph Neural Networks, Instruction-following Responses, Molecular Description Generation, Property Prediction, IUPAC Name Prediction

Background

The research background of LLaMo involves the necessity for improved molecular modeling through language models. The study aims to bridge the gap in existing literature by introducing a novel approach that combines molecular graphs, text tokens, and SMILES representation in a large language model for enhanced instruction-following responses.

Research Gap

  • Lack of efficient instruction-following models in the molecular domain.
  • Limited integration of graph encoders and large language models for molecular tasks.
  • Insufficient exploration of multi-level graph projectors for capturing detailed molecular information.

Technical Challenges

  • Data leakage due to uncertainty in data exclusivity for LLM pretraining and testing.
  • Memory and computational costs associated with LLM-based models.
  • Hallucination issues inherited from LLMs affecting model performance.

Prior Approaches

Existing solutions lack the comprehensive integration of graph encoders, multi-level graph projectors, and large language models for instruction-following responses in molecular tasks. Limited emphasis on leveraging GPT-generated data for instruction-tuning.

Methodology

The methodology of LLaMo involves a graph encoder, multi-level graph projector, and large language model for instruction-following responses in molecular tasks. The model undergoes two-stage training, focusing on graph encoder training and LLM fine-tuning using LoRA.

Theoretical Foundation

Utilization of Graph Neural Networks for updating node representations and a multi-level graph projector for capturing multi-hop graph information. Integration of large language models for instruction-following capabilities.

Technical Architecture

  • Graph encoder utilizing GNNs for iterative node representation updates.
  • Multi-level graph projector aligning node representations with the language model.
  • Backbone large language model for generating instruction-following responses.

Implementation Details

  • Usage of PyTorch, PyTorch Geometric, Huggingface transformers, and GIN for implementation.
  • Specific optimization parameters and training schedules for model training.
  • Leveraging GPT-4 for generating multi-turn conversation datasets for instruction-tuning.

Innovation Points

  • Introduction of a multi-level graph projector for capturing detailed molecular information.
  • Effective instruction-tuning using GPT-generated data for enhancing model performance.
  • Superior performance in molecular tasks due to the comprehensive integration of graph encoders and large language models.

Experimental Validation

LLaMo is experimentally validated for tasks like molecule description generation, IUPAC name prediction, and property prediction, showcasing its superior performance compared to existing models. The evaluation involves specific configurations, metrics, and comparative analyses.

Setup

  • Training the multi-level graph projector with molecule-description pairs from datasets like PubChem.
  • Fine-tuning the language model using various datasets and GPT-generated instruction-following data.
  • Evaluation on tasks like molecular description generation, IUPAC name prediction, and property prediction.

Metrics

  • Evaluation metrics include BLEU and METEOR for text generation tasks.
  • MAE is used for property question answering tasks.

Results

  • Superior performance of LLaMo on molecular tasks compared to baselines.
  • Detailed experimental settings with specific implementation details and optimization parameters.

Comparative Analysis

  • Outperformance of LLaMo in chemical reaction tasks compared to existing models.
  • Benchmarking against LLM-based generalist models, molecule instruction-tuned models, and specialist models like MolCA.

Impact and Implications

LLaMo's impact lies in its superior performance in molecular tasks, although it faces limitations such as data leakage and computational costs. The model's broader implications include its wide applicability to various molecule-related tasks and potential biases in output.

Key Findings

  • Enhanced performance in molecular description generation, IUPAC name prediction, and property prediction.
  • Effective instruction-tuning with GPT-generated data for improved instruction-following capabilities.
  • Superiority over existing models in both generalist and specialist settings.

Limitations

  • Data leakage concerns due to uncertainty in data exclusivity.
  • Computational costs and memory requirements.
  • Potential biases in model output and environmental impact due to CO2 emissions during LLM training.

Future Directions

  • Addressing data leakage issues through more stringent data handling protocols.
  • Mitigating computational costs through optimization strategies.
  • Exploring methods to reduce biases in model output and environmental impact.

Practical Significance

  • LLaMo's applications in accurate and informative molecule description generation.
  • Potential for advancements in property prediction and IUPAC name generation in chemistry and biology fields.

References

The paper acknowledges related works in the fields of molecular modeling and language models.

Uitgelichte Papers

DeepSeek-R1: Het stimuleren van redeneervermogen in LLM's via Reinforcement Learning
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen ZhangJan 22, 20253685

Technisch Rapport Qwen2.5
Qwen2.5 Technical Report

Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan QiuDec 19, 202436311

MiniMax-01: Schalen van Foundation Modellen met Bliksem Aandacht
MiniMax-01: Scaling Foundation Models with Lightning Attention

MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia WuJan 14, 20252836

PDF752November 16, 2024