We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
首先感谢优化的工作,真的感觉是一个很惊艳的多模态交互解决方案。但是还存在以下几点疑问,请问是否有解决方案。 是否可以在输入中加入图片音频文本全部的编码?这样可以方便RAG技术的引入。 第二点,关于现有模型的理解能力,使用T1A2返回的结果不如,使用纯文本输出文本的效果,引入音频输出似乎使效果变更差了。请问什么原因? 训练代码是否可以开源?谢谢。
The text was updated successfully, but these errors were encountered:
相同的文本输入,T1A2输出的结果是错的,T1T2输出是对的
Sorry, something went wrong.
No branches or pull requests
首先感谢优化的工作,真的感觉是一个很惊艳的多模态交互解决方案。但是还存在以下几点疑问,请问是否有解决方案。
是否可以在输入中加入图片音频文本全部的编码?这样可以方便RAG技术的引入。
第二点,关于现有模型的理解能力,使用T1A2返回的结果不如,使用纯文本输出文本的效果,引入音频输出似乎使效果变更差了。请问什么原因?
训练代码是否可以开源?谢谢。
The text was updated successfully, but these errors were encountered: