Current Vacancies

Intern - Surgical Video Comprehension and Multimodal Basic Modeling

实习生-手术视频理解与多模态基础模型

中国科学院香港创新研究院人工智能与机器人中心（CAIR），作为人工智能与机器人领域的前沿科研阵地，现面向全球招募博士后、研究助理与实习生，诚邀怀揣科研热忱、勇于探索创新的你，加入我们共同开启手术视频理解基础大模型的研究之旅！本项目聚焦开发先进多模态基础模型，致力于在手术动作理解、流程精准识别、质量科学评估及智能决策制定等关键任务中实现重大突破，推动医疗领域的智能化变革。

研究方向

深度开发并创新应用多模态基础模型，对手术视频进行全方位、深层次分析，以此优化手术流程设计、革新手术培训模式，最终显著提升手术成功率与患者康复效果，为医疗行业智能化升级注入强劲动力。

岗位职责

模型研发支持：协助团队开展手术视频理解基础大模型开发，包括使用标注工具对手术视频中的关键动作、器械使用等行为进行标注；基于标注数据，参与手术行为分析模块、手术阶段识别模块的算法调优与模型训练，记录训练过程中的关键参数与实验结果。
数据处理与构建：前往合作医院等真实临床场景，按照既定标准收集手术视频数据，对数据进行格式转换、去噪等预处理；参与构建手术视频数据集，整理手术视频的元数据信息（如手术类型、主刀医生、手术时长等）；协助构建视觉 - 语言预训练数据集，完成视频片段与对应文本描述的匹配、校对工作。
技术探索与实验：调研多模态预训练技术前沿论文，撰写技术调研报告；在团队指导下，使用现有开源多模态预训练模型，结合手术视频数据进行小规模实验；通过调整超参数、修改网络结构等方式，探索提升模型对手术环境上下文知识学习效果的方法。
临床应用辅助：在临床场景中，协助医生使用已开发的手术视频分析模型，收集医生在使用过程中的反馈意见；整理模型分析结果，生成可视化报告，为临床决策提供直观的数据参考。
成果整理输出：按照学术规范，整理实验过程中的数据、图表，撰写技术文档；协助团队进行专利申请材料的准备，挖掘研究成果中的创新点；参与学术论文的撰写工作，负责部分章节的初稿撰写与文献资料收集。

任职要求

算法知识储备：熟悉预训练算法，如 MAE、Dino、自监督学习、视觉 - 语言对齐等，能理解算法原理及在模型训练中的作用。
编程技术能力：掌握 Python 编程语言，能使用 PyTorch、TensorFlow 等深度学习框架完成简单的模型搭建与训练任务。
综合技能素养：具备基础的文档撰写能力，能清晰记录实验过程；具有一定的代码管理意识；有较强的学习能力，愿意主动探索新技术；善于沟通交流，能够与团队成员协作完成任务。

优先条件

拥有计算机视觉、自然语言处理或相关领域学习背景，具备跨学科知识基础。
有使用多模态大型模型（如 LLaVA、BLIP、Qwen-VL）进行开发的经验，了解模型预训练、微调流程。
熟悉 RAG 或 Agent 开发技术（如 RAG、LangChain、LlamaIndex），有相关项目实践经历。
有视频处理、分析的项目经验，熟悉视频数据的常用处理方法。
了解上下文学习、提示工程与提示微调技术，有相关实践经验。
有学术论文撰写或参与科研项目的经历，具备一定的科研能力。

加入我们，你将在顶尖科研团队中与行业精英并肩作战，接触前沿科研资源与先进技术，在推动医疗人工智能发展的道路上实现个人价值与职业成长！期待优秀的你投递简历，共赴科研创新征程！

申请方式

请将个人简历发送至hr02@cair-cas.org.hk。邮件主题请注明应聘[实习生-手术视频理解与多模态基础模型]-[姓名]。

Responsibilities

Model R&D Support: Assist the team in the development of a basic model for surgical video understanding, including the use of annotation tools to annotate key actions, instruments and other behaviors in the surgical video; based on the annotated data, participate in the algorithm tuning and model training of the surgical behavior analysis module and surgical stage recognition module, and record the key parameters and experimental results during the training process.
Data Processing and Construction: Go to real clinical scenarios such as cooperative hospitals, collect surgical video data according to the established standards, and perform pre-processing such as format conversion and denoising; participate in the construction of surgical video datasets, organize the metadata information of surgical videos (e.g., type of surgery, surgeon in charge, length of surgery, etc.); assist in the construction of the visual-linguistic pre-training dataset, and complete the matching and proofreading work between the video clips and the corresponding textual descriptions. Matching and proofreading of video clips and corresponding text descriptions.
Technology Exploration and Experimentation: Research on cutting-edge papers on multimodal pre-training technology, and write technical research reports; under the guidance of the team, use the existing open source multimodal pre-training model to conduct small-scale experiments with the surgical video data; through adjusting the hyper-parameters and modifying the network structure, etc., we will explore the methods to improve the model's learning effect on the contextual knowledge of the surgical environment.
Clinical application assistance: assist doctors in using the developed surgical video analysis model in clinical scenarios, collect feedback from doctors in the process of using the model; organize the model analysis results and generate visualization reports to provide intuitive data reference for clinical decision-making.
Results collation and output: in accordance with academic standards, organize the data and charts in the experimental process, write technical documents; assist the team in preparing patent application materials, digging out the innovation of the research results; participate in the writing of academic papers, responsible for some chapters of the first draft of the writing and the collection of literature.

Requirements

Algorithm knowledge base: Familiar with pre-training algorithms, such as MAE, Dino, self-supervised learning, vision-language alignment, etc., and be able to understand the principles of the algorithms and their role in model training.
Programming skills: master Python programming language, able to use PyTorch, TensorFlow and other deep learning frameworks to complete simple model building and training tasks.
Comprehensive skills: basic document writing ability, able to clearly record the experiment process; have a certain sense of code management; strong learning ability, willing to take the initiative to explore new technologies; good communication, able to collaborate with team members to complete the task.

Preferred Qualifications

Have a learning background in computer vision, natural language processing or related fields, with interdisciplinary knowledge base.
Experience in developing large-scale multimodal models (e.g. LLaVA, BLIP, Qwen-VL), and understanding of model pre-training and fine-tuning process.
Familiar with RAG or Agent development technology (e.g. RAG, LangChain, LlamaIndex), and have relevant project experience.
Have project experience in video processing and analysis, and be familiar with the common processing methods of video data.
Understanding of context learning, cue engineering and cue fine-tuning techniques, and relevant practical experience.
Have experience in writing academic papers or participating in research projects, and have certain research ability.

By joining us, you will work side by side with industry experts in a top research team, be exposed to cutting-edge research resources and advanced technologies, and realize your personal value and career growth on the road of promoting the development of medical artificial intelligence! We are looking forward to hearing from you, and we hope you will join us in the journey of scientific research and innovation!

Application Method

Please submit CV to hr02@cair-cas.org.hk. The subject of the email should be marked as Application for [Intern - Surgical Video Comprehension and Multimodal Basic Modeling]-[Name].

HOME

INNOVATION

PARTNERS

NEWS

JOIN US

ABOUT

Current Vacancies

Intern - Surgical Video Comprehension and Multimodal Basic Modeling

Intern - Surgical Video Comprehension and Multimodal Basic Modeling