Towards Collaborative Generative Ai For Vision-And-Language Studies