Image-Caption Alignment And Object Naming Variability As Supervision For Multi-Modal Object Detection