Grounded-Knowledge-Enhanced Instruction Understanding For Multimodal Assistant Applications