As the use of generative AIs rapidly expands, addressing their inherent flaws, fabricated facts, out-of-date knowledge, biases in the training data and the inability to cite sources, to name a few, becomes increasingly crucial. Retrieval Augmented Generation effectively addresses many of these issues and forms the foundation of practically all AI applications that focus on cognitive search, question answering and similar tasks. Nevertheless, even augmented AIs are not perfect and a naive implementation can potentially perform worse than top-of-the-line, vanilla models. It is thus critical to be able to quantify the reliability of AI and have trackable metrics to guide the work of agile teams implementing these systems. The talk will present an overview of the key metrics of AI quality that a team should track, explain how they are evaluated in practice, introduce available tools and frameworks, and outline common strategies for enhancing AI reliability. The talk concludes with a case study of an AI chatbot developed for the University of Vaasa for the purpose of surveying and instructing citizens and businesses about region-specific climate change risks. It showcases how rigorous evaluation practices identified AI output issues and how each development sprint improved the reliability metrics.