Enhancing LLM quality and interpretability with the Vertex AI Gen AI Evaluation Service

Developers harnessing the power of large language models (LLMs) often encounter two key hurdles: managing the inherent randomness of their output and addressing their occasional tendency to generate factually incorrect information. Somewhat like rolling dice, LLMs offer a touch of unpredictability, generating different responses even when given the same prompt. While this randomness can fuel creativity, it can also be a stumbling block when consistency or factual accuracy is crucial. Moreover, the occasional "hallucinations" – where the LLM confidently presents misinformation – can undermine trust in its capabilities. The challenge intensifies when we consider that many real-world tasks lack a single, definitive answer. Whether it's summarizing complex information, crafting compelling marketing copy, brainstorming innovative product ideas, or drafting persuasive emails, there's often room for multiple valid solutions.

In this blog post and accompanying notebook, we'll explore how to tackle these challenges by introducing a new workflow which works by generating a diverse set of LLM-generated responses and employing the Vertex Gen AI Evaluation Service to automate the selection process of the best response and provide associated quality metrics and explanation. This process is also extensible to multimodal input and output and stands to benefit almost all use cases across industries and LLMs.

Picture this: a financial institution striving to summarize customer conversations with banking advisors. The hurdle? Ensuring these summaries are grounded in reality, helpful, concise, and well-written. With numerous ways to craft a summary, the quality varied greatly. Here is how they leveraged the probabilistic nature of LLMs and the Vertex Gen AI Evaluation Service to elevate the performance of the LLM-generated summaries.

Step 1: Generate Diverse Responses

The core idea here was to think beyond the first response. Causal decoder-based LLMs have a touch of randomness built in, meaning they sample each word probabilistically. So, by generating multiple, slightly different responses, we boost the odds of finding a perfect fit. It's like exploring multiple paths, knowing that even if one leads to a dead end, another might reveal a hidden gem.

For example, imagine asking an LLM, "What is the capital of Japan?" You might get a mix of responses like "Kyoto was the capital city of Japan," "Tokyo is the current capital of Japan," or even "Tokyo was the capital of Japan." By generating multiple options, we increase our chances of getting the most accurate and relevant answer.

To put this into action, the financial institution used an LLM to generate five different summaries for each transcript. They adjusted the LLM's "temperature," which controls the randomness of output, to a range of 0.2 to 0.4, to encourage just the right amount of diversity without straying too far from the topic. This ensured a range of options, increasing the likelihood of finding an ideal, high-quality summary.

Step 2: Find the Best Response

Next came the need to search through the set of diverse responses and pinpoint the best one. To do this automatically, the financial institution applied the pairwise evaluation approach available in the Vertex Gen AI Evaluation Service. Think of it as a head-to-head showdown between responses. We pit response pairs against each other, judging them based on the original instructions and context to identify the response that aligns most closely with the user's intent.

Continuing the example above to illustrate, let's say we have those three responses about Japan's capital. We want to find the best one using pairwise comparisons:

Response 1 vs Response 2: The API favors Response 2, potentially explaining, "While Response 1 is technically correct, it doesn't directly answer the question about the current capital of Japan."
Response 2 (best response so far) vs Response 3: Response 2 wins again! Response 3 stumbles by using the past tense.
After these two rounds of comparison, we conclude that Response 2 is the best answer.

In the financial institution's case, they compared their five generated summaries in pairs to select the best one.

Step 3: Assess if the Response is Good Enough

The workflow then takes the top-performing response (Response 2) from the previous step and uses the pointwise evaluation service to assess it. This evaluation assigns quality scores and generates human-readable explanations for those scores across various dimensions, such as accuracy, groundedness, and helpfulness. This process not only highlights the best response but also provides insights into why the model generated this response, and also why it's considered superior to the other responses, fostering trust and transparency in the system's decision-making. In the case of the financial institution, they now used the summarization-related metrics in pointwise evaluation on the winning response to obtaining an explanation of how this answer is grounded, helpful, and high-quality. We can choose to return just the best response or include its associated quality metrics and explanation for greater transparency.

In essence, the workflow (as illustrated in this blog's banner) encompasses generating a variety of LLM responses, systematically evaluating them, and selecting the most suitable one—all while providing insights into why that particular response is deemed optimal. Get started by exploring our sample notebook and adapting it to fit with your use case. You can reverse the order of pairwise and pointwise evaluations, by ranking individual responses based on their pointwise scores and then conducting pairwise comparisons only on the top candidates. Further, while this example focuses on text, this approach can be applied to any modality or any use case including but not limited to question answering and summarization like illustrated in this blog. Finally, if you need to minimize latency, both workflows can benefit greatly from parallelizing the various API calls.

Take the next step

By embracing the inherent variability of LLMs and utilizing the Vertex Gen AI Evaluation Service, we can transform challenges into opportunities. Generating diverse responses, systematically evaluating them, and selecting the best option with clear explanations empowers us to unlock the full potential of LLMs. This approach not only enhances the quality and reliability of LLM outputs but also fosters trust and transparency. Start exploring this approach in our sample notebook and check out the documentation for the Vertex Gen AI Evaluation Service.

이 블로그의 인기 게시물

Non-contact exposure to dinotefuran disrupts honey bee homing by altering MagR and Cry2 gene expression

Non-contact exposure to dinotefuran disrupts honey bee homing by altering MagR and Cry2 gene expression Dinotefuran is known to negatively affect honeybee ( Apis mellifera ) behavior, but the underlying mechanism remains unclear. The magnetoreceptor ( MagR , which responds to magnetic fields) and cryptochrome ( Cry2 , which is sensitive to light) genes are considered to play important roles in honey bees’ homing and localization behaviors. Our study found that dinotefuran, even without direct contact, can act like a magnet, significantly altering MagR expression in honeybees. This non-contact exposure reduced the bees’ homing rate. In further experiments, we exposed foragers to light and magnetic fields, the MagR gene responded to magnetic fields only in the presence of light, with Cry 2 playing a key switching role in the magnetic field receptor mechanism ( MagR–Cry2 ). Yeast two-hybrid and BiFc assays confirmed an interactio...

“Global honey crisis”: Testing technology and local sourcing soars amid fraud and tampering concerns

“Global honey crisis”: Testing technology and local sourcing soars amid fraud and tampering concerns The World Beekeeping Awards will not grant a prize for honey next year due to the “inability” to thoroughly test honey for adulteration. The announcement comes amid the rise of honey fraud in the EU, where a 2023 investigation found that 46% of 147 honey samples tested were likely contaminated with low-cost plant syrups. Apimondia, the International Federation of Beekeepers’ Associations, organizes the event at its Congress, whose 49th edition will be held in Copenhagen, Denmark, in September 2025. The conference brings together beekeepers, scientists and other stakeholders. “We will celebrate honey in many ways at the Congress, but honey will no longer be a category, and thus, there will be no honey judging in the World Beekeeping Awards. The lessons learned from Canada 2019 and Chile 2023 were that adequate testing was impossible if we are to award winning honey at the Con...

Unveiling the Canopy's Secrets: New Bee Species Discovered in the Pacific

Unveiling the Canopy's Secrets: New Bee Species Discovered in the Pacific In an exciting development for environmentalists and beekeeping experts, researchers have discovered eight new species of masked bees in the Pacific Islands, shining a light on the rich biodiversity hidden within the forest canopy. This discovery underscores the critical role bees play in our ecosystems and the pressing need for conservation efforts to protect these vital pollinators. A New Frontier in Bee Research By exploring the forest canopy, scientists have opened a new frontier in bee research, revealing species that have adapted to life high above the ground. These discoveries are crucial for understanding the complex relationships between bees, flora, and the broader ecosystem. The new species of masked bees, characterized by their striking black bodies with yellow or white highlights, particularly on their faces, rely exclusively on the forest canopy for survival. The Importance of Bee Conservation...

New Report – Interlocked: Midwives and the Climate Crisis

New Report – Interlocked: Midwives and the Climate Crisis Earlier this year, midwives from 41 countries shared their experiences of working in communities affected by climate change through our survey, Midwives’ Experiences and Perspectives on Climate Change. Their voices shaped our new report, Interlocked: Midwives and the Climate Crisis , which highlights how midwives are already responding to the health impacts of climate disasters like floods, wildfires, and extreme heat—and why they must be included in climate action plans. What did we learn?Climate change is damaging community health: 75% of midwives reported that climate change is harming the communities they serve, with rising rates of preterm births, food insecurity, and restricted access to care during disasters like floods. Midwives are critical first responders: Midwives are often the first and only healthcare providers on the ground in crises, delivering care during wildfires, floods, and extreme heat. Midwives face signi...

Bee attack claims life of newspaper distributor

Bee attack claims life of newspaper distributor Newspaper distributor Pushparaja Shetty (45), who sustained severe injuries in a bee attack, succumbed to his injuries on Thursday at a hospital in Mangaluru. Pushparaja was attacked by a swarm of bees on Wednesday morning while walking at Kenjaru Taangadi under Bajpe town panchayat limits. He was immediately admitted to a hospital for treatment but could not survive the ordeal. Fondly known as ‘Boggu’ in the Porkodi area, Pushparaja was well-known for his dedication to delivering newspapers on foot to every household. He was admired for his generosity, as he often distributed sweets to schoolchildren on Independence Day using his own earnings and contributed part of his income to the betterment of society. Pushparaja was unmarried and is survived by three brothers and one sister.

Start the New Year Humming Like a Bee

Start the New Year Humming Like a Bee There are lots of opportunities to be as busy as a bee during these winter holidays. As we hustle toward the dawn of the New Year, it can be hard to notice that the natural world is actually suggesting something different for us right now. We’re past the solstice, but the winter still stretches ahead, the days are still short and the nights long. We’re being invited into a quieter, more inner-focused time. The ancient yogis were all about this inner focus. In India, for example, the Upanishads, the Sanskrit writings that accompanied the development of Hinduism — and alongside it, yoga — beginning around 800 B.C.E., went deeper than earlier texts had into philosophy and questions of being. With the goals of increased inner awareness and higher consciousness, yoga was at that time not yet as focused on the body or on asanas, as it now can tend to be. But the yogis did develop many practices to try to open the way to those goals. They discovered...

The largest “killer hornets” in the world were exterminated in the US

The largest “killer hornets” in the world were exterminated in the US The US informed that it had exterminated the worldʼs largest hornets, nicknamed "killer hornets" — they are capable of occupying a hive of honey bees in just 90 minutes, decapitating all its inhabitants and feeding their offspring to their own. This was reported by the Department of Agriculture in Washington. The hornets, which can reach five centimeters in length, were previously called Asian giant hornets, but in 2019 they were also spotted in Washington state near the Canadian border. In China, these insects killed 42 people and seriously injured 1,675. A dead northern giant hornet (below) next to a native bald hornet. According to experts, the hornets could have entered North America in plant pots or shipping containers. The hornet can sting through most beekeeper suits because it produces nearly seven times more venom than a honeybee and stings multiple times. Thatʼs why the Washington Departme...

From Classroom to Hive: Jeff Tech students experience sweet journey of honey making

From Classroom to Hive: Jeff Tech students experience sweet journey of honey making The Courier Express has partnered with digital media arts students at Jeff Tech to highlight accomplishments and updates from the school. q q q REYNOLDSVILLE — The new “Intro to Agriculture” class, taught by advanced manufacturing instructor Perry Neal, has recently been buzzing throughout the halls of Jeff Tech. The course has been receiving positive feedback from both students and teachers. “It’s a great class. I love it,” said Jeff Tech student Jacob DeFoor. Student Kyle Lasher said, “I’m really considering getting bees of my own.” Intro to Agriculture is an 18-week course that starts with students learning anything and everything bees. They gather together to learn the process and safety procedures of making honey from scratch with locally-sourced honey bees. In class, students research pollination, foods that contain honey, where to purchase hive equipment, types of bees, etc., according to N...

The Unexpected Surge: America's Honeybees Buzz Back to Record Numbers

The Unexpected Surge: America's Honeybees Buzz Back to Record Numbers In an age where environmental narratives often lean towards loss and decline, the story of the American honeybee offers a glimmer of hope and a puzzle to solve. Recent data from the Census of Agriculture reveals an astonishing rebound in the honeybee population, now soaring to an all-time high of 3.8 million colonies. This revelation comes as a surprise against the backdrop of two decades marked by fears of colony collapse and the potential ramifications for ecosystems and agriculture. The surge in bee populations brings to light a series of questions and insights into the intertwined worlds of agriculture, conservation, and legislation. Central to this narrative is the state of Texas, where legislative changes have catalyzed a beekeeping boom by offering agricultural tax breaks to landowners cultivating honeybees. This policy shift, coupled with the entrepreneurial spirit of Texans and the essential role of bees...

Researchers use advanced robotics to study honeybee behaviour

Researchers use advanced robotics to study honeybee behaviour Researchers from our top-rated Computer Science department have made significant advances in understanding honeybee behaviour through the use of innovative robotic technology. The study, published in the cover page of prestigious journal - Science Robotics, offers unprecedented insights into the daily activities of honeybee colonies, particularly focusing on the queen bee and her interactions with worker bees. Robotic system provides continuous monitoring The research team, led by Professor Farshad Arvin, developed a sophisticated robotic system capable of continuous, long-term observation of bee hives. This system employs two high-resolution cameras that work autonomously, tracking the queen bee's movements and mapping the contents of the honeycomb. This technology allows the researchers to collect data on bee behaviour 24 hours a day, seven days a week. Researchers say this level of continuous monitoring was previous...

Small actions make a big difference engilsh dkbee.com

이 블로그 검색