기본 콘텐츠로 건너뛰기

Enhancing LLM quality and interpretability with the Vertex AI Gen AI Evaluation Service

 Enhancing LLM quality and interpretability with the Vertex AI Gen AI Evaluation Service


Developers harnessing the power of large language models (LLMs) often encounter two key hurdles: managing the inherent randomness of their output and addressing their occasional tendency to generate factually incorrect information. Somewhat like rolling dice, LLMs offer a touch of unpredictability, generating different responses even when given the same prompt. While this randomness can fuel creativity, it can also be a stumbling block when consistency or factual accuracy is crucial. Moreover, the occasional "hallucinations" – where the LLM confidently presents misinformation – can undermine trust in its capabilities. The challenge intensifies when we consider that many real-world tasks lack a single, definitive answer. Whether it's summarizing complex information, crafting compelling marketing copy, brainstorming innovative product ideas, or drafting persuasive emails, there's often room for multiple valid solutions.

In this blog post and accompanying notebook, we'll explore how to tackle these challenges by introducing a new workflow which works by generating a diverse set of LLM-generated responses and employing the Vertex Gen AI Evaluation Service to automate the selection process of the best response and provide associated quality metrics and explanation. This process is also extensible to multimodal input and output and stands to benefit almost all use cases across industries and LLMs.

Picture this: a financial institution striving to summarize customer conversations with banking advisors. The hurdle? Ensuring these summaries are grounded in reality, helpful, concise, and well-written. With numerous ways to craft a summary, the quality varied greatly. Here is how they leveraged the probabilistic nature of LLMs and the Vertex Gen AI Evaluation Service to elevate the performance of the LLM-generated summaries.

Step 1: Generate Diverse Responses

The core idea here was to think beyond the first response. Causal decoder-based LLMs have a touch of randomness built in, meaning they sample each word probabilistically. So, by generating multiple, slightly different responses, we boost the odds of finding a perfect fit. It's like exploring multiple paths, knowing that even if one leads to a dead end, another might reveal a hidden gem.

For example, imagine asking an LLM, "What is the capital of Japan?" You might get a mix of responses like "Kyoto was the capital city of Japan," "Tokyo is the current capital of Japan," or even "Tokyo was the capital of Japan." By generating multiple options, we increase our chances of getting the most accurate and relevant answer.

To put this into action, the financial institution used an LLM to generate five different summaries for each transcript. They adjusted the LLM's "temperature," which controls the randomness of output, to a range of 0.2 to 0.4, to encourage just the right amount of diversity without straying too far from the topic. This ensured a range of options, increasing the likelihood of finding an ideal, high-quality summary.

Step 2: Find the Best Response

Next came the need to search through the set of diverse responses and pinpoint the best one. To do this automatically, the financial institution applied the pairwise evaluation approach available in the Vertex Gen AI Evaluation Service. Think of it as a head-to-head showdown between responses. We pit response pairs against each other, judging them based on the original instructions and context to identify the response that aligns most closely with the user's intent.

Continuing the example above to illustrate, let's say we have those three responses about Japan's capital. We want to find the best one using pairwise comparisons:

  • Response 1 vs Response 2: The API favors Response 2, potentially explaining, "While Response 1 is technically correct, it doesn't directly answer the question about the current capital of Japan."
  • Response 2 (best response so far) vs Response 3: Response 2 wins again! Response 3 stumbles by using the past tense.
  • After these two rounds of comparison, we conclude that Response 2 is the best answer.

In the financial institution's case, they compared their five generated summaries in pairs to select the best one.

Step 3: Assess if the Response is Good Enough

The workflow then takes the top-performing response (Response 2) from the previous step and uses the pointwise evaluation service to assess it. This evaluation assigns quality scores and generates human-readable explanations for those scores across various dimensions, such as accuracy, groundedness, and helpfulness. This process not only highlights the best response but also provides insights into why the model generated this response, and also why it's considered superior to the other responses, fostering trust and transparency in the system's decision-making. In the case of the financial institution, they now used the summarization-related metrics in pointwise evaluation on the winning response to obtaining an explanation of how this answer is grounded, helpful, and high-quality. We can choose to return just the best response or include its associated quality metrics and explanation for greater transparency.

In essence, the workflow (as illustrated in this blog's banner) encompasses generating a variety of LLM responses, systematically evaluating them, and selecting the most suitable one—all while providing insights into why that particular response is deemed optimal. Get started by exploring our sample notebook and adapting it to fit with your use case. You can reverse the order of pairwise and pointwise evaluations, by ranking individual responses based on their pointwise scores and then conducting pairwise comparisons only on the top candidates. Further, while this example focuses on text, this approach can be applied to any modality or any use case including but not limited to question answering and summarization like illustrated in this blog. Finally, if you need to minimize latency, both workflows can benefit greatly from parallelizing the various API calls.

Take the next step

By embracing the inherent variability of LLMs and utilizing the Vertex Gen AI Evaluation Service, we can transform challenges into opportunities. Generating diverse responses, systematically evaluating them, and selecting the best option with clear explanations empowers us to unlock the full potential of LLMs. This approach not only enhances the quality and reliability of LLM outputs but also fosters trust and transparency. Start exploring this approach in our sample notebook and check out the documentation for the Vertex Gen AI Evaluation Service.

댓글

이 블로그의 인기 게시물

Non-contact exposure to dinotefuran disrupts honey bee homing by altering MagR and Cry2 gene expression

  Non-contact exposure to dinotefuran disrupts honey bee homing by altering  MagR  and  Cry2  gene expression Dinotefuran is known to negatively affect honeybee ( Apis mellifera ) behavior, but the underlying mechanism remains unclear. The magnetoreceptor ( MagR , which responds to magnetic fields) and cryptochrome ( Cry2 , which is sensitive to light) genes are considered to play important roles in honey bees’ homing and localization behaviors. Our study found that dinotefuran, even without direct contact, can act like a magnet, significantly altering  MagR  expression in honeybees. This non-contact exposure reduced the bees’ homing rate. In further experiments, we exposed foragers to light and magnetic fields, the  MagR  gene responded to magnetic fields only in the presence of light, with  Cry 2 playing a key switching role in the magnetic field receptor mechanism ( MagR–Cry2 ). Yeast two-hybrid and BiFc assays confirmed an interactio...

New Report – Interlocked: Midwives and the Climate Crisis

New Report – Interlocked: Midwives and the Climate Crisis Earlier this year, midwives from 41 countries shared their experiences of working in communities affected by climate change through our survey, Midwives’ Experiences and Perspectives on Climate Change. Their voices shaped our new report, Interlocked: Midwives and the Climate Crisis , which highlights how midwives are already responding to the health impacts of climate disasters like floods, wildfires, and extreme heat—and why they must be included in climate action plans. What did we learn?Climate change is damaging community health: 75% of midwives reported that climate change is harming the communities they serve, with rising rates of preterm births, food insecurity, and restricted access to care during disasters like floods. Midwives are critical first responders: Midwives are often the first and only healthcare providers on the ground in crises, delivering care during wildfires, floods, and extreme heat. Midwives face signi...

Bee attack claims life of newspaper distributor

  Bee attack claims life of newspaper distributor Newspaper distributor Pushparaja Shetty (45), who sustained severe injuries in a bee attack, succumbed to his injuries on Thursday at a hospital in Mangaluru. Pushparaja was attacked by a swarm of bees on Wednesday morning while walking at Kenjaru Taangadi under Bajpe town panchayat limits. He was immediately admitted to a hospital for treatment but could not survive the ordeal. Fondly known as ‘Boggu’ in the Porkodi area, Pushparaja was well-known for his dedication to delivering newspapers on foot to every household. He was admired for his generosity, as he often distributed sweets to schoolchildren on Independence Day using his own earnings and contributed part of his income to the betterment of society. Pushparaja was unmarried and is survived by three brothers and one sister.

“Global honey crisis”: Testing technology and local sourcing soars amid fraud and tampering concerns

  “Global honey crisis”: Testing technology and local sourcing soars amid fraud and tampering concerns The World Beekeeping Awards will not grant a prize for honey next year due to the “inability” to thoroughly test honey for adulteration. The announcement comes amid the rise of honey fraud in the EU, where a 2023 investigation found that 46% of 147 honey samples tested were likely contaminated with low-cost plant syrups.  Apimondia, the International Federation of Beekeepers’ Associations, organizes the event at its Congress, whose 49th edition will be held in Copenhagen, Denmark, in September 2025. The conference brings together beekeepers, scientists and other stakeholders. “We will celebrate honey in many ways at the Congress, but honey will no longer be a category, and thus, there will be no honey judging in the World Beekeeping Awards. The lessons learned from Canada 2019 and Chile 2023 were that adequate testing was impossible if we are to award winning honey at the Con...

Unveiling the Canopy's Secrets: New Bee Species Discovered in the Pacific

  Unveiling the Canopy's Secrets: New Bee Species Discovered in the Pacific In an exciting development for environmentalists and beekeeping experts, researchers have discovered eight new species of masked bees in the Pacific Islands, shining a light on the rich biodiversity hidden within the forest canopy. This discovery underscores the critical role bees play in our ecosystems and the pressing need for conservation efforts to protect these vital pollinators. A New Frontier in Bee Research By exploring the forest canopy, scientists have opened a new frontier in bee research, revealing species that have adapted to life high above the ground. These discoveries are crucial for understanding the complex relationships between bees, flora, and the broader ecosystem. The new species of masked bees, characterized by their striking black bodies with yellow or white highlights, particularly on their faces, rely exclusively on the forest canopy for survival. The Importance of Bee Conservation...

Start the New Year Humming Like a Bee

  Start the New Year Humming Like a Bee There are lots of opportunities to be as busy as a bee during these winter holidays. As we hustle toward the dawn of the New Year, it can be hard to notice that the natural world is actually suggesting something different for us right now. We’re past the solstice, but the winter still stretches ahead, the days are still short and the nights long. We’re being invited into a quieter, more inner-focused time. The ancient yogis were all about this inner focus. In India, for example, the Upanishads, the Sanskrit writings that accompanied the development of Hinduism — and alongside it, yoga — beginning around 800 B.C.E., went deeper than earlier texts had into philosophy and questions of being. With the goals of increased inner awareness and higher consciousness, yoga was at that time not yet as focused on the body or on asanas, as it now can tend to be. But the yogis did develop many practices to try to open the way to those goals. They discovered...

The Essential Role of Bees in Our Ecosystem and the Challenges They Face

The Essential Role of Bees in Our Ecosystem and the Challenges They Face Bees have been an integral part of our ecosystem and human agriculture for over four centuries, primarily utilized for their honey in the early days. Now, they are recognized more for their critical role in pollination, which directly impacts a vast majority of the food we consume. Mark Lilly, a prominent beekeeper, emphasizes that without bees' pollination efforts, a significant portion of our diet would be at risk. However, bees face numerous threats, including varroa mites and, in specific regions like West Virginia, black bears. These challenges necessitate concerted efforts for preservation and protection. Frequently Asked Questions (FAQs) Q1: Why are bees so important to the ecosystem? A1: Bees are crucial for pollinating plants, which is necessary for the production of fruits, vegetables, and nuts. Without bees, a large portion of the food we eat would not be available. Q2: What are the main threats to...

The Growing Dilemma of Pet Burials in South Korea: Environmental and Legal Hurdles

The Growing Dilemma of Pet Burials in South Korea: Environmental and Legal Hurdles As the number of pets in South Korea steadily increases, pet owners face a growing dilemma due to the limited number of pet cemeteries across the country. Under the current legal framework, pet remains are classified as household waste, requiring specific disposal methods that are neither convenient nor easily accessible for all citizens. The imbalance between the rising pet population—over 5.52 million households with pets by the end of 2022—and the mere 70 registered pet cemeteries has led to significant challenges. This shortage is particularly evident in regions like Jeju Island, where pet owners must undergo arduous journeys to the mainland to ensure their pets receive proper burials. The root cause of this shortage lies in the strict regulations governing the establishment of pet cemeteries. These facilities, which must include crematoriums, face numerous restrictions, such as proximity to resident...

Climate Crisis Claims Glacier's Vital Climate Data Archive

  Climate Crisis Claims Glacier's Vital Climate Data Archive A recent study published in Nature Geoscience reveals a distressing consequence of global warming: the irreversible loss of valuable climate data stored in alpine glaciers. The research, conducted by a team led by Margit Schwikowski from the Paul Scherrer Institute (PSI), underscores the alarming rate at which glaciers are melting and highlights the implications for climate research. The study focuses on the Corbassière glacier at Grand Combin in Switzerland, where ice cores drilled in 2018 and 2020 were intended to serve as vital climate archives. However, comparing the two sets of ice cores reveals a grim reality—global warming has rendered the glacier unsuitable as a reliable climate archive. Glaciers, renowned as climate archives, encapsulate valuable information about past climatic conditions and atmospheric compositions. The fluctuating concentrations of trace substances in ice layers provide insights into historica...

German election: Climate and environment take a back seat

  German election: Climate and environment take a back seat When the coalition government comprising the center-left  Social Democratic Party (SPD) ,  Greens  and neoliberal  Free Democratic Party (FDP)  emerged after the last German federal election in the fall of 2021, then-incoming Chancellor  Olaf Scholz  (SPD) did not object to being called the "climate chancellor." That was no surprise: the climate crisis had been a top issue during the election campaign. The new government made the fight against climate change a task for the Economy Ministry and appointed Vice Chancellor  Robert Habeck  from the Greens as its head. Three and a half years later, campaign speeches barely mention climate protection. The dominant issues are how to curb irregular immigration and how to boost Germany's sluggish economy. Skeptical view of renewable energy The head of the center-right  Christian Democratic Union (CDU) ,  Friedrich Merz ...