Large Language Models (LLMs) have proven effective in tasks like natural language processing, sentiment analysis, data extraction, and answering questions.
According to ChatGPT, complex mathematical operations for it include advanced techniques such as calculus, matrix operations, differential equations, optimization, and probability, often used for solving real-world problems in fields like engineering, physics, and economics. And complex data analysis involves multivariate analysis, statistical modeling, time series analysis, machine learning, and big data handling, used for discovering patterns, making predictions, and drawing insights from large datasets.
Recently, we've explored multiple LLMs’ ability to handle basic mathematical and analytical operations, including additions, subtractions, multiplication, division and percentage calculations and financial data analysis. While LLMs can manage basic arithmetic, While LLMs can manage basic arithmetic, we tested their ability to solve more complex tasks using a restaurant bill, and tried to convert the bill into a table, split the total amount, and calculate each person's percentage share.
First, let’s take the restaurant bill and try to understand how LLM’s can answers complex questions through prompt engineering. We start with basic math and data problems.
We started by using ChatGPT 4.0 to transform the restaurant bill into a structured table, and it handled the task flawlessly.
Our second question was to compute the total of the bill, excluding and including taxes. The result is as follows:
As seen here, ChatGPT accurately performed all the calculations based on the data from the table.
Next, we asked ChatGPT a question,_“What would be the bill total including and excluding taxes if Adana Kebab was not purchased?”_
Here, ChatGPT successfully calculated the total amount excluding service charges and taxes. However, it struggled to provide the correct total amount when taxes were included. This was due to its inability to recognize that VAT applies only to alcoholic products, while GST is applied to food and non-alcoholic items. After we provided the model with this information, it generated the following result:
In this case, the LLM confused food and beverages, leading to inaccuracies in calculating their respective costs. To clarify, we provided the total cost of food, excluding taxes and service charges.
This prompt enabled the LLM to accurately calculate the costs of alcoholic items and determine the new totals, both with and without taxes and service charges. However, one issue remained: it did not proportionally reduce the service charge after removing the Adana Kebab. To address this, we provided two additional prompts, which resulted in the correct calculation.
Our next query was to calculate each person’s total and share in percentage. For this, we gave each person’s share. The following is the query: _“Jaya-2328.60, Vijay-2127.60, Jithin-2628.60, Gautham-3077.60, Kiran-3382.60. Calculate total and each persons share in percentage.”_
ChatGPT's response was close but not entirely accurate. For the next query, we required the individual split for food and drinks. To facilitate this, we provided a list of the drinks each person had, as the food was shared equally among them. The prompt is given below.
_Drinks are charged on the person and the food is shared commonly. Here are the drinks and who had it: Jaya-Cupid, Cosmopolitan, Don Azus (2). Jithin-Mango Meltdown, Mojito (1), Don Azus (2). Gautham-Jagger Bomb, Mojito (1), Don Azus (1). Kiran-Jhonnie Walker Red Label, Monkey Shoulder, Don Azus (2). Vijay-Beefeater, tonic water, Don Azus (1). Show me each persons split._
//?#Initial amounts based on the previous breakdown !! !! total-food-cost = 3271.00 # New food subtotal after removing Adana Kebab !! !! total-service-charge = 392.94 # New service charge after removing Adana Kebab !! !! total-food-cost-with-tax = total-food-cost + (81.78 + 81.78) # Adding CGST and SGST !! !! # Drinks cost per person !! !! drinks_cost = !! !! "Jaya": 499.00 + 599.00 + (249.00 * 2), !! !! "Jithin": 799.00 + 599.00 + (249.00 * 2), !! !! "Gautham": 499.00 + 599.00 + 249.00, !! !! "Kiran": 299.00 + 599.00 + (249.00 * 2), !! !! "Vijay": 349.00 + 99.00 + 249.00 !! !! !! !! #Calculate each person's total cost including their drinks and their share of the food cost !! !! total-cost = drinks-cost.copy() !! !! num-people = len(total-cost) !! !! #Each person's share of the food and service charge !! !! shared-cost-per-person = (total-food-cost-with-tax + total-service-charge) / num-people !! !! for person in total-cost: !! !! total-cost[person] += shared-cost-per-person !! !! total-cost !! !! Result !! !! 'Jaya': 2361.5, !! !! 'Jithin': 2661.5, !! !! 'Gautham': 2112.5, !! !! 'Kiran': 2161.5, !! !!'Vijay': 1462.5//?
In this instance, we requested the total split of the bill, excluding taxes and service charges. However, ChatGPT mistakenly excluded the cost of the Adana Kebab from its calculations. When we asked for a breakdown that included the Adana Kebab, it again provided an incorrect answer. After giving a second prompt, the LLM finally delivered the correct calculation.
//?Apologies for the confusion. Let's calculate the correct split for the total bill amount of ₹13,545 based on the drinks each person had and the food cost being shared equally. !! !! Step-by-Step Breakdown: !! !! Total Bill Amount: ₹13,545 !! !! Drinks Total (Alcohol): ₹9,825 (VAT: ₹1,424.63) !! !! Food Total (including Adana Kebab): ₹3,720 (CGST & SGST: ₹186.00) !! !! Drinks Cost Per Person: !! !! Jaya: Cupid (₹499) + Cosmopolitan (₹599) + Don Azus (2 × ₹249) = ₹1,596 !! !! Jithin: Mango Meltdown (₹799) + Mojito (₹599) + Don Azus (2 × ₹249) = ₹1,895 !! !! Gautham: Jagger Bomb (₹499) + Mojito (₹599) + Don Azus (₹249) = ₹1,347 !! !! Kiran: Jhonnie Walker Red Label (₹299) + Monkey Shoulder (₹599) + Don Azus (2 × ₹249) = ₹1,395 !! !! Vijay: Beefeater (₹349) + tonic water (₹99) + Don Azus (₹249) = ₹697 !! !! Shared Food and Service Charge Cost: !! !! Total Food Cost: ₹3,720 !! !! Service Charge: ₹406.35 !! !! CGST & SGST: ₹186.00 !! !! \textShared Cost Per Person = \frac\text{Food Cost + Service Charge + CGST & SGST}5 = \frac3,720 + 406.35 + 186.005 = ₹862.07 !! !! Final Calculation: !! !! Jaya's Total: ₹1,596 (Drinks) + ₹862.07 (Shared Food) = ₹2,458.07 !! !! Jithin's Total: ₹1,895 (Drinks) + ₹862.07 (Shared Food) = ₹2,757.07 !! !! Gautham's Total: ₹1,347 (Drinks) + ₹862.07 (Shared Food) = ₹2,209.07 !! !! Kiran's Total: ₹1,395 (Drinks) + ₹862.07 (Shared Food) = ₹2,257.07 !! !! Vijay's Total: ₹697 (Drinks) + ₹862.07 (Shared Food) = ₹1,559.07 !! !! Summary of Each Person’s Total: !! !! Jaya: ₹2,458.07 !! !! Jithin: ₹2,757.07 !! !! Gautham: ₹2,209.07 !! !! Kiran: ₹2,257.07 !! !! Vijay: ₹1,559.07 !! !! These amounts correctly split the total bill of ₹13,545 among everyone based on their drinks and an equal share of the food and service charges.//?
Next, we requested a list of food and non-alcoholic drinks from the LLM. However, it hallucinated and provided an inaccurate response, including items that were not on the bill.
The next query was, _“Due to discount assume the bill is 12877 and not 15562. Now what is each persons share?”_
//?# Re-defining the amounts and total to correct the issue !! !! amounts = !! !! "Jaya": 2328.60, !! !!"Vijay": 2127.60, !! !! "Jithin": 2628.60, !! !! "Gautham": 3077.60, !! !! "Kiran": 3382.60 !! !! !! !! # Total of the initial amounts !! !! total-initial-amount = sum(amounts.values()) !! !! # New total bill after discount !! !! new-total-bill = 12877 !! !! # Calculate the new percentage share and total amount for each person !! !! new-total-amounts = !! !! for person, amount in amounts.items(): !! !! percentage-share = amount / total-initial-amount !! !! new-total-amounts[person] = percentage-share * new-total-bill !! !! new-total-amounts//?
ChatGPT's response was nearly correct, reflecting each person's proportional share of the discounted bill. However, it required multiple prompts and follow-up queries to grasp the logic of adding and subtracting taxes and service charges when removing an item and splitting each person’s bill by percentage. Additionally, it began to hallucinate when we asked it to list the names of the items on the bill, providing incorrect answers. To streamline the process, we consolidated all the prompts and follow-up queries into a single prompt and posed the same question in a separate chat. Here’s the prompt:
_Convert the above bill into a table format and answer the below questions:_
_1. Compute the total bill cost with and without - tax and service charge._
_2. What would be the bill total including and excluding taxes and service charge if Adana Kebab was not purchased?_
_3. Jaya-2328.60, Vijay-2127.60, Jithin-2628.60, Gautham-3077.60, Kiran-3382.60. Calculate each persons share in percentage. The total amount should be taken without the tax and service charge._
_4. Jaya-2328.60, Vijay-2127.60, Jithin-2628.60, Gautham-3077.60, Kiran-3382.60. Calculate each persons share in percentage. The total amount should be taken with the tax and service charge._
_Considerations:_
_1. VAT is calculated only for alcohol and CGST and SGST are calculated for non-alcoholic drinks and food._
_2. Below is a list of food, alcoholic and non-alcoholic answers._
_Alcoholic drinks: Cupid, Beefeater, Jagger Bomb, Don Azus, Mango meltdown, Water Melon Jalapeno Tini, Panda Orange, Dice N Nice, Cosmopolitan, Monkey shoulders, Johnnie Walker Red label, Vodka Prawn_
_Non-Alcoholic drinks: Tonic Water, Mojito_
_Food: Crispy Corn, Jalapeno pops, Chermoula Paneer, Adana Kebab, Muttai Paniyaram, Tiramisu, Ice cream, Kunafeh_
_3. Do no make approximations, give the correct answer upto 2 decimal points_
_4. Do not make assumptions_
_5. Do not provide code_
_6. Give the exact calculations for every step_
ChatGPT converted it into a table and gave us a response which is given below.
For the second question, the calculated service tax after reducing the cost of the Adana Kebab differed slightly from the actual answer of Rs. 392.88, leading to a minor adjustment in the final result by two points. In response to the third question about percentage calculations, the LLM provided an answer that was approximately correct. However, it faltered on the fourth question, failing to calculate the split of the total amount with and without taxes and service charges. We followed up by asking for each person's amount after including their percentage share of tax and service charges, and it generated an answer that was approximately correct. To further clarify, we inquired about splitting the tax and service charges as a percentage for each person, which resulted in the correct response. We asked it to add these expenses to the individual amounts, and it provided an accurate response.
It gave the correct answer after the two follow up questions. We also asked a query on the discount bill._“Due to discount assume the bill is 12877 and not 15562. Now what is each persons share”_.The LLM’s response was accurate.
Thus, we concluded that by providing accurate prompts and instructions to ChatGPT, it would yield approximately accurate responses for math and financial data analysis questions, requiring only four follow-up inquiries. This highlights the potential benefits of fine-tuning LLMs for specific use cases. In this scenario, fine-tuning ChatGPT on a dataset of bill-related queries and arithmetic calculations helped improve its ability to accurately handle tasks like splitting bills, calculating taxes, and processing multiple related prompts in one go. By adapting the model to better understand financial and tabular data, its performance in similar contexts would likely improve, reducing the need for multiple clarifications and minimizing hallucinations.
We then tested the same prompt with various LLMs and VLMs, and the accuracy of their responses is summarized below:
Claude AI
Claude AI was successful in generating the table from the bill and calculating the basic total including and excluding taxes and service charge. However, it struggled to accurately calculate the query involving the removal of the Adana Kebab, incorrectly reducing the VAT despite our instruction that VAT applies only to alcoholic items. For the fourth question, which asked for each person’s amount including taxes and service charges, it failed to provide an accurate result. After prompting it again for the calculation with tax and service charge, it returned a nearly accurate figure, though it still differed slightly from the correct answer. In the subsequent query, we requested a split of the tax and service charge amounts in both percentage and absolute values for each person, to which it responded with an almost accurate number. When we inquired about the discount, it provided an approximately correct answer. Thus, similar to ChatGPT, Claude AI was also able to deliver nearly correct responses to the mathematical and data analysis queries following effective prompt engineering.
//?def calculate-bill(items, exclude-item=None): !! !! subtotal = sum(item['price'] * item['quantity'] for item in items if item['name'] != exclude-item) !! !! alcohol-total = sum(item['price'] * item['quantity'] for item in items if item['category'] == 'Alcoholic' and item['name'] != exclude_item)!! !! non-alcohol-total = subtotal - alcohol-total !! !! service-charge = subtotal * 0.03 !! !! vat = alcohol-total * 0.145 !! !! sgst = non-alcohol-total * 0.025!! !! total = subtotal + service-charge + vat + cgst + sgst !! !! return subtotal, total !! !! def calculate-share(amount, total): !! !! return (amount / total) * 100 !! !! # Define items !! !! items = [!! !! 'name': 'Cupid', 'price': 499, 'quantity': 1, 'category': 'Alcoholic', !! !! # ... Add all other items here !! !! ] !! !! # Calculate totals !! !! subtotal, total-with-tax = calculate_bill(items) !! !! subtotal-without-kebab, total-without-kebab = calculate-bill(items, 'Adana Kebab') !! !! # Calculate shares !! !! shares = [2328.60, 2127.60, 2628.60, 3077.60, 3382.60] !! !! names = ['Jaya', 'Vijay', 'Jithin', 'Gautham', 'Kiran'] !! !! shares-without-tax = [calculate-share(share, subtotal) for share in shares]!! !! shares-with-tax = [calculate-share(share, total-with-tax) for share in shares] !! !! tax-and-service = total-with-tax - subtotal !! !!shares-with-tax-amount = [share + (percentage / 100 * tax-and-service) for share, percentage in zip(shares, shares-with-tax)] !! !!# Print results !! !! print(f"Total without tax: subtotal:.2f")!! !! print(f"Total with tax: total_with_tax:.2f") !! !! print(f"Total without Adana Kebab (without tax): subtotal_without_kebab:.2f") !! !! print(f"Total without Adana Kebab (with tax): total_without_kebab:.2f") !! !! for name, share-without, share-with, amount in zip(names, shares-without-tax, shares-with-tax, shares-with-tax-amount): !! !! print(f"name: share-without:.2f% (without tax), share-with:.2f% (with tax), ₹amount:.2f")//?
Mistral AI
Mistral AI initially failed to present the bill as a complete table. Its basic calculations for VAT, GST, the total without the Adana Kebab, and the service tax were incorrect, despite clear instructions. However, it managed to provide an approximate calculation of each person's percentage share. It also calculated the split amounts based on these percentages with reasonable accuracy and produced an approximate split amount after applying the discount.
Copilot
Copilot was entirely inaccurate in providing the correct answers. The data it converted into a table was incorrect, and it failed to accurately translate the GST amount from the bill into text format. Additionally, it hallucinated figures for the service charge and VAT. While it could calculate the percentage share, it was unable to accurately determine the amount for each person after including taxes and service charges.
Gemini
Gemini successfully converted the bill into a table, but the data was inaccurate. It provided incorrect answers when calculating the amounts with taxes and service charges. However, it was able to give an approximately correct answer when calculating the percentage share and splitting each person’s amount. When we asked about the split after applying the discount, it hallucinated and initially failed to provide a response. After repeating the question, it delivered approximate figures, though they still varied slightly from the correct amounts.
//? import pandas as pd !! !! # Create a list of items from the bill !! !! items = [!! !!"Cupid", "Beefeater", "Tonic Water", "Mojito", "Crispy Corn", !! !! "Water Melon", "Jalapeno Tini", "Panda Orange", "Dice 'n' Nice", !! !!"Vodka Prawn", "Chermoula Paneer", "Adana Kebab", "Don Azus", !! !! "Mango Meltdown", "Cosmopolitan", "Monkey Shoulder", "Jalapeno Pops", !! !! "Muttai Paniyaram", "Johnnie Walker Red Label", "Tiramisu", "Kunafeh", !! !! "Choice Of Ice Cream" !! !!] !! !! # Create a list of quantities and prices !! !! quantities = [1, 1, 2, 2, 3, 1, 1, 1, 1, 1, 1, 1, 8, 1, 1, 2, 1, 1, 1, 1, 1] !! !! prices = [499.00, 99.00, 59.00, 299.00, 249.00, 699.00, 299.00, 299.00, 449.00, 449.00, 299.00, 449.00, 249.00, 799.00, 599.00, 299.00, 249.00, 279.00, 299.00, 349.00, 349.00, 199.00] !! !! # Create a DataFrame !! !!df = pd.DataFrame("Item": items, "Quantity": quantities, "Price": prices) !! !! # Calculate subtotal !! !! df["Subtotal"] = df["Quantity"] * df["Price"] !! !! # Calculate total bill without tax and service charge !! !!total-without-tax-service = df["Subtotal"].sum() !! !!# Calculate tax and service charge !! !!tax-rate = 0.145 !! !! service-charge = 406.35 !! !! cgst-rate = 0.025 !! !!sgst-rate = 0.025 !! !! tax = total-without_-tax-service * tax-rate !! !!total-with-tax-service = total-without-tax-service + tax + service-charge + (total-without-tax-service * cgst-rate) + (total-without-tax-service * sgst-rate) !! !! # Print the DataFrame and total bill !! !!print(df) !! !!print("\nTotal bill without tax and service charge:", total-without-tax-service) !! !!print("Total bill with tax and service charge)//?
Qwen2-VL-Max
Qwen VLM successfully generated an accurate table, but faced challenges with larger prompts, leading to slow response times. While the first question was answered correctly, it struggled with the second, failing to accurately calculate the total bill after removing the Adana Kebab, as it didn’t proportionally adjust the taxes and service charge. The responses to the third, fourth, and fifth questions were mostly correct, with only slight numerical deviations by a few points.
LLaVA OneVision
LLaVA OneVision was unable to generate tables from the bill, though it accurately answered the first question. For the second question, the model failed to calculate the proportional amount after excluding the Adana Kebab. It provided reasonably accurate responses for the fourth and fifth questions, but struggled to process large prompts, requiring multiple prompts to be submitted.
InternVL2
InternVL2 was able to partially generate a table from the bill but struggled with processing long prompts in a single instance, necessitating follow-up questions. This often led to incomplete responses. While it calculated the first question accurately, the second answer was partial and inaccurate. The model provided fairly accurate answers for questions three and four. However, for the fifth question, it only partially generated the response and failed to complete it even after a follow-up prompt.
Ovis 1.6 Gemma
The VLM was unable to convert the bill into a table and struggled with lengthy prompts, requiring the queries to be broken down into follow-up questions. The response to the second question was incorrect, as the model failed to calculate the proportionate amount after excluding the Adana Kebab. It completely miscalculated the third question, though the first answer was accurate, and the fourth was roughly correct. However, the model also provided an incorrect response to the fifth question, where it miscalculated the amount after including the percentage share with tax and service charges.
The analysis highlights the limitations of both large language models (LLMs) and visual language models (VLMs) when it comes to handling slightly complex mathematical tasks and data analysis. While models like ChatGPT were able to solve basic arithmetic and perform proportional splits in a restaurant bill, they struggled with more intricate calculations involving taxes, percentages, and logical reasoning. In cases where the prompt involved removing items from the bill and recalculating taxes or service charges, some models required multiple prompts and follow-up queries to arrive at the correct answer. Others failed altogether, producing hallucinations—incorrect or unrelated outputs—especially when faced with large or more detailed queries.
Interestingly, some models were able to arrive at the correct answer only after being given very specific instructions, while others simply could not handle the complexity, especially when multiple related calculations were involved. VLMs, in particular, showed limitations in processing large prompts and struggled with maintaining accuracy when dealing with mathematical operations that require logical progression.
This comparison underscores a broader challenge with LLMs and VLMs: despite being trained on vast datasets and capable of answering a wide range of queries, they still exhibit significant inaccuracies when it comes to handling slightly complex mathematics and logic-based tasks. Many of these models, though powerful, are not yet fully reliable for real-world applications that demand precise calculations or nuanced data interpretation. This calls for further fine-tuning and refinement of these models, particularly in areas like arithmetic and financial reasoning, to ensure they can handle even basic math with a higher degree of accuracy. Improvements in these areas will be crucial for making LLMs and VLMs more effective in practical use cases where such calculations are essential.