Significance of UN Gifts to donor country — A text-based automated pipeline

MehtA+
6 min readJul 17, 2024

--

By: Nayana J, Andy J, Sia K, Sai K, Meghana O — MehtA+ AI/Machine Learning Research Bootcamp students

In a project in partnership with CUNY professor, Prof. Elizabeth Macaulay, high school students in MehtA+ AI/Machine Learning Research Bootcamp were provided with a United Nations Gifts Dataset and tasked to use AI to understand why? In part 7 of a seven part series, students explore ways in which AI can help us understand archaeological gifts better.

If you would like to learn more about MehtA+ AI/Machine Learning Research Bootcamp, check out https://mehtaplustutoring.com/ai-ml-research-bootcamp/.

*******************

Question Statement

We answered the question “Why is the artifact significant to the country that gifted it?” We created a text-based automated pipeline that uses Google Gemini to get an output to the question, and then we created a function to check the accuracy of the response.

Processing

Before we began coding, we downloaded our dataset. Essentially, the goal of our pipeline was to produce an AI generated response to our question in an easily readable format. Therefore, we coded up a few different functions, and then put them together in a for loop. The output is in the CSV file that we were given.

Scraping Google Searches

First and foremost, we have to obtain sources that will help the AI generate its response. To do this, we made two different functions, which became our first two steps.

The first function, called “google_search_scrape()”, takes the name of the artifact, and returns the top three searches. Then, the function scrapes the text from these searches and puts it into paragraph format. The only issue is that there is a huge amount of text that is scraped, and due to the varying type/format of the websites scraped, we programmed it so that everything on the website would be scraped. This produces a lot of text that we don’t need.

We also created a function called scrape which takes in ‘url’ and returns the body paragraph. A problem we ran into with this was the fact that it returned a list. Since we wanted to input this into columns in the dataset, by returning a list, it would make it harder to break out of the for loop, where it adds the information into the list. This by itself was not a hard fix however there were some troubles that occurred while saving it which made us waste more time.

Gemini API

Our third step required us to create a function to access the Gemini API so we could enter our query and sources to get an answer to our question.

The ‘genai_response’ function takes in two parameters, ‘un_paras’ and ‘relevant2_paras’. The first of which stores scraped text from the UN website related to the artifact and the latter stores paragraphs from other relevant websites. Then after choosing our model, we initialized a variable called ‘response’. This variable was set equal to the content that Gemini API provided us. To do this, we used ‘model.generate_content’ and f-strings. Then we wrote our query and used ‘un_paras’ and ‘relevant2_paras’ as our sources. Finally, we returned response.text which contained the answer to our question.

A problem we ran into was that we began overcomplicating the code, making it longer and adding extra code that was unnecessary. To fix this, we did some research and found an easy way to generate text from text inputs, as well as realizing that we should change our query to something more general. After this, coding the rest of the function became quite easy and we were done with our third step in no time.

Accuracy Metric

For the fourth step, we coded for the Accuracy Metric. This allows us to program the accuracy of one response from the output of the Gemini API function.

First, the function definition which is ‘check_response_accuracy’ takes the ‘response’ as its parameter which is assumed to be the response object from the Gemini API. Next, steps two and three are similar because we checked for accuracy by making sure the response contained a field named ‘accuracy’. Then, once that was made sure of, we checked whether the value associated with ‘accuracy’ is either an integer or a float. This assumes that accuracy might be represented as a numerical value.

After, depending on the specifics of the Gemini API, adjustments may need to be made. In our code, it checks if the accuracy value is between 0 and 100, assuming accuracy is represented as a percentage. Finally, once all checks pass, it returns ‘true’, indicating that the response contains valid accuracy information. Otherwise, it returns ‘false’.

For Loops

So, since we have all the necessary functions, it’s time to put them together using for loops. These are the steps we tried to accomplish:

1) Creates new columns for the paragraphs of scraped text

2) Inputs the data for each into the gemini API function

3) Takes the response and stores it in the column

4) Takes each response and runs it through the accuracy metric

5) Prints out the accuracy and creates a new column with it

6) Creates a new csv with all this information for easy access

In the end, viewers can see a preview of some of the outputs. We added columns that allow others to see the scraped text of the UN sites, the LLM’s response to the problem statement, and the accuracy score.

A problem we encountered was the limitations on printing a secret output. Originally, we planned to add on to the CSV file of the gifts, and while we did stick with this, we tried to alter it so it wouldn’t show up in the code. To access the results, there is a piece of code that lets you download it in your google drive.

While writing the for loop itself, the most challenging part was finding a for loop that would allow us to access the information in the data set while simultaneously adding new information. Eventually, we found the iterrows() method, and from there it was much easier.

The last problem was the gemini api key. It allows only 15 searches per min, so it was a struggle to get it to evaluate everything. We tried running it on the CPU, but it timed out after ten minutes. Eventually, after getting back on GPU and trying over again, it worked.

Finally, there was a problem with the google search scrape function. The for loop worked perfectly when a data.head(3) was implemented. We assume that certain searches weren’t possible with the key due to restrictions on inappropriate content. A solution to this would have been coding in something that overcomes these restrictions, or lowers them. However, due to time, we instead decided to not use the function. It is still in our final code, but it doesn’t affect any results.

Overall Analysis

Since we chose to do one question only, we tried to make it the best it could be. We learned how to work as a team, trying to help each other to the best of our abilities, and researching together. While it wasn’t exactly as we were hoping it would turn out, we still got pretty close and are happy with the results. Overall, we learned a lot from this project. LLM’s are interesting. If we had a chance to do this over again, we might have tried multiple questions using a lang-link-chain and fixed the google search scrape function.

Our Code

https://github.com/MehtaPlusTutoring/studentprojects/blob/main/aimlresearchbootcamp/2024/midterm/mid_term_project_lang.ipynb

--

--

MehtA+

MehtA+ is founded and composed of a team of MIT, Stanford and Ivy League alumni. We provide technical bootcamps and college consulting services.