Unveiling Archaeological Gifts to the United Nations: A Computational Exploration
By: Faaz M, Glen S, Chloe C, Ece A, Aryan J — MehtA+ AI/Machine Learning Research Bootcamp students
In a project in partnership with CUNY professor, Prof. Elizabeth Macaulay, high school students in MehtA+ AI/Machine Learning Research Bootcamp were provided with a United Nations Gifts Dataset and tasked to use AI to understand why? In part 6 of a seven part series, students explore ways in which AI can help us understand archaeological gifts better.
If you would like to learn more about MehtA+ AI/Machine Learning Research Bootcamp, check out https://mehtaplustutoring.com/ai-ml-research-bootcamp/.
*******************
Introduction
The United Nations (UN), an international organization dedicated to global cooperation and peace, has received numerous archaeological gifts from member states over the years. These gifts, rich in historical and cultural significance, are curated as part of the UN’s efforts to preserve global heritage. We seek to explore the motives behind these gifts using advanced computational methods.
Project Statement
Our research aims to develop an automated pipeline that can answer the question “Why?” for any archaeological gift within the UN’s collection. Our objective is to cluster the gifts into different categories based on their descriptions by utilizing techniques such as natural language processing (NLP), dimensionality reduction, and clustering, the project seeks to uncover the historical, artistic, and symbolic meanings behind these artifacts. The pipeline will integrate data scraping, text preprocessing, feature extraction, and clustering analysis to achieve its goals.
Methodology and Implementation
Data Scraping and Preprocessing
The project begins with data scraping from URLs containing descriptions of archaeological gifts. Additionally, we added URLS for all the other gifts in the UN catalog. The DataScraper class retrieves HTML content, parses it using BeautifulSoup, and extracts relevant text data. This text undergoes preprocessing using the Preprocessing class, which includes handling contractions, cleaning text, tokenization, stop word removal, and lemmatization.
Feature Extraction: Word Embeddings
In order for a machine to work with text data, we had to convert it into a numerical representation using word embeddings. Text data processed through preprocessing is then used to train a Word2Vec model for generating word embeddings. The WordEmbeddings_Handler class in the pipeline handles the construction and training of the Word2Vec model. These embeddings capture semantic relationships between words, facilitating deeper analysis of textual descriptions of archaeological gifts.
Dimensionality Reduction: PCA
To visualize and analyze the high-dimensional embeddings in a 2D space, we utilized Principal Component Analysis (PCA) in the PCA_Handler class. PCA reduces the dimensionality of the embeddings while preserving important relationships between data points. This step prepares the data for clustering analysis.
Clustering Analysis: K-Means
The final step involves clustering analysis using the K-Means algorithm implemented in the KMeans_Model class.
To determine the optimal value of the hyper-parameter n_clusters, two methods were employed: Silhouette analysis and the Elbow method.
The silhouette score metric measures how similar an object is to its own cluster compared to other clusters. Higher silhouette scores indicate better-defined clusters. This method was therefore used to find the optimal number of clusters for the transformed word embeddings by performing KMeans clustering for different cluster counts and evaluating their performance using silhouette scores.
The optimal number of clusters, according to silhouette analysis, is 2, as it provides the highest silhouette score, indicating well-defined clusters.
Additionally, the elbow method allows for an automated approach to selecting the optimal number of clusters in KMeans clustering. This was done by plotting the distortion score for different values of k and looking for a point where the rate of decrease sharply slows down (the “elbow point”).
The distortion score decreases rapidly from k=2 to k=6, indicating significant improvement in clustering performance. Beyond k=6, the reduction in distortion score becomes less pronounced, which confirms k=6 as a reasonable choice for the number of clusters. This choice balances the trade-off between the clustering performance and the complexity of having more clusters.
Creating Labels for Clusters: Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) is a model used in Natural Language Processing (NLP) for discovering topics within a collection of documents. In our particular use-case, we’ve utilized LDA for the task of naming the clusters deduced from our K-Means analysis.
Firstly, we grouped preprocessed untokenized texts into individual clusters. We combined texts within each cluster into a single document and then converted each text into a bag-of-words. Then, we created a corpus of documents and then applied LDA which returned the following output.
From these results, we deduced that there were truly 4 clusters compared to 6 inferred following the elbow method analysis and 2 by silhouette.
{
0: Secretary General,
1: Gift,
2: Artist,
4: Sculpture,
}
Challenges Faced
Initially, we utilized the DBSCAN model to cluster the word embeddings due to the data’s arbitrary shape, which required effective noise handling without pre-specifying the number of clusters. We implemented an automated approach to determine the optimal values for the hyperparameters epsilon and min_samples to fit the model to the word embeddings. However, DBSCAN did not successfully create meaningful clusters.
Results and Analysis
The pipeline successfully processes and analyzes archaeological gift descriptions from UN datasets. The clustering results reveal distinct clusters: artist, gift, sculpture and related to the secretary general. Visualizations, generated using Plotly and displayed using tkinter, illustrate these clusters in two-dimensional space, providing insights into the distribution and relationships among gifts.
Our Code: https://github.com/FuzzDOT/MidProject1
Conclusion and Future Work
In conclusion, the automated pipeline developed by our team offers a robust framework for understanding the motivations behind archaeological gifts to the UN. Future work could include enhancing the pipeline’s accuracy metrics, integrating additional feature extraction techniques, and expanding the dataset to encompass a broader range of cultural artifacts.
Acknowledgments
This project was made possible by leveraging advanced AI and NLP techniques, as well as the collaborative efforts of the developers dedicated to preserving and interpreting global cultural heritage.