A project for the Text Mining course, the Autumn 2022 semester.
In this study, we investigate the use of machine learning and text mining techniques to classify and analyze TED talk videos based on their transcript data. We perform sentiment analysis to identify the opinions and feelings expressed by TED speakers about each talk topic, and use topic analysis to cluster the videos and compare them to the categories labeled by the TED website. We also apply text classification techniques to predict the topics of new videos using a random forest model. Our results show that TED talks tend to present a positive sentiment and that the clusters generated by Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) align closely with the known categories. The supervised learning model using a combination of LSA on Term Frequency-Inverse Document Frequency (TF-IDF) and additional information had the highest accuracy, with an overall accuracy of 0.90. Limitations and potential directions for future research are also discussed.
Full report: https://rpubs.com/Manunpat/1105498