Eksplorasi Teknik Web Scraping pada Data Mining: Pendekatan Pencarian Data Berbasis Python

Debora Chrisinta(1*), Justin Eduardo Simarmata(2)

(1) Program Studi Teknologi Informasi, Universitas Timor, Indonesia
(2) Program Studi Pendidikan Matematika, Universitas Timor, Indonesia
(*) Corresponding Author

Abstract


Web scraping was an automated information extraction technique from web pages for data collection and was applied in data mining. Two common algorithms used in data mining are clustering and classification. The data source used originated from the Google Search Engine. The design of the web scraping script using Python was implemented to collect data, process HTML, and extract information from web pages. Data was successfully gathered from the Google Search Engine regarding tourism, with the number of links and processing time measured. Data processing involved cleaning the data and implementing hierarchical clustering algorithms. The evaluation was carried out by selecting the optimal number of clusters using the Dunn index. Subsequently, the data was used to train a decision tree model, and the results were evaluated using accuracy, confusion matrix, and classification reports. The results of this research indicated that the importance of web scraping in data mining could provide a comprehensive understanding of the effectiveness of web scraping techniques and the application of data mining. 

Full Text:

PDF

References


S. Munzert, C. Rubba, P. Meißner, and D. Nyhuis, Automated data collection with R: A practical guide to web scraping and text mining, no. 1. John Wiley & Sons, 2014.

V. Krotov, L. Johnson, and L. Silva, “Tutorial: Legality and Ethics of Web Scraping,” Communications of the Association for Information Systems, vol. 47, 2020, doi: 10.17705/1CAIS.04724.

V. Franzoni and A. Milani, “A Semantic Comparison of Clustering Algorithms for the Evaluation of Web-Based Similarity Measures. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2016. ICCSA 2016,” in Lecture Notes in Computer Science(), vol 9790, Springer, Cham, 2016. doi: 10.1007/978-3-319-42092-9_34.

D. Chrisinta, I. M. Sumertajaya, and I. Indahwati, “Evaluasi Kinerja Metode Cluster Ensemble Dan Latent Class Clustering Pada Peubah Campuran,” Indonesian Journal of Statistics and Its Applications, vol. 4, no. 3, pp. 448–461, 2020, doi: 10.29244/ijsa.v4i3.630.

D. Chrisinta, “Identifikasi Karakteristik Desa di Provinsi Bengkulu Tahun 2018 Berdasarkan Latent Class Cluster (LCC),” in Seminar Nasional Official Statistics, 2022, pp. 927–936. doi: 10.34123/semnasoffstat.v2022i1.1287.

A. Mabrouk, R. P. D. Redondo, and M. Kayed, “Seopinion: summarization and exploration of opinion from e-commerce websites,” Sensors, vol. 21, no. 2, p. 636, 2021, doi: 10.3390/s21020636.

Z. Zuo, “Sentiment analysis of steam review datasets using naive bayes and decision tree classifier,” no. 3, 2018. doi: 10.30598/barekengvol14iss3pp343-356.

I. N. Husada, E. H. Fernando, H. Sagala, A. E. Budiman, and H. Toba, “Ekstraksi dan Analisis Produk di Marketplace Secara Otomatis dengan Memanfaatkan Teknologi Web Crawling,” Jurnal Teknik Informatika dan Sistem Informasi, vol. 5, no. 3, pp. 2443–2229, 2019, doi: 10.28932/jutisi.v5i3.1977.

A. Priadana and A. W. Murdiyanto, “Analisis Waktu Terbaik untuk Menerbitkan Konten di Instagram untuk Menjangkau Audiens,” Jurnal Penelitian Pers dan Komunikasi Pembangunan, vol. 24, no. 1, pp. 59–70, 2020, doi: 10.46426/jp2kp.v24i1.118.

E. Yuniar, D. Safiroh, D. Wahyuningsih, S. Informasi, S. Ppkia, and P. Paramita, “Implementasi Scrapping Data Untuk Sentiment Analysis Pengguna Dompet Digital dengan Menggunakan Algoritma Machine Learning,” janitra.orgE Yuniar, DS Utsalinah, D WahyuningsihJurnal Janitra Informatika dan Sistem Informasi, 2022•janitra.org, vol. 2, no. 1, pp. 35–42, 2022, doi: 10.25008/janitra.v2i1.145.

T. M. Fahrudin, P. A. Riyantoko, and K. M. Hindrayani, “Implementation of Web Scraping on Google Search Engine for Text Collection Into Structured 2D List,” Telematika: Jurnal Informatika dan Teknologi Informasi, vol. 20, no. 2, pp. 139–152, 2023, doi: 10.31315/telematika.v20i2.9575.

SCM de S Sirisuriya, “A comparative study on web scraping,” in Proceedings of 8th International Research Conference, 2015, pp. 135–140.

J. W. Seifert, Data mining: An overview. National security issues, 2004.

M. F. Sanner, “Python: a programming language for software integration and development,” J Mol Graph Model, vol. 17, no. 1, pp. 57–61, 1999, doi: 10.1016/j.str.2005.01.010.

D. Chrisinta and J. E. Simarmata, “Analisis Sentimen Penilaian Masyarakat Terhadap Pejabat Publik Menggunakan Algoritma Naïve Bayes Classifier,” Komputika: Jurnal Sistem Komputer, vol. 12, no. 1, pp. 93–101, 2023, doi: 10.34010/KOMPUTIKA.V12I1.9638.




DOI: http://dx.doi.org/10.30998/faktorexacta.v17i1.22393

Refbacks

  • There are currently no refbacks.




Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

template doaj grammarly tools mendeley crossref SINTA sinta faktor exacta   Garuda Garuda Garuda Garuda Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Flag Counter

site
stats View Faktor Exacta Stats


pkp index