Democracy and Human Rights Program
【GGR Workshop】How to Scrape Text Data
DateSeptember 28, 2023 and October 5, 2023
PlaceConference Room, Mercury Tower
Event Outline

On September 28 and October 5, 2023, the Institute for Global Governance Research (GGR) held a workshop titled “How to Scrape Text Data” with Mr. Pablo Andres Bugueno Echiburu, a software engineer from Chile with expertise in strategic project management and web development. The workshop was conducted in English, and a total of 26 people participated in the two sessions, including students and faculty members from Hitotsubashi University.

The workshop was divided into two days. On the first day, Mr. Bugueno introduced the participants to machine learning, data sourcing, scraping, and appropriate tools and challenges, sharing materials such as Python and BeautifulSoup tutorial.  Regarding the data repositories, Mr. Bugueno specified the different types in data mining (extracting), analog repos (physical documents), API (application programming interface), databases (sorted and maintained), and web scraping (structured extraction from web). To make it comprehensible, he provided explanations on how to effectively scrape content from webpages and how to select the data or information manually for the data projects. Participants started to try their hand at web scraping, following the instructions of Mr. Bugueno with scraper examples and data frames. Thereafter, Mr. Bugueno encouraged the participants to tackle the GGR web site scraping solution.

On the second day, Mr. Bugueno began with a review of the lecture from the previous day. Following that, he continued the session by showing how to scrape content from HTML pages, and the participants were motivated by Mr. Bugueno to utilize BeautifulSoup, which requires the user to load two libraries for requests as well as BeautifulSoup itself. Mr. Bugueno frequently checked in with participants during the workshop to see whether they found it difficult to follow his instructions to practice with codes. Participants also express their concerns about HTML file paths and technical terms.

【Event report prepared by】
Hnin Htet Htet Aung (Master’s student, School of International and Public Policy)