What is Glue Crawler and how to use it

About the Glue Crawler

Glue Crawler is used to get the metadata or schema of the data store it scans. Metadata here means important aspects of data such as column names and their data types and datastore here means the place from which data is obtained which can be either database or the s3 bucket. The crawler has an inbuilt classifier that identifies the data type of columns in case the data is stored in tabular formats such as a CSV file or the data in a relational database. It can identify the structure of data in JSON format, but one needs to write code to flatten it and bring it to a tabular format. The crawler after obtaining the metadata stores this data in a database. The metadata stored is referred to as AWS Glue Data Catalogue which is especially useful while performing a Glue Job which is covered in the next section.

To learn more about AWS Glue Data Catalogue you can visit https://rb.gy/n0spln .

This operation of obtaining the metadata and storing it in the AWS Glue Catalogue is called scanning of the data. It should be noted that one can run a Glue Job without running the crawler through the data store. It is only that crawling the data store makes it easy when using the graphical interface in the Glue Studio and makes it easy for the user if there are many tables stored in the data store. Also, it must be kept in mind that the crawler cannot scan through all file types. It supports only scanning through or crawling through CSV, JSON file formats. If the datastore is a relational database this should not be an issue but if your data is stored in an s3 bucket under files, then it is important to note down the file formats and look if it is supported. An extremely popular file format .xlsx (excel file type) is not supported by the crawler or the Glue Studio and therefore you must convert these file types to a format such as CSV so that you can run Glue Job on them. Also, one need not run the crawler every time if they are certain that columns and datatypes have not changed from the last crawler run. Unnecessary running of crawlers can increase your expenses so use them judiciously. You will need to run the crawler multiple times only when the names of the columns are changing or new tables or schemas or files is being added with different column names to the datastore.

An example of using the crawler

I have experience using AWS Glue in one of my projects in my organization. I know the problems one faces while using it. I wanted to help those who want to know how to make use of AWS glue for the ETL (Extract Transfer Load) process. The blog describes all AWS glue functionalities such as crawlers, glue jobs, connections, etc. It covers which policies to be attached to the IAM roles of diverse types of jobs. It also covers aspects related to cost-saving while using it.

This example is borrowed from the used case in my organization. I have used a Kaggle dataset to perform the crawl operation for exemplifying purposes.

Step 1: Go to Glue Console. You can reach the glue console by using the search option at the top of the window you get after logging in to your AWS account. Search AWS GLUE and click on the first link.