AWS Glue - Why, When and How to use it

Image

A brief introduction to AWS

AWS is one of the subsidiaries of Amazon which specializes in providing cloud computing services to its users. There is a diverse range of cloud computing services provided by AWS ranging from simple storage services known as ‘Amazon s3’ to ‘Amazon Sage Maker’ a Machine learning service. Amazon has consistently claimed it can provide large-scale computing infrastructure more quickly and cheaply than building and maintaining in-house physical servers. The services offered by AWS are billed based on their usage. According to the latest market share reports AWS owns about 33% of all cloud resources (IaaS, PaaS) while the two other major competitors i.e. Microsoft Azure and Google Cloud own 21% and 10% market share respectively.

AWS glue is one of AWS’s microservices, which will be covered in this blog.

To follow this blog, you can create an AWS free tier account if you do not have access to an AWS account. Most of the things mentioned in this blog fall under the free tier except for running the Glue Job. This is the link https://rb.gy/he8xij to create a free account if anybody is interested in it.

Why am I writing this blog?

I have experience using AWS Glue in one of my projects in my organization. I know the problems one faces while using it. I wanted to help those who want to know how to make use of AWS glue for the ETL (Extract Transfer Load) process. The blog describes all AWS glue functionalities such as crawlers, glue jobs, connections, etc. It covers which policies are to be attached to the IAM roles of diverse types of jobs. It also covers aspects related to cost-saving while using it.

It is to be noted that I will be covering AWS Glue Studio which is an ETL (Extract, Transfer, and load) tool, and not AWS Glue Brew which is a data preparation tool for building machine learning models.

A brief introduction to AWS Glue

AWS glue is a microservice provided by AWS to perform ETL (Extract, Transform, and Load) operations. It is a tool to move data from various sources such as files, databases, etc. to a common location where it can be used as a data lake or a data warehouse for analytics, and deriving insights. It is a serverless technology so one need not buy expensive servers or virtual machines and maintain them to perform these operations. Glue is best used when the data size is very large example petabytes of data. It is particularly useful when we want to extract data from several IoT devices for analysis. Nonetheless, it can also be used for data sizes in GB’s or 100’s of MB’s such as extracting sales or order records from the central database or an ERP system to an analytics database or a data warehouse. It may not be such a good solution for small data sizes, and one can look for other alternatives such as using the AWS lambda function. Once again by AWS Glue here, I mean AWS Glue Studio and not AWS Glue Brew. In the upcoming sections I would be covering how to set up a connection, then I will be going through the glue crawler, and finally about how to set up a glue job (ETL job).

Why use AWS Glue

The processes that transform the data from various sources and bring it together to a single location are called data pipelines. It requires a lot of effort and time to maintain these data pipelines. Since companies would like to focus on their core competencies rather than constructing and maintaining these data pipelines, they tend to use services that would do it for them, so they need not worry about the infrastructure and maintainability aspects of it. Amazon Web Services (AWS) is one such provider of cloud computing services. AWS has several data-related services one such is AWS Glue which focuses on ETL (Extract, Transform and Load). It's one of two AWS services for moving data from sources to analytics destinations; the other is AWS Data Pipeline, which is more focused on data transfer. Amazon's AWS Glue is a fully integrated ETL service. A fully integrated ETL service that makes it easy for customers to prepare and load their data for analysis. Some of the features offered by AWS Glue are:

  • Easy - AWS Glue has all the functionalities required for building, maintaining, and running ETL operations. AWS Glue makes a pass through your data sources, identifies data formats, and suggests schemas and conversions required. AWS Glue automatically generates code to perform your data extraction and load processes.
  • Integrated - AWS Glue is included throughout the wide range of AWS services.
  • No server - AWS Glue requires no servers. There is no managing any infrastructure. AWS Glue handles all the aspects right from proving infrastructure to auto-scaling it based on the requirements.

Setting up a connection to a Database in AWS Glue Console

A connection is essential if one wants to move data to or get data from a database. If the goal is to move data from one location in an s3 bucket to another one in another s3 bucket, connections will not be required. The connection becomes essential only when either the source data comes from a database, the target is a database or when both source and target are databases. Connections are mechanisms to let Glue services access data from a database. The connection also lets glue write data onto a database.

Steps to setup an AWS Glue Connection

Step 1: To set up a Connection one needs to go to the AWS Management Console and from there to the AWS Glue console. The AWS management console is the default page after logging in to an AWS account. The AWS Management Console is a graphical interface for accessing a wide range of AWS Cloud services and managing compute, storage, and other cloud resources. To learn more on how to navigate through their console to use AWS services you can go through this official documentation https://docs.aws.amazon.com/awsconsolehelpdocs/ One can access Glue Console by searching for AWS Glue after logging in to their account. This step is shown in the figure below:

Select appropriate visuals to represent the metrics

Step 2: Once in the console, you should be able to find ‘Connections’ in the left pane as shown below. Click on it.

Select appropriate visuals to represent the metrics

Before making a connection, you must be familiar with concepts such as what an endpoint is and what a database is. You must have all the necessary things required to make the connection to your database such as the endpoint, username, password, etc. Here I am giving an example to make a connection using the JDBC endpoint. To make another type of connection such as connecting to a MongoDB database there might be some changes, but the procedure remains the same.

To make a connection to your database go the add connections on the top.

Select appropriate visuals to represent the metrics

Step 3: After this, you would have to fill in details such as the connection name, the connection type, and a brief description of the connection. Make sure you name the connection in such a way that you can uniquely identify it and it gives you a sense of what database it connects and for what purpose it is created. In my case. I have named it ‘my_connecton’ and it may not be such a great name. For the connection type, I have chosen as JDBC connection. You can choose the type based on the database. Also, there is a direct option to connect to Amazon RDS or Amazon Redshift if you are trying to make connections to them. Make sure you don’t check to require SSL connections unless SSH is essential to make a connection to your database.

Select appropriate visuals to represent the metrics

Then click on the Next button.

Step 4: After this, you need to fill in details such as the JDBC URL for the database you want to connect. The username, the password, etc. This may change based on the type of database you are making connections to.

Select appropriate visuals to represent the metrics

Then, click the Next button. On the final page, you would see all details regarding your connection.

Select appropriate visuals to represent the metrics

Step 5: Click on the finish button to finalize your connection. Once done you should see your newly created connection. You must test it to make sure it is working. To test it check the box to the left of the connection and click on the test connection button at the top.

Select appropriate visuals to represent the metrics

You may be asked to choose an IAM role as in this case. Choose a basic IAM role if it exists or create one and attach a policy to it, such as VPN access, etc., based on your database.

Select appropriate visuals to represent the metrics

To learn more about the IAM role you can visit https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html.

Finally check the connection status at the top. You should get connected successfully.

If you want to learn more about adding a Glue Connection, you can go to the official documentation https://docs.aws.amazon.com/glue/latest/dg/console-connections.html.

Connection properties can be explored in more detail at https://docs.aws.amazon.com/glue/latest/dg/connection-defining.html.

If facing an issue while testing the connection for an IAM role you can visit https://docs.aws.amazon.com/glue/latest/dg/console-test-connections.html.

Select appropriate visuals to represent the metrics

Now you should be able to use this connection to perform ETL operations. In the next section, I will give an overview of using the crawler.

What is Glue Crawler and how to use it

Glue Crawler is used to get the metadata or schema of the data store it scans. Metadata here means important aspects of data such as column names and their data types and datastore here means the place from which data is obtained which can be either database or the s3 bucket. The crawler has an inbuilt classifier that identifies the data type of columns in case the data is stored in tabular formats such as a CSV file or the data in a relational database.

To learn more about glue crawler and how to use it, visit our blog - What is Glue Crawler and how to use it.

How to set up and run a Glue Job

Running a Glue Job first requires defining the job to specify the source location, the transformations that be done to be done on the source, a target location, etc. It is very similar to any other ETL tool if you have used any. Most of the work can be done on the graphical interface provided in the tool but for certain customized operations, one may need to unlock the script.

To learn how to set up and run a glue Job and how to use it, visit our blog - How to set up and run a glue job.

Solving errors in case of ETL failures by looking at ‘Cloud Watch’ logs

I will be using ‘Cloud Watch’ logs to find the cause for the Glue Job failure. Glue Job failure can occur due to several reasons such as data discrepancy, a change in the structure of the source table, the logic for implementation is wrong etc.

To learn how to solve errors of ETL failures, visit our blog - Solving errors in case of ETL failures by looking at ‘Cloud Watch’ logs.

Conclusion

The functionalities of AWS Glue covered here are sufficient to perform ETL operations with basic transformations of the data. I have used these same functionalities in one of my projects in my organizations. At times it may so happen that you may be stuck at some point or may not be knowing how to approach the problem. AWS has put up a wide variety of advanced used cases on their website you can refer to them to move forward. If one knows what the tool comprises of and how its resources can be used it should be easy for them to follow these used cases. The purpose of this blog was to familiarize its readers with what AWS Glue is, the functionalities present in it and how they can be used. More in depth exploration can be done on specific things based on the requirement one has. This concludes it. Thanks.

Written by:

Shubhank Sharma

Data Scientist

LinkedIn

Related Post

Leave a Reply