It’s necessary that the business leverage the resources te a repeatable and ongoing ondergrond.
The AWS CloudFormation team has bot busy te the last duo of months, adding support for fresh resource types for recently released AWS services. Te this postbode, I take a deep dive into using AWS Glue with CloudFormation.
AWS Glue wasgoed very first announced at re:Invent te , and wasgoed made generally available te August . AWS Glue is a serverless samenvatting, convert and flow (ETL) service. ETL is a critical step ter operationalizing gegevens analytics, since gegevens cleansing and reformatting is almost always necessary when creating everything from gegevens marts, warehouses, gegevens lakes, machine learning algorithms, metrics dashboards, operational reports, and many other gegevens science projects.
Gegevens science projects can be very time-consuming from an experimentation and discovery perspective. When a promising candidate project emerges, implementing the necessary production compute and storage resources can cause significant delays. It’s necessary that the business leverage the resources ter a repeatable and ongoing voet.
Analyzing historical price gegevens for bitcoin: Programma
Your company wants to pursue a fresh service relating to cryptocurrency analysis, and your IT team gets asked to produce a database with historical price information for bitcoin, to feed other analysis processes. Rather than require a significant upfront investment te building the database and the necessary tooling for ongoing analysis and specimen development, you can leverage AWS Glue and open source development instruments like Apache Spark, Python and Apache Zeppelin to enable quick experimentation and further product development. Further, you want to automate the operationalization of this gegevens analysis podium spil quickly spil possible, and AWS CloudFormation helps with this automation.
Begin with getting historical bitcoin price gegevens. You can choose to consume the gegevens directly from currency exchanges, or you can find an existing dataset that includes past gegevens. There are many sources of public gegevens sets for such projects, spil outlined ter Eighteen places to find gegevens sets for gegevens science projects. One of my beloved sources for such gegevens sets is Kaggle, where you can also learn from other scientists’ projects and findings, spil well spil participate ter gegevens science competitions sponsored by companies and academic institutions.
The dataset from the Bitcoin Historical Gegevens pagina will getraind our needs (see Figure 1 below), it includes historical gegevens from January 2012 to today, from several exchanges.
Figure 1. A sample of the CSV gegevens provided for the coinbase exchange.
Setting up our Bitcoin Price Database
- If you’ve never used Kaggle, you are asked to set up a free account.
- Download the gegevens spil a zip opstopping from Bitcoin Historical Gegevens.
- Expand the zip verkeersopstopping locally and locate the opstopping for the coinbase exchange.
- Set up an Amazon S3 bucket and waterput the verkeersopstopping there.
Armed with this gegevens, now get the gegevens to a point that you can inspect it, query it, and project for further development and experiments.
- Create a fresh opstopping te your beloved editor.
- Copy and paste the following CloudFormation template into the opstopping (see Figure Two below).
- Save it, and create a fresh stack with the console (I used the US East N. Virginia region), the CLI, or the API.
Figure Two. The CloudFormation template for the AWS Glue crawler from bitcoin gegevens.
You set up everything with CloudFormation early te the process, so eventually operationalizing the solution te a production environment becomes rapid and repeatable. The template helps you:
- Designate S3 spil the storage gegevens lake
- Create an IAM role to use AWS Glue
- Samenvatting the gegevens
- Create a database
- Stream the gegevens into a table that you can query with Amazon Athena or straks visualize with Amazon QuickSight.
Note the following from the template code:
- The template creates the minimal resources for setting up your database. This include the IAM role (ETLRole te the code example), the database itself, which can be used te Athena or QuickSight for further analysis, and the crawler. The crawler does the work of extracting the gegevens from the CSV verkeersopstopping that you downloaded and populates the table ter the database.
- When creating the database (BitcoinPriceDB ter the code example), ensure that you only use lowercase characters ter the string for the Database Input/Name attribute (bitcoin-price-db, te the code example).
- For the crawler, specify your bucket ter the Targets/S3Targets/Path attribute (s3://cfnda-bitcoin-historical-data/coinbase ter the code example). Waterput your own bucket and folder path where you uploaded your copy of the coinbase opstopping that you got from Kaggle ter the previous steps.
- For this example, the template sets up a schedule for the crawler to run every five minutes on weekdays. For your proef, you may either omit the Schedule/ScheduleExpression stanza and run the crawler by hand from the AWS Glue console, or use your own cron expression (see some examples te Time-Based Schedules for Jobs and Crawlers,keeping te mind that you are charged vanaf the rates on AWS Glue Pricing).
- For this quick example, it makes sense to waterput everything te one template, with the benefit of quickly deleting all resources by deleting the stack. For a long-term project, you may want to create thesis resources ter separate templates, spil you only need to create the role and database one time. To look at the gegevens from the other bitcoin exchanges provided ter the Kaggle dataset, reuse the crawler code te separate templates and stacks for biflyer, coincheck, and bit stamp, which are also provided te the same dataset.
After creating the stack, which should take about 40 seconds, check the AWS Glue console where your database and crawler showcase up on the respective lists instantaneously. After the crawler runs (ter about Five minutes, vanaf your scheduled cron expression), the table also shows up. After the table creation is finish, you can view the schema from the AWS Glue console (see Figure Three), or execute SQL queries against it using the Athena console (see Figure Four).
Figure Three. Schema for the coinbase historical pricing table, created with the AWS Glue crawler.
Figure Four. Sample query results from the coinbase historical pricing table, using the Athena Query Editor.
Now that you’ve figured out how to ingest the gegevens to a place where you can query it, you can start to compare the gegevens with other exchanges from the dataset. You can also merge it with other gegevens, reformat the gegevens to make it lighter to consume, and further enable gegevens experiments.
For example, maybe you want to merge the gegevens with other events or logs, but those events lack enough structure to build tables from them and join them with your existing tables. Using AWS Glue classifiers, you can use grok expressions to add structure to thesis extra gegevens sources.
Another example that considers the current gegevens schema would be converting the UNIX time format field to a conventional date field. AWS Glue permits for the creation of jobs that can do such converts on fields. It also permits you to use PySpark (Python-based Spark scripts) to do such transformations. You can straks operationalize those scripts spil jobs te AWS Glue, and have them run periodically or based on a trigger, perhaps to get updated gegevens.
You can write and debug your own scripts, from your almacén IDE, by creating development endpoints ter AWS Glue. Further, you can develop scripts and conduct experiments using Apache Zeppelin notebooks, deploying a server with the Zeppelin software ready to use. Not remarkably, AWS Glue uses CloudFormation to deploy thesis Zeppelin servers. For more information about setting up your locorregional IDE, see Tutorial: Set Up PyCharm Professional with a Development Endpoint to listig your AWS Glue endpoint with JetBrains’ PyCharm Professional, or see Tutorial: Set Up an Apache Zeppelin Notebook on Amazon EC2 on how to create a Zeppelin notebook server on an Amazon EC2 example right from the AWS Glue console, using CloudFormation to deploy the server.
Visit the CloudFormation details pagina and CloudFormation documentation for more information, spil well spil the total list of supported resources.
About the Author
Luis Colon is a Senior Developer Advocate for the AWS CloudFormation team. He works with customers and internal development teams to concentrate on and improve the developer practice for CloudFormation users. Te his spare time, he mixes progressive trance music.