Friday, April 06, 2012

Using Data from Windows Azure Blobs with Apache Hadoop on Windows Azure CTP

Update 4/17/2012: See my Using Excel 2010 and the Hive ODBC Driver to Visualize Hive Data Sources in Apache Hadoop on Windows Azure sequel of 4/14/2012 for creating Excel graphs from the same data.

Contents:

Introduction

imageHadoop on Azure Team members Avkash Chauhan (@avkashchauhan) and Denny Lee (@dennylee) have written several blog posts about the use of Windows Azure blobs as data sources for the Apache Hadoop on Windows Azure Community Technical Preview (CTP). Both authors assume readers have some familiarity with Hadoop, Hive or both and use sample files with only a few rows to demonstrate Hadoop operations. The following tutorial assumes familiarity with Windows Azure blob storage techniques, but not with Hadoop on Azure features. Each of this tutorial’s downloadable sample flightdata text files contains about 500,000 rows. An alternative flighttest_2012_01.txt file with 100 rows is provided to speed tests with Apache Hive in the Interactive Hive console.

Note: As mentioned later, an initial set of these files is available from my SkyDrive account here.

imageMy (@rogerjenn) Introducing Apache Hadoop Services for Windows Azure article, updated 4/2/2012 provides an overview of the many facets of Microsoft’s recent replacement for it’s original big-data analytic offerings from Microsoft Research, Dryad, DryadLINQ and LINQ to HPC.

The Hadoop On Azure Elastic Map Reduce (EMR) portal’s Manage Cluster page offers two methods for substituting imageremote blob storage for Hadoop Distributed File System (HDFS) storage:

  1. Enabling the ASV protocol to connect to a Windows Azure Blob storage account or
  2. Setting up a connection to an Amazon Simple Storage Services (S3) account

image

Figure 1. The Hadoop on Azure EMR portal’s Manage Cluster page automates substituting Windows Azure or Amazon S3 storage for the HDFS database.

Note: My Importing Windows Azure Marketplace DataMarket DataSets to Apache Hadoop on Windows Azure’s Hive Databases tutorial of 4/3/2012 describes how to take advantage of the import from the Windows Azure Marketplace DataMarket option. For more information about Disaster Recovery, see the Understanding Disaster Recovery for Hadoop on Azure section at the end of this post.

This tutorial uses the ASV protocol in the following three scenarios:

  1. Hadoop/MapReduce with the Remote Desktop Protocol (RDP) and Hadoop Command Shell
  2. Hadoop/MapReduce with the Interactive JavaScript console
  3. Apache Hive with the Interactive Hive console

Note: According to Denny Lee, ASV is an abbreviation for Azure Storage Vault.

Scenarios 1 and 2 use the Hadoop FileSystem (FS) Shell Commands to display and manipulate files. According to the Guide for these commands:

The FileSystem (FS) shell is invoked by bin/hadoop fs <args>. All the FS shell commands take path URIs as arguments. The URI format is scheme://authority/path.

  • For HDFS the scheme is hdfs
  • For the local filesystem the scheme is file
  • [For Azure blobs, the scheme is asv]

The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply as /parent/child (given that your configuration is set to point to hdfs://namenodehost). Most of the commands in FS shell behave like corresponding Unix commands. Differences are described with each of the commands. Error information is sent to stderr and the output is sent to stdout.

imageScenario 3 executes Apache Hive’s CREATE EXTERNAL TABLE instruction to generate a Hive table having Windows Azure blobs, rather than HDFS, as its data source.

Note: The technical support forum for Apache Hadoop on Windows Azure is the Apache Hadoop on Azure CTP Yahoo! Group.


Creating the Azure Blob Source Data

The source data is a set of 32 tab-delimited text files derived from the FAA’s On-Time Performance data sets, which are available in zipped *.csv format for each month of every year since 1987 from the Bureau of Transportation’s Research and Innovative Technology Administration site. For more information about these files, see The FAA On_Time_Performance Database’s Schema and Size section of my Analyzing Air Carrier Arrival Delays with Microsoft Codename “Cloud Numerics” article of 3/26/2012. Each *.csv file has 83 columns, about 500,000 rows and an average size of about 225 MB.

These files are processed as follows:

  1. Download On_Time_On_Time_Performance_YYYY_M.zip files
  2. Extract *.zip files to On_Time_On_Time_Performance_YYYY_M.csv files
  3. Open each *.csv file in in Excel 2010
  4. Delete all columns except Year, Month, DayofMonth, Carrier, Origin, Dest, DepDelayMinutes, ArrDelayMinutes
  5. Save the abbreviated file as flightdata_YYYY_MM.txt with 0 prepended to single month numbers but don’t save changes to the *.csv file
  6. Open the abbreviated *.txt file in TextPad, delete the first (column name) row (see Figure 2) and save changes
  7. Open the *.txt file in Excel and replace empty DepDelayMinutes and ArrDelayMinutes cells with 0 (see the section below for the process)
  8. Upload the files to a flightdata folder of an aircarrier container in a Windows Azure storage account with Cerebrata Cloud Storage Studio or equivalent (see Figure 3)

image

Figure 2. TextPad displaying the latest available flightdata_YYYY_MM.txt file after deleting the column names on the first line. Blank cells have been replaced with zeroes as described below.

Note: The *.txt files have an average size of about 13 MB and are available to download in *.zip format from the FlightData Files for Hadoop on Azure folder of my SkyDrive account. TextPad is recommended for editing because it opens and saves the files much more quickly than Notepad.

image

Figure 3. Cerebrata Cloud Storage Studio displaying the five files of Figure 2 after uploading them to the \oakleaf\flightdata Windows Azure blob container.

Note: Adding the *.txt files to a folder within the container enables the Hive table to use all files in the folder as its data source.

Replacing Empty DepDelayMinutes and ArrDelayMinutes Cells with Zeroes

Records for canceled and diverted flights have empty cells for departure, arrival or both delay times, as shown for rows 1 and 22 below:

image

Figure 4. Empty cells in numeric (integer) fields occur for canceled and diverted flights.

Empty (null) values in integer columns might cause unexpected results in MapReduce or HiveQL operations that depend on these values. To replace empty cells with zeroes in Excel, do the following:

1. Open the *.txt file in Excel, select the columns whose empty cells you want to fill, and press Ctrl+G to open the Go To dialog, click the Special button to open the Go To Special dialog and select Blanks:

image

Figure 5. Use the Go To Special dialog to select all empty cells in the selected columns.

2. Click OK to dismiss the dialog and select the Empty cells, and type a 0 in the first empty cell:

image

Figure 6. Type a 0 in the first selected empty cell to prepare for propagation to all empty cells.

3. Press Ctrl+Enter to replicate the replacement throughout the worksheet:

image

Figure 7. Pressing Ctrl+Enter replaces all blank cells with zeroes.

4. Save the modified worksheet.


Configuring the ASV Data Source

1. Click the Manage Cluster page’s Set Up ASV button (refer to Figure 1) to open the Configre Azure Blob Storage page. Type the Storage Account Name and its Primary Access Key value in the two text boxes:

image

Figure 8. Data required to complete configure a storage account as an Azure Storage Vault.

Note: The requirement for a primary or secondary account key (called “passkey” here) prevents using public blobs, such as those in the Cloud Numeric Team’s public flightdata container (see “Cloud Numerics” Example: Analyzing Air Traffic “On-Time” Data, step 2.) I have requested a fix from the HadoopOnAzure team and will update this post when the problem is fixed.

2. Click the Save Settings button to create the ASV connection and display this confirmation page:

image

Figure 9. Confirming the new ASV connection.


Scenario 1: Working with ASV Files in the Hadoop Command Shell via RDP

1. Click the Azure on Hadoop EMR portal landing page’s Remote Desktop tile in the Your Cluster group to start the connection and display the download bar for your connection:

image

Figure 10. The download bar for a Remote Desktop (RDP) connection.

2. Click the Open button to open the Windows Security dialog to your Hadoop on Azure instance and type your password:

image

Figure 11. Providing the Password for your Hadoop on Azure cluster.

3. Click OK to display the Remote Desktop Connection warning dialog:

image

Figure 12. The primary Remote Desktop Connection dialog to your Hadoop on Azure cluster

4. Click Yes to start (see Figure 13) and complete (see Figure 14) the connection process:

image

Figure 13. The RDP’s Connecting dialog.

image

Figure 14. A Remote Desktop Connection to a Hadoop on Azure cluster.

5. Double-click the Hadoop Command Shell icon to open the Hadoop Command Shell window.

6. To list the files you uploaded previously, change to the c:\apps\dist folder if necessary, type hadoop fs -lsr asv://aircarrier/flightdata/, and press Enter:

image

Figure 15. Using the Hadoop FileSystem -lsr command to list files in a designated container’s folder.

7. To verify that a text file contains content, for this example execute hadoop fs -tail asv://aircarrier/flightdata/flightdata_2012_01.txt or substitute another file name to display the last 1 kB of the file:

image

Figure 16. Displaying the last 1 kB of a text file with the FileSystem -tail command.

Note: Data for the last three days of January 2012 is missing because monthly data is based on full weeks.


Scenario 2: Working with ASV Files in the Interactive JavaScript Console

The Interactive JavaScript console handles most FileSystem commands you can execute in the Hadoop Command Shell. To emulate the preceding scenario, do the following:

1. On the Hadoop on Azure EMR portal’s landing page, click the Interactive Console tile and sign in if requested to open the Interactive JavaScript console by default.

2. Type #ls asv://aircarrier/flightdata/ or #lsr asv://aircarrier/flightdata/ (the recursive version) to display a list of the six files uploaded above.

3. Type #tail asv://aircarrier/flightdata/flightdata_2012_01.txt to display the the last 1 kB of the file:

image

Figure 17. Displaying a list of files and the last 1 kB of a specified file’s content in the Interactive JavaScript console.


Scenario 3: Working with ASV Files in the Interactive Hive Console

Hive tables require a schema to define their data structure before adding records from the ASV data source. Hive’s CREATE [EXTERNAL] TABLE syntax is similar to T-SQL’s, as demonstrated by the Apache Hive Language Manual’s Create Table Syntax topic. For this example, the Data Manipulation Language (DML) statement to create a table with abbreviated names of columns listed in step 4 of the file processing instructions is:

CREATE EXTERNAL TABLE flightdata_asv (
year INT,
month INT,
day INT,
carrier STRING,
origin STRING,
dest STRING,
depdelay INT,
arrdelay INT
)
COMMENT 'FAA on-time data'
ROW FORMAT DELIMITED FIELDS TERMINATED by '9'
STORED AS TEXTFILE
LOCATION 'asv://aircarrier/flightdata';

To add the table to the Hive database in the same folder as its source data, do this:

1. Click the Interactive Console tile to open the Interactive JavaScript page and click the Hive button to open the Interactive Hive Console.

2. Copy and paste the preceding CREATE EXTERNAL TABLE command in the text box at the bottom of the page, and click the Evaluate button to create the flightdata_asv table.

image

Figure 18. A multiple-exposure screen capture displaying available Hive tables and the columns of a selected table.

Note: The flighttest_asv table uses the \aircarrier\flighttest\flighttest_2012_01.txt file which contains 100 rows. The smaller table speeds tests for live demos. Clicking a >> button adds the table name or column name to the text box so you don’t need to type it.

3. To verify the table schema, click the Clear Screen button, type DESCRIBE flightdata_asv in the text box and click Evaluate:

image

Figure 19. The DESCRIBE command lists column names and data types.

4. Verify that the Windows Azure blog data is accessible by typing SELECT * FROM flightdata_asv LIMIT 10 in the text box and click the Evaluate button:

image

Figure 20. The first 10 rows return quickly.

5. To verify that the Hive table uses all six text files currently in the flightdata folder, type SELECT COUNT(*) FROM flightdata_asv in the text box and click the Evaluate button:

image

Figure 21. Each *.txt file has an average of about 500,000 rows, so a count of ~3 million (returned in about 74 seconds) verifies that the Hive table uses all tables in the flightdata folder.

The full text of Hive History is:

Hive history file=C:\Apps\dist\logs\history/hive_job_log_RogerJ_201204071855_304201235.txt
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201204071422_0001, Tracking URL = http://10.186.132.33:50030/jobdetails.jsp?jobid=job_201204071422_0001
Kill Command = C:\Apps\dist\bin\hadoop.cmd job -Dmapred.job.tracker=10.186.132.33:9010 -kill job_201204071422_0001
2012-04-07 18:55:42,800 Stage-1 map = 0%, reduce = 0%
2012-04-07 18:55:57,878 Stage-1 map = 17%, reduce = 0%
2012-04-07 18:56:01,910 Stage-1 map = 33%, reduce = 0%
2012-04-07 18:56:04,925 Stage-1 map = 67%, reduce = 0%
2012-04-07 18:56:06,941 Stage-1 map = 72%, reduce = 0%
2012-04-07 18:56:10,003 Stage-1 map = 100%, reduce = 0%
2012-04-07 18:56:13,019 Stage-1 map = 100%, reduce = 22%
2012-04-07 18:56:22,050 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201204071422_0001
OK
Time taken: 74.423 seconds

6. Click the Jobs button to display the result of the previous steps (and a few for the flighttest_asv table):

image

Figure 22. The Job History page displays items for correct and incorrect syntax, including execution times of HiveQL statements.

7. To return the average and standard deviation of arrival delays, execute a SELECT avg(arrdelay), stddev_pop(arrdelay) FROM flightdata_asv query:

image

Figure 23. Obtaining the average and standard deviation of the arrival delay column in minutes. Execution time was 77.2 seconds.

The 9.15 and 28.52 values returned are in the same range as those calculated by the Codename “Cloud Numerics” project using 32 months (~16 million rows) of data: 4.51 and 34.97 minutes reported in the Interpreting FlightDataResult.csv’s Data section of my Analyzing Air Carrier Arrival Delays with Microsoft Codename “Cloud Numerics” article of 3/26/2012.

Note: avg(column) and stddev_pop(column) are members of HiveQL’s built-in aggregate functions (UDAF).

8. To generate an array representing coordinates of a histogram having 20 unevenly spaced bins of arrival delays (x) of height (y) in total minutes, execute SELECT histogram_numeric(arrdelay, 20) FROM flightdata_asv:

image

Figure 24. Data for a 20-bin histogram of total flight delay minutes.


Using the Interactive JavaScript Console to Create a Simple Graph from Histogram Tail Data

Avkash Chauhan described how to create a graph from HDFS file data in his Apache Hadoop on Windows Azure Part 9 – Using Interactive JavaScript for Data Visualization post of 1/2/2012. The following procedure uses the Interactive JavaScript console and the JavaScript native graph class with edited data from a modified version of Figure 24’s HiveQL statement that returns data for 10 bins. JavaScript native graphs don’t have an option for a logarithmic y axis, which would be required to accommodate very large hour counts for delays less than a few hours. Therefore, the data is edited to show the last 5 bins of the tail, which contains counts for delays of 12 hours or longer. Decimal fractions are removed for clarity.

To generate a five-bin member histogram, do the following:

1. In the Interactive Hive console, execute a SELECT histogram_numeric(arrdelay, 10) FROM flightdata_asv command.

2. Select the returned data, copy it to the clipboard, and paste the array to Notepad.

3. Delete the first five members and remove the decimal fractions, leaving this array:

image

Figure 25. Reducing the number of histogram bins and removing decimal fractions from the data.

4. Select and copy the array data to the clipboard. Alternatively, select and copy this data: [{"x":726,"y":117}, {"x":892,"y":78}, {"x":1046,"y":28}, {"x":1243,"y":10},{"x":1403,"y":9}] 

5. Click the console’s JavaScript button, type data = at the js> prompt, paste the array data and press Enter to provide the graph’s data argument value (see Figure 26).

6. Set the graph options by typing options = { title: "Air Carrier Arrival Delay Histogram", orientation: 20, x: "x", y: "y" } at the prompt and pressing Enter:

image

Figure 26. Providing data and options argument values for the histogram tail graph.

7. Type graph.bar(data, options) at the prompt and press Enter to display a bar chart with departure delay hours on the x-axis and the count of departure delays for the period on the y-axis:

image

Figure 27: A histogram of the number of air-carrier arrival delays of 12 hours and longer
from August 2011 through January 2012.


Viewing Hadoop/MapReduce Job and HDFS Head Node Details

The default remote desktop window includes a Hadoop MapReduce icon that opens a Hadoop/MapReduce Administration page and a Head Node icon that opens a NameNode ‘10.186.132.33:9000’ page (the Head Node for this example).
MapReduce Job History in the Hadoop/MapReduce Administration Page

Here’s the page’s Completed Jobs section after running four MapReduce jobs from the Interactive Hive console:

image

Figure 28: The MapReduce Administration Page’s Completed Jobs section.
Head Node Details in the NameNode ‘10.186.132.33:9000’ Page

FHere’s most of the NameNode page:

image

Figure 29: The NameNode page’s Cluster Summary

You can browse the filesystem and Namenode logs, as well as view details of live and other nodes from links on this page. Most of this information isn’t germane to MapReduce operations generated by interactive Hive queries against ASV data.


Understanding Disaster Recovery for Hadoop on Azure

Figure 29 (above) shows 0.20.203.1 (branch 20) as the the Hadoop version running in the Hadoop on Azure CTP as of 4/9/2012. The Apache Software foundation released branch 20 as Hadoop 1.0 on 12/27/2012. HDFS v1.0 doesn’t meet today’s high-availability standards because the HeadNode is a single point of failure. If the HeadNode breaks, links to all worker nodes and the data they contain are lost. Hadoop v0.23 (branch 23), sometimes prematurely called Hadoop v2, provides a secondary HeadNode that takes over in case the primary HeadNode fails.

imageNote: For more information about Hadoop branch 23 features, see Andrew Brust’s (@andrewbrust) Hadoop 2.0: MapReduce in its place, HDFS all grown-up article of 4/6/2012 for ZDNet’s Big on Data blog.

The Introduction’s Figure 1 shows a Disaster Recovery button and “Disaster Recovery is configured” message. Clicking the button opens a page with the same message. On 4/8/2012, I searched for but couldn’t find a description of Hadoop on Azure disaster recovery, so I posted a “How Is Disaster Recovery for Hadoop Clusters Implemented?” question to the HadoopOnAzureCTP Yahoo! Group,

Brad Sarsfield (@bradoop), a Senior Developer on the Hadoop on Windows team, provided the following detailed information in a reply to my question:

imageName Node Disaster Recovery in Azure is a transparent and fully automated.

Here’s how things work. There are two ‘roles’ for Hadoop in Azure VM instances. One is the ‘head node’ and the others are ‘worker nodes’. On the head node we run Job Tracker/Name Node and a number of other related services (javascript, ftp etc).

imageEvery 30 minutes we take a HDFS metadata snapshot using the “secondary name node” service; then we zip up the directory and transfer it into an Azure blob. This ensures that the HDFS metadata is physically not on the ‘head node’. The Name Node stores it’s metadata on the local VM instance hard drive (in c:\). [Emphasis added.]

Now the head node VM instance can fail in a number of ways;

  1. It can be rebooted
  2. The OS can be re-installed (OS is on d:)
  3. The VM [might] be ‘migrated’ to another machine

In the case of #1 and #2 there is no data loss and the Name Node will start up and continue on. But in the case of #3 we have lost the data on c:. #3 is a rare instance, and is not a part of normal operations, primarily results from hardware failure. While there is a very low [percentage] chance of this happening, it can happen. [Emphasis added.]

So now we’re in the state where the ‘head node’ VM is allocated for the second time. When the head node VM is started it first checks to see if HDFS metadata exists (by looking in c:\hdfs\nn) and then it checks to see if there is a metadata backup in [an] Azure blob. If there is a backup, it downloads and loads it; if there isn’t a backup file then it will format HDFS (which is the path taken the first time you start up the Hadoop instance).

This is a disaster recovery solution [for] any data written since the snapshot. Hadoop [v2] addresses this with a high availability solution which will replace this mechanism.

When we release (very shortly) our on premises preview[, Apache Hadoop for Microsoft Windows]; it [will] contain a similar backup mechanism that you can control by modifying the “Name Node Backup” service configuration file to specify your own Azure storage blob to backup your Hadoop HDFS metadata into the cloud as an offsite metadata backup solution. [Emphasis added.]


This post will be updated when more flightdata*.txt files have been uploaded to a replacement for the oakleaf storage account. The North Central US hosts Hadoop on Azure clusters so I will create a new storage account there to reduce latency. The oakleaf storage account is in the South Central US (San Antonio, TX) data center.

Note: According to an earlier message of 4/8/2012 from Brad Sarsfield: “Plans are not final, but this may change in the next few months, likely to W[est] US or E[ast] US.”

Stay tuned.


0 comments: