Sunday, July 11, 2010

Archive and Mine Tweets In Azure Blobs with The Archivist Application from MIX Online Labs

Update 7/10/2010: Partially explained the date/time strangeness noted in the 7/4/2010 update as the result of Lotus 1-2-3 and Excel processing 1900 as a leap year, which it is not (see Figure 3 and below).

Update 7/8/2010: Karsten Januszewski (@irhetroric) explains How The Archivist Polls Twitter in his 7/7/2010 post to the MIX Online Labs blog.

Update 7/4/2010: Clarified PowerPivot’s problem with DateTime values in Excel data sources (see Figures 2 and 3) and added Tim’s Twitter user name.

imageKarsten Januszewski (@irhetoric) and Tim Aidlen (@Systim) appeared in a 00:22:41 Channel9 The Archivist: Your friendly neighborhood tweet archiver video segment of 6/28/2010, in which the pair explained the the objectives, structure, and data flow of their Azure-based The Archivist application for MIX Online Labs:

The Archivist is a new lab/website from MIX Online that lets people archive, analyze and export tweets. Here’s a little more about why we built The Archivist and who we built it for.

Sounds good, guys. Now, let's go learn about what this really means and how/why Karsten and Tim built this Twitter information analysis and archival service.

Learn more: http://visitmix.com/LabNotes/Intro-To-The-Archivist

The Archivist demonstrates Windows Azure’s potential for ad-hoc cloud data storage, data mining, mashup, and business intelligence (BI) applications. The project offers an API and you can download the source code from the MSDN Code Gallery, which describes the project as:

A Windows Azure service and ASP.NET MVC website that allows you to monitor Twitter, archive tweets, data mine and export archives.

The Archivist’s Architecture

The documentation says little about the project’s architecture, so here’s a data flow diagram I made from Karsten’s whiteboard drawing and narration:

image Figure 1

If you download a CSV file, open it in Excel 2010, and save it as an *.xlsx file, you can take advantage of PowerPivot to analyze large worksheets:

imageFigure 2

The following PowerPivot example uses a row filter on my Twitter user name:

image Figure 3

Update 7/10/2010: PowerPivot appears to interpret the fractional time component of an Excel DateTime value (see Figure 2) as the date equivalent of 0.#, where # is the fractional part of a double value. An Excel double (General Number) value of 1.0 corresponds to a DateTime value of 12/31/1899 12:00:00 AM, not the expected 1/1/1900 12:00:00 AM (see below), so PowerPivot treats 0.# as 12/30/1899, not 12/31/1899.

Date/Time conflicts between Excel’s and PowerPivot’s treatment of 0 as a null date and 0.# as a time value with a null date cause 0.# to appear as 12/30/1899 in PowerPivot instead of HH:MM:SS.

Following are tests of Excel VBA date/time expression in Excel 2010’s Immediate Window:

image  Figure 4

Excel VBA doesn’t recognize 2/29/1900 as a valid date because 1900 wasn’t a leap year. Executing the ? Format(#2/29/1900#, "General Number") expression throws a “Compile error. Expected Expression” exception and there is only a difference of 1 between the Double values of 2/28/1900 and 3/1/1900.

See Wikipedia’s Leap Year topic for rules to determine which century years are leap years:

Years that are evenly divisible by 100 are not leap years, unless they are also evenly divisible by 400, in which case they are leap years.For example, 1600 and 2000 were leap years, but 1700, 1800 and 1900 were not. Similarly, 2100, 2200, 2300, 2500, 2600, 2700, 2900 and 3000 will not be leap years, but 2400 and 2800 will be.

However, the Excel expression service does recognize 1900 as a leap year, as demonstrated by the expression is cells A1 and B1 of this worksheet. Ash Wednesday is 46 days before Easter Sunday:

image Figure 5

Microsoft Knowledge Base’s Excel 2000 incorrectly assumes that the year 1900 is a leap year (Article ID: 214326 - Last Review: January 24, 2007 - Revision: 5.4) states:

Symptoms: Microsoft Excel incorrectly assumes that the year 1900 is a leap year. This article explains why the year 1900 is treated as a leap year, and outlines the behaviors that may occur if this specific issue is corrected.

More Information: When Lotus 1-2-3 was first released, the program assumed that the year 1900 was a leap year, even though it actually was not a leap year. This made it easier for the program to handle leap years and caused no harm to almost all date calculations in Lotus 1-2-3.

When Microsoft Multiplan and Microsoft Excel were released, they also assumed that 1900 was a leap year. This assumption allowed Microsoft Multiplan and Microsoft Excel to use the same serial date system used by Lotus 1-2-3 and provide greater compatibility with Lotus 1-2-3. Treating 1900 as a leap year also made it easier for users to move worksheets from one program to the other.

Although it is technically possible to correct this behavior so that current versions of Microsoft Excel do not assume that 1900 is a leap year, the disadvantages of doing so outweigh the advantages.

Note: Microsoft Excel correctly handles all other leap years, including century years that are not leap years (for example, 2100). Only the year 1900 is incorrectly handled.

It appears to me to be up to the PowerPivot team to correct or at least document the problem.

Sample Public Archives

I created a public archive of my tweets (@rogerjenn), as well as tweets with #Azure and #SQLAzure hastags starting on 7/3/2010. You can access them from these links:

Note: Twitter provides access only to the last two weeks of tweets (less for trending topics) and The Archivist permits a maximum of three simultaneously active archives.

As you can see, I was the Top User of the #SQLAzure and #OData hashtags on 7/3/2010:

imageFigure 6

imageFigure 7

(I was #16 of the Top Users for the #Azure hashtag.)

Archived tweets appear in a list below the Chart Controls that’s sorted by date and time descending:

image

Visualizations

The site includes a Visualizations page that explains how to use the ASP.NET Chart Controls, which Microsoft acquired from Dundas Software, shown in the preceding captures. Clicking a control blows it up to full page size:

image 

blog comments powered by Disqus