Wednesday, January 09, 2008

Dare Obasanjo Compares LINQ with Python for List Processing

Dare's been working with IronPython lately but occasionally lapses into C# 3.0 and LINQ. He recently encountered Brad Fitzpatrick's Naming twins in Python & Perl blog post and decided to write the corresponding LINQ routine to find anagrams in the most common U.S. first names for males. You'll find Brad's Python and Dare's C# 3.0 code in Dare's Finding Names that are Anagrams of Each Other in C# 3.0 post of January 8, 2008.

Update 1/9/2008: Brad Fitzpatrick is the founder of LiveJournal, a proponent of OpenID and Google's OpenSocial initiative, and the Google representative to the DataPortability.org Workgroup. The Workgroup announced on January 8, 2008 that Brad, Benjamin Ling of Facebook, and Joseph Smarr from Plaxo had joined.

The U.S. Department of the Census publishes the rank and frequency distributions of 1,219 male (dist.male.first) and 4,275 female (dist.female.first) first names for the U.S. population, so I picked up the two files and instrumented Dare's code with a counter and Stopwatch timer. Following are the details for the 1990 Census' first names:

Last 10 Male Names with Anagrams

ANTOINE ANTIONE
REUBEN RUEBEN
COLE CLEO
DION DINO
CARMELO MARCELO
CARLO CAROL
FOREST FOSTER
ARIEL ARLIE
CHRISTOPER CRISTOPHER
OLIN LINO
Names with anagrams: 38, Elapsed time: 97 ms.

Compare the 97 ms. value with that for Python (43 ms) and Perl (26 ms) at the end of Brad's post.

Last 10 Female Names with Anagrams

RAISA SARAI
TERISA SERITA
NA AN
MAN NAM
JEANICE JANIECE
LASHON SHALON
UN NU
GALINA ANGILA
SULEMA SAMUEL
LUBA BULA
Names with anagrams: 410, Elapsed time: 557 ms.

If you don't write the names to the console, elapsed times are 47 and 130 ms., respectively.

You can download my instrumented version with both name lists here.

Dare concludes in his earlier Python vs C# 3.0: Tuples vs. Anonymous Types (Redux) post of January 5, 2008:

What I find interesting about this is that even though I’ve been using C# for the past five or six years, I feel like I have to relearn the language from scratch to fully understand or be able to take advantage the LINQ features. Perhaps a few stints as a SQL developer may be necessary as well?

I'm not certain that SQL development experience is all that helpful in coming to grips with LINQ query expressions. In fact, it might be a negative influence.

Update 1/9/2008: Mike Champion noted in his Rough Spots in the LINQ to XML Learning Curve post of November 13, 2006, while he was LINQ to XML Program Manager in SQL Data Programmability:

One overall pattern we've detected is worth noting: The less you know about current XML technologies, the faster you learn LINQ to XML.  The intuition of participants in our studies who know DOM, SAX, XmlReader, etc. is that the problems we posed them should be hard to solve, so they tended to roll up their sleeves and start writing a bunch of imperative code. Those who didn't know much about XML except that it is a tree of elements and attributes could use their SQL and C# intuition and found more elegant solutions.  On the other hand, the really experienced developers fairly quickly learned the core LINQ design patterns and began to apply them effectively to new problems.

I believe the same applies to T-SQL experts when working with LINQ to Objects, SQL, or Entities.

Note: Dimitre Novatchev, who maintains the FXSL: the Functional Programming Library for XSLT project on SourceForge, came up with a similar transform using XSLT. His example from an April 17, 2007 XSLT Text Processing: Fun with Anagrams post uses an 46379-word English dictionary list that requires 280 seconds of preprocessing to create an index. Finding the six anagrams of just the word trace takes 78 milliseconds.

A few of the alternative approaches in comments to Brad's post have elapsed times less than 47 ms.

0 comments: