Friday, August 24, 2007

Clarification of the Object Tracking Problem with LINQ to SQL's Eager Loading Feature

Scott Guthrie responded with the following comment to my suggestion (also in a comment) that he check my Eager Loading Appears to Cause LINQ to SQL Entity Table Problems post before writing his forthcoming post about Deferred/Eager Loading:

Hi Roger,

I'll go into more detail on lazy vs. eager loading in the post after next in my series.  Here are a few comments, though, that I posted in response to one of David's posts:

In cases where the relationship between two entities is (n:1), LINQ to SQL does a JOIN as opposed to separate queries since these tend to be safe from a data explosion perspective.  

In the case of a 1:n associations, LINQ to SQL only supports joining-in one 1:n association per query.

So for example, if you wrote code like below:

    DataLoadOptions options = new DataLoadOptions();
    options.LoadWith<Product>(c => c.Category);
    options.LoadWith<Product>(c => c.OrderDetails);
    options.LoadWith<OrderDetail>(o => o.Order);
    db.LoadOptions = options;
    IEnumerable<Product> products = db.Products.ToList<Product>();

You'd find that only 1 query is executed against the SQL database for everything (even though you are bringing back all of the Products and their associated Category and OrderDetail information).  This is because the product->category relationship is (n:1) and the OrderDetail->Order relationship is (n:1) and so both are automatically joined in.

The reason the relationship in your blog post above is generating multiple queries is that you have two (1:n) relationship (Customers->Orders) and (Orders->OrderDetails).  If you just had one (1:n) relationship (Customer->Orders) or (Orders->OrderDetails) LINQ to SQL would optimize and grab it in one query (using a JOIN).  

Hope this helps,

Scott

I noted in my post that one query per highest member of the object hierarchy probably wasn't a showstopper, but I'm reconsidering that conclusion. One of the primary goals of most data-oriented developers is to reduce server round trips for a particular activity to an absolute minimum. Three-level hierarchies are one of the most common object graphs you encounter in day-to-day development; think course:section:student or division:department:employee.

I've updated my earlier post with the comment that while the TopLevelRows + 1 query count for eager loading of two 1:n joins appears to be by design, it's not a very good design.

Suggested fix: Add a DataContext.AllowDataExplosions property that lets the developer enable an unlimited number of 1:n joins. Posted to Connect Feedback as Provide a DataContext.AllowDataExplosions Property for Eager Loading with Multiple 1:N Joins suggestion.

The Real Problem

The primary issue with LINQ to SQL's eager-loading loading feature seems to me to be the loss of object tracking when requerying the underlying tables. Deferred loading behaves in accordance with this statement in the "Object Identity" section LINQ to SQL: .NET Language-Integrated Query for Relational Data whitepaper by Dinesh Kulkarni, Luca Bolognese, Matt Warren, Anders Hejlsberg, and Kit George (March 2007):

Of course, if the object requested by the query is easily identifiable by its primary key as one already retrieved no query is executed at all. The identity table acts as a cache storing all previously retrieved objects.

Eager loading doesn't conform to this model.

I plan to enhance the sample code with additional tests of the identity table after I finish my LINQ to SQL Stored Procedure Test Harness project which demonstrates some unexpected behavior and limitations when using stored procedures to hydrate LINQ to SQL object graphs.

Stay tuned.