Thursday, November 17, 2005

Google Base and Atom 0.3 Bulk Uploads

Adding Web pages or blog posts as News or Articles or Reference Articles item types to Google Base is problematic for content owners. Google's draconian Terms of Service for the content you upload gives Google carte blanche to "reproduce, modify, adapt, publish, and otherwise use, with or without attribution such Content," as well as "to use your trademarks, service marks, trade names, proprietary logos, domain names and any other source or business identifiers."

I'm certainly not enthusiastic about the Google folks modifying or adapting and then publishing my content without attribution. ZDNet's Garrett Rogers discusses these and related issues in his November 16, 2005 post, "Google Base: Preparing for the worst?."

Note: If you're searching on "Google Base," you're likely to see references to "All your base are belong to us," an idiosyncratic Internet message (apparently in pidgin, akin to "him belong me") that's explained in this lengthy Wikipedia entry. It's an interesting sidelight that Al Queda is "the base" in Arabic. The Wiki entry also mentions common derivatives, such as "all your data are belong to us," which is more germane to the Terms of Service issue.

Despite my misgivings about Google's Terms of Service, I decided to invest a few hours of Visual Studio 2005 programming time to clean up Blogger's Atom 0.3 XML file for this site, add some optional tags and attributes, and publish it to Google Base. I'll provide details on the VB 2005 code I used to manipulated the Atom.xml XmlDomDocument object in a future post. As I mentioned in the "Initial Conclusions" section of my earlier "Google Base and Bulk Uploads with Microsoft Access" post:

Initial tests with an Atom 0.3 (Atom.xml) file generated by Blogger for the OakLeafBlog, saved as a local XML file with FireFox 1.5 RC2, and Bulk Uploaded as the Reference Articles item type showed several problems. The description attribute contains HTML markup and error messaages state that the value is limited to a maximum of 10,000 characters. Thus only the shorter OakLeafBlog articles publish to the list; HTML markup contributes substantially to description length. Help Center's "What do I include in 'Description'?" topic says "Please ensure that the description does not contain any HTML as we don't currently recognize or display HTML tags in your item." Help Center also says the maximum description length is 1,000 characters.
I was surprised by the inconsistencies between the help topics and the result of an initial test with a moderate-size Atom.xml document from a Google application (Blogger). So I temporarily increased the size of the main page to include all OakLeafBlog posts (50 as of this post), which would permit more complete tests and let me evaluate issues that relate to creating Google Base-enabled XML files.

Note: Atom 1.0 is the current version of the Atom specification, and Tim Bray says it's awaiting an IETF RFC number as of mid-November 2005. Blogger continues to use the outdated Atom 0.3 spec for syndication, as does the Google Base - Atom 0.3 Specification page. Venture capitalist Bill Burnham says in his "RSS and Google Base: Google Feeds Off The Web" post, "Google intends to build the world's largest RSS 'reader' which in turn will become the world's largest XML database." Alan Wood at Folknology refers to Adam Bosworth's MySQL presentation and provides an Atom-oriented analysis. Most readers treat RSS and Atom feeds similarly.

The ultimate objective of this exercise is to determine whether any benefits accrue to Web site publishers—or, for this example, bloggers—by publishing copies of linked content on Google Base. Much of the initial Googe Base content—such as real-estate listings—consists of links to existing Web pages. Presumably, Google will have spidered the source site's pages previously. Technorati's Niall Kennedy posits:
Why should you go to the trouble of submitting your information to Google Base? You will be completely sure that Google has all your latest content complete with the appropriate link back to your site. Feeding the content directly to Google may help your posts place better in Google search results.
Whether posts uploaded to Google Base gain precedence in Google search results remains to be seen. Mine haven't so far.

Update 11/22/2005: John Markoff and Michael Barbaro of the New York Times report that Google Base now has available the ability to provide a local version of Google's Froogle shopping service. There's no announcement of the new feature in the Google Blog; the Google Base Blog still wants a username and password for access. InfoWorld's Jon Udell posts "Dueling simplicities," which analyzes the potential relationships between Microsoft's proposed Simple Sharing Extensions for RSS and OPML (SSE) specification, Adam Bosworth's "Learning from the Web" presentation, and Google Base. This post follows "The two-way data web" article that was written before&mdash'but published after—the release of the SSE specification.

Update 11/30/2005: The "official" Google Base blog, http://googlebase.blogspot.com/, added an entry with tips on bulk-uploading items.

Completing Your Personal Profile
If you have or create a Google account, which you need for most Google applications, you'll probably find it worthwhile to add the additional default attribute values that apply to Google Base only. See this section in the preceding "Google Base and Bulk Uploads with Microsoft Access" post for details.

Creating the Raw XML Bulk Upload File
The http://oakleafblog.blogspot.com/atom.xml document contains data for 50 posts (<entry> groups) in a 498-KB file for an average of about 10,000 characters per <entry>. FireFox 1.5 RC2 displays the HTML tags in the <content> elements, as shown here, which transform to Google Base description attribute values:
FireFox 1.5's View Page Source command displays the Atom 0.3 source code and enables saving the Atom 0.3 source code to a physical file, which is required for bulk XML file uploads:


The stylesheet employed by Internet Explorer 5+ strips the HTML markup from the XML document's content element but won't display or enable saving the unformatted <content> value locally, as shown here:


Thus, you'll need to substitute FireFox for IE to generate and save a file—OakLeafBlogAtom.xml for this example—for the Bulk Upload operation. (Only FireFox 1.5 RC2 and RC3 have been tested to date.)

Uploading the Atom 0.3 XML File as a Reference Articles Item Type
The Specify a Bulk Upload page's Choose an Existing Type list doesn't offer the News and Articles Item Type, which would be more appropriate for a list of blog posts. (News and Articles and Wanted Item Types appear in the Choose an Existing Item Type list on the Post an Item page for ad hoc items.) News and Articles supports the following standard attributes, in addition to title and description: author, expiration_date, label, news_source, pages, and publish_date. (It's unfortunate that Google didn't adopt standardized metadata terms, such as those of the Dublin Core Metadata Intiative—DMCI.)

Note: Niall Kennedy's "Google Base blog import instructions" post describes for Movable Type or TypePad Pro users how to output your last n blog posts to an Atom.xml file with his Movable Type Google Base template.

Thus, you're stuck with Reference Articles, which doesn't include several attributes that would be useful for qualifying searches. Reference Articles (presumably included in the "Research Studies and Publications - scholarly literature" Informatoin Type) appear to be limited to author, expiration_date, label, pages, publication_name, publication_volume, and publish_date. (Where is publication_number?) However, you can use the Google Base Provider Namespace to define your own custom attribute taxonomy in the Atom 0.3 document.

Update 11/25/2005: You're no longer stuck with Reference Articles as the Item Type for Blogger Atom 0.3 feeds. The Bulk Upload page's Choose an Existing Type list now includes News and Articles and Wanted Ads Item Types. Google also added Blogs, Coupons, Rentals, and Comic Books as standard search categories to the default home page. Rapid ad hoc changes like this demonstrate another advantage of Web-based services.

The process for uploading an Atom 0.3 XML file is similar to that for uploading a tab-separated value text file to create a list of the Products Item Type:

1. After logging in with your Google account, navigate to the Google Base home page and click the Post Multiple Items with a Bulk Upload File link to open the My Items page.

2. Click the Specify a Bulk Upload File link, type the FileNameAtom.xml file name in the text box, select Reference Articles in the Item Type list, and click Specify Bulk Upload File to open the My Items page.

3. Click Browse, navigate to and double-click the file you saved with FireFox to specify it as the source of the registered FileNameAtom.xml file, as shown here:


4. Click Upload and Processs This File. Wait a few minutes (or hours), and then press F5 to determine the publication status of the file. If you can't stand the wait, click the Active Items link after it displays a count of 1 or more to review unpublished items in the list:

5. Click one of the Edit links to display the item in the standard editing form for the Reference Articles Item Type:



Notice the HTML markup in the Description attribute textarea. This example has a substantially lower proportion of markup characters to content than most OakLeafBlog posts. It would be possible—but certainly tedious—to remove the tags manually and add Details attribute-value pairs and Labels keywords tags.

Viewing the Items as a Google Base User
To emulate a search by an ordinary Google Base user, follow this drill:

1. Sign out of your account, navigate to the Google Base home page, type a unique search term, such as xlinq for OakLeafBlog posts, and click Search Base to display the results. Alternatively, click here.


As expected, clicking the OakLeaf Consulting link or here displays all active items for authorid=1063521.

2. Click one of the titles to open the linked page whose URL appears in green, or click here.

Fixing Feed Errors
The inclusion of HTML markup in the description attribute isn't a problem for ordinary users, because they don't see the attribute value. However, large amounts of markup combined with lengthy content can result in failure to post overlength entry groups. In this case, the My Items page displays an error message:


Note: It might take several hours for the preceding warning to appear. Bulk Updates don't occur in real time.

Clicking the Details link displays this page with error messages:


To overcome this problem, you must edit the content element of overlength entries, remove the HTML tags, test for content length, and then trim the string value if it's more than 10,000 characters.

Serious Bug in g:label Custom Attributes Documentation
Google has created its own taxonomy of Atom 0.3 extensions that's identified by an xmlns:g="http://base.google.com/ns/1.0" namespace attribute added to the the feed element. The Google Base - Atom 0.3 Specification page includes an example of use of this namespace to add several predefined elements—g:image_link, g:expiration_date, g:job_function, g:location; and g:label—to specify non-standard attributes for a specifc Item Type. The example for the <g:label> elements is incorrect. The label item of the Google Base - XML Attributes page has the same error.

Following is an abbreviated version of a Blogger Atom 0.3 test file with the Google Base extension namespace attribute and multiple g:label elements added in accordance with the preceding XML document example and attribute specification. Technorati tag names provide the values of the multiple g:label elements.

<?xml version="1.0" encoding="UTF-8"
standalone="yes"?>
<?xml-stylesheet
href="http://www.blogger.com/styles/atom.css"
type="text/css"?>
<feed xmlns="http://purl.org/atom/ns#"
version="0.3" xml:lang="en-US"
xmlns:g="http://base.google.com/ns/1.0">
<link href="https://www.blogger
  .com/atom/11646261"
rel="service.post" title="OakLeaf Systems"
type="application/atom+xml" />
<link href="https://www.blogger
  .com/atom/11646261"
rel="service.feed" title="OakLeaf Systems"
type="application/atom+xml" />
<title mode="escaped" type="text/html">
OakLeaf Systems
</title>
<tagline mode="escaped" type="text/html">
OakLeaf Systems is a Northern California
software consulting organization specializing
in developing and writing about Microsoft SQL
Server/.NET database and Web services projects.
</tagline>
<link href="http://oakleafblog.blogspot.com"
rel="alternate" title="OakLeaf Systems"
type="text/html" />
<id>tag:blogger.com,1999:blog-
11646261</id>
<modified>2005-11-19T14:32:23Z</modified>
<generator url="http://www.blogger.com/"
version="5.15">
Blogger
</generator>
<info mode="xml" type="text/html">
<div xmlns="http://www.w3.org/1999/xhtml">
  This is an Atom formatted XML site feed.
  It is intended to be viewed in a Newsreader
  or syndicated to another site. Please visit
  Blogger Help for more info.
</div>
</info>
<convertLineBreaks
xmlns="http://www.blogger.com/atom/ns#">
true
</convertLineBreaks>
<entry xmlns="http://purl.org/atom/ns#">
<link href=
  "https://www.blogger.com/atom/
    11646261/113227683042337068"
  rel="service.edit"
  title="Google Base and Atom 0.3 Bulk Uploads"
  type="application/atom+xml" />
<author>
  <name>--rj</name>
</author>
<issued>2005-11-17T16:34:00-08:00</issued>
<modified>2005-11-18T21:48:40Z</modified>
<created>2005-11-18T01:20:30Z</created>
<link
  href="http://oakleafblog.blogspot.com/2005/11/
  google-base-and-atom-03-bulk-uploads.html"
  rel="alternate"
  title="Google Base and Atom 0.3 Bulk Uploads"
  type="text/html" />
<id>tag:blogger.com,1999:blog-11646261.
  post-113227683042337068</id>
<title mode="escaped" type="text/html">
  Google Base and Atom 0.3 Bulk Uploads
</title>
<content mode="escaped" type="text/html"
  xml:base="http://oakleafblog.blogspot.com"
  xml:space="preserve">
  Content with HTML tags removed.
</content>
<draft xmlns="http://purl.org/atom-blog/ns#">
  false
</draft>
<g:label>Databases</g:label>
<g:label>Google Base</g:label>
<g:label>XML</g:label>
<g:label>Atom</g:label>
<g:label>RSS 2.0</g:label>
<g:label>Google</g:label>
</entry>
</feed>

Note: Some line-breaks have been inserted at illegal positions to prevent exceeding the left frame width limit.

Click here for a more readable version of the preceding sample file from Google Groups (in print format).

Uploading the complete 256-KB file as a Reference Article resulted in a Failure status report in the My Items page with a single instance of "Bad data" as the reason for the failure. The Upload page reported 0 Items Processed, 0 Items Succeeded, and 0 Active Items. However, after a few hours (overnight), the Active Items page reported all items had Published status. (The Upload page data didn't change.)

Fixing the g:label Attribute Specification Bug
Opening in the Edit page the few entries that had a single Technorati tag—and thus a single <g:label> element, typically LINQ—showed the tag name in the Label textarea. The text associated with the Label control suggests "Keywords or phrases that describe your item. Maximum of 10. Separate with commas." Based on this hint, I changed the <g:label> elements from:


<g:label>Databases</g:label>
<g:label>Google Base</g:label>
<g:label>XML</g:label>
<g:label>Atom</g:label>
<g:label>RSS 2.0</g:label>
<g:label>Google</g:label>

to:

<g:label>
Databases, GoogleBase, XML, Atom, RSS 2.0, Google
</g:label>
 
This change solved the Failure problems, reported Success as the status, and processed all 50 items, as shown here:
Note: The Google Base - XML Attributes page's image item states that a comma-separated list—such as <g:label> leater, power locks, sunroof, ABS </g:label>—is Not acceptable. (It's doubtful that the list isn't acceptable because of leading or trailing spaces or a missing "h" in "leater").

The fix to the g:label attribute format also fixed the missing Labels entries problem with multiple <g:label> elements, as shown here:


The Label tags appear on the edit page immediately after Google processes the upload, so you don't need to wait for Published status to test your editing application.

Use Labels to Refine Google Base User Searches
When you add Label tags to your entries, users can refine their searches by clicking links that return entries that match all tags for an entry as shown here:


Notice that comma-separated Names (tag) values appear under the Titles.
Click here to open the preceding interactive Google Base page, click the More... link to display all Names combinations, and try the various refinement choices. Click the publisher's moniker—Roger Jennings for this example—or click here to display a list of all items (not just Reference Articles) contributed by the publisher (authorid=1071203).

Conclusion
Google needs to clean up its Atom 0.3 documentation to minimize developers' wild-goose chases. The current (beta) UI undoubtedly will confuse potential users. For example, I would not have known the benefit of adding Name tags to search refinement, if I hadn't written a simple VB.NET 2005 project to clean up the description attribute (<context> element) value and add the Google Base namespace and a <g:name> element in the correct format.

Robert Niles, editor of the USC Annenberg Online Journalism Review, concludes: "Right now, the UI is geared more toward people upload information than those looking for it." BusinessWeek's Rob Hoff thinks folks are "Ganging Up on Google." The Solution Watch blog offers a postive review and links to other detailed Google Base reviews.

The World Resources Institute (WRI) claims to have submitted information to Google Base "on a 5 million-record database on sustainable development for 200 countries over a period of up to a century." However, a search of Google Base on "World Resources Institute" returns only 4,253 items that were entered between November 15, 2005 and November 28, 2005 as Research Studies and Publications Item Type. This Item Type appears to have been replaced by Reference Articles. The status of the remaining 4.996 million (purported) items isn't clear as of December 2, 2005.

Watch for updates to this post as other developers add their content to Google Base and keep an eye on the Google Base Help Discussion group to see what problems users encounter.

Technorati:

0 comments: