Google pushes for its robots.txt parser to become internet standard
Google LLC is pushing for its decades-old Robots Exclusion Protocol to be certified as an official internet standard, so today it open-sourced its robots.txt parser as part of that effort.
The REP, as it’s known, is a protocol that website owners can use to exclude web crawlers and other clients from accessing a site. Google said it’s one of the “most basic and critical components of the web” and that it’s in everyone’s interest that it becomes an official standard.
REP was first proposed as a web standard by one of its creators, the Dutch software engineer Martijn Koster, back in 1994, and has already become the de facto standard that’s used by websites to tell crawlers which parts of a website they shouldn’t process.
When indexing websites for its search engine, Google’s Googlebot crawler typically scans the robots.txt file to check for any instructions on which parts of the site it should ignore. If it doesn’t find any robots.txt file in a site’s root directory, it will just assume that it’s okay to index the entire site
But Google worries that REP was never officially adopted as an internet standard, saying that the “ambiguous de-facto protocol” has been interpreted “somewhat differently over the years” by developers, and that this makes it “difficult to write the rules correctly.”
“On one hand, for webmasters, it meant uncertainty in corner cases, like when their text editor included BOM characters in their robots.txt files,” Google wrote on its Webmaster Central blog. “On the other hand, for crawler and tool developers, it also brought uncertainty; for example, how should they deal with robots.txt files that are hundreds of megabytes large?”
To solve those problems, Google said, it has documented exactly how REP should be used with the modern web and submitted its proposal for it to become an official standard to the Internet Engineering Task Force, which is a nonprofit open-standards organization.
“The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP,” Google said. “These fine-grained controls give the publisher the power to decide what they’d like to be crawled on their site and potentially shown to interested users. It doesn’t change the rules created in 1994, but rather defines essentially all undefined scenarios for robots.txt parsing and matching, and extends it for the modern web.”
Analyst Holger Mueller of Constellation Research Inc. told SiliconANGLE that standards are vital for the internet to work properly, so it’s good to see Google take a lead even for something very basic like REP.
“As with any open-source initiative and standardization attempt, we’ll have to wait and see wait and see what kind of uptake there is before we know if this is a success or not,” Mueller said. “It can be also something very self-serving, as Google is one of the biggest web crawlers itself. It’s an area to keep a watchful eye on.”
Image: Google/Twitter
Since you’re here …
… We’d like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.
If you like the reporting, video interviews and other ad-free content here, please take a moment to check out a sample of the video content supported by our sponsors, tweet your support, and keep coming back to SiliconANGLE.