Add a sitemap for Sourcegraph.com covering over 400k+ Go symbols and packages
Created by: slimsag
This PR adds a sitemap generation tool and adds a sitemap to Sourcegraph.com with 405,164 API docs pages and sub-pages, covering a wide variety of Go symbols and packages. Two examples:
- https://sourcegraph.com/github.com/golang/go/-/docs/net/http (page)
- https://sourcegraph.com/github.com/golang/go/-/docs/net/http?CookieJar (sub-page)
The sitemap is generated by a tool which issues approx 1.6 million GraphQL requests in order to discover all the pages and generate static sitemap.xml.gz files, which are then uploaded to a GCS bucket.
Improving the quality of our SEO
Today, we have no sitemap at all (yes, really!) so this is a first, small step in the right direction of ensuring people can find Sourcegraph through Google. Analysis shows that many of the pages Google has indexed today on Sourcegraph are garbage pages, such as empty README files in repositories or very old commits in repositories that just got discovered by accident somehow.
As such, I will begin the process of updating our metadata to instruct Google and others to not index many of the garbage pages they've indexed today.
Ensuring the pages we ask Google to index are high quality
The pages included in this sitemap are approximately only 30% of our pages. I've eliminated over 70% as they do not meet a criteria that is relatively high quality:
- Is a public Go symbol/package
- Has a description with >100 characters of text.
- Has at least one usage example.
The pages that remain come from just 2,778 repositories, with 6,159,920 GitHub stars total and 24,055 Go packages combined.
Only 6,247 symbols have a usage example from an external repository, and so for now I have chosen not to filter down to just pages with an external usage example. I hope we'll index many more Go repositories very soon to remedy this and then improve our inclusion criteria further to restrict to only symbols that have >=1 external usage example.
Over half a million pages are excluded as they have zero usage examples, and a further half a million or so are excluded as they are not exported/public symbols. 114 Go repos are missing API docs, unclear why yet: https://github.com/sourcegraph/sourcegraph/issues/24539
Once we've indexed more repositories, we will be able to adjust the inclusion criteria to be even higher quality. This is just a small stepping stone in the much larger picture of ensuring what we serve to Google and others is actually high quality. We don't want low-quality content being fed to Google, it's harmful to developers and we haven't invested enough resources in improving this to date. With a bit of effort, we should be able to make big improvements and ensure any Sourcegraph link a developer comes across is high quality and truly useful.