Tip:
Highlight text to annotate it
X
Today's question comes from the middle
of America in Kansas.
John Heard wants to know, "Why does Google take so long to
naturally remove 404 urls?" It's a good question.
So in theory, 404s could be transient.
A page could be missing, and then come back later.
Technically, if you really want to signal that this page
is completely gone and will never come back, there's an
HTTP status code called 410, 4-1-0.
But at least last time we checked back in 2007, we
actually treated those the same.
But to get to the meat of your question, why does it takes so
long, the answer is webmasters can do kind
of interesting things.
And we sometimes see webmasters shoot
themselves in the foot.
Like, they'll completely remove their site from the
search results.
Or they'll be down and returning 404s instead of
something like a 503 that says come back later.
And so, rather than learn very quickly, this is a 404, make
it drop out forever, usually you prefer to build in a
little bit more leeway there, so that if a webmaster is
making a mistake, you can check a few times and make
sure that it really is gone before you drop
it out of the index.
Now, it's always tricky because if you get it wrong
one way, people are unhappy.
If you get it wrong the other way, people are unhappy.
So we try to find a balance based on the feedback that we
hear, the complaints that we hear, what people are happy,
what they're sad about, to try to sort of find, maybe we'll
try this page a few more times and make sure
that it's really gone.
And otherwise, you would hate it if you had a temporary
glitch with your web server, and then Google didn't come
back and check on that web server for like three years or
something like that.
So it is the sort of thing where we try to find a good
balance there.
Thanks for the feedback, though.
I can always talk to the crawl team and find out, do the 410s
really make things go away faster now?
Or are they still treated the same?
But at least for the time being, we try to build in that
safety margin so that if webmasters do make a mistake,
if their server's overloaded, if their web host configured
something incorrectly, it won't sabotage, it won't cause
long term damage, and there will be a way to recover.