Tip:
Highlight text to annotate it
X
Today's question comes from Zurich.
Gary wants to know, "How does SafeSearch, both for text and
images, work?"
Well, I worked on the initial version of
SafeSearch for text.
So let's concentrate on that.
Don't want to give away anything that spammers could
use, but I can talk about way back in 2000 how SafeSearch
worked, so you can kind of get an idea.
And the idea is roughly what you would expect, which is we
look for certain words, and we give them certain weight.
And if you have enough words with enough weight, then we
sort of say, OK, this looks like it might be a sort of
*** or ***-related document.
And you can have various thresholds, where you can say,
OK, it might be safe at this level, but unsafe
once to get too many.
And you can do things like, well, if it's a book, if it's
a really long thing and it's got one word, that's not quite
as bad as if you have just like a very small document and
you have that same word.
And you can very much imagine that some words are worse and
more likely to be pornographic than other words.
So certain slang terms, it turns out misspellings, right?
So like amateur misspelled A-M-A-T-U-R-E is much more
likely to be amateur *** than amateur radio or something
along those lines.
But you do have to be careful, because there's words like
breast, which can be breast cancer, or
sex can be sex education.
So you do want to try to do the learning to learn which
words should carry which weights and which words should
have more weight, and those sorts of things.
But it actually is relatively sophisticated in terms of
trying to figure out--
you can imagine doing a lot more than just pure content
analysis or using just straight words.
But at least to a first approximation, that's a pretty
good way to sort of classify something as *** or not.
One thing that I wanted to mention, which if you go down
to the metadata for this video, we have a place where
you can click.
And if you think you have been detected as *** when you're
not pornographic, or you think you found a bug or an error
with SafeSearch, you can report that and pass that
information along.
And so people can adjust the algorithms or otherwise make
improvements so that we don't necessarily say that a site
that is really, really good is pornographic if it's not.
But you would be surprised at how well just doing some
pretty simple scanning with some relatively simple weights
can catch a large fraction of the *** on the web.
Previous search engines, just a little bit of historical
digression here, at least I remember in the early days,
AltaVista, you could search for sex and have their family
mode on, and they would have only like 20 results returned.
Because they had basically said, OK, we are only going to
allow these results for this query, or we're only going to
say these results are safe.
And the mental model that Google had was different.
We said, OK, if there's a mother, she's searching with
her Cub Scout son, would she be surprised, would she be
offended by the results?
But at the same time, you'd like to get the
comprehensiveness of the web.
So you'd like to score the entire web and find the
documents that are *** and exclude those.
But then if there's something about sex education or things
along those lines, you would like those to be returned.
So it's a pretty good approach.
It's worked very well.
And thankfully, there's a much better team of engineers who
are much more sophisticated in the ways that they analyze
pages now, so all of that original stuff that I wrote
back in 2000 I'm sure has been replaced by much better stuff
at this point.