Tip:
Highlight text to annotate it
X
Today's question comes from Cardiff, UK.
Tristan Perry asks, "Hi Matt.
What resources--
textbooks, online PDFs, et cetera--
would you recommend to people interested in learning more
about LSI, semantic indexing, search engine algorithms, et
cetera?"
That's a really fun question.
I think one of the things I would do is I would hunt down
the original PageRank papers, the original
papers written by Googlebot.
So there's a whole bunch of different stuff about the
anatomy of a large-scale hypertext search engine and
then also a bunch of papers about PageRank.
Those are really, really good.
There's also a couple textbooks that I recommend.
One is Modern Information Retrieval.
That's got a lot of good stuff about the scoring and the
science and thinking about that.
And then there's also one called Managing Gigabytes.
I think Ian Witten wrote that one.
And that one is just a little bit more about the logistics
and being able to horse around that much data and thinking
about some of the machine's issues and how does a large
scale engine work.
So those three together, and then of course, you can always
do searches.
Google Research actually has a ton of different papers that
we've published.
So you might want to look into that a little bit as well.
But basically PageRank, the early Google papers, can give
you an idea of how to write a very simple search engine that
can scale to 100 million documents or so, Managing
Gigabytes, and Modern Information Retrieval, and
that will give you a pretty good view of the sort of
different parts of the space.