Debian Code Search [gpn13]

It's 15:00, let's get started. Thanks for coming, my name is Michael Stapelberg. This talk is about Debian Code Search. Before we start: who uses Debian? Most of you. Maybe I should have chosen a different title; the topic is actually not Debian-specific. Here we have the contents of this talk. It looks like a lot, but we can definitely cover it. First of all, I want to explain the motivation, i.e. why you would need a code search engine. Afterwards, I will briefly explain how search engines in general work. Then I will cover the basis of my work, the codesearch tools from Russ ***. I will cover why it makes sense to do this project in the context of Debian, then the general architecture, what an inverted index means, how we implemented ranking, what kinds of optimizations I did and finally we have some time for questions. I should mention that Debian Code Search is also the subject of my bachelor thesis, which was mostly a lucky coincidence. I would have done it anyway, but the time was right and I found a professor supporting my project. Let's talk about motivation: why would you search for source code? Frequently, with Open Source projects, I experience the problem that documentation is really bad. Maybe there is no documentation at all, or out-of-date docs, or docs which are just not very friendly to newcomers. Two examples are OpenSSL and XCB. With OpenSSL - who of you has worked with OpenSSL? Who of you found it easy to get into OpenSSL? Exactly! Nobody. OpenSSL is incredibly complex. It does a lot. All of it is documented, but to understand the simple things, you need to read all the docs. So it is much much simpler to read a tutorial or look at what other programs are doing, i.e. working with code examples, instead of reading through all the documentation, which is not a feasible thing in this case. Another, different, example is XCB. With XCB, nothing is documented. Why is that? XCB, first of all, stands for X C Bindings. It is a library that allows you to talk to the X server from a C program. XCB is the intended successor of Xlib, which partly succeeded, but that is not very interesting right now. Previously, with Xlib, humans wrote and maintained all the code. With XCB, they thought: that is not very clever, it led to subtle errors, so instead, they created a protocol description of the X11 protocol. This is an XML file from which source code is automatically generated, for C, for Python, for Lua, whatever. That is a nice advantage, the other advantage is that by generating code, you cannot make the same mistakes as have happened in the past. However, in that protocol description, there is no documentation. So, if you as a programmer look at this library, you will see lots and lots of functions. They are all generated and come with zero documentation. That means you have to know the X11 protocol already, or maybe have worked with Xlib in the past, before you can even use this project. When working with XCB, it helped me a lot to look at the source code of other programs. So this is the first point. I want to find real-life code examples so that I can avoid dealing with bad documentation. The other big point is tracing program flow across project boundaries. Assume I have some application, say evince, on my computer and find a bug. What do I do? Well, I download the source code, locate the bug in the source and usually realize: ah, evince does a library call into another project. So I start from scratch: download that library, locate the line, realize it calls another library. And it goes on like that. It is annoying to download the whole source code, when all you want to look at is just a tiny fraction of the source. Also, jumping around between different projects is much easier with a search engine, where you just enter the name of a function and get to the function. So this is the second reason. There are more, but I think it is enough to explain those two. People who want to read more can read my bachelor thesis. The next question is of course: why a new search engine? You would think somebody already built such a thing. Of course there are some code search engines, but they are language-specific. Two examples are Nullege and Sourcerer. Nullege is a Python search engine, and it really understands Python. So you can navigate through hierarchies of classes and stuff, which is cool. But it also implies that Nullege cannot handle non-python source code. And I don't want to be restricted to one language, I want it to work for all of them. With sourcerer, it's the same issue, but sourcerer is Java-specific. The other class of search engines is the one which is not up-to-date or contains very little software in general. An example for that is Koders and ohloh code, which by now belong together. They are changing their system since quite some time, and don't update their index in the meantime. It is super frustrating when you search for something and find old code. Another example is Krugle, which, like a few others, has a company behind it. What this company wants to do is sell a code search engine to other companies. To demonstrate their product, they run a demo version with open source software in it. The problem is that Krugle, in particular, contains 450 OSS projects. This is a ridiculously small amount. If you search for anything, you won't find it. The other problem is: there have been search engines which were pretty good, e.g. Google Code Search, which I used. but it is not available anymore because Google wants to focus on different projects and just closed it. Furthermore, all of the previously mentioned search engines are not open source themselves. It is always good to have an OSS project, but especially when doing something for an OSS project like Debian, it's even more important the tools are OSS. Aside from all that, it's very interesting to build your own search engine. I learnt a lot. So, even if there already are search engines, it's worthwhile to build another one. So how does a search engine work in general? Who of you implemented something in that area? 2 people, alright. There are a few technical terms you should know, so I'll briefly explain them. The first one is the corpus, which describes the set of data which is being searched. Usually, when thinking of a search engine, you think of a web search engine. For a web search engine, the corpus is just the set of all web sites. However, web sites change permanently, so a web search engine needs a crawler. The crawler permanently downloads web sites, constantly expanding the corpus. But this is not an inherent property of search engines. A corpus could also, in your local library, be the set of all books. But this is not an inherent property of search engines. A corpus could also, in your local library, be the set of all books. And when new books arrive, the employee at the library enters them into the system. Then you don't need a crawler. So there are different ways to construct your corpus, but now you know what the term itself means. The second thing you think of is the query, meaning a query to your search engine. This is very simple and everyone of you has dealt with it: this is what you enter for example into the Google search box. However, not only the query itself is sent to search engine, but it usually gets expanded. As an example, you could imagine that, when you do a web search in Germany, your search engine could say: okay, the query was "kebab house", I will add Germany, or even Karlsruhe, to the query, so it knows which documents to return first. So, query means query for the internal system and can be different from what you entered. You also need an inverted index, which is just a data structure leading from a query to the location in your corpus. I will cover how that works in a second. And finally, the ranking is very important because nobody would be happy with a results page in no particular order. Obviously, you would want the thing you are most likely looking for to appear at the very top. This is a pretty hard problem. I will cover later how Debian Code Search does it. By the way, if there are any questions, please just raise your hand. The basis of my work are the code search tools by Russ ***. He is the author of Google Code Search and wrote it as his internship project in 2006. After Google Code Search was taken offline in 2012, he published an article in February 2012, in which he explains how it worked. The article not only explains how it works, but also comes with source code. He re-implemented his ideas, and his code does a few simplifications, so it is not production-ready. But it provides a good basis, and you understand how it was meant to work. I want to show a quick demo. Let's first have a quick look at the article itself. It is called "Regular Expression Matching with a Trigram Index, or, How Google Code Search worked". Here he described how it all worked. Now, let's have a look at how those tools work. Assume I have a directory containing source code. Yes, this is open source under the BSD license, very friendly. Now let's demonstrate the tools. I have a directory called "i3" and can run the tool "cindex", which creates an index. I pass "src" as command line argument and now the tool has created an index. What I can do now is run "csearch" with a search term like "i3Font", and I will get the results. So it searched the directory I previously indexed and displays the results. So this is the basis with which I started. In the end, Debian Code Search is much more, though. Most people know Debian. It is a free, non-commercial Linux distribution. So it is a good idea to host such a project there, where you don't want it to get taken offline after a few years. In case I cannot maintain the project at some point in time, somebody else can step up. Personally, I am a Debian Developer since March 2012. And one of the reasons that led me to doing this with Debian is that we have a lot of software in Debian: about 17000 source packages, resulting in even more binary packages. All in all, they contain 129 GiB of source code. The advantage is that in Debian we only have free software. That means, we don't need a crawler. So we don't need to download software and figure out its license, or find out whether we can re-distribute it at all. So a lot of what you normally need for a web search engine is just not an issue here. Furthermore, in Debian, we have plenty of machine-readable metadata. Quite recent are machine-readable copyright files. Obviously, we also have the package names and their description. But what's even better is that we have usage numbers from the popularity contest. We can see how many popcon participants have installed a specific package. That's how we can figure out which packages are more important than others. More on that later. Now let's look at the architecture on a high level. The best thing is to give a quick demo. What we see here is the index page of Debian Code Search, which is quite simple and just contains a search box. There are a few links at the top: about, FAQ, etc. Those are static and the search engine only handles input to the search box. Now, question: what should I search for? Here we have the results page. It took a bit of time, but not incredibly long. A grep on the entire Debian archive would be much slower, so that's a win. What we see here are the results. The function we searched for is defined in the libc, which is a package that a lot of people have installed, so it shows up at the very top. The results page is structured like you would expect it. There are more results, e.g. in the python package, and then you can navigate to the next page. As a short aside, what you wanted to search for is fine, but have a look at what other people enter into search engines after you launch them. I was surprised at first, especially considering the amount of hits. The first hit, by the way, was caused by somebody writing a blog post about the issues with Douglas Crockford's license. In that post, he linked to this query as an example. For the other queries, I just think that this is what people enter once you present them an input box. This table was created to evaluate the cache ratio. The most important improvement after launch was to enable aggressive caching, otherwise the search engine would have been overloaded. Now, to the architecture. What you just saw works like this: To the top left, you see the HTTP frontend, which is just an nginx server in our case. What nginx does for us is delivering static pages like the about and FAQ page. Furthermore, it can load-balance the requests going into the system. That is, if I have multiple of these backend processes, I could split the load 50/50. That allows me to disable one of the stacks, update it, then re-enable it. nginx then sends the query to dcs-web, which is a process that queries the different backends and then renders the results page that we just looked at. We have the index backend, which is sharded for different reasons. This is the inverted index, which is kept in memory, and allows us to go from a search query to the corresponding position in the Debian archive. After dcs-web queried the index backend, it only knows which files should be searched. This means, it only reduces the amount of files to search. Afterwards, dcs-web gets information from a PostgreSQL database, which it uses to define the order in which the files are searched. And finally, the query is passed to the source backend, which has access to the entire Debian archive. It searches the files and returns the actual matches, resulting in the page you just saw. Now to the inverted index. Assume we have two documents, d1.txt and d2.txt. d1 contains "Das Wetter ist schoen", d2 contains "Die Wetter-Vorhersage luegt". In an inverted index, you want to go from a word to the posting list. So you go through the first document, word by word. Then you store in which document each word appears. So the posting list for "Das" contains just d1.txt. "Wetter" however appears in both documents. This is a very simple index with a very tiny amount of documents, but that is how it works. Note that we only use whole words, no punctuation. So if you have asked yourself why web search engines don't care about punctuation: this is the reason. Right, two words are missing on this slide because I was careless. This is a good remark. Many search engines completely ignore words that are too frequent, because you would have incredibly long posting lists that don't really help you in finding what you are looking for. The thing is that we cannot just use a classical inverted index for a regular expression search engine. I should note that both, Google Code Search and Debian Code Search support regular expressions. This is a mighty tool for developers searching for code, but it also makes it harder to build a search engine in the first place. The problem is, when you consider ".*bar", we could search in our inverted index for "bar", but that doesn't match "***". So you would have to apply the regular expression on all keys in our index before you get to the posting list, and that is not feasible. So this example explained why you cannot just simply search for regular expressions. Another example is searching for "..._free". What should we do? Search for all combinations? Especially after considering Unicode, this results in combinatoric explosion, and everybody agrees that one cannot do this. Instead, the approach is to reduce the amount of files to search in. So we somehow want to go from a regular expression to a query that excludes a lot of files, and then we have a closer look at the remaining files. How does that work? You take the source code files and split them in trigrams. It looks like this: taking the identifier "XCreateWindow", we take "XCr", "Cre", and so on. So we split the entire source code in trigrams, including punctuation. "Cre" appears in two documents, so the posting list contains two entries, just like you saw in the example earlier. So we now transform the regular expression query we get into a trigram query. In this case, we search for Deb AND ebi AND bia and so on. The trigram query engine only supports AND, OR and subqueries. Those interested in how all the special cases are dealt with are welcome to read Russ ***'s original article. Some people might ask: why trigrams and not 2-grams or 4-grams? With 2-grams, we have too many common 2-grams, so the index doesn't tell you anything. With 4-grams, we have too many distinct entries, so the index is too big. So trigrams are the sweet spot. Now to another short demo. What I want to show is how to get from a trigram to a posting list. So we want to consider "i3F" as a trigram. This perl oneliner just prints the corresponding ASCII values for each character. So 69, 33, 46. These can be passed to a debugging tool which searches for the trigram in the first index shard. Within 20 us I get a list of files containing "i3F". When doing this with all 6 different index shards, I get a long list of files. But when ANDing "fon", and so on, the list gets shorter and shorter. And afterwards, I just essentially do a grep on the remaining files. The question was: are the numbers 16 bit or 32 bit? The answer is it's 32 bit to support unicode. Yes, you can search for unicode. Now to the ranking. This depends on the encoding of the source code. Yeah, latin is a subset of UTF-8 [not correct]. No, it does not convert encodings. It assumes everything is UTF-8. So everything in latin which is in the subset of UTF-8, I hope that was clearer, just works. But having latin-encouded source is very rare, why would you do such a thing? Usually, comments are English, ASCII and therefore Unicode-clean, but if you have localized comments, please save them as UTF-8. Any more questions? The question was: are the sources indexed, too? Only the sources are indexed. I don't get the question yet. Okay, got it. The question is: are only program source code files indexed or everything which is inside a source package? Yeah, changelogs and stuff. I wrote the indexer myself and just used a blacklist of things I don't want to index. What the original code search tools already provide is recognizing binary files and not indexing them. So, files such as images, are not in the index. A few other files, like CHANGELOG and NEWS are ignored on purpose. Does that answer your question? Any more questions? Now we saw how we go from the search query to the list of potential results. Now comes the ranking. We have to decide which files to search first. It's not enough to search all files and sort afterwards, because, depending on the search query, searching all files is not feasible. So which results to show on top is a thing we decide later on, but we have to consider the ranking now already. Evaluating the ranking was done by picking search queries and looking for expected results manually. That is, as an example, we have "XCreateWindow" as a search query, and what I want to have in my results is libx11 in this file in this line. Then I tested each ranking idea I had and looked at how they performed. The result is a weighted sum, consisting to a large part of the popularity contest, as I explained, popcon is an opt-in mechanism which transmits the amount of package installations. I can see that e.g. libc is installed on so many machines whereas python is only installed on so many machines. This is the most important part. Then we have the reverse dependencies. When having a library such as libx11, and I have another library or program depending on libx11, that is a reverse dpendency for libx11. That is, the longer the list of reverse dependencies, the more important is the software. This is very similar to the popularity contest, but it's not the same. I tested it and it brings a different quality to the result. Then we have the file path, so if the search term is already contained in the file path, that gets a boost. You can have a look at how many libraries are structured, e.g. libc. Each function is in a separate file with a corresponding name. This is not the only case, so it's helpful to have this in our ranking. The remaining 14% contain the scope, the exact word match and the package name match. Scope means having a look at the indentation of the file. From the indentation, we derive the scope of the result. That is, if you have a result in a line which is not indented at all, the result is an the top-level scope, so a class definition, or a function definition. If it's indented by one, it is for example a variable definition. If it's indented even further, it might only be a helper variable and not very important. So you can say the higher in the scope, the more important the match. This is not very scientific and exact, but it works. The second thing is the exact word match. As an example, when searching for "XCreateWindow", I prefer "XCreateWindow(" over "XCreateWindowEvent". So if the result is contained in another word, it's not as good as when it appears standalone. The package name match is essentially the same thing as the file path. So if the search term is already in the package name, the result is better. This happens rarely, but when it happens, it's a strong signal. Now I briefly want to cover two optimizations I performed. The first one is posting list decoding. A posting list, we recall, is the list of documents for a certain word in our inverted index. Posting lists are saved using variable integer delta encoding. This sounds more complex than it is. Think of two documents, identified by id. So you don't have d1.txt and d2.txt, but 450 and 451, which makes the index much smaller. Now you want to save the numbers, but you don't want to use 32 bit for each number, because the index should be small enough to keep it entirely in RAM. So what you do, is you split the numbers in blocks of 7 bits and use the eighth bit to indicate there is another block, a lot like UTF-8. So, when encoding 450, you get 0x83 and 0x42. And delta encoding means we interpret each number based on the previous number. The delta between 450 and 452 is 2, so we just store a 2. The underlying assumption is that documents which are in the same source package are close to each other in the index, too, so that saves space. This decoding was implemented naively. Debian Code Search is written in Go, because Russ ***' code search tools are in Go and because I like the language. The Go implementation is understandable, but not highly optimized. So I decided to re-implement it in C, and that works significantly faster. Question: are the posting lists sorted due to the delta? Yes, I think they are precisely for that reason. The second optimization is a query planner. This table is sorted by the amount of documents in the posting list of each term. Let's explain what's what: Assuming we have the query "XCreateWindow", these are all the trigrams which this query contains. The second optimization is a query planner. This table is sorted by the amount of documents in the posting list of each term. Let's explain what's what: Assuming we have the query "XCreateWindow", these are all the trigrams which this query contains. The second column is how many documents are in each posting list. So, "XCr" is in 763 documents, "teW" in 5732 and "ate" is very frequent, it appears in 419000 documents. The i column is just an index. Then we have the amount of documents in the intersection. In this query we have "XCr" AND "teW" and so on. So we start with 763 documents, then we go to 266, 263, and so on. You can already see that there is not much change down here. This fact is made very explicit in the following column. In the first row we don't have a delta yet, then 497 documents change, then 3, 2, 1 and no change after that. The second to last column is the delta to the final result, which is big in the beginning and goes down to zero quickly. The last column is the time of decoding for each trigram individually. A short posting list is decoded in 12 us, but the time is much higher for a longer posting list. The original source used all trigrams in the order in which they appeared in the index and then ANDed them all. The observation is that if you sort the documents ascending by documents in their posting list, you will realize that after a certain point, nothing changes. Therefore, my algorithm heuristically waits until not a lot changes in the results anymore and then aborts. It is better to search one more file than to decode and process all remaining, long posting lists. This works and makes the search engine faster. To conclude, I want to state that Debian Code Search is useful to developers in their day to day life. Many people told me they like to use it and it's helpful for them. I would like to see patches. There are many ideas which really simplify life further for developers and other users. If you would like to help, please talk to me. And just in case you have a machine with unused 8 GiB RAM and 160 GiB SSD connected to the internet, I could make use of that. Question: Why does the posting list decoding time vary and not get longer as we go further down in the table? I think that's not an error in the table. Possibly the posting list just has different values. The amount of documents in the posting list is not the same thing as the length of the posting list in bytes. That's why they vary. Exactly, Debian Code Search is open source since January 2013. The question is how often do we re-index and can you incrementally update it. Currently, we update every friday. The problem is that an update takes a long time. Updating incrementally doesn't work unfortunately. The published code search tools don't do that. One could think about tackling this issue, but it's a lot of work for relatively little benefit. The problem is that an update takes a long time. We have not a lot of CPU resources, I just host it all on my server, so just one machine. An update takes about 8 hours. I could decide to update the index daily and have a slow search engine for 8 of 24 hours, or not have very fresh results, but a faster search engine, which currently is more important. The question was whether I have automated bench marks so that I can see what effect changes have on the performance. Yes, I have that. The tool uses the same queries that I also used for the evaluation of my ranking. The note was: Complicated regular expressions could take a lot of time to process. The answer is: Yes and no. It works faster than you might think, but the more complex your query is, the longer the processing takes. You can definitely think of queries which take a long time to process. The question was whether we can go from the very simple result list we currently have, open a file, and click on an identifier to see all the callers for example. This is a bit hard because the search engine currently is language independent. To do that, you at least need to recognize identifiers. You could do that, but suddenly you have a lot more metadata. So, yes, of course that is technically possible, but it's a lot of work. I personally don't intend to do that, but I don't mind other people trying to implement it. The remark is that the original use-case of following program flow cross-project gets even easier with such a feature, which is correct. The remark was that there are tools like ctags which already recognize identifiers. One of the search engines I mentioned in the beginning just uses ctags to build their index. The question is how do queries below three characters work when using a trigram index? The answer is: they don't. I think that's the reason why many search engines have a three character minimum length. In that case, the site just says the query is invalid and links you to the FAQ. The question was: How are the Debian packages stored, are they unpacked? The answer is yes, before building the index, I need to unpack the packages, and I just leave them on disk like that. That way, I can search and display the individual files. The remark is that there are packages like the linux kernel which contain just another tarball, which then contains the actual source. I didn't think of that, but think only a handful of packages are affected. We would need to modify the unpacker to handle that case. This is a little bit risky in terms of security, but, given the package is already in Debian, it should be somewhat clean. The question was whether popcon still works without atime. I don't know. Personally, I have disabled atime and use popularity contest. I think packages which only I can have yet do show up, so it must work, but I'm not entirely sure. The implementation is entirely in Go except for one or two shell scripts that unpack the sources. The question is how high the effort of reimplementing it in C++ is. Or, actually, how big is Debian Code Search and how many dependencies does it have. I can show two things to answer that. Another remark is that I reimplemented parts of it in C. But I only reimplemented one little file in C, so that's negligible. So here is the source code, see github.com/Debian/dcs. It is nearly self-contained. There is a document which describes how to build a Debian package from it that you can just install on your server. What you can see here is how to setup your Go development environment and what dependencies you need: a postgresql driver, a Go debian control file parser and the original code search tools from Russ ***, so 3 external dependencies. Apart from that, what we can do real quick: Oh, this doesn't work. sloccount doesn't do Go, so I have to guess. It's not easy to port Debian Code Search to C++, especially when keeping its architecture. Quite a bit depends on Go's way of handling concurrency, i.e. channels and so on. The architecture uses a lot of Go's features, so it's definitely a good fit. You certainly can do it, but what is your motivation? Yeah, but why? Okay, so it's purely theoretical, without a good reason. Then again: right tool for the right job. Any more questions? The question is does the indexer handle stable, unstable, etc.? This is a frequently asked question and we index sid, i.e. unstable. If you would index more, you'd have the problem that for a single query you'd get multiple results in the same package, but with different versions, which makes the results page not as useful. Otherwise, you'd need to select which distributions you want to search in. This implies that you need multiple indexes. So this makes everything more complex. But in case you have a really good reason for doing this, please talk to us. The question was: are package-specific patches in the index? Yes, they are. Also, the entire debian/ directory. So if you want to search in things that only appear in debian/rules, you can do that. Is anything within debian/ excluded? Good question, let's look at the source. There is dcsindex, which is not very complicated. It has a hard-coded white- and blacklist. Ah, changelog and readme, not being in /debian. So within /debian, we don't exclude anything. I mean, it's called Debian Code Search, so it better finds Debian specific stuff. Exactly. My work is on top of cindex and csearch which I showed earlier. The question is are there any alternatives. Yeah, most likely. But I think none of the alternatives does what Russ *** did, i.e. regular expression search on a big index. Or, on a big corpus. Also, I found his approach attractive. I share his views and like Go, so without looking for alternatives, I just went for it. Any more questions? Alright. In case you think of any more questions, feel free to talk to me later on. Thanks for the many questions and your attention!