[Boost][GSOC] proposal for Trie

From 2012 till now, I am studying Computer Science in Peking University
Hi, this is my proposal for GSOC 2013, please read it and give me some useful information on it, thanks == Personal Details == * Name: Zhe(Hardy) Huang * College/University: Peking University * Course/Major: Computer Science * Degree Program: M.Sc. * Email: hzithlony@gmail.com * Homepage: not yet * Availability: * How much time do you plan to spend on your GSoC? I will spend the most part time of the semester(till late June), and then the whole summer on my GSoC project. * What are your intended start and end dates? I basically will obey the rules of GSoC, and I will also listen to mentor’s advice on it. * What other factors affect your availability (exams, courses, moving, work, etc.)? I am having my courses now, and I will have my final exams in June. Since I am not having many courses, I have enough time to do something for the project before exams. And in summer I could almost work full time on it. == Background Information == Please summarize your educational background (degrees earned, courses taken, etc.). I majored in Computer Science in Zhongshan University from 2008 to 2012, by which I earned my bachelor’s degree. During the four years, I learned many basic courses on Computer Science, like algebra, logic, data structure, algorithms, operating system, Compilers, databases, networks. pursuing my master’s degree. My research field is information retrieval. Please summarize your programming background (OSS projects, internships, jobs, etc.). GSoC is the first time that I get involved with a real and large OSS project. Besides, I’ve developed many things in C and C++. I developed a simplified B-tree data structure in C during my intership for a small company. Later a system in C to extract and evaluate the content of webpages for my research, and I am still on it. An implementation of a search system in C++. A webpage template system in C for my lab to generating webpages in order to do some research on them. Please tell us a little about your programming interests. I like to write codes for the basic and general use. I am pursuing well-framework of design and code, and high reusability when I think of develop the project. Please tell us why you are interested in contributing to the Boost C++ Libraries. Boost is widely used by programmers and famous for its powerful and efficient. It is an honor to contribute to such a lib, and have my codes used and reviewed by so many programmers. I like the reusability of well-designed lib, so it also will be a good chance for me to improve my skill of design and programming. And a good chance to communicate with good guys as well. I want to be a good programmer with many good programmers as my friends. What is your interest in the project you are proposing? Because I have participated in many competitions of algorithm, I have a good knowledge of algorithms and data structures. I have also implement many, but for the sake of winning, the codes usually can not be well-reused. Meanwhile the STL in C++, which I’ve used many times during the competitions, such as set, map, priority_queue, sort are good examples of well-reused algorithms and data structures. So I’ve been expecting to develop some general libraries of algorithms and data structures. Boost is a well-formed organization, and Boost lib is a widely-used lib, so I also want to learn some skills to develop a good lib from developing the project with mentors in Boost. Have you done any previous work in this area before or on similar projects? I am good at data structure and algorithm, and I wrote codes of various algorithms and data structures for many times. I am very familiar with the Trie data structure, and as I list above, I've develop a simplified general B-tree data structure in C, which is a good experience when I develop Trie for Boost. What are your plans beyond this Summer of Code time frame for your proposed work?. I will dedicate my summer to my work. And taking the experience as a good beginning, I will be feeling happy to contribute more to Boost, and other OSS projects. Please rate, from 0 to 5 (0 being no experience, 5 being expert), your knowledge of the following languages, technologies, or tools: - C++ 3 - C++ Standard Library 3 - Boost C++ Libraries 2 - Subversion 3 What software development environments are you most familiar with (Visual Studio, Eclipse, KDevelop, etc.)? I use Visual Studio in my lab, and vim + gcc for myself. What software documentation tool are you most familiar with (Doxygen, !DocBook, Quickbook, etc.)? I’ve used Doxygen before, it is simple and efficient. == Project Proposal == Please provide a description of your proposed work. Trie(Radix Tree) is an efficient in memory data structure to deal with string storing and searching. It can be used in many applications, and in comparison to hashtables, it has better expandability and stability. It will be an data structure of great importance and wide usage. == Proposed Milestones and Schedule == Please provide estimated milestones and schedule for completing the proposed work. Before May 27: To familiarize with the Boost lib, with reading the codes and discussing about it in mailing list. To have a better knowledge of lib, with reading some books of STL and Boost, and comparing Boost with STL. To setup us a proper environment for development. Study deep into Trie, reading some others’ good implementations. May 27 – June 17: Keep in touch with the Boost organization and have good communications with my mentor. Get involved with the development of boost. Try some toys about Boost, and discuss them with mentor, in order to know better about mentor and how the development goes on. June 17 – July 15(Coding period begins): Discuss the detail of the design with mento, develop an alpha version of Trie, and have it tested in small scale. July 15 – July 29(To mid-term evaluation): Refine my alpha codes, analyze the performance, and try to improve it. Write an initial report of the project and document of Trie. July 29 – September 23: Refine my codes, and have some thorough tests on it, make it robust. Reorganize codes and improve documents.

2013/4/21 Hardy Huang
Hi, this is my proposal for GSOC 2013, please read it and give me some useful information on it, thanks <...>
Thanks, proposal looks good and interesting and has all the chances to be chosen for GSOC. A good addition to proposal will be some sketch of trie interface: methods and constructors that Trie class will contain. -- Best regards, Antony Polukhin

On 2013/4/22 17:15, Antony Polukhin wrote:
2013/4/21 Hardy Huang
: Hi, this is my proposal for GSOC 2013, please read it and give me some useful information on it, thanks <...>
Thanks, proposal looks good and interesting and has all the chances to be chosen for GSOC.
A good addition to proposal will be some sketch of trie interface: methods and constructors that Trie class will contain.
-- Best regards, Antony Polukhin
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thanks for reviewing my proposal. I am very glad to see your positive reply. I have thought of what you say when writing my proposal. Since it will be a good addition, I will reorganize my initial conceivation about Trie and post it as soon as I could.

On 4/22/2013 7:36 AM, Hardy wrote:
Thanks for reviewing my proposal. I am very glad to see your positive reply. I have thought of what you say when writing my proposal. Since it will be a good addition, I will reorganize my initial conceivation about Trie and post it as soon as I could.
If possible implement it as a boost.intrusive container which generally allows things to be trivially adapted into non-intrusive containers.

2013/4/22 Michael Marcin
On 4/22/2013 7:36 AM, Hardy wrote:
Thanks for reviewing my proposal. I am very glad to see your positive reply. I have thought of what you say when writing my proposal. Since it will be a good addition, I will reorganize my initial conceivation about Trie and post it as soon as I could.
If possible implement it as a boost.intrusive container which generally allows things to be trivially adapted into non-intrusive containers.
It works both ways :) So do as *you* think would be better and more useful. (Nobody forbids you to implement it other way later and also propose to Boost) -- Best regards, Antony Polukhin

On 4/22/2013 10:55 AM, Antony Polukhin wrote:
2013/4/22 Michael Marcin
: On 4/22/2013 7:36 AM, Hardy wrote:
Thanks for reviewing my proposal. I am very glad to see your positive reply. I have thought of what you say when writing my proposal. Since it will be a good addition, I will reorganize my initial conceivation about Trie and post it as soon as I could.
If possible implement it as a boost.intrusive container which generally allows things to be trivially adapted into non-intrusive containers.
It works both ways :) So do as *you* think would be better and more useful. (Nobody forbids you to implement it other way later and also propose to Boost)
You can take a non-intrusive container and make it intrusive? Black magic?

On 4/22/13 11:08 AM, Antony Polukhin wrote:
2013/4/22 Michael Marcin
: <...> You can take a non-intrusive container and make it intrusive?
Black magic?
And how are you going to make intrusive container manage memory and objects lifetime?
Stefan Strasser said it well not long ago, I'll just quote him. On 4/6/13 10:08 AM, Stefan Strasser wrote:
Zitat von Vadim Stadnik
: If Boost.Intrusive containers (and containers based on Boost.Intrusive containers) outperform all other containers, they are the best possible choice. They are winners by both of these important parameters. Otherwise their advantage is not so obvious.
a non-intrusive container that is implemented in terms of an intrusive container is as efficient as any other non-intrusive container. it is a little more work to implement, to provide the Intrusive public interface. However, you can not implement an Intrusive container in terms of a non-intrusive container but have to reimplement the entire data structure. That's why Boost.Intrusive is not based on namespace std containers but are a reimplementation of the algorithms.
The implementation of a non-intrusive container in terms of an intrusive one is as simple as:
struct value_holder : intrusive::hook{ T value; };
and forwarding all calls of the STL container interface.

2013/4/23 Michael Marcin
On 4/22/13 11:08 AM, Antony Polukhin wrote:
2013/4/22 Michael Marcin
: <...> You can take a non-intrusive container and make it intrusive?
Black magic?
And how are you going to make intrusive container manage memory and objects lifetime?
Stefan Strasser said it well not long ago, I'll just quote him.
On 4/6/13 10:08 AM, Stefan Strasser wrote:
Zitat von Vadim Stadnik
: If Boost.Intrusive containers (and containers based on Boost.Intrusive containers) outperform all other containers, they are the best possible choice. They are winners by both of these important parameters. Otherwise their advantage is not so obvious.
a non-intrusive container that is implemented in terms of an intrusive container is as efficient as any other non-intrusive container. it is a little more work to implement, to provide the Intrusive public interface. However, you can not implement an Intrusive container in terms of a non-intrusive container but have to reimplement the entire data structure. That's why Boost.Intrusive is not based on namespace std containers but are a reimplementation of the algorithms.
The implementation of a non-intrusive container in terms of an intrusive one is as simple as:
struct value_holder : intrusive::hook{ T value; };
and forwarding all calls of the STL container interface.
May be I'm missing something, but Stefan Strasser was talking about STL and Boosts intrusive containers, and his words shall be interpreted in context that we have no access to internals of STL. But when we have access to internals it is a totally different situation: * If we develop a non-intrusive container, we are free to split its implementation to resource management and intrusive logic. * If we develop an intrusive container - we need to add resource management to make a non-intrusive container. In both situations we achieve same results - we have intrusive and non-intrusive containers. So it works both ways *when we have access to internals* . -- Best regards, Antony Polukhin

On 4/23/2013 1:37 AM, Antony Polukhin wrote:
May be I'm missing something
Unfortunately I feel the same.
But when we have access to internals it is a totally different situation: * If we develop a non-intrusive container, we are free to split its implementation to resource management and intrusive logic.
By doing this split aren't you making an intrusive container and then layering the resource management ontop of it exactly as I suggested?
* If we develop an intrusive container - we need to add resource management to make a non-intrusive container.
Isn't this the same thing?
In both situations we achieve same results - we have intrusive and non-intrusive containers.
So it works both ways *when we have access to internals* .
Right if you have a non-intrusive container and access to the source code you can, most likely, refactor it to be an intrusive container but I don't see how that is comparable to being able to layer an non-intrusive interface over top an intrusive container without changing the implementation. To put it another way if there was only intrusive_trie in boost. And a boost user using boost burned to a cd (read-only) she should be able to make trie in user code by writing a simple non-intrusive adapter. If there was only trie in boost in the same scenario she would not be able to make use of boost trie to implement intrusive_trie in user code short of copy-paste-modify.

2013/4/23 Michael Marcin
On 4/23/2013 1:37 AM, Antony Polukhin wrote:
May be I'm missing something
Unfortunately I feel the same.
But when we have access to internals it is a totally different situation: * If we develop a non-intrusive container, we are free to split its implementation to resource management and intrusive logic.
By doing this split aren't you making an intrusive container and then layering the resource management ontop of it exactly as I suggested?
* If we develop an intrusive container - we need to add resource management to make a non-intrusive container.
Isn't this the same thing?
In both situations we achieve same results - we have intrusive and non-intrusive containers.
So it works both ways *when we have access to internals* .
Right if you have a non-intrusive container and access to the source code you can, most likely, refactor it to be an intrusive container but I don't see how that is comparable to being able to layer an non-intrusive interface over top an intrusive container without changing the implementation.
To put it another way if there was only intrusive_trie in boost. And a boost user using boost burned to a cd (read-only) she should be able to make trie in user code by writing a simple non-intrusive adapter.
If there was only trie in boost in the same scenario she would not be able to make use of boost trie to implement intrusive_trie in user code short of copy-paste-modify.
Now I get it... I was talking about intrusive/non-intusive with full access to sources (in my head Boosters were doing one container from another), you were talking about user that has no access to internals of implementation.
From your point of view intrusive containers provide better code reuse, from my point of view Boosters/GSOC will do both implementations and there is no big difference from where to start.
Using your point of view: Code reuse is times better, but users (including me) are usually lazy and more of them will prefer to work with non-intrusive container, which makes intrusive less popular. -- Best regards, Antony Polukhin

On 4/23/2013 3:52 AM, Antony Polukhin wrote:
Now I get it...
I was talking about intrusive/non-intusive with full access to sources (in my head Boosters were doing one container from another), you were talking about user that has no access to internals of implementation.
From your point of view intrusive containers provide better code reuse, from my point of view Boosters/GSOC will do both implementations and there is no big difference from where to start.
Using your point of view: Code reuse is times better, but users (including me) are usually lazy and more of them will prefer to work with non-intrusive container, which makes intrusive less popular.
Right. As a user you'll almost always just use non-intrusive because it is easier, safer and has good enough performance, until it doesn't. Then it's nice to be able to reach for an intrusive version that you can trust. And if your non-intrusive version that you've been using successfully is built on an intrusive version, and that intrusive version is built using the wonderfully designed boost.intrusive library you can migrate your code with a high degree of confidence. Cheers.

On 2013/4/23 16:52, Antony Polukhin wrote:
2013/4/23 Michael Marcin
: On 4/23/2013 1:37 AM, Antony Polukhin wrote:
May be I'm missing something
Unfortunately I feel the same.
But when we have access to internals it is a totally different situation: * If we develop a non-intrusive container, we are free to split its implementation to resource management and intrusive logic.
By doing this split aren't you making an intrusive container and then layering the resource management ontop of it exactly as I suggested?
* If we develop an intrusive container - we need to add resource management to make a non-intrusive container.
Isn't this the same thing?
In both situations we achieve same results - we have intrusive and non-intrusive containers.
So it works both ways *when we have access to internals* .
Right if you have a non-intrusive container and access to the source code you can, most likely, refactor it to be an intrusive container but I don't see how that is comparable to being able to layer an non-intrusive interface over top an intrusive container without changing the implementation.
To put it another way if there was only intrusive_trie in boost. And a boost user using boost burned to a cd (read-only) she should be able to make trie in user code by writing a simple non-intrusive adapter.
If there was only trie in boost in the same scenario she would not be able to make use of boost trie to implement intrusive_trie in user code short of copy-paste-modify.
Now I get it...
I was talking about intrusive/non-intusive with full access to sources (in my head Boosters were doing one container from another), you were talking about user that has no access to internals of implementation.
From your point of view intrusive containers provide better code reuse, from my point of view Boosters/GSOC will do both implementations and there is no big difference from where to start.
Using your point of view: Code reuse is times better, but users (including me) are usually lazy and more of them will prefer to work with non-intrusive container, which makes intrusive less popular.
-- Best regards, Antony Polukhin
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Your discussion above help me understand the design of libraries better, but meanwhile I am a little confused about how I can give a good addition to my proposal as your early advice - "A good addition to proposal will be some sketch of trie interface: methods and constructors that Trie class will contain." Could you give me some examples or more advices on that? What I think out now are something like STL set, map, and with some specific methods and variables of Trie.

2013/4/23 Hardy
"A good addition to proposal will be some sketch of trie interface: methods and constructors that Trie class will contain." Could you give me some examples or more advices on that? What I think out now are something like STL set, map, and with some specific methods and variables of Trie.
You are right, it is better to be as close to STL interface as
possible. You may also say something like:
"
Trie interface will have all the typedefs and methods of std::set excluding:
iterator insert (const_iterator position, const value_type& val);
...
It will additionally have the following methods:
pair

On 2013/4/23 20:59, Antony Polukhin wrote:
2013/4/23 Hardy
: "A good addition to proposal will be some sketch of trie interface: methods and constructors that Trie class will contain." Could you give me some examples or more advices on that? What I think out now are something like STL set, map, and with some specific methods and variables of Trie.
You are right, it is better to be as close to STL interface as possible. You may also say something like:
" Trie interface will have all the typedefs and methods of std::set excluding: iterator insert (const_iterator position, const value_type& val); ... It will additionally have the following methods:
pair
prefix_range (const value_type& prefix) const; - returns range of elements starting from `prefix`. ... " You may also add some notes that you think will be interesting or just affect design, like "trie implementation will be storing it's size as a separate field to allow getting size in constant time" or "won't be storing it's size leading to liner complexity of size() function"... Some more examples of containers features: http://www.boost.org/doc/libs/1_53_0/doc/html/container/other_features.html -- Best regards, Antony Polukhin
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
That's very kind of you. I will try my best to refine my design as soon as possible.

On 4/23/2013 7:59 AM, Antony Polukhin wrote:
2013/4/23 Hardy
: "A good addition to proposal will be some sketch of trie interface: methods and constructors that Trie class will contain." Could you give me some examples or more advices on that? What I think out now are something like STL set, map, and with some specific methods and variables of Trie.
You are right, it is better to be as close to STL interface as possible. You may also say something like:
" Trie interface will have all the typedefs and methods of std::set excluding: iterator insert (const_iterator position, const value_type& val); ... It will additionally have the following methods:
pair
prefix_range (const value_type& prefix) const; - returns range of elements starting from `prefix`. ... " You may also add some notes that you think will be interesting or just affect design, like "trie implementation will be storing it's size as a separate field to allow getting size in constant time" or "won't be storing it's size leading to liner complexity of size() function"... Some more examples of containers features: http://www.boost.org/doc/libs/1_53_0/doc/html/container/other_features.html
You should be able to meet at least the requirements of Container. http://en.cppreference.com/w/cpp/concept/Container Other than that it depends. I think a trie is usually an associative container I'm not sure if it is ordered. If you can directly fit a more refined C++ container concept many more algorithms will just work for it.

On 2013/4/23 20:59, Antony Polukhin wrote:
2013/4/23 Hardy
: "A good addition to proposal will be some sketch of trie interface: methods and constructors that Trie class will contain." Could you give me some examples or more advices on that? What I think out now are something like STL set, map, and with some specific methods and variables of Trie.
You are right, it is better to be as close to STL interface as possible. You may also say something like:
" Trie interface will have all the typedefs and methods of std::set excluding: iterator insert (const_iterator position, const value_type& val); ... It will additionally have the following methods:
pair
prefix_range (const value_type& prefix) const; - returns range of elements starting from `prefix`. ... " You may also add some notes that you think will be interesting or just affect design, like "trie implementation will be storing it's size as a separate field to allow getting size in constant time" or "won't be storing it's size leading to liner complexity of size() function"... Some more examples of containers features: http://www.boost.org/doc/libs/1_53_0/doc/html/container/other_features.html -- Best regards, Antony Polukhin
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Hi, the following is what I conceive of Trie, which I will put to the
'description of work' in my proposal, because of its unmaturity, I will
be happy to discuss on it with you, please give me some advice on it so
that I can finish it and submit my application on the GSOC site.
Thank you very much.
==description==
Trie(Radix Tree) is an efficient in memory data structure to deal with
string storing and searching. It can be used in many applications, and
in comparison to hashtables, it has better expandability and stability.
It is a data structure of great importance and wide usage.
C++ has already had powerful common key-value containers such as set and
map(I will only use map to make the comparison later), but when the keys
are strings, map will take O(logN * length(string)) to lookup a key
which has a worse performance when handling long strings and large
amount of data, and stores every key as a full string which also wastes
space. Besides, it is hard to use map traverse some keys with common
prefix.
Trie is good at handling strings especially the above tasks, so it is
needed to improve the performance of handling key-value storage with the
type of key being string.
The basic interface of Trie will have similar typedefs and methods to
stl::map, with some differences:
pair

2013/4/25 Hardy
On 2013/4/23 20:59, Antony Polukhin wrote:
2013/4/23 Hardy
: "A good addition to proposal will be some sketch of trie interface: methods and constructors that Trie class will contain." Could you give me some examples or more advices on that? What I think out now are something like STL set, map, and with some specific methods and variables of Trie.
You are right, it is better to be as close to STL interface as possible. You may also say something like:
" Trie interface will have all the typedefs and methods of std::set excluding: iterator insert (const_iterator position, const value_type& val); ... It will additionally have the following methods:
pair
prefix_range (const value_type& prefix) const; - returns range of elements starting from `prefix`. ... " You may also add some notes that you think will be interesting or just affect design, like "trie implementation will be storing it's size as a separate field to allow getting size in constant time" or "won't be storing it's size leading to liner complexity of size() function"... Some more examples of containers features: http://www.boost.org/doc/libs/1_53_0/doc/html/container/other_features.html
-- Best regards, Antony Polukhin
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Hi, the following is what I conceive of Trie, which I will put to the 'description of work' in my proposal, because of its unmaturity, I will be happy to discuss on it with you, please give me some advice on it so that I can finish it and submit my application on the GSOC site. Thank you very much.
==description== Trie(Radix Tree) is an efficient in memory data structure to deal with string storing and searching. It can be used in many applications, and in comparison to hashtables, it has better expandability and stability. It is a data structure of great importance and wide usage. C++ has already had powerful common key-value containers such as set and map(I will only use map to make the comparison later), but when the keys are strings, map will take O(logN * length(string)) to lookup a key which has a worse performance when handling long strings and large amount of data, and stores every key as a full string which also wastes space. Besides, it is hard to use map traverse some keys with common prefix. Trie is good at handling strings especially the above tasks, so it is needed to improve the performance of handling key-value storage with the type of key being string.
The basic interface of Trie will have similar typedefs and methods to stl::map, with some differences:
pair
prefix_range (const key_type& prefix) const; - returns range of elements starting from ‘prefix’ void prefix_erase(const key_type& prefix) const; - erases elements with keys starting with ‘prefix’ void prefix_count(const key_type& prefix) const; - returns the number of elements with keys starting with ‘prefix’
Pretty good!
key_type get_key(const_iterator position); - returns the key of the iterator on position, because I think it is a waste of space to store the complete key, so users cannot get the key directly by iterator. vector
get_keys(const Trie &trie); - returns the keys of all elements in Trie vector get_keys(const key_type& prefix); - returns the keys of all elements in starting with ‘prefix’ vector get_keys(const_iterator left, const_iterator right); - returns the keys between iterator left and right.
Did not get the idea. Dereferencing an iterator what shall return? Do you mean 'key' here as a character of substring or 'key' here is a string itself?
Besides, considering that traditional Trie has limited scale of using, I want to design my Trie in 3 layers: The bottom one can provide the user-definition key type, which uses like RawTrie
, in which the key can be a sequence of some type(maybe an array of anything can be a key), and every node of Trie can be taken as a root of sub RawTrie, which gives convenience to add Trie into it and even swap a sub Trie. And the second layer will be a specific Trie with the key of chars(when handling lookup and insert, both string and char[] can be adapted to it). And the feature of sub Trie is provided. The class is like Trie . And the third layer will be a Trie without so much freedom of operating as the above two. I will just provide the basic interface as I list above. And the last two layers of Trie will have a version of compressed storage of strings.
Relations between layers are not clear. Bay the way, student applications can now be submitted to https://google-melange.appspot.com/gsoc/homepage/google/gsoc2013. The deadline is 3 May. Do not forget to submit your proposal and add link to this discussion in proposal. -- Best regards, Antony Polukhin

On 2013/4/26 16:15, Antony Polukhin wrote:
2013/4/25 Hardy
: On 2013/4/23 20:59, Antony Polukhin wrote:
2013/4/23 Hardy
: "A good addition to proposal will be some sketch of trie interface: methods and constructors that Trie class will contain." Could you give me some examples or more advices on that? What I think out now are something like STL set, map, and with some specific methods and variables of Trie.
You are right, it is better to be as close to STL interface as possible. You may also say something like:
" Trie interface will have all the typedefs and methods of std::set excluding: iterator insert (const_iterator position, const value_type& val); ... It will additionally have the following methods:
pair
prefix_range (const value_type& prefix) const; - returns range of elements starting from `prefix`. ... " You may also add some notes that you think will be interesting or just affect design, like "trie implementation will be storing it's size as a separate field to allow getting size in constant time" or "won't be storing it's size leading to liner complexity of size() function"... Some more examples of containers features: http://www.boost.org/doc/libs/1_53_0/doc/html/container/other_features.html
-- Best regards, Antony Polukhin
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Hi, the following is what I conceive of Trie, which I will put to the 'description of work' in my proposal, because of its unmaturity, I will be happy to discuss on it with you, please give me some advice on it so that I can finish it and submit my application on the GSOC site. Thank you very much.
==description== Trie(Radix Tree) is an efficient in memory data structure to deal with string storing and searching. It can be used in many applications, and in comparison to hashtables, it has better expandability and stability. It is a data structure of great importance and wide usage. C++ has already had powerful common key-value containers such as set and map(I will only use map to make the comparison later), but when the keys are strings, map will take O(logN * length(string)) to lookup a key which has a worse performance when handling long strings and large amount of data, and stores every key as a full string which also wastes space. Besides, it is hard to use map traverse some keys with common prefix. Trie is good at handling strings especially the above tasks, so it is needed to improve the performance of handling key-value storage with the type of key being string.
The basic interface of Trie will have similar typedefs and methods to stl::map, with some differences:
pair
prefix_range (const key_type& prefix) const; - returns range of elements starting from ‘prefix’ void prefix_erase(const key_type& prefix) const; - erases elements with keys starting with ‘prefix’ void prefix_count(const key_type& prefix) const; - returns the number of elements with keys starting with ‘prefix’ Pretty good!
key_type get_key(const_iterator position); - returns the key of the iterator on position, because I think it is a waste of space to store the complete key, so users cannot get the key directly by iterator. vector
get_keys(const Trie &trie); - returns the keys of all elements in Trie vector get_keys(const key_type& prefix); - returns the keys of all elements in starting with ‘prefix’ vector get_keys(const_iterator left, const_iterator right); - returns the keys between iterator left and right. Did not get the idea. Dereferencing an iterator what shall return? Do you mean 'key' here as a character of substring or 'key' here is a string itself?
Sorry for not stating my design clearly. Firstly, the 'key' here represent a string. The first 'get-key' function is to get a key which corresponds to the iterator(node in Trie), because in order to save space(which is one of Trie's advantages), the whole key string should not store in a single iterator(node), so it is necessary to provide such an interface under my design. So I think it is not difficult to understand the last three function prototypes here is to show all the keys in different ranges. Since the relationship between key and string makes you confused, I am now trying to explain more about it. In my current design, a key is a 'string' itself. My idea is to take sequences of anytype as an abstract string in the bottom layer of the lib, in order to extend the capability of Trie. And after thinking on it several times, I now prefer to use 'key' as an element of a 'string', for it is more clear to users. Since it is a crucial and essential problem of the whole Trie project, I want some advices of you on it.

2013/4/26 Hardy
Firstly, the 'key' here represent a string. The first 'get-key' function is to get a key which corresponds to the iterator(node in Trie), because in order to save space(which is one of Trie's advantages), the whole key string should not store in a single iterator(node), so it is necessary to provide such an interface under my design.
This looks bad. Iterating through elements of 'string' may be a useful feature, but not the main. User will be putting in container 'string' and he will be expecting to retrieve 'string' when dereferencing iterator.
Since the relationship between key and string makes you confused, I am now trying to explain more about it. In my current design, a key is a 'string' itself. My idea is to take sequences of anytype as an abstract string in the bottom layer of the lib, in order to extend the capability of Trie. And after thinking on it several times, I now prefer to use 'key' as an element of a 'string', for it is more clear to users. Since it is a crucial and essential problem of the whole Trie project, I want some advices of you on it.
'key' is what we are going to store in container. User shall not care that we store key in parts, so it looks right to treat 'key' close to 'string'.
Ok, you mean that I should add this link of our discussion to the 'additional info' of my proposal in GSoC page?
Yes -- Best regards, Antony Polukhin

On 2013/4/26 18:47, Antony Polukhin wrote:
2013/4/26 Hardy
: Firstly, the 'key' here represent a string. The first 'get-key' function is to get a key which corresponds to the iterator(node in Trie), because in order to save space(which is one of Trie's advantages), the whole key string should not store in a single iterator(node), so it is necessary to provide such an interface under my design.
This looks bad. Iterating through elements of 'string' may be a useful feature, but not the main. User will be putting in container 'string' and he will be expecting to retrieve 'string' when dereferencing iterator.
Since the relationship between key and string makes you confused, I am now trying to explain more about it. In my current design, a key is a 'string' itself. My idea is to take sequences of anytype as an abstract string in the bottom layer of the lib, in order to extend the capability of Trie. And after thinking on it several times, I now prefer to use 'key' as an element of a 'string', for it is more clear to users. Since it is a crucial and essential problem of the whole Trie project, I want some advices of you on it.
'key' is what we are going to store in container. User shall not care that we store key in parts, so it looks right to treat 'key' close to 'string'.
Ok, you mean that I should add this link of our discussion to the 'additional info' of my proposal in GSoC page?
Yes
-- Best regards, Antony Polukhin
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I agree with you, that is why I design a layer of interfaces as TrieMap,
and TrieSet, in my conceivation, it can be used as stl:map and stl:set,
with the difference that they retrieve the keys using Trie data
structure. So of course user can get 'key' when dereferencing iterator.
Meanwhile, I still think that it is useful to provide some classes on
lower layer to not store the keys in the iterators, so that users can
have a more space-saving application.
In conclusion, I adjust my design to the following form:
bottom layer:
class Trie

2013/4/26 Hardy
On 2013/4/26 18:47, Antony Polukhin wrote:
2013/4/26 Hardy
: Firstly, the 'key' here represent a string. The first 'get-key' function is to get a key which corresponds to the iterator(node in Trie), because in order to save space(which is one of Trie's advantages), the whole key string should not store in a single iterator(node), so it is necessary to provide such an interface under my design.
This looks bad. Iterating through elements of 'string' may be a useful feature, but not the main. User will be putting in container 'string' and he will be expecting to retrieve 'string' when dereferencing iterator.
Since the relationship between key and string makes you confused, I am now trying to explain more about it. In my current design, a key is a 'string' itself. My idea is to take sequences of anytype as an abstract string in the bottom layer of the lib, in order to extend the capability of Trie. And after thinking on it several times, I now prefer to use 'key' as an element of a 'string', for it is more clear to users. Since it is a crucial and essential problem of the whole Trie project, I want some advices of you on it.
'key' is what we are going to store in container. User shall not care that we store key in parts, so it looks right to treat 'key' close to 'string'.
Ok, you mean that I should add this link of our discussion to the 'additional info' of my proposal in GSoC page?
Yes
-- Best regards, Antony Polukhin
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I agree with you, that is why I design a layer of interfaces as TrieMap, and TrieSet, in my conceivation, it can be used as stl:map and stl:set, with the difference that they retrieve the keys using Trie data structure. So of course user can get 'key' when dereferencing iterator. Meanwhile, I still think that it is useful to provide some classes on lower layer to not store the keys in the iterators, so that users can have a more space-saving application. In conclusion, I adjust my design to the following form:
bottom layer: class Trie
; The key and value are not in the for as map(pair ), so it is space-saving. second layer: typedef TrieMap
Trie > The two layers can satisfy different needs of users.
Duplicating data is bad. If we have an iterator pointing to leaf node we can restore the whole key, without storing it one more time (leaf node == final element of whole key). In my head node looks like this: template <class Element> struct node { node* next_on_same_level_; node* deeper_or_parent_; // points to parent if next_on_same_level_ is NULL Element value_; }; If next_on_same_level_ is NULL - we are in leaf node. To retrieve key we need to move to root, gathering elements. And it looks like it is possible to get rid of one of the pointers using http://en.wikipedia.org/wiki/XOR_linked_list technique. Or am I missing something? -- Best regards, Antony Polukhin

On 2013/4/27 1:20, Antony Polukhin wrote:
2013/4/26 Hardy
: On 2013/4/26 18:47, Antony Polukhin wrote:
2013/4/26 Hardy
: Firstly, the 'key' here represent a string. The first 'get-key' function is to get a key which corresponds to the iterator(node in Trie), because in order to save space(which is one of Trie's advantages), the whole key string should not store in a single iterator(node), so it is necessary to provide such an interface under my design.
This looks bad. Iterating through elements of 'string' may be a useful feature, but not the main. User will be putting in container 'string' and he will be expecting to retrieve 'string' when dereferencing iterator.
Since the relationship between key and string makes you confused, I am now trying to explain more about it. In my current design, a key is a 'string' itself. My idea is to take sequences of anytype as an abstract string in the bottom layer of the lib, in order to extend the capability of Trie. And after thinking on it several times, I now prefer to use 'key' as an element of a 'string', for it is more clear to users. Since it is a crucial and essential problem of the whole Trie project, I want some advices of you on it.
'key' is what we are going to store in container. User shall not care that we store key in parts, so it looks right to treat 'key' close to 'string'.
Ok, you mean that I should add this link of our discussion to the 'additional info' of my proposal in GSoC page?
Yes
-- Best regards, Antony Polukhin
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I agree with you, that is why I design a layer of interfaces as TrieMap, and TrieSet, in my conceivation, it can be used as stl:map and stl:set, with the difference that they retrieve the keys using Trie data structure. So of course user can get 'key' when dereferencing iterator. Meanwhile, I still think that it is useful to provide some classes on lower layer to not store the keys in the iterators, so that users can have a more space-saving application. In conclusion, I adjust my design to the following form:
bottom layer: class Trie
; The key and value are not in the for as map(pair ), so it is space-saving. second layer: typedef TrieMap
Trie > The two layers can satisfy different needs of users.
Duplicating data is bad. If we have an iterator pointing to leaf node we can restore the whole key, without storing it one more time (leaf node == final element of whole key).
In Trie a non-leaf node can be the ending of a string, so you thought here confused me. How will your design handle this situation? A pointer to a single leaf node from the non-leaf node to it?
In my head node looks like this:
template <class Element> struct node { node* next_on_same_level_; node* deeper_or_parent_; // points to parent if next_on_same_level_ is NULL
Element value_; };
If next_on_same_level_ is NULL - we are in leaf node. To retrieve key we need to move to root, gathering elements.
And it looks like it is possible to get rid of one of the pointers using http://en.wikipedia.org/wiki/XOR_linked_list technique.
That is an amzazing trick, it is good to see such a beautiful method implementing linked list.

2013/4/27 Hardy
On 2013/4/27 1:20, Antony Polukhin wrote:
2013/4/26 Hardy
: On 2013/4/26 18:47, Antony Polukhin wrote:
2013/4/26 Hardy
: Firstly, the 'key' here represent a string. The first 'get-key' function is to get a key which corresponds to the iterator(node in Trie), because in order to save space(which is one of Trie's advantages), the whole key string should not store in a single iterator(node), so it is necessary to provide such an interface under my design.
This looks bad. Iterating through elements of 'string' may be a useful feature, but not the main. User will be putting in container 'string' and he will be expecting to retrieve 'string' when dereferencing iterator.
Since the relationship between key and string makes you confused, I am now trying to explain more about it. In my current design, a key is a 'string' itself. My idea is to take sequences of anytype as an abstract string in the bottom layer of the lib, in order to extend the capability of Trie. And after thinking on it several times, I now prefer to use 'key' as an element of a 'string', for it is more clear to users. Since it is a crucial and essential problem of the whole Trie project, I want some advices of you on it.
'key' is what we are going to store in container. User shall not care that we store key in parts, so it looks right to treat 'key' close to 'string'.
Ok, you mean that I should add this link of our discussion to the 'additional info' of my proposal in GSoC page?
Yes
-- Best regards, Antony Polukhin
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I agree with you, that is why I design a layer of interfaces as TrieMap, and TrieSet, in my conceivation, it can be used as stl:map and stl:set, with the difference that they retrieve the keys using Trie data structure. So of course user can get 'key' when dereferencing iterator. Meanwhile, I still think that it is useful to provide some classes on lower layer to not store the keys in the iterators, so that users can have a more space-saving application. In conclusion, I adjust my design to the following form:
bottom layer: class Trie
; The key and value are not in the for as map(pair ), so it is space-saving. second layer: typedef TrieMap
Trie > The two layers can satisfy different needs of users.
Duplicating data is bad. If we have an iterator pointing to leaf node we can restore the whole key, without storing it one more time (leaf node == final element of whole key).
In Trie a non-leaf node can be the ending of a string, so you thought here confused me. How will your design handle this situation? A pointer to a single leaf node from the non-leaf node to it?
How about a node, that has no value and only shows that this is the end of substring: struct node_base { node* next_on_same_level_; node* deeper_or_parent_; // points to parent if next_on_same_level_ is NULL }; template <class Element> struct node_nonleaf: public node_base { Element value_; }; struct node_leaf: public node_base {}; // next_on_same_level_ is always NULL So sting 'abc' in trie would look like node_nonleaf('a') -> node_nonleaf('b') -> node_nonleaf('c') -> node_leaf Now if we add string 'ab' our structure will look like node_nonleaf('a') -> node_nonleaf('b') -> [node_nonleaf('c'), node_leaf] -> node_leaf
In my head node looks like this:
template <class Element> struct node { node* next_on_same_level_; node* deeper_or_parent_; // points to parent if next_on_same_level_ is NULL
Element value_; };
If next_on_same_level_ is NULL - we are in leaf node. To retrieve key we need to move to root, gathering elements.
And it looks like it is possible to get rid of one of the pointers using http://en.wikipedia.org/wiki/XOR_linked_list technique.
That is an amzazing trick, it is good to see such a beautiful method implementing linked list.
There are some more tricks, that allow to hide a few bits of information in XORed value. But generic implementation first, optimizations later. -- Best regards, Antony Polukhin

On 2013/4/28 0:54, Antony Polukhin wrote:
2013/4/27 Hardy
: On 2013/4/27 1:20, Antony Polukhin wrote:
2013/4/26 Hardy
: On 2013/4/26 18:47, Antony Polukhin wrote:
2013/4/26 Hardy
: Firstly, the 'key' here represent a string. The first 'get-key' function is to get a key which corresponds to the iterator(node in Trie), because in order to save space(which is one of Trie's advantages), the whole key string should not store in a single iterator(node), so it is necessary to provide such an interface under my design.
This looks bad. Iterating through elements of 'string' may be a useful feature, but not the main. User will be putting in container 'string' and he will be expecting to retrieve 'string' when dereferencing iterator.
Since the relationship between key and string makes you confused, I am now trying to explain more about it. In my current design, a key is a 'string' itself. My idea is to take sequences of anytype as an abstract string in the bottom layer of the lib, in order to extend the capability of Trie. And after thinking on it several times, I now prefer to use 'key' as an element of a 'string', for it is more clear to users. Since it is a crucial and essential problem of the whole Trie project, I want some advices of you on it.
'key' is what we are going to store in container. User shall not care that we store key in parts, so it looks right to treat 'key' close to 'string'.
Ok, you mean that I should add this link of our discussion to the 'additional info' of my proposal in GSoC page?
Yes
-- Best regards, Antony Polukhin
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I agree with you, that is why I design a layer of interfaces as TrieMap, and TrieSet, in my conceivation, it can be used as stl:map and stl:set, with the difference that they retrieve the keys using Trie data structure. So of course user can get 'key' when dereferencing iterator. Meanwhile, I still think that it is useful to provide some classes on lower layer to not store the keys in the iterators, so that users can have a more space-saving application. In conclusion, I adjust my design to the following form:
bottom layer: class Trie
; The key and value are not in the for as map(pair ), so it is space-saving. second layer: typedef TrieMap
Trie > The two layers can satisfy different needs of users.
Duplicating data is bad. If we have an iterator pointing to leaf node we can restore the whole key, without storing it one more time (leaf node == final element of whole key).
In Trie a non-leaf node can be the ending of a string, so you thought here confused me. How will your design handle this situation? A pointer to a single leaf node from the non-leaf node to it?
How about a node, that has no value and only shows that this is the end of substring:
struct node_base { node* next_on_same_level_; node* deeper_or_parent_; // points to parent if next_on_same_level_ is NULL };
template <class Element> struct node_nonleaf: public node_base { Element value_; };
struct node_leaf: public node_base {}; // next_on_same_level_ is always NULL
So sting 'abc' in trie would look like node_nonleaf('a') -> node_nonleaf('b') -> node_nonleaf('c') -> node_leaf Now if we add string 'ab' our structure will look like node_nonleaf('a') -> node_nonleaf('b') -> [node_nonleaf('c'), node_leaf] -> node_leaf
In my head node looks like this:
template <class Element> struct node { node* next_on_same_level_; node* deeper_or_parent_; // points to parent if next_on_same_level_ is NULL
Element value_; };
If next_on_same_level_ is NULL - we are in leaf node. To retrieve key we need to move to root, gathering elements.
And it looks like it is possible to get rid of one of the pointers using http://en.wikipedia.org/wiki/XOR_linked_list technique.
That is an amzazing trick, it is good to see such a beautiful method implementing linked list.
There are some more tricks, that allow to hide a few bits of information in XORed value. But generic implementation first, optimizations later.
-- Best regards, Antony Polukhin
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Ok, thanks, I will digest your idea, and continue optimizing my design. BTW, I've submitted a version of my proposal which shows the result of our discussions till now. And I will refine it carefully when we have further discussion.

On 2013/4/26 16:15, Antony Polukhin wrote:
Besides, considering that traditional Trie has limited scale of using, I want to design my Trie in 3 layers: The bottom one can provide the user-definition key type, which uses like RawTrie
, in which the key can be a sequence of some type(maybe an array of anything can be a key), and every node of Trie can be taken as a root of sub RawTrie, which gives convenience to add Trie into it and even swap a sub Trie. And the second layer will be a specific Trie with the key of chars(when handling lookup and insert, both string and char[] can be adapted to it). And the feature of sub Trie is provided. The class is like Trie . And the third layer will be a Trie without so much freedom of operating as the above two. I will just provide the basic interface as I list above. And the last two layers of Trie will have a version of compressed storage of strings. Relations between layers are not clear.
bottom layer:
class RawTrie
Bay the way, student applications can now be submitted to https://google-melange.appspot.com/gsoc/homepage/google/gsoc2013. The deadline is 3 May. Do not forget to submit your proposal and add link to this discussion in proposal.
Ok, you mean that I should add this link of our discussion to the 'additional info' of my proposal in GSoC page?
-- Best regards, Antony Polukhin
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 22/04/2013 2:32 AM, Hardy Huang wrote:
July 15 – July 29(To mid-term evaluation): Refine my alpha codes, analyze the performance, and try to improve it. Write an initial report of the project and document of Trie.
July 29 – September 23: Refine my codes, and have some thorough tests on it, make it robust. Reorganize codes and improve documents.
Perhaps you could review this library and consider getting it "incorporated" into Boost: http://code.google.com/p/patl/

On 2013/4/22 17:28, Arash Partow wrote:
On 22/04/2013 2:32 AM, Hardy Huang wrote:
July 15 – July 29(To mid-term evaluation): Refine my alpha codes, analyze the performance, and try to improve it. Write an initial report of the project and document of Trie.
July 29 – September 23: Refine my codes, and have some thorough tests on it, make it robust. Reorganize codes and improve documents.
Perhaps you could review this library and consider getting it "incorporated" into Boost:
http://code.google.com/p/patl/
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Thanks you, I'll try to study into it, and learn to refine my design.
participants (5)
-
Antony Polukhin
-
Arash Partow
-
Hardy
-
Hardy Huang
-
Michael Marcin