Hi,
I'm a student from Tsinghua University in china and interested
in the idea about "approximate string matching". Below is my
proposal for this. Could you please view it and give me any
kinds of suggestions on it? Thanks!
(1) Personal Details
* Name: Yu Jiang
* College/University: Tsinghua University, China
* Course/Major: Computer Science and Technology
* Degree Program: M.Sc.
* Email: sunlightt07@gmail.com
* Homepage: do not have one
* Availability:
- How much time do you plan to spend on your GSoC?
Since I have a long summer holiday, I can spend 30 hours a week.
- What are you intended start and end dates?
I will follow the calendar of GSoC. I have plenty of time from June
to September.
- What other factors affect your availability (exams, courses,
moving, work, etc.)?
I have some courses this term which will end before June. 15th.
However, these courses don't have heavy workload, I still have
enough time to work on GSoC.
(2) Background Information
* Please summarize your educational background (degrees earned,
courses taken, etc.).
I majored in Computer Science and got my bachelor degree in Tsinghua
University from 2008 to 2012. I learned some basic courses such as
algorithm,
data structure, algebra and logic. I also learned many advanced courses
such as database, operating system, network, formal language, compiler,
computer architecture. Currently I'm working for my master degree.
My research interest is database.
* Please summarize your programming background (OSS projects,
internships, jobs, etc.).
This is the first time I plan to take part in an OSS projects.
I have worked as an intern in Hulu, an online video website,
for one year, mainly focus on frontend webpages.
I have used C++ to finish many course projects, such as a simple
ftp server and client, completing a tiny os and so on.
Recently I am learning the new features in C++11 and get to know
much more about the C++ language.
* Please tell us a little about your programming interests.
I really love to optimize on the implantation of an algorithm by
making full use of the compiler and the architecture(e.g. cache, SIMD).
Besides, I like to write webpages using different kinds of excellent
frameworks.
* Please tell us why you are interested in contributing to the
Boost C++ libraries.
Boost is the most famous C++ libraries. It is powerful and can be
used in many different kinds of programs. It's my honor to add
some new features into it.
* What is your interest in the project you are proposing?
It's related to my research interest. Nowadays data are produced
rapidly. To clean or integrate them so that we can get a better view,
"Approximate string matching" is rather a powerful tool. I'd like to
see such functions in boost library.
* Have you done any previous work in this area before or on
similar projects?
Yes. My lab has focused on researching on such algorithms for 3-5 years.
The "best" algorithm for general string similarity search and join is
called "passjoin", which is invented by our lab. I have implemented it
and done a lot of optimizations. Early this year we attended a string
similarity search & join competition[1] and achieved champion.
The code is mainly written by me.
[1]
http://www2.informatik.hu-berlin.de/~wandelt/searchjoincompetition2013/Resul...
* What are your plans beyond this Summer of Code time frame for
your proposed work?
I will make efforts to make it accepted by the boost library. If
it can be released someday, I will maintain it.
* Please rate, from 0 to 5(0 being no experience, 5 being expert),
your knowledge of the following languages, technologies, or tools:
-C++: 3
-C++ Standard Library: 3.5
-Boost C++ Libraries: 2
-Subversion: 3
* What software development environments are you most familiar with?
Eclipse. I use it in both windows and linux.
* What software documentation tool are you most familiar with?
Doxygen.
(3) Project Proposal
Approximate string matching is a "hotspot" in recent years. It's widely
used in big data processing such as search engine, DNA matching. It also
can be used in simple tasks like spell checking. However, there isn't a
high performance and well-designed library in such area. I wish to see
such algorithms in boost so that many people can benefit from this!
My proposal contains two parts:
A. To implement basic similarity metrics. Edit distance and Jaccard
may be the most popular ones. Well, I'll implement more similarity metrics,
such as Hamming distance, dice and cosine.
B. To implement the fastest algorithm for string similarity search and join.
Similarity metrics will be edit distance only since it is most commonly
used.
Similarity search is to find out all the strings in a given set S, which are
"similar" to a query string. Similarity join is to find out all the string
pairs