Genetics library: Volunteers needed
Hi All, I am recruiting users for the putative genetics library. https://github.com/andy-thomason/genetics We have a few simple examples of gene searching and I am working on a more complete aligner example and some performance improvements to the index data structure. For data, you can obtain the human genome from: ftp://ftp.ensembl.org/pub/release-81/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz Interesting problems we would like to solve: Given a 20 character sequence with up to six errors, what is the fastest way to list all possibilities other than a brute force search (CRISPR). Can we use JNI to connect the library to Hadoop and other distributed seach systems? Can we construct a database of all known viral genomes including recombination? Can we detect variations in MHC VDJ regions within a single sample? Many other interesting puzzles are there to be found... Andy. --- This email has been checked for viruses by Avast antivirus software. http://www.avast.com
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Andy Thomason Sent: 21 July 2015 13:11 To: boost@lists.boost.org Subject: [boost] Genetics library: Volunteers needed
Hi All,
I am recruiting users for the putative genetics library.
https://github.com/andy-thomason/genetics
We have a few simple examples of gene searching and I am working on a more complete aligner example and some performance improvements to the index data structure.
For data, you can obtain the human genome from:
ftp://ftp.ensembl.org/pub/release- 81/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
Interesting problems we would like to solve:
Given a 20 character sequence with up to six errors, what is the fastest way to list all
possibilities
other than a brute force search (CRISPR).
Can we use JNI to connect the library to Hadoop and other distributed seach systems?
Can we construct a database of all known viral genomes including recombination?
Can we detect variations in MHC VDJ regions within a single sample?
Many other interesting puzzles are there to be found...
Andy.
Potential users may find the draft docs useful at https://rawgit.com/andy-thomason/genetics/master/doc/html/index.html (if lacking some icons and style sheets to see in all their glory ;-)
2015-07-21 15:11 GMT+03:00 Andy Thomason
Hi All,
I am recruiting users for the putative genetics library.
Hi, I like the idea of genetic library in Boost! However code misses essential optimizations and suffers from premature optimizations. * dna_string misses reserve() in assignment. This makes some of the push_back()s slow. * Attempt to understande the exact search rewarded me with headache (cool hack, I've enjoyed it!). Too many magic constants and variables, this makes the algo hard to maintain. Also I have a disbelive that the algorithm is optimal: You are comparing by 4 nucleotides. 256 nucleotide combinations with length 4 exist. Let's assume for simplicity that nucleotides are uniform distributed. Algorithm will often give false positives: it will be triggered roughtly once each 256 nucleotide comparisons. You're doing some kind of vectorization, so algo will give false positives each ~8 loop bodies. Comparing by longer nucleotide chain will trigger the compare_inexact less often. For example comparing by 8 necleotides will trigger false positive once per ~65500 comparisons. * comparison operators require improvements. Compare sizes first (it's cheap!). Use memcmp in cases like `values < rhs.values || values == rhs.values`. memcmp will give you an integer that already shows is value bigger\smaller\equal, without a need to iterate over the data for seconf time. * `const auto str_values = str.get_values();` - must be `const auto& str_values = str.get_values();` * provide an enum for nucleotides { nA = 0, nT = ...}. This would make the library more user friendly. There's more. If you're interested, I can investigate further -- Best regards, Antony Polukhin
Whoa you want ot use the JNI manually? Have you heard about JNA? It might
be good to take a look at that because it really takes the pain out of
working with Java <-> native.
On Thu, Jul 23, 2015 at 2:38 PM, Antony Polukhin
2015-07-21 15:11 GMT+03:00 Andy Thomason
: Hi All,
I am recruiting users for the putative genetics library.
Hi,
I like the idea of genetic library in Boost!
However code misses essential optimizations and suffers from premature optimizations.
* dna_string misses reserve() in assignment. This makes some of the push_back()s slow. * Attempt to understande the exact search rewarded me with headache (cool hack, I've enjoyed it!). Too many magic constants and variables, this makes the algo hard to maintain. Also I have a disbelive that the algorithm is optimal: You are comparing by 4 nucleotides. 256 nucleotide combinations with length 4 exist. Let's assume for simplicity that nucleotides are uniform distributed. Algorithm will often give false positives: it will be triggered roughtly once each 256 nucleotide comparisons. You're doing some kind of vectorization, so algo will give false positives each ~8 loop bodies.
Comparing by longer nucleotide chain will trigger the compare_inexact less often. For example comparing by 8 necleotides will trigger false positive once per ~65500 comparisons.
* comparison operators require improvements. Compare sizes first (it's cheap!). Use memcmp in cases like `values < rhs.values || values == rhs.values`. memcmp will give you an integer that already shows is value bigger\smaller\equal, without a need to iterate over the data for seconf time.
* `const auto str_values = str.get_values();` - must be `const auto& str_values = str.get_values();` * provide an enum for nucleotides { nA = 0, nT = ...}. This would make the library more user friendly.
There's more. If you're interested, I can investigate further
-- Best regards, Antony Polukhin
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Hi Kenneth,
Whoa you want ot use the JNI manually? Have you heard about JNA? It might be good to take a look at that because it really takes the pain out of working with Java <-> native.
That is useful information, for example. Andy. --- This email has been checked for viruses by Avast antivirus software. http://www.avast.com
participants (4)
-
Andy Thomason
-
Antony Polukhin
-
Kenneth Adam Miller
-
Paul A. Bristow