Tuesday, November 21, 2006

Free Bioinformatics tools


Arka
Description
Arka is a program that Serves as a graphical interface for the programs from the GP package
has some interesting functions on it. Main scope of the program is the manipulation and visualisation of DNA / RNA / protein sequences.
The GP package contains many command-line utilities which fullfill a whole bunch of tasks (from DNA sequence searches to restriction analysis and determining the melting temperature of oligonucleotides). While those programs are convenient to use in batch processing and CGI scripts (which was the purpose of those programs), they lack a nice GUI.
Arka remembers the options for the GP programs and knows what both the programs and the options do. Besides, it has some gadgets on its own. It requires GTK+, but doesn't need GNOME. Also, it is small and quick: look, I write and use my programs on an old 486 laptop. It should run like hot butter on your computer. Unless, of course, it is a 386 SX. The name comes from the "UAG" stop codon, which is traditionally called "arka codon".
Available Downloads
Source + i386 binaries, tar and gzipped: arka-0.11.tgz RPM, i386 binaries: arka-0.11-1.i386.rpm RPM, source: arka-0.11-1.src.rpm

Bioperl
Description
The Bioperl server provides an online resource for modules, scripts, and web links for developers of Perl-based software for life science research. They can also provide web, FTP and CVS space for individuals and organizations wishing to distribute or otherwise make freely available standalone scripts & code.
Tutorial
BioPerl 1.4 Tutorial
Pasteur Institute Bioperl Course
BioPerl 1.4 Module Documentation
Available Downloads
Core:bioperl-1.4
gzip
bz2
zip
ppm.gzip
Run:bioperl-run-1.4
gzip
bz2
Ext:bioperl-ext-1.4
gzip
bz2
DB:bioperl-db-1.4
gzip
zip
Microarray:bioperl-microarray-0.1
gzip
bz2
zip
GUI:bioperl-gui-0.7
gzip

Chemtool
Description
Chemtool is a small program for drawing chemical structures on Linux and Unix systems using the GTK toolkit under X11. A short and possibly outdated description of the available functions is available here.
Chemtool relies on transfig by Brian Smith for postscript printing and exporting files in PicTeX and EPS formats. Its companion program, XFig, is recommended for enhancing the output of chemtool, and for creation of 2D diagrams and schematics in general. Both are included with most distributions of Linux, and are available through a number of websites including http://www.nminbre.org/pages/bioinformatics/www.xfig.org. If you want to import chemtool drawings into word processing programs other than LaTeX, you will probably want to add a preview bitmap to them, as neither StarOffice/OpenOffice nor that software from Redmond seem to be able to display postscript inserts on screen without them. For this purpose, using either ps2epsi, which comes with ghostscript, or epstool, a part of gsview is recommended. Since chemtool-1.6, this option is supported directly (through the equivalent function offered by recent versions of transfig). Chemtool was originally written by Thomas Volk, then a student of chemistry and biology at the University of Ulm, Germany. His version, which was described in an article in the German periodical LinuxMagazin, was using plain X11. A more recent review of chemtool appeared in Nachr. Chem. Tech. Lab. 49 (2001) 1310-1314.
Available Downloads
chemtool-1.6.8, sourcecode in tar.gz formatchemtool-1.6.8-1.src.rpm, sourcecode in rpm formatchemtool-1.6.8-1.i586.rpm, SuSE 9.3 rpm package

ClustalW
Description
Clustal W is a general purpose multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. Evolutionary relationships can be seen via viewing Cladograms or Phylograms.
Multiple alignments of protein sequences are important tools in studying sequences. The basic information they provide is identification of conserved sequence regions. This is very useful in designing experiments to test and modify the function of specific proteins, in predicting the function and structure of proteins, and in identifying new members of protein families.
Sequences can be aligned across their entire length (global alignment) or only in certain regions (local alignment). This is true for pairwise and multiple alignments. Global alignments need to use gaps (representing insertions/deletions) while local alignments can avoid them, aligning regions between gaps. ClustalW is a fully automatic program for global multiple alignment of DNA and protein sequences. The alignment is progressive and considers the sequence redundancy. Trees can also be calculated from multiple alignments. The program has some adjustable parameters with reasonable defaults. EBI provides a version of Clustal W that can be executed over the Internet on their computers. In addition, you can download a copy of the basic software to run on your own computer. Versions exist for UNIX, DOS, Windows XP (command line mode only) and Mac OSX.
Tutorials
Nucleotides
Proteins
Available Downloads
DOS
XP
MAC-OSX
UNIX

COALESCE
Description
Metropolis-Hastings Markov Chain Monte Carlo genealogy sampler.
For use in cases without recombination, selection or migration and with constant population size.
This program takes as input a set of aligned DNA or RNA sequences from different individuals in a population and uses them to make a maximum likelihood estimate of the parameter "theta," using the method described in Kuhner et al. (1995). Theta is defined as 4 times the effective population size times the mutation rate in a diploid organism, or 2 times the effective population size times the mutation rate in a haploid. (Note that this is mutation rate per site, not per locus.)
COALESCE assumes that the sampled population is of constant size, and that the loci sampled are not affected by selection or recombination. If these assumptions are violated the results may be erroneous. The algorithm begins with a genealogy for the sequences and sequentially makes modifications in it, accepting or rejecting the modifications based on the sequence data, and sampling the current genealogy at intervals. From the sampled genealogies it constructs a likelihood curve and maximum likelihood estimate for theta. The aim is to preferentially sample those genealogies which can contribute substantial information to the estimate of theta, avoiding the myriads of possible but unlikely and thus uninformative genealogies. If more than one locus is analyzed, likelihoods from all loci are summed to make an overall likelihood curve and estimate of theta. The basic unit of progress of the program is a "step"--one proposed change to the genealogy, which may be accepted or rejected. A continuous series of steps, all using the same parameter values, is a "chain".
Documentation
User's guide
Available Downloads
Unix source code:
UNIX tar.gz encoded [156kb]
UNIX tar.Z encoded [211kb] PowerMac binary:
HQX encoded CompactPro archive [256kb; binary + documentation]

fastDNAml
Description
fastDNAml is a program for estimating maximum likelihood phylogenetic trees from nucleotide sequences. Much of this program is based on version 3.3 of Joseph Felsenstein's DNAML program.
Reference
G. J. Olsen, H. Matsuda, R. Hagstrom, and R. Overbeek. 1994. fastDNAml: A tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Comput. Appl. Biosci. 10: 41-48
Available Downloads
fastDNAml -- The current release of the program.
mpi_fastDNAml and pvm_fastDNAml -- Parallel versions based upon MPI or PVM are available from Indiana University.
fastDNAml_p4 -- A version of the program using the p4 (Portable Programs for Parallel Processing) library. This version is now quite old and is not being supported (see the link above for newer MPI and PVM versions from Indiana University).

FLUCTUATE
Description
FLUCTUATE fits the model which has a single population which has been growing (or shrinking) according to an exponential growth law. It estimates 4Nu and g, where N is the effective population size, u is the neutral mutation rate per site, and g is the growth rate of the population. If you have a PowerMac, you will want to fetch the PowerMac binary, or if you have an Intel processor with Windows 95/98/NT/2000/XP you want the exe file.
Available Downloads
LINUX tar.gz encoded source and documentation.
LINUX tar.gz encoded binary and documentation.
Self-extracting HQX encoded CompactPro archive [PowerMac binary and documentation]
Self extracting Windows archive [Windows 95/98/NT/2000/XP binary and documentation]
LaTeX version of paper
PostScript version of paper

HMMER
Description
Profile hidden Markov models (profile HMMs) can be used to do sensitive database searching using statistical descriptions of a sequence family's consensus. HMMER is a freely distributable implementation of profile HMM software for protein sequence analysis
Documentation
User's guide
Available Downloads
All distributions below come with full source code, the User's Guide (PDF format), UNIX man pages, and other documentation. Once you download, uncompress (gunzip), and un-tar (tar xf), see the file INSTALL for quick installation instructions.
HMMER should build cleanly on any UNIX platform, including Mac OS/X. It should also compile on Microsoft Windows platforms, but you would have to work around the GNU configure script and UNIX makefiles. Porting to other non UNIX operating systems such as VAX/VMS should not be difficult. The code is standard ANSI/POSIX C.
Source code
AMD Opteron/Linux
Apple Macintosh PowerPC OS/X
Compaq Alpha Tru64
Compaq Alpha Linux
Hewlett/Packard IA64 (Itanium2), Linux
Hewlett/Packard IA64 (Itanium2), HP/UX
IBM Power4, Linux
IBM Power4, AIX
Intel FreeBSD
Intel GNU/Linux
Intel GNU/Linux as RPM
Intel OpenBSD
Intel Solaris
Silicon Graphics IA64 (Itanium2), Linux
Silicon Graphics MIPS IRIX
Sun Sparc Solaris not currently available; Use source code above

GeneSplicer
Description
A fast, flexible system for detecting splice sites in the genomic DNA of various eukaryotes. The system has been trained and tested successfully on Plasmodium falciparum (malaria), Arabidopsis thaliana, human, Drosophila, and rice. Training data sets for human and Arabidopsis thaliana are included. Use the GeneSplicer Web Interface to run GeneSplicer directly, or see below for instructions on downloading the complete system including source code. GeneSplicer is released as source code and was tested on Linux RedHat 6.x+, Sun Solaris, and Alpha OSF1, but should work on any Unix system.
Available Downloads
GeneSplicer system

GP
Description
GP is a set of small utilities written in ANSI C to manipulate DNA sequences in a Unix fashion, fit for combining within shell and cgi scripts. I have done this utilities for myself and found them very useful for my work; they are fast and quite reliable, and playing with large numbers of sequences is much more convenient with command line interface then with standard GUI tools. Feel free to mail me bug reports and suggestions. The programs are supposed to compile fine under any ANSI C compiler, but I never tried any platform other then Unix / Linux. You will find more details online on the GP man pages. And here is an example of a site using GP programs in CGI scripts to do promoter searches on-the-fly.
Available Downloads
Source + i386 binaries, tar and gzipped: gp-0.26.tgz RPM, i386 binaries: gp-0.26-1.i386.rpm RPM, source: gp-0.26-1.src.rpm

Lucy
Description
A Sequence Cleanup Program. Lucy is a utility that prepares raw DNA sequence fragments for sequence assembly, possibly using the TIGR Assembler. The cleanup process includes quality assessment, confidence reassurance, vector trimming and vector removal. The primary advantage of Lucy over other similar utilities is that it is a fully integrated, stand alone program.
Reference
H. H. Chou and M. H. Holmes. 2001. DNA sequence quality trimming and vector removal. Bioinformatics. 17(12): 1093-1104.
Documentation
Program Requirements
Available Downloads
Lucy [Unix version]
Lucy2 [Hui-Hsien Chou's Windows version]

NUT
Description
NUT is an open-source free nutrition software that records what you eat and analyzes your meals for nutrient levels in terms of the "Daily Value" or DV which is the standard for food labeling in the US. The program uses the free food composition database from the USDA. By experimenting, you can find the optimal level of the various nutrients and how to implement this with foods available to you. NUT can help reconstruct the lost instruction manual to your care and feeding because, when the authorities and crackpots disagree on the proper human diet, you can design an experiment using the food composition tables to discover the truth!
Features of NUT include:
7146 foods and 136 nutrients--the complete, latest USDA database
Foods easy to find and add to daily meals
Configurable for 1-19 meals per day and any dietary plan--including low carb, zone, low fat
Comprehensive meal analysis for any number of consecutive meals
Presents both easy-to-read percentage summaries and in-depth nutrient analysis, including Omega-3 and Omega-6 essential fatty acids
Defaults to ounces or grams based on user input
Suggests foods based on current diet
Can easily create additional databases for other family members
Auto-transfer of successful dietary strategies from analysis screen to configuration settings
Allows recording of recipes and customary meals for fast data entry
Guesses recipes of packaged foods
Creates graphs of nutrient intake showing daily and monthly trends
Sorts foods richest in each of the 136 nutrients
Reveals which foods contribute most to user's nutrition
Runs on Linux, Un*x, Windows (DOS); allows dual-boot systems to share the same data; and has no dependencies on other programs
The price is right--it's free! And you can read and modify the source code.
Documentation
Man page
Installation instructions
Opinions on how to improve your nutrition
A frequently-asked question
How To Use NUT
Find the Optimum: It's Easy as 1, 2, 3!
Find the Optimum: How NUT's Default Polyunsaturated Fat Reference Values Were Derived
Find the Optimum: Which Fats?
Find the Optimum: A Word about Insulin Resistance (Which Carbohydrates?)
Find the Optimum: Notes on Vitamins and Minerals
Read about Feline Nutrition and Consider Its Resonance with Human Nutrition
Available Downloads
latest source archive compressed with gzip: nut-11.1.tar.gz
latest source archive compressed with bzip2: nut-11.1.tar.bz2
latest source archive compressed with zip: nut-11.1.zip

PdbAlign
Description
Given a GCG multiple sequence alignment file (a GCG MSF file), which a includes a sequence of known structure, the program pdbalign maps the sequence variability onto the known structure. The central premise is of course, that for a closely related family of proteins (sequence ID > 40%) the 3-D structures will not be significantly different.
Reference
Roger A. Sayle, Mansoor A. S. Saqi, M. Weir, Andrew Lyall. 1995. PdbAlign, PdbDist and DistAlign: tools to aid in relating sequence variability to structure. Computer Applications in the Biosciences. 11(5): 571-573.
Documentation
README
Available Downloads
UNIX

PHYLIP
Description
PHYLIP (the PHYLogeny Inference Package) is a package of programs for inferring phylogenies (evolutionary trees). It is available free over the Internet, and written to work on as many different kinds of computer systems as possible. The source code is distributed (in C), and executables are also distributed. In particular, already-compiled executables are available for Windows (95/98/NT/2000/me/xp), MacOS 8 and 9, MacOS X, and Linux systems. Complete documentation is available on documentation files that come with the package.
Methods that are available in the package include parsimony, distance matrix, and likelihood methods, including bootstrapping and consensus trees. Data types that can be handled include molecular sequences, gene frequencies, restriction sites and fragments, distance matrices, and discrete characters.
The programs are controlled through a menu, which asks the users which options they want to set, and allows them to start the computation. The data are read into the program from a text file, which the user can prepare using any word processor or text editor (but it is important that this text file not be in the special format of that word processor -- it should instead be in "flat ASCII" or "Text Only" format). Some sequence analysis programs such as the ClustalW alignment program can write data files in the PHYLIP format. Most of the programs look for the data in a file called "infile" -- if they do not find this file they then ask the user to type in the file name of the data file.
Output is written onto special files with names like "outfile" and "outtree". Trees written onto "outtree" are in the Newick format, an informal standard agreed to in 1986 by authors of a number of major phylogeny packages. At this stage they do not have a mouse-windows interface for PHYLIP.
Documentation
Overview
PHYLIP programs and documentation
Installation instructions
Available Downloads
Linux or Unix gzip'ed tar archive of C sources and documentation
Windows Documentation and C source code
Windows95/98/NT/2000/me/xp executables, part 1
Windows95/98/NT/2000/me/xp executables, part 2
Mac OS X Documentation, source code and executables
Mac OS 8 or 9 Single Stuffit Documentation and C source code
Mac OS 8 or 9 Multiple Stuffit Documentation and C source code
Macintosh Mac OS 8 or 9 executables, part 1
Macintosh Mac OS 8 or 9 executables, part 2
Macintosh Mac OS 8 or 9 executables, part 3
Please register after downloading

ProFit
Description
ProFit (pronounced Pro-Fit, not profit!) is designed to be the ultimate program for performing least squares fits of two protein structures. It performs a very simple and basic function, but allows as much flexibility as possible in performing this procedure. Thus one can specify subsets of atoms to be considered, specify zones to be fitted by number, sequence, or by sequence alignment. ProFit does not try to address the question of sorting out equivalent atoms for you beyond doing a sequence alignment. There are other programs such as SSAP and GAFIT which address that problem. You must specify which residues and atoms you consider to be equivalent although the program supports internal sequence alignment to set the zones automatically.
Documentation
Full ProFit documentation
Frequently asked questions
Available Downloads
ProFit is freely available for use by not-for-profit organisations and for commercial organisations (providing they inform the author that they are using it). It may not be distributed without the author's permission, but must be obtained from this site. It is supplied as a gzipped tar file of source code and as an Linux binary.
Bernhard Rupp has kindly provided a ZIP file of ProFit compiled for Windows (Win32). This is only available for Version 2.3 of ProFit.
Registration and download

RasMol
Description
RasMol is a molecular graphics program intended for the visualisation of proteins, nucleic acids and small molecules. The program is aimed at display, teaching and generation of publication quality images. The program has been developed at the University of Edinburgh's Biocomputing Research Unit and the Biomolecular Structures Group at Glaxo Research and Development, Greenford, UK.
RasMol reads in molecular co-ordinate files in a number of formats and interactively displays the molecule on the screen in a variety of colour schemes and representations. Currently supported input file formats include Brookhaven Protein Databank (PDB), Tripos' Alchemy and Sybyl Mol2 formats, Molecular Design Limited's (MDL) Mol file format, Minnesota Supercomputer Center's (MSC) XMol XYZ format, CHARMm format, MOPAC format, CIF format and mmCIF format files. If connectivity information and/or secondary structure information is not contained in the file this is calculated automatically. The loaded molecule may be shown as wireframe, cylinder (drieding) stick bonds, alpha-carbon trace, spacefilling (CPK) spheres, macromolecular ribbons (either smooth shaded solid ribbons or parallel strands), hydrogen bonding and dot surface. Atoms may also be labelled with arbitrary text strings. Alternate conformers and multiple NMR models may be specially coloured and identified in atom labels. Different parts of the molecule may be displayed and coloured independently of the rest of the molecule or shown in different representations simultaneously. The space filling spheres can even be shadowed. The displayed molecule may be rotated, translated, zoomed, z-clipped (slabbed) interactively using either the mouse, the scroll bars, the command line or an attached dials box. RasMol can read a prepared list of commands from a `script' file (or via interprocess communication) to allow a given image or viewpoint to be restored quickly. RasMol can also create a script file containing the commands required to regenerate the current image. Finally the rendered image may be written out in a variety of formats including both raster and vector PostScript, GIF, PPM, BMP, PICT, Sun rasterfile or as a MolScript input script or Kinemage. RasMol will run on a wide range of architectures and systems including SGI, sun4, sun3, sun386i, SGI, DEC, HP and E&S workstations, IBM RS/6000, Cray, Sequent, DEC Alpha (OSF/1, OpenVMS and Windows NT), IBM PC (under Microsoft Windows, Windows NT, OS/2, Linux, BSD386 and *BSD), Apple Macintosh (System 7.0 or later), PowerMac and VAX VMS (under DEC Windows). UNIX and VMS versions require an 8bit, 24bit or 32bit X Windows frame buffer (X11R4 or later). The X Windows version of RasMol provides optional support for a hardware dials box and accelerated shared memory rendering (via the XInput and MIT-SHM extensions) if available.
Available Downloads
DEC/Compaq/HP
HP
Linux (RedHat 7, i386)
Mac
MS Windows
RS/6000
SGI

SeWeR
Description
SeWeR is an acronym, stands for SEquence analysis using WEb Resources. It serves you a single door to all the common web-based services for sequence analysis. And it sews. It sews all these services together. For a refined mind, SeWeR is an integrated portal to common web-based services in bioinformatics. SeWeR is cross-browser DHTML. It is written entirely in JavaScript1.2. Hence it will run only in Netscape 4.0 or higher and Internet Explorer 4.0 or higher.
Reference
M. K. Basu. 2001. SeWeR: a customizable and integrated dynamic HTML interface to bioinformatics services. Bioinformatics. 17(6): 577-578.
Available Downloads
SeWeR is feather-light! The whole package is just around 300K. You can even run it from a floppy. The zip archive is available at two locations:
IUBIO
http://www.bioinformatics.org/sewer/sewer.zip

STRIDE
Description
STRIDE is a program to recognize secondary structural elements in proteins from their atomic coordinates. It performs the same task as DSSP by Kabsch and Sander but utilizes both hydrogen bond energy and mainchain dihedral angles rather than hydrogen bonds alone. It relies on database-derived recognition parameters with the crystallographers' secondary structure definitions as a standard-of-truth. Please see Frishman and Argos for detailed description of the algorithm.
Reference
D. Frishman & P. Argos. 1995. Knowledge-based secondary structure assignment. Proteins. 23: 566-579.
Available Downloads
Executables of STRIDE for several UNIX platforms, VAX/VMS, OpenVMS, Dos and Mac together with documentation and source code are available by anonymous FTP from ftp://ftp.ebi.ac.uk/ (directories /pub/software/unix/stride, /pub/software/dos/stride, /pub/software/vms/stride, /pub/software/mac/stride). Data files with STRIDE secondary structure assignments for the current release of the PDB databank are in the directory /pub/databases/stride of the same site. Atomic coordinate sets can be submitted for secondary structure assignment through electronic mail to stride@embl-heildelberg.de. A mail message containing HELP in the first line will be answered with appropriate instructions. See also WWW page http://www.embl-heidelberg.de/stride/stride_info.html.

XYLEM
Description
XYLEM(1) is a package of tools designed to exploit the Unix environment to enable the user to identify, extract and manipulate data from major databases such as GenBank, EMBL and PIR. SPLITDB splits database files into annotation, sequence, and index files for more efficient searching. Fundamental to the power of these programs is the ability to perform operations on groups of sequences, represented by names or accession numbers which function as virtual database subsets. Keyword searches can be performed by FINDKEY. Hits can be retrieved using FETCH. The most powerful program is FEATURES, which uses the GETOB parser to evaluate GenBank/EMBL/DDBJ Features Table expressions, thereby extract features (eg. mRNA, sig_peptide, intron) from lists of entries. Additional programs perform operations such as translation or randomization of datasets, and formatting of multiply-aligned sequences for publication. XYLEM is compatible with the Fristensky Sequence Analysis Package, and the Pearson FASTA programs(2), and can be used from within the Genetic Data Environment (GDE) of Steven Smith(3).
Reference
B. Fristensky. 1993. Feature expressions: creating and manipulating sequence datasets. NAR. 21: 5997-6003.
W. R. Pearson and D. J. Lipman. 1988. Improved tools for biological sequence comparison. PNAS. 85: 2444-2448.
S. W. Smith, R. Overbeek, C. R. Woese, W. Gilbert and P. M. Gillevet. 1994. The genetic data environment an expandable GUI for multiple sequence analysis. Computer Applications in the Biosciences. 10:671-675
Available Downloads
Source code and documentation (xylem.1.8.7.tar.Z, 418 k)
Solaris/Sparc binaries (xylem.1.8.7.solaris-bin.tar.Z, 192k)
Linux/Intel binaries (xylem.1.8.7.linux-bin.tar.Z, 179k)

No comments: