Editing Computational biology (section)

== Techniques ==
Computational biologists use a wide range of software and algorithms to carry out their research.

===Unsupervised Learning===
[[Unsupervised learning]] is a type of algorithm that finds patterns in unlabeled data. One example is [[k-means clustering]], which aims to partition ''n'' data points into ''k'' clusters, in which each data point belongs to the cluster with the nearest mean. Another version is the [[k-medoids]] algorithm, which, when selecting a cluster center or cluster centroid, will pick one of its data points in the set, and not just an average of the cluster.
[[File:Jmatrix.png|thumb|A heat-map of the Jaccard distances of nuclear profiles]]
The algorithm follows these steps:
# Randomly select ''k'' distinct data points. These are the initial clusters.
# Measure the distance between each point and each of the 'k' clusters. (This is the distance of the points from each point ''k'').
# Assign each point to the nearest cluster.
# Find the center of each cluster (medoid).
# Repeat until the clusters no longer change.
# Assess the quality of the clustering by adding up the variation within each cluster.
# Repeat the processes with different values of k.
# Pick the best value for 'k' by finding the "elbow" in the plot of which k value has the lowest variance.

One example of this in biology is used in the 3D mapping of a genome. Information of a mouse's HIST1 region of chromosome 13 is gathered from [[Gene Expression Omnibus]].<ref>{{cite web | url=https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE64881 | title=GEO Accession viewer }}</ref> This information contains data on which nuclear profiles show up in certain genomic regions. With this information, the [[Jaccard distance]] can be used to find a normalized distance between all the loci.

===Graph Analytics===
Graph analytics, or [[Network theory|network analysis]], is the study of graphs that represent connections between different objects. Graphs can represent all kinds of networks in biology such as [[Protein–protein interaction|protein-protein interaction]] networks, regulatory networks, Metabolic and biochemical networks and much more. There are many ways to analyze these networks. One of which is looking at [[centrality]] in graphs. Finding centrality in graphs assigns nodes rankings to their popularity or centrality in the graph. This can be useful in finding which nodes are most important. For example, given data on the activity of genes over a time period, degree centrality can be used to see what genes are most active throughout the network, or what genes interact with others the most throughout the network. This contributes to the understanding of the roles certain genes play in the network.

There are many ways to calculate centrality in graphs all of which can give different kinds of information on centrality. Finding centralities in biology can be applied in many different circumstances, some of which are gene regulatory, protein interaction and metabolic networks.<ref name=":3">{{Cite journal |last1=Koschützki |first1=Dirk |last2=Schreiber |first2=Falk |date=2008-05-15 |title=Centrality Analysis Methods for Biological Networks and Their Application to Gene Regulatory Networks |journal=Gene Regulation and Systems Biology |volume=2 |pages=193–201 |doi=10.4137/grsb.s702 |issn=1177-6250 |pmc=2733090 |pmid=19787083}}</ref>

===Supervised Learning===
[[Supervised learning]] is a type of algorithm that learns from labeled data and learns how to assign labels to future data that is unlabeled. In biology supervised learning can be helpful when we have data that we know how to categorize and we would like to categorize more data into those categories.

[[File:Random forest explain.png|thumb|350x350px|Diagram showing a simple random forest]]
A common supervised learning algorithm is the [[random forest]], which uses numerous [[Decision tree learning|decision trees]] to train a model to classify a dataset. Forming the basis of the random forest, a decision tree is a structure which aims to classify, or label, some set of data using certain known features of that data. A practical biological example of this would be taking an individual's genetic data and predicting whether or not that individual is predisposed to develop a certain disease or cancer. At each internal node the algorithm checks the dataset for exactly one feature, a specific gene in the previous example, and then branches left or right based on the result. Then at each leaf node, the decision tree assigns a class label to the dataset. So in practice, the algorithm walks a specific root-to-leaf path based on the input dataset through the decision tree, which results in the classification of that dataset. Commonly, decision trees have target variables that take on discrete values, like yes/no, in which case it is referred to as a [[Classification chart|classification tree]], but if the target variable is continuous then it is called a [[regression tree]]. To construct a decision tree, it must first be trained using a training set to identify which features are the best predictors of the target variable.

===Open source software===
[[Open source software]] provides a platform for computational biology where everyone can access and benefit from software developed in research.<ref>{{Cite journal |last1=Boudreau |first1=Mathieu |last2=Poline |first2=Jean-Baptiste |last3=Bellec |first3=Pierre |last4=Stikov |first4=Nikola |date=2021-02-11 |title=On the open-source landscape of PLOS Computational Biology |journal=PLOS Computational Biology |language=en |volume=17 |issue=2 |pages=e1008725 |doi=10.1371/journal.pcbi.1008725 |doi-access=free |issn=1553-7358 |pmc=7877734 |pmid=33571204|bibcode=2021PLSCB..17E8725B }}</ref> [[PLOS]] cites four main reasons for the use of open source software:
* [[Reproducibility]]: This allows for researchers to use the exact methods used to calculate the relations between biological data.
* Faster development: developers and researchers do not have to reinvent existing code for minor tasks. Instead they can use pre-existing programs to save time on the development and implementation of larger projects.
* Increased quality: Having input from multiple researchers studying the same topic provides a layer of assurance that errors will not be in the code.
* Long-term availability: Open source programs are not tied to any businesses or patents. This allows for them to be posted to multiple [[web page]]s and ensure that they are available in the future.<ref>{{cite journal|journal=PLOS Computational Biology| doi=10.1371/journal.pcbi.1002799 | volume=8| issue=11 |title=The PLOS Computational Biology Software Section|pages=e1002799|year=2012|last1=Prlić|first1=Andreas| last2=Lapp | first2=Hilmar |pmc=3510099| bibcode=2012PLSCB...8E2799P | doi-access=free }}</ref>