| Title: | Mathematical and Statistical Tools in 'ribios' |
|---|---|
| Description: | Mathematical and statistical tools for computational biology in drug discovery. Functions are designed for high performance. Zhang (2025) <https://github.com/bedapub/ribiosMath>. |
| Authors: | Jitao David Zhang [aut, cre, ctb] (ORCID: <https://orcid.org/0000-0002-3085-0909>) |
| Maintainer: | Jitao David Zhang <[email protected]> |
| License: | GPL-3 |
| Version: | 1.2.0 |
| Built: | 2026-06-03 08:33:31 UTC |
| Source: | https://github.com/bedapub/ribiosMath |
The function returns column-wise kappa statistics of a matrix, using a linear algebra procedure implemented in C++.
colKappa(matrix, minOverlap = 0L)colKappa(matrix, minOverlap = 0L)
matrix |
An adjacency matrix, containing values of either 0 or 1 (default), or values between 0 and 1 (weighted). |
minOverlap |
Integer, minimal overlap between two columns in order to be considered. Pairs with fewer overlaps will return |
A matrix of size if the input matrix is of size .
A kappa statistics of value 1 indicates perfect agreement. A value of 0 indicates no agreement. Note that the value can be negative, which implies the agreement is worse than random.
rowKappa to calculate the statistic of rows
Other kappa functions:
kappaSimp(),
rowKappa()
testMat <- cbind(c(1,1,0,0,1,0), c(1,1,0,1,1,0)) colKappa(testMat)testMat <- cbind(c(1,1,0,0,1,0), c(1,1,0,1,1,0)) colKappa(testMat)
Calculate the cosine distance between two vectors (matrices)
cosdist(x, y, na.rm = TRUE)cosdist(x, y, na.rm = TRUE)
x |
An integer or numeric vector or matrix |
y |
An integer or numeric vector or matrix |
na.rm |
Logical, whether |
Cossine distance is defined by , where represents the cosine similarity.
If parameters are given as matrices, the function calculates the cossine distance between all pair of columns of both matrices.
Numeric vector or matrix, the cossine similarity between the inputs
Currently, na.rm is only considered when both inputs are vectors
Jitao David Zhang <[email protected]>
https://en.wikipedia.org/wiki/Cosine_similarity
testVal1 <- rnorm(10) testVal2 <- rnorm(10) testVal3 <- c(rnorm(9), NA) cosdist(testVal1, testVal2) cosdist(testVal1, testVal3, na.rm=TRUE) cosdist(testVal1, testVal3, na.rm=FALSE) ## test matrix testMat1 <- matrix(rnorm(1000), nrow=10) testMat2 <- matrix(rnorm(1000), nrow=10) testVecMatDist1 <- cosdist(testMat1[,1L], testMat2) testVecMatDist <- cosdist(testMat1, testMat2)testVal1 <- rnorm(10) testVal2 <- rnorm(10) testVal3 <- c(rnorm(9), NA) cosdist(testVal1, testVal2) cosdist(testVal1, testVal3, na.rm=TRUE) cosdist(testVal1, testVal3, na.rm=FALSE) ## test matrix testMat1 <- matrix(rnorm(1000), nrow=10) testMat2 <- matrix(rnorm(1000), nrow=10) testVecMatDist1 <- cosdist(testMat1[,1L], testMat2) testVecMatDist <- cosdist(testMat1, testMat2)
Calculate the cosine similarity between two vectors (matrices)
cossim(x, y, na.rm = TRUE)cossim(x, y, na.rm = TRUE)
x |
An integer or numeric vector or matrix |
y |
An integer or numeric vector or matrix |
na.rm |
Logical, whether |
If given as vectors, x and y must be of the same
length. If given as matrices, both must have the same number of
rows. If given as a pair of matrix and vector, the length of the
vector must match the row number of the matrix. Otherwise the
function aborts and prints error message.
If parameters are given as matrices, the function calculates the cossine similarity between all pair of columns of both matrices.
If na.rm is set FALSE, any NA in the input vectors
will cause the result to be NA, or NaN if all values
turn out to be NA.
Numeric vector or matrix, the cossine similarity between the inputs
Currently, na.rm is only considered when both inputs are vectors
Jitao David Zhang <[email protected]>
https://en.wikipedia.org/wiki/Cosine_similarity
testVal1 <- rnorm(10) testVal2 <- rnorm(10) testVal3 <- c(rnorm(9), NA) cossim(testVal1, testVal2) cossim(testVal1, testVal3, na.rm=TRUE) cossim(testVal1, testVal3, na.rm=FALSE) cosdist(testVal1, testVal2) cosdist(testVal1, testVal3, na.rm=TRUE) cosdist(testVal1, testVal3, na.rm=FALSE) ## test matrix testMat1 <- matrix(rnorm(1000), nrow=10) testMat2 <- matrix(rnorm(1000), nrow=10) system.time(testMatCos <- cossim(testMat1, testMat2)) testMatVec <- cossim(testMat1, testMat2[,1L]) testVecMat <- cossim(testMat1[,1L], testMat2)testVal1 <- rnorm(10) testVal2 <- rnorm(10) testVal3 <- c(rnorm(9), NA) cossim(testVal1, testVal2) cossim(testVal1, testVal3, na.rm=TRUE) cossim(testVal1, testVal3, na.rm=FALSE) cosdist(testVal1, testVal2) cosdist(testVal1, testVal3, na.rm=TRUE) cosdist(testVal1, testVal3, na.rm=FALSE) ## test matrix testMat1 <- matrix(rnorm(1000), nrow=10) testMat2 <- matrix(rnorm(1000), nrow=10) system.time(testMatCos <- cossim(testMat1, testMat2)) testMatVec <- cossim(testMat1, testMat2[,1L]) testVecMat <- cossim(testMat1[,1L], testMat2)
The function implements the Hierarhical fuzzy multi-linkage partitioning method used in the DAVID Bioinformatics tool.
davidClustering_kappa( kappaMatrix, kappaThr = 0.35, initialGroupMembership = 3L, multiLinkageThr = 0.5, mergeRule = 1L )davidClustering_kappa( kappaMatrix, kappaThr = 0.35, initialGroupMembership = 3L, multiLinkageThr = 0.5, mergeRule = 1L )
kappaMatrix |
A numeric matrix of Kappa statistics, which is likely returned by |
kappaThr |
Numeric, the threshold of the Kappa statistic, which is used to select initial seeds. Default value: 0.35, as recommended by the authors of the original study based on their experiences. |
initialGroupMembership |
Non-negative integer, the number of minimal members in initial groups. Default value: 3. |
multiLinkageThr |
Numeric, the minimal linkage between two groups to be merged. Default value: 0.5. |
mergeRule |
Integer, how two seeds are merged. See below. Currently following merge rules are implemented:
|
A list of integer vectors. Each element represents a cluster and contains the indices of rows belonging to that cluster. Rows can appear in multiple clusters (fuzzy clustering).
The function has only been tested in a few anecdotal examples. Cautions and more systematic tests are required before it is applied to critical datasets.
Jitao David Zhang <[email protected]>
Huang et al. The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biology, 2007. doi:10.1186/gb-2007-8-9-r183
synData <- matrix(c(rep(c(rep(1, 10), rep(0, 5)), 3), rep(0, 4), rep(1, 7), rep(0,4), rep(c(rep(0,5), rep(1,10)), 3), rep(c(rep(0,3), 1), 4)[-16]), ncol=15, byrow=TRUE) rownames(synData) <- sprintf("Gene %s", letters[1:8]) colnames(synData) <- sprintf("t%d", 1:15) synKappaMat <- rowKappa(synData) synKappaMat.round2 <- round(synKappaMat, 2) davidClustering_kappa(synKappaMat.round2)synData <- matrix(c(rep(c(rep(1, 10), rep(0, 5)), 3), rep(0, 4), rep(1, 7), rep(0,4), rep(c(rep(0,5), rep(1,10)), 3), rep(c(rep(0,3), 1), 4)[-16]), ncol=15, byrow=TRUE) rownames(synData) <- sprintf("Gene %s", letters[1:8]) colnames(synData) <- sprintf("t%d", 1:15) synKappaMat <- rowKappa(synData) synKappaMat.round2 <- round(synKappaMat, 2) davidClustering_kappa(synKappaMat.round2)
The function implements the Hierarhical fuzzy multi-linkage partitioning method used in the DAVID Bioinformatics tool.
davidClustering_kappa_R( kappaMatrix, kappaThr = 0.35, initialGroupMembership = 3, multiLinkageThr = 0.5, removeRedundant = TRUE, debug = FALSE )davidClustering_kappa_R( kappaMatrix, kappaThr = 0.35, initialGroupMembership = 3, multiLinkageThr = 0.5, removeRedundant = TRUE, debug = FALSE )
kappaMatrix |
A numeric matrix of Kappa statistics, which is likely returned by |
kappaThr |
Numeric, the threshold of the Kappa statistic, which is used to select initial seeds. Default value: 0.35, as recommended by the authors of the original study based on their experiences. |
initialGroupMembership |
Integer, the number of minimal members in initial groups. Default value: 3. |
multiLinkageThr |
Numeric, the minimal linkage between two groups to be merged. Default value: 0.5. |
removeRedundant |
Logical, whether redundant initial groups should be removed before clustering. Used for debugging. Setting as |
debug |
Logical, whether seed information is printed for debugging purposes. |
The function has only been tested in a few anecdotal examples. Cautions and more systematic tests are required before it is applied to critical datasets.
Jitao David Zhang <[email protected]>
Huang et al. The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biology, 2007. doi:10.1186/gb-2007-8-9-r183
Calculate empirical p-values from real values and simulated values
empval(stat, sim)empval(stat, sim)
stat |
A numeric vector of calculated statistic from the actual data |
sim |
A numeric vector (or matrix) of simulated statistics, e.g. by Monte-Carlo methods. |
The estimate of the P-value is obtained as ,
where n is the number of replicate samples that have been
simulated and r is the number of these replicates that produce
a test statistic greater than or equal to that calculated for the
actual data.
A vector of empirical p-values, of the same length as the input
Jitao David Zhang <[email protected]>
Davison AC, Hinkley DV (1997) Bootstrap methods and their applications. Cambridge University Press, Cambridge, United Kindom.
North BV, Curtis D, Sham PC (2002) A note on the calculation of empirical p values from Monte Carlo Procedures. Am J Hum Genet. 2002 August; 71(2):439–441.
set.seed(1995) testStat <- c(-100, -3, -1, 0, 1, 3, 100) testSim <- rnorm(1000) empval(stat=testStat, sim=testSim)set.seed(1995) testStat <- c(-100, -3, -1, 0, 1, 3, 100) testSim <- rnorm(1000) empval(stat=testStat, sim=testSim)
Calculate column-wise kappa statistics of a matrix, using a simple procedure by going through the matrix and counting
kappaSimp(matrix, minOverlap = 0)kappaSimp(matrix, minOverlap = 0)
matrix |
a binary matrix of either 0 or one |
minOverlap |
Numeric/integer, the minimal overlap between two columns to be considered for further calculation |
A matrix of size nxn if the input matrix is of size mxn (m is arbitrary)
colKappa to calculate the same statistic using a linear algebra based routine
Other kappa functions:
colKappa(),
rowKappa()
The function returns row-wise kappa statistics of a matrix, using a linear algebra procedure implemented in C++.
rowKappa(matrix, minOverlap = 0L)rowKappa(matrix, minOverlap = 0L)
matrix |
An adjacency matrix, containing values of either 0 or 1. |
minOverlap |
Integer, minimal overlap between two columns in order to be considered. Pairs with fewer overlaps will return |
A matrix of size if the input matrix is of size .
A kappa statistics of value 1 indicates perfect agreement. A value of 0 indicates no agreement. Note that the value can be negative, which implies the agreement is worse than random.
colKappa to calculate the statistic of rows
Other kappa functions:
colKappa(),
kappaSimp()
testMat <- cbind(c(1,1,0,0,1,0), c(1,1,0,1,1,0), c(0,1,0,0,1,0), c(1,0,1,0,1,0)) rowKappa(testMat) stopifnot(identical(rowKappa(testMat), colKappa(t(testMat))))testMat <- cbind(c(1,1,0,0,1,0), c(1,1,0,1,1,0), c(0,1,0,0,1,0), c(1,0,1,0,1,0)) rowKappa(testMat) stopifnot(identical(rowKappa(testMat), colKappa(t(testMat))))
Compared with bootstrapping, the results do not reveal input values, and the empirical distribution can be smoother. The function assumes that the distribution can be aproximated using a gaussian kernel.
simulate_from_density(vec, N = 1e+05)simulate_from_density(vec, N = 1e+05)
vec |
Numeric vector |
N |
Integer, number of simulated instances |
A numeric vector of length N with values simulated from the
kernel density estimate of vec.
Iakov Davydov
my_vec <- c(23, 27, 26, 24, 25) simulate_from_density(my_vec, 10)my_vec <- c(23, 27, 26, 24, 25) simulate_from_density(my_vec, 10)
Calculate TF-IDF using a input matrix with terms in rows and documents in columns
tfidf( tdMat, tfVariant = c("raw", "binary", "frequency", "log", "doubleNorm0.5"), idfVariant = c("raw", "smooth", "probabilistic"), idfAddOne = TRUE )tfidf( tdMat, tfVariant = c("raw", "binary", "frequency", "log", "doubleNorm0.5"), idfVariant = c("raw", "smooth", "probabilistic"), idfAddOne = TRUE )
tdMat |
A term-document matrix, terms in rows, documents in columns, and counts as integers (or logical values) in cells |
tfVariant |
Variant of term frequency. See details below. |
idfVariant |
Variant of inverse document frequency. See details below. |
idfAddOne |
Logical, whether one should be added to both numerator and denominator to calculate IDF. See details below. |
tfVariant accepts following options:
The input matrix is used as it is.
The input matrix is transformed into logical values.
Term frequency per document is calculated from the input matrix.
Transformation log(1+tfMat)
Double normalisation 0.5
idfVariant accepts following options:
log(N/Nt)
log(1+N/Nt)
log((N-nt)/nt)
, where N represents the total number of documents in the corpus, and nt is the number of documents where the term t appears. If idfAddOne is set TRUE, both numbers with addition of 1 to prevent division-by-zero.
A numeric matrix of the same dimensions as tdMat, containing
the TF-IDF values.
The Wikipedia item on TF-IDF: https://en.wikipedia.org/wiki/Tf%E2%80%93idf.
tiExample <- matrix(c(1,1,1,1,1, 1,1,0,0,0, 1,0,0,0,0, 0,1,0,0,0, 0,0,0,1,0, 1,0,1,0,1, 0,0,0,0,1), ncol=5, byrow=TRUE) colnames(tiExample) <- sprintf("D%d", 1:ncol(tiExample)) rownames(tiExample) <- sprintf("t%d", 1:nrow(tiExample)) tiRes <- tfidf(tiExample)tiExample <- matrix(c(1,1,1,1,1, 1,1,0,0,0, 1,0,0,0,0, 0,1,0,0,0, 0,0,0,1,0, 1,0,1,0,1, 0,0,0,0,1), ncol=5, byrow=TRUE) colnames(tiExample) <- sprintf("D%d", 1:ncol(tiExample)) rownames(tiExample) <- sprintf("t%d", 1:nrow(tiExample)) tiRes <- tfidf(tiExample)