Trends in Population-Based Studies: Molecular and Digital Epidemiology (Review)

The development of high-throughput technologies has sharply increased the opportunities to research the human body at the molecular, cellular, and organismal levels in the last decade. Rapid progress in biotechnology has caused a paradigm shift in population-based studies. Advances in modern biomedical sciences, including genomic, genome-wide, post-genomic research and bioinformatics, have contributed to the emergence of molecular epidemiology focused on the study of the personalized molecular mechanism of disease development and its extrapolation to the population level. The work of research teams at the intersection of information technology and medicine has become the basis for highlighting digital epidemiology, the important tools of which are machine learning, the ability to work with real world data, and accumulated big data. The developed approaches accelerate the process of collecting and processing biomedical data, testing new scientific hypotheses. However, new methods are still in their infancy, they require testing of application under various conditions, as well as standardization. This review highlights the role of omics and digital technologies in population-based studies.


Introduction
The global medical and demographic problems, those of population ageing, an increase in the prevalence of chronic non-communicable diseases, the pandemic of a new coronavirus infection, set new large-scale challenges for healthcare, where precision medicine becomes one of the tools for solving them. Initially demanded mainly in the diagnosis and treatment of oncological diseases, it is being introduced into all areas of medicine now.
Major research projects and campaigns are being initiated worldwide to develop and implement precision medicine strategies. Experts estimate the global precision medicine market to reach $87.7 billion by 2023. The leading scientific institutions are located in the USA, United Kingdom, France, and China. Since 2018, the number of publications in the field of precision medicine has amounted to about 16 thousand worldwide.
Molecular and digital epidemiology is one of the main tools of precision medicine.

Molecular epidemiology Genomic research
Biological research has traditionally been carried out using reductionist approaches, partly due to reviews limitations in both the experimental power of the devices and the complexity of the analytical data evaluation processes. In the last decade, the development of high-throughput technologies has led to a sharp increase in the opportunities of studying the human body at the molecular, cellular, and organismal levels [1]. The rapid progress of biotechnology has led to a paradigm shift in genomic epidemiology, from linkage analysis to genome-wide association studies (GWAS) and the widespread use of next-generation sequencing (NGS). Technological developments have improved research design, enhanced our understanding of disease etiology, and led to numerous scientific discoveries [2].
In genomics, first-generation sequencing methods could sequence the human genome for $300,000; two decades later, next-generation methods can sequence the human genome in a few hours at a cost of $1000. Measurements of characteristics such as epigenome, transcriptome, proteome, etc. have undergone similar changes, which has allowed researchers to start studying pathologies using their characteristics at the molecular level rather than tissue one [3]. Therefore, both the study of individual organisms and the study of populations require calculative and statistical approaches to the data of various "omics", which consider metabolism in cells, tissues and organs as a whole, as an integrated system, rather than isolated separate processes.
The reductions in the cost of genome sequencing, combined with an increase in the computational power, have caused a strong revival of interest in the application of whole genome sequencing in public health [4]. Today, genomic epidemiology makes it possible to study the genomes of pathogens so as to have a better insight into the spread of infectious diseases among populations and quickly respond to the outbreaks of the diseases [5]. Together with philodynamics (a combination of epidemiology, evolution, and immunodynamics), genomic epidemiology is a rapidly developing field of science that addresses key issues related to epidemic preparedness and management in real time [6].
In the beginning, genomic data were used to study a variety of viruses, particularly, the influenza A virus and human immunodeficiency virus (HIV). The Ebola virus epidemic in West Africa (2013-2016) was the first major and large-scale challenge to study the virus genomes; that resulted in the discovery of their origin and causes for such a rapid spread of the epidemic and also allowed to detect subsequent sources of local outbreaks [7]. Genomic epidemiology has become a valuable source of information for scientists about the nature of the threats to public health such as Zika, Middle East Respiratory Syndrome (MERS), Ebola, and SARS-CoV-2 outbreaks [8]. These threats have required a variety of approaches including intensive genome sequencing to understand transmission dynamics during the acute phase of epidemics (Ebola virus in the Democratic Republic of the Congo) and broader genomic "surveillance" to detect a hidden increase in the prevalence (poliomyelitis) [9]. During the SARS-CoV-2 pandemic, many countries that had not previously used genomic data began to actively conduct such studies and rely on their results. Genomic technologies have made more than 2.5 million SARS-CoV-2 sequences known from over 185 countries [10], and due to the subsequent public interest in genomic epidemiology, new methodologies have been rapidly developed to fully utilize this dataset to fight against the pandemic.
The transmission of all infections occurs at different spatial scales, which depend on the pathogen, the nature of the host's movement, immunity, and other factors [11].
The impact of obtaining genomic data on the formation of public health is shown in Figure 1.
Genomic data can be used to characterize clinical cases of infection depending on location and time and track outbreaks at all spatial scales: from nosocomial infections to pandemics [12]. The analysis of the pathogen genomes in the context of other sequences obtained from the same outbreak, as well as their comparison with previously characterized variants, allow researchers to develop intervention strategies at the individual and population levels to minimize the burden of infectious diseases on the individual and society [13]. This comprehensive approach involving pathogen sequencing, analysis, and response is called Trends in Population-Based Studies: Molecular and Digital Epidemiology Public health outcome Development of vaccines, drugs, early diagnosis reviews molecular epidemiology. In contrast to the development of individual-level treatment strategies that focus on the functional roles of host and/or pathogen mutations, the outbreak-scale genomic analysis uses pathogen mutations as markers of transmission events [14]. Genomic epidemiology studies the dynamics of outbreaks and the rapid evolution of pathogens that often accumulate mutations on the same scale as the spread of these pathogens.
NGS makes it possible to detect various types of genomic and epigenetic variations with high accuracy. Such sequencing allows researchers to directly study all these variations in person, increasing the chance of detecting mutations [15]. Although the use of NGS is still limited due to its high cost, the success of several recent projects demonstrates the great potential of this method in genomic epidemiology, especially in view of the sequencing cost decline.
With a sufficient sample size, appropriate metadata (such as location and date), and an appropriate statistical framework, pathogen genomes may assist in the identification of patterns in the spread of an epidemic with a small number of patients studied, allowing the development of precise targeted interventions compared to traditional methods and the use of demographic data [16]. In the nearest future, we will also be able to estimate the prevalence of chronic noncommunicable diseases using patient's pedigree data.
In 2011, the National Human Genome Research Institute (USA) published a review on genetic medicine, noting that the most effective way to improve human health is to understand normal biology (in this case, biology of the human genome) as a basis for studying the biology of diseases, which then becomes the basis for health promotion. To date, it is still difficult to fully determine the future prospects of genetic epidemiology for improving the public health [17].
When evaluating the contribution of genetic epidemiology to public health, it is equally important to understand that the etiology of diseases is complex and the genetic risk for developing pathology does not equate to genetic determinism [18]. The complex relationship between genetics and disease poses an ethical dilemma for practitioners regarding the correct interpretation of genetic test results. When performing genetic tests, it is possible to indirectly reveal the disorders that will not cause the development of the clinical disease manifestation [19]. An ethical question arises, should patients be aware of these incidental findings that may have a medical value?

Biomarkers
In the epidemiological study of diseases, metabolite concentrations are increasingly used as biomarkers that serve as indirect indicators of the rate of metabolic reactions. Though, the assessment of the rate of individual reactions can provide more accurate information about the ongoing changes directly in the organ [20].
Direct measurement of the rate of metabolic reactions in situ is currently impractical in large population studies since they are costly, technically complex, and require high-throughput equipment. This method is more successful when applied on a smaller scale, primarily through the use of non-invasive nuclear magnetic resonance spectroscopy (NMR spectroscopy) [21]. Metabolic pathway imaging techniques using hyperpolarized metabolites have shown promising results in the diagnosis and localization of tumors in patients with prostate cancer [22].
In a prospective clinical study involving 58 patients with chronic heart failure, the rate of adenosine triphosphate (ATP) synthesis was measured by studying the activity of cardiac creatine kinase in situ using the 31P NMR spectroscopy method [23].
ATP and creatine phosphate concentrations, as well as general clinical parameters, were used as predictors of chronic heart failure over an 8-year follow-up period. Excessive creatine kinase activity exceeded the significance of such parameters like patient's age, gender, and concentrations of other metabolites in predicting heart failure events and death, including hospitalization for heart failure and ventricular assist device insertion [24].
These results relate to a relatively small group of patients, but they add weight to the case for the development of biomarkers based on the rate of metabolic pathways and reactions in the study of disease.
Metabolism works as a continuously operating system of movement and transformation of molecules through reactions. Since the flow of metabolites is regularly redirected, metabolites are accumulated at various points or become depleted which results in a change in their concentration. The concentrations of metabolites reflect the effects of combined changes in the reaction rate, but do not give a direct idea of the dysfunctions of the processes themselves, for example, in pathology affecting enzymes, genes, and other molecular products derived from the human genome [24]. In this regard, a systematic assessment of the reaction rate on the scale required for epidemiology will be done by integrating metabolomic data with genomic, transcriptomic and/or proteomic information to determine enzymatic function.
Due to the ability to characterize diverse variants of endogenous and exogenous metabolites in biological specimens, metabolomic approaches have quickly been recognized as an important tool in public health studies [25]. The results show that the use of small volumes of blood, urine, feces, saliva, exhaled air condensate, cerebrospinal fluid, and biopsy for measuring the metabolome can provide information on possible mechanisms underlying the disease [26-30]. However, most of the existing evidence has come from case-control or crossover studies, which do not allow reviews for a clear temporal relationship between exposure, biomarkers, and disease.
Recently, the metabolic characterization of amniotic fluid, cord blood, and maternal/child urine or serum samples has been used to assess complex effects on the fetus and mother, and it may potentially be associated with developmental problems. Dried newborn blood spot used to identify metabolic biomarkers of future risk for cancer and other diseases have been proposed as a promising sample for metabolomic profiling [31][32][33][34].
The application of metabolomics for the study of disease risks, screening, and treatment efficacy has yielded promising initial results, although the field is still under development. These studies include ones on neurodegenerative diseases [35], type 2 diabetes [36], cancer [37], HIV, tuberculosis [38], malaria [39], and cardiovascular diseases [40]. The next important step in the application of metabolomics to study the etiology of diseases and early detection of pathologies will be longitudinal studies, which have already shown their effectiveness in creating biological models of the environmental impact on humans [41,42].

Digital epidemiology
To conduct large multicenter epidemiological studies, digital technologies are actively used to facilitate the processes of work planning, data remote collection and entry control, as well as subsequent result presentation and reuse [43][44][45][46].
Though the epidemiology of chronic non communicable diseases in Russia is still lagging behind infectious diseases [47], there is a need to create and implement digital services for epidemiology of chronic noncommunicable diseases [48]. The need is owing to the increase of omics technologies' availability, the accumulation of the many years results of research, the need to compare the findings of similar studies, and the increased requirements for practical application and implementation of the results [47,49].

Digital systems for clinical research
The basis for conducting research in the field of precision medicine is the formation of databases of clinical information annotated with the data on the collected biomaterials for each clinical case [50,51]. This significantly expands resource opportunities for research at the intersection of clinical areas when new members of the research team are involved or in the case of a long-term work [52].
Coppola et al. [50] emphasized the importance of combining primary data with paraclinical information, including data from imaging studies, in a digital system. According to the authors, a service for visual data processing should have the options not only to display, but also to analyze data, which requires pre-processing and data markup. The selection of areas with suspected lung infiltration according to computed tomography (CT) data or with pathological signal foci in the magnetic resonance imaging (MRI) pictures can be an example. Integration of genomic analysis into the data system contributes to the development of genomics and radiomics (radiomics is aimed at creating mathematical models and computer algorithms that, through the analysis of medical images, such as MRI or CT images, provide a finding about the pathophysiological features of tissues) [50,53]. According to the research teams accumulating biomedical data, the imaging biobank data are to be used in accordance with the already-known standards until specific standards have been developed [50,54]. Harmonization of processing will make it possible to combine data from multi-omics studies and visual materials for the integration of phenotypic and genotypic data [50,55].
Over the past 10 years, many medical institutions have collected integrated databases (integrated data repositories, IDRs) [56], which are collected from electronic medical records [57]. Based on the accumulated data, not only scientific hypotheses are tested, but also a clinical decision support system is built [56]. Gagalova et al. [56] identified four models for the architecture of medical data collection and storage, in which data sources, the purpose of use, the availability of storage, etc. The purpose of this work was to initiate the development of guidelines on IDR creation in hospitals.

Online databases
Interactive monitoring systems have gained wide popularity [58]. Over the past 20 years, many services for monitoring infectious diseases have emerged [59,60]. To monitor the situation with antibiotic resistance, many services have been created that are limited geographically as well as by described microorganisms and assessed metrics: EARS-Net (https://atlas.ecdc.europa.eu/public/index. aspx); CDDEP Resistance Map (https://resistancemap. cddep.org/index.php); SGSS (https://sgss.phe.org.uk/Security/Register); ATLAS (https://atlas-surveillance.com/#/login); SMART (https://globalsmartsite.com/#/auth/login). The free-access web application AMRmap (https:// amrmap.ru/) [61] is a Russian development which displays data on antibiotic resistance obtained in multicenter clinical trials. The system has a section of genetic markers. Information in the database has been stored since 1997, access provided free of charge.
Since 2018, the University of Bristol's project EpiGraphDB [62] has been developing, which is a data-based analytical platform designed for the intellectual analysis of epidemiological indicators. The project is developing approaches to the reviews interpretation of causal relationships in the systematic automated analysis of many phenotypes using data from the array of bioinformatic resources. The university is also developing a software for statistical processing of omics studies, MR-Base being an example of it [63].
A large system of producing sequences of biological reactions in the body is presented in the WikiPathways system [64]. Currently, this system is being actively filled out with omics research data. The STRING database contains known and predicted protein-protein interactions [65].
Toom et al. [66] compared the results of an epidemiological study of headache in Estonia using an online questionnaire with the results of data research obtained during face-to-face visits of patients. The use of online questionnaires can significantly speed up the data collection process, increase population coverage, and reduce manual data entry errors. However, the authors noted that in the online survey, the majority of people did not have a headache, which greatly differed the sample of people who completed the online questionnaires from the sample of patients who came for face-to-face visits. This reduced the incidence of headache in the population. Also, more women, young people, married people, urban residents and people with a high level of education participated in the online survey. These characteristics of the sample are typical and should be considered as limiting in the case of studies using online questionnaires [67][68][69].
The integrated (online access, telephone, and paper mail) National Australian StepUp System for Dementia Research [70] is an interesting solution. In this system, patients with dementia and researchers of the diseases accompanied by cognitive deficits are registered in one of three convenient ways [70]. This allows accelerating the process of collecting data for research hypotheses and developing new approaches to combat dementia [71,72]. The authors note that the continuous operation of the system went on after the start of the pandemic of a new coronavirus infection [70]. Over the two years of the platform operation, more than 1000 patients, 120 researchers have been registered, and more than 40 studies have been initiated [70].
For clinical trials, there are a number of free services which provide creating electronic individual registration cards, such as REDCap [73] or Ark [74]. The use of specialized services may be limited since access is provided to the organization after the conclusion of an agreement with the copyright holders and not directly to the researcher. However, the service ensures secure personal data storing without third-party access, unlike many open resources, including Google Forms [75]. In the future, research services will be used to create large databases on certain nosologies, diagnostic methods, or treatment. Services are constantly evolving, additional specialized analysis modules are created, for example, building a pedigree [74].
The pandemic of a new coronavirus infection caused an accelerated and forced introduction of digital technologies in all spheres of life, including all stages of research [76,77]. Since the beginning of the pandemic in 2019, many national and international online monitoring systems have been developed [78]. The challenges for the fast-growing services are their weak integration with each other and the lack of centralized management, a difficulty in interpretation and practical application of data [79]. On the other hand, a limiting factor is the reluctance of patients to use digital questionnaires or remote methods of communication due to uncertainty about confidentiality in their use or unwillingness to become addicted to gadgets [80], which is especially common among older patients.

Open data
The annual increase in the accumulated data requires the introduction of new guidelines for the management of captured data. One of the most common standards for such work with data is FAIR (findability, accessibility, interoperability, and reusability) [81], which has become a fundamental requirement for open science [82,83]. In their paper, Suhre et al. [84] emphasize the importance of data exchange for omics research, giving an example of a combination of GWAS and proteomic analysis. The authors consider the prospects for the creation of a database that will accumulate information about the genetic colocalization of genomic information and characteristics of the molecular phenotype of a disease (for example, gene expression and metabolomic characteristics) with clinical trial endpoints.

Real world data
Real world data in biomedical research refers to data captured from electronic medical records, medical registries, medical insurance companies, non-interventional clinical trials, and other sources in which information was obtained not under experimental conditions [85]. The HealthMap online system (https://www.healthmap. org/ru/) has been operating since 2006, accumulating data on disease outbreaks from open web resources [86]. In 2008, the web-based influenza surveillance system Influenzanet was launched [87,88]. Limitations in the use of these data are their redundancy (repetitions), heterogeneity (different input formats), inconsistency (violation of the chronology of events). Chatzidimitriou et al. [89] created a database (n=20,463) on clinical cases of chronic lymphocytic leukemia (The ERIC CLL Database) filled with data from more than 90 centers and 31 countries. The authors consider the provision of standardization, integration of retrospective data, and assessment of the quality of input data to be necessary for the successful functioning of the distributed database [89].

Digital epidemiology as a separate field of knowledge
According to Salathé [90], digital epidemiology has become a separate area of scientific knowledge. Its purpose is to understand the patterns of disease development and the dynamics of the health status of the population, as well as to determine the causes of these patterns in order to find ways to prevent the development of diseases and promote health. The broadest definition of digital epidemiology is epidemiology that uses digital data. Though, the author then specifies that digital epidemiology operates the data that has not been collected with the main purpose of conducting epidemiological studies. Such data can include electronic medical records, information from insurance funds, city, regional, and federal health departments, as well as data from search engines, social networks, and mobile phones [90].
Google Flu Trends (GFT) has become one of the first known digital epidemiology services that uses search queries on acute respiratory symptoms for epidemiological analysis [91,92]. A serious problem was that the collected data were owned by a private company, and the analysis algorithms used were unavailable even to national healthcare systems [90], and independent testing of the capabilities of this service for epidemiological studies showed a low efficiency in assessing the incidence of infectious diseases [93]. Unofficial Internet sources can be a valuable resource for epidemiological research, but the current trend towards protecting personal data and maintaining privacy is an important limiting factor. Salathé identifies two ways to the solution of this problem [90]: creation of the monitoring systems by groups of scientists or professional communities, which will be more understandable and transparent for national healthcare, and that will increase the potential for their practical application; greater involvement of the population in epidemiological studies. The rights to the data generated by individuals belong to the developers of the resource. A representative part of the population should be persuaded to share their personal health data with public health authorities for scientific research, the results of which can benefit society.
Roth et al. [94] have shown the formation of digital epidemiology (Figure 2).
According to the authors, machine learning methods based on the data from healthcare systems or social networks (Twitter), which help determine the prognosis for survival and complications, had already been developed by 2018.
It is important to note that the transformation of epidemiology leads to a change in its teaching principles [95]. Werler et al. [96] note that new curricula in epidemiology require the formation of causal thinking and the subsequent formation of a scientific hypothesis. Common mistakes made by young epidemiologists include estimation of one risk factor for one outcome, inaccurate formulation of research questions, and giving greater importance in research to epidemiological and statistical approaches compared to public importance.

Ethical issues
The development of high-precision medicine technologies entails the need to form new ethical standards [97]. Classical basic ethical principles are respect for patient's autonomy and privacy [98]. In this case, ethical requirements must ensure that individuals cannot be identified in open data portals for the exchange of scientific data. The ethics of precision public healthcare regulates the interaction between patients who have given voluntary informed consent to their attending doctor for the use of their clinical specimens in precision medicine research and the public decision-making process that drives public health activities. The development of a new hybrid ethical paradigm is possible only with the well-coordinated work of these process participants. Conducting omics studies allows obtaining detailed information about any subject. However, in order to plan disease control measures in a particular area or in a particular population, the following data indicating the demographic characteristics of an individual are important: geographical location, migration history, stay in prison, lifestyle and profession, etc. All these data are personal, they must not be subjected to Formation of digital epidemiology as a field of knowledge reviews wide dissemination and increase the risks of discovering the identity of the subject. In this regard, particular attention is paid to the way of presenting the obtained information. The ethics of precision medicine includes a public health ethic commitment to social justice and an emphasis on professional transparency and the trust built through it. The collected data should be transparent and aimed at improving the existing system and people's lives, and not stigmatizing social groups with high risk factors or relatively high incidence [97].
The development of electronic systems for capturing and storing data requires careful study of the risks to maintaining the security of the collected data [99]. New requirements for data management and professional confidentiality are emerging [98]. The speed, accuracy, and efficiency of big data processing offer great opportunities for public health, but entail a responsibility to adapt in a society that is committed to privacy, respect for human rights in matters of health, and social justice.
Sharma et al. [100] advocate for the development of legislation to maintain the confidentiality of personal data collected during scientific and clinical research, for auditing and implementation of independent oversight to assess the management of the risks related to the reuse of the data on research subjects. Solving this problem requires new approaches to working with patient data, taking into account an increased activity of scientific communication, creation of open repositories, exchange of primary research data, which is an integral part of large epidemiological studies. However, people are motivated to participate in study by pursuing their own interests, like the reputation of the organization with which they interact. Reuse of data by other organizations carries certain risks, which patients should be informed about before submitting voluntary informed consent to participate in a study.
FAIR-Health is a new paradigm of open science that has been developed in view of the peculiarities of biomedical research [101]. This paradigm is aimed at considering the information and biomaterials collected in research to be a single resource. It is this principle that, according to Holub et al. [101], will help ensure the reproducibility of studies and the subsequent integration of results.

Conclusion
Modern methods of population-based studies, including both omics technology data and the results of monitoring the conditions and behavior of patients over a long period of time, provide detailed data on subjects. At the moment, a search for methods of standardizing the collected data, their analysis and synthesis for further use is in progress. One of the major challenges to science is the integration of research results not only for rational storage, but also for the creation of dynamic digital models of subjects and processes.  10. Liu T., Chen Z., Chen W., Chen X., Hosseini M., Yang Z., Li J., Ho D., Turay D., Gheorghe C.P., Jones W., Wang C. A benchmarking study of SARS-CoV-2 whole-genome sequencing protocols using COVID-19 patient samples.