Data Management
RELEVANT FOR RESEARCH & DS
MATHEMATICS, ECONOMICS & HEALTH
Topics:
Data Management
Research Standards
Naming Conventions
Solic Principles
Additional Information and References
The aim of this topic is to emphasize the importance of maintaining scientific rigor and research integrity to enhance the reliability and reproducibility of research outcomes, resulting in the generation of high-quality scientific data that can be utilized effectively for research purposes and increased research efficiency.
As researchers, data analysis or data scientist, we must effectively manage, store, and share data. While various institutions and financial organizations may have their own policies and principles, there are common public policies that help researchers manage data in an appropriate manner. The NIH Policy for Data Management & Sharing is an good example and we use these as starting point.
The FAIR data principles, which refer to data that is Findable, Accessible, Interoperable, and Reusable, are crucial to enabling validation of research results and providing accessibility to high-value datasets.
It is also important to note that various journals have their own data sharing guidelines, which researchers must adhere to in order to ensure their research findings are published effectively.
RESEARCH STANDARDS
Research standards refer to the guidelines and principles that researchers must adhere to during the research process to ensure that their work is conducted with scientific rigor and research integrity. These standards are designed to promote transparency, reproducibility, and the responsible conduct of research.
Some common research standards include:
Informed consent: Researchers must obtain informed consent from study participants, ensuring that they are fully informed about the nature and purpose of the study, as well as any potential risks or benefits.
Data management: Researchers must implement effective data management practices to ensure that data is stored securely, managed ethically, and available for future analysis and replication.
Research ethics: Researchers must adhere to ethical principles, such as the protection of human subjects and the responsible use of animals in research.
Publication standards: Researchers must comply with publication standards, such as ensuring that data is reported accurately and transparently, and that authorship is appropriate and properly attributed.
Reproducibility: Researchers must design their studies in a way that allows others to reproduce their findings, using transparent and well-documented methods.
Statistical analysis: Researchers must use appropriate statistical methods to analyze their data, ensuring that results are valid and reliable.
By adhering to these standards, researchers can ensure that their work is conducted in an ethical and responsible manner, and that their findings are trustworthy and of high scientific quality.
Some research standards commonly used in various fields:
- Good Clinical Practice (GCP)
- International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH)
- Consolidated Standards of Reporting Trials (CONSORT)
- Standards for Reporting Diagnostic Accuracy (STARD)
- Transparent Reporting of Evaluations with Nonrandomized Designs (TREND)
- Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)
- Minimum Information for Biological and Biomedical Investigations (MIBBI)
- Minimum Information for Publication of Quantitative Real-Time PCR Experiments (MIQE)
- Minimum Information about a Microarray Experiment (MIAME)
- Minimum Information for Reporting Microbial Genome Sequences (MIGS)
- Minimum Information for Metagenomic and Metatranscriptomic Studies (MIMARKS)
- Minimum Reporting Standards for Tumor Marker Prognostic Studies (REMARK)
- Strengthening the Reporting of Observational Studies in Epidemiology (STROBE)
- The Guidelines for Reporting Reliability and Agreement Studies (GRRAS)
- The Society for Immunotherapy of Cancer Immunoscore guidelines
- The Equator Network (Enhancing the Quality and Transparency Of health Research)
Data Management Plan
A data management plan (DMP) is a document that outlines how research data will be collected, stored, organized, backed up, preserved, shared, and disposed of, among other aspects of data management. Here are the main steps to create a DMP:
Define the scope and objectives of the research project: Before starting a DMP, you need to have a clear idea of what your research project is about, what data it will generate, and what are the objectives and potential impact of the project.
Identify the types of data you will be collecting: Depending on your research, you may be collecting different types of data, such as quantitative or qualitative data, survey responses, images, audio or video recordings, and so on.
Define data documentation and metadata standards: Data documentation is crucial to ensure that your data is findable, understandable, and reusable. You need to define the metadata standards and data dictionary for each type of data you will be collecting.
Determine the data storage and backup requirements: You need to identify the appropriate data storage and backup solutions for your research data. This includes considering the size, format, security, and access control of the data.
Define the data sharing and dissemination policies: You need to decide who will have access to your research data, under what conditions, and for how long. You also need to define the data sharing and dissemination policies, including the licensing terms and citation requirements.
Identify the long-term preservation and curation requirements: You need to plan for the long-term preservation and curation of your research data, including the selection of appropriate digital preservation strategies and the allocation of sufficient resources for this purpose.
Consider ethical, legal, and regulatory issues: You need to be aware of any ethical, legal, and regulatory issues that may affect your research data, such as data privacy, intellectual property rights, and data ownership.
Write the DMP document: Based on the above considerations, you need to write a DMP document that includes all the relevant information about your data management plan. There are several DMP templates and tools available online that can help you structure your DMP in a standardized format.
Overall, a good DMP can help you ensure that your research data is managed effectively, efficiently, and ethically, and can improve the reproducibility, transparency, and impact of your research project.
Specific Sample for a Research Project:
--------------------------------------------------------------------------------------
Introduction
This data management plan outlines the policies and procedures for managing and sharing data generated by the XYZ research project. The goal of this plan is to ensure that data is managed in a way that promotes scientific integrity, maximizes the research's impact, and complies with any applicable regulations or policies.
Data Collection
Data will be collected through interviews, surveys, and observations. Data will be recorded using paper forms and digital tools, such as a tablet or smartphone. Data will be stored securely to prevent loss or damage, and access to data will be restricted to authorized personnel.
Data Storage and Backup
All data collected will be stored on a secure server maintained by the research institution. The server will be backed up regularly to prevent data loss in case of a technical failure. Additionally, all paper records will be stored in a secure location with restricted access.
Data Analysis
Data will be analyzed using appropriate statistical software, such as SPSS or R. All data analysis will be documented to ensure transparency and reproducibility. Any identifying information will be removed from the data before analysis.
Data Sharing and Archiving
The data generated by this research project will be made available to other researchers upon request, as long as the data does not contain identifying information. Data will be archived in a secure location for at least five years after the project has ended. Any publications resulting from the data will acknowledge the source of the data.
Ethics and Legal Considerations
The research project will comply with all applicable ethical and legal requirements for data collection, storage, and sharing. This includes obtaining informed consent from research participants and complying with any data privacy regulations.
Conclusion
This data management plan outlines the policies and procedures for managing and sharing data generated by the XYZ research project. The plan ensures that data is managed in a way that promotes scientific integrity, maximizes the research's impact, and complies with any applicable regulations or policies.
--------------------------------------------------------------------------------------
Plan - Store & Organize - Share
Plan
storing and documenting data also allows more people to use the data in the future, potentially leading to more discoveries beyond the initial research.
Plan and budget Data Management and Sharing process
- A brief summary and associated costs - Data description
- Review of Existing Datasets
- Formats
- metadata
- Storage and Backup
- Security
- Responsability (data lifecicle)
- Access and Sahring
- Domaing Repositories
- Self/dissemiunation
- Preservation
- Institutional Repositories
- Budget
- Others considerations
- Audiencia
- Selection and Retention Period
- Archiving and Preservation
- Ethics and Privacy
Include:
- Data type
- Tools, Software and/or code
- Data Standards
- Data Preservation, Access and Associated Timelines
- Access, Distribution, or Reuse Considerations
- Oversight of Data Management and Sharing
Is a sample for the minumun and necesary.
This is a commont and reference standar. But, by my experece in planning, project and budget I prefer
Is important to learnd and get a overview of the most commont standards for specific project or data protocols. We dont have to learnd it, but know where search or consult
Data Store and Organization
Length of Time to Maintain and Make Data Available
Documentation and Metadata
Here are examples of metadata or other information that may be included with research data:- Methodology and procedures used to collect the data
- Any other information necessary to reproduce and understand the data
Metadata standards
Examples:
• FGDC (Federal Geographic Data Committee)
• DDI (Data Documentation Initiative)
• Dublin Core
• Darwin Core
• ABCD (Access to Biological Collections Data)
• AVMS (Astronomy Visualization Metadata Standard)
• CSDGM (Content Standard for Digital Geospatial Metadata)
REgistro de Metadata
In a filename
In a readme file
In a spreadsheet
In an XML file
Into a database
Systematic Folders Hierarchy
Files
Naming Conventions
- Be descriptive
- Be consistent
When you* are looking for a file, how do you think about it?
Avoid overlapping categories
- Don't use space or special characters
- Use leading zeros for sequential numbering
- Use period only before file extnsion
- Limit to les than 32 characters
Naming conventions make life easier!
Naming conventions should be:
• Descriptive
• Consistent
README: File & Folder Schema (Example)
The mantra:
Make a System - Share the System - Follow the System
Be consistent
Document it
Naming Conventions
Data Storage Format
Open
unencrypted and uncompressed
Lossless
Known problems, inconsistencies, limitations
Readme
Best Practices:
- Create one for each data file/dataset
- Name it so that it is easily to associete with th data file(s)
- Write it as a plain text file
- Identically structure
- Use standardized date formats
- Follow the conventions for your discipline
Back up
Almost three copy:
- local/working
- remote
- other in a remote location or local/external
HNF - Here Near Far
weeding obsolete data as you go
Dropbox
OneDrive
Data Security
Additional considerations:
•My Research is top secret!
Then you can use encryption
• Don’t rely on 3rd party encryption alone
• Use something like PGP (Pretty Good Privacy)
• Write the keys down on two pieces of paper
• Store each piece of paper securely in separate locations
Repository to Share and Preservation
the FAIRness (Findable, Accessible, Interoperable, and Re-usable)
• GenBank (for genome data)
• ICPSR (for numeric social science data)
Reasons:
• Further science as a whole
• Further your research/reputation
• Enable new discoveries with your data
• Comply with funder/publisher data sharing requirements
Share Informally:
Posting on a web site, sending via email upon request
Share Formally:
Via a repository, which may also provide preservation and
makes your data more accessible and citable
Requirements
Confidencialidad
Managing Data Chccklist - MIT
Tools:
Make the data management plan public, and get feedback from colaborators.
Best Practice:
Serial indexe archives:
Poli
Some MIT workshops mateiral in especifict topics:
Funding
Copyright and conflict of interest
Etic
Copyright and conflict of interest
Conflict of Interest
Is a good practice and required declared any conflict of interest. Is common in health sector research and some journals have require form and statement.
Indexed Journals
-----In Health:
My firts profesional experience was in the health sector and ther are hig standars for research.
Data standards, health data standards. Is way I base this article with a big influence of the health documentation.
Reference Institution:
This is the riserach field with most advantage in standards and rigorous specifications or requirement for research.
In economics JEL:
The "JEL" is a classification system as a code standard method of clasification scholarly literature in the field of economics.
In Math:
Best practices:
Generals
AI in Data Management
There are five common data management areas where we see AI playing important roles:
Classification: Broadly encompasses obtaining, extracting, and structuring data from documents, photos, handwriting, and other media.
Cataloging: Helping to locate data.
Quality: Reducing errors in the data.
Security: Keeping data safe from bad actors and making sure it’s used in accordance with relevant laws, policies, and customs.
Data integration: Helping to build “master lists” of data, including by merging lists.
Style - Reference and Citation
APA
LATEX
Some journals have or require a LaTex template or format.
.
CERTIFICATIONS
Key Concepts:
Repository
National Institutes of Health (NIH)
Resource:
https://sharing.nih.gov/data-management-and-sharing-policy
https://libraries.mit.edu/data-management/
Database schema: defines how data is organized within a relational database; this is inclusive of logical constraints such as, table names, fields, data types, and the relationships between these entities. Types - a conceptual database schema, a logical database schema, and a physical database schema.
Data modeling is the process of creating a visual representation of either a whole information system or parts of it to communicate connections between data points and structure
-
Pendiente revisar y verificar evidencia:
https://blog.dmptool.org/about-the-dmptool/
https://youtu.be/u-QdhVszRsg
https://www.youtube.com/watch?v=O_lRDw-_MKs&embeds_euri=https%3A%2F%2Fwww.nnlm.gov%2F&feature=emb_rel_pause
https://www.youtube.com/watch?v=RWX2mj_yh5o&embeds_euri=https%3A%2F%2Fwww.nnlm.gov%2F&feature=emb_rel_pause
SOLIC PRINCIPLES
SOLIC principles stand for the following:
- Self-describing: Data should be self-describing or self-explanatory, meaning it should contain all the necessary information to understand its content, structure, and context.
- Open: Data should be open and accessible to anyone, without any restrictions or barriers to access, use, reuse, or distribution.
- Linked: Data should be linked to other related datasets and resources, using standard and interoperable methods and technologies, to facilitate integration, analysis, and discovery.
- Interoperable: Data should be interoperable, meaning it should be able to be exchanged and used across different platforms, systems, and domains, using standard and harmonized formats, protocols, and vocabularies.
- Reusable: Data should be reusable, meaning it should be available and usable for multiple purposes and applications, without any limitations or constraints, and with appropriate acknowledgement and citation.
These principles are commonly used in the context of open data and data sharing initiatives, to promote the availability, accessibility, and usability of research data.
Application:
Suppose you are conducting a study on the impact of exercise on mental health in college students. You plan to collect data through surveys and fitness tracking devices, and you will analyze the data using statistical software.
SOLIC principles for this study could be:
- S: Store the data in a secure location, such as a password-protected server or a cloud-based storage solution that meets security standards. Limit access to the data only to authorized personnel.
- O: Organize the data by creating a data dictionary that describes the variables and their values, as well as a file naming convention that is clear and consistent. Use version control to keep track of changes to the data.
- L: Label the data with appropriate metadata that includes information about the study, the participants, the data collection methods, and any relevant contextual information. Use standardized terminology where possible.
- I: Interoperability can be achieved by using standardized file formats and data models. For example, you could use the Data Documentation Initiative (DDI) metadata standard to describe your data and make it easier to share with other researchers.
- C: Ensure that the data is complete and accurate by validating it against the original source documents and using quality control measures such as double-entry data input and outlier detection.
By following the SOLIC principles, you can help ensure that your data is well-managed, organized, and reusable, which can improve the integrity of your research and make it easier to share your findings with others.
Source: https://www.ibm.com/topics/database-schema