Movatterモバイル変換


[0]ホーム

URL:


Sherry Lake, profile picture
Uploaded bySherry Lake
PPTX, PDF3,941 views

Best practices data management

1. The document discusses best practices for managing research data over the data life cycle, from collection through sharing and archiving. It provides tips for organizing, documenting, and storing data in sustainable file formats and naming conventions. Following best practices helps ensure usability, reproducibility, and long-term access to research data.2. Specific best practices covered include using consistent organization, standardized naming and formats, descriptive filenames, quality assurance, scripting for processing, documenting file contents, and choosing open file formats. The document also addresses data security, backup, and storage considerations.3. Managing data properly is important for reuse and sharing data with others now or in the future. Scripting helps capture data workflows for reproducibility.

Embed presentation

Downloaded 109 times
Best PracticesCreating and Managing Research DataPresented by Sherry LakeShLake@virginia.eduhttp://dmconsult.library.virginia.edu/Data Life CycleRe-PurposeRe-Use DepositDataCollectionDataAnalysisDataSharingProposalPlanningWritingDataDiscoveryEnd ofProjectDataArchiveProjectStart Up
Why Manage Your Data?
Best Practices for Creating Data1. Use Consistent Data Organization2. Use Standardized Naming, codes and formats3. Assign Descriptive File Names4. Perform Basic Quality Assurance / Quality Control5. Preserve Information - Use Scripted Languages6. Define Contents of Data Files; CreateDocumentation7. Use Consistent, Stable and Open File Formats
Spreadsheet Examples
Spreadsheets
Consistent Data Organization• Spreadsheets (such as those found in Excel)are sometimes a necessary evil– They allow “shortcuts” which will result in yourdata not being machine-readable• But there are some simple steps you can taketo ensure that you are creating spreadsheetsthat are machine-readable and will withstandthe test of time
Spreadsheets
Spreadsheet Problems?
Problems• Dates are notstoredconsistently• Values are labeled inconsistently• Data coding is inconsistent• Order of values are different
Problems• Confusionbetweennumbers andtext• Different types of data are stored in thesame columns• The spreadsheet loses interpretability if itis sorted
How would you correct this file?
Spreadsheet Best Practices• Include a Header Line 1st line (or record)• Label each Column with a short but descriptive nameNames should be uniqueUse letters, numbers, or “_” (underscore)Do not include blank spaces or symbols (+ - & ^ *)
• Columns of data should be consistent– Use the same naming convention for text data• Each line should be “complete”• Each line should have a unique identifierSpreadsheet Best Practices
Spreadsheet Best Practices• Columns should include only a single kind of data– Text or “string” data– Integer numbers– Floating point or real numbers
Use Naming Standards & Codes• Use commonly accepted label names thatdescribe the contents (e.g., precip forprecipitation)• Use consistent capitalization (e.g., not: temp,Temp, and TEMP in same file)• Standard codes– State Postal (VA, MA)– FIPS Codes for Counties and County EquivalentEntities(http://www.census.gov/geo/reference/codes/cou.html)
Use Standardized Formats• Use standardized formats for unitsInternational System of Units (SI)http://physics.nist.gov/Pubs/SP330/sp330.pdf• ISO 8601 Standard for Date and TimeYYYYMMDDThh:mmss.sTZD20091013T09:1234.9Z20091013T09:1234.9+05:00• Spatial Coordinates for Latitute/Longitude+/- DD.DDDDD-78.476 (longitude)+38.029 (latitude)
File Names
File Names• Use descriptive names• Not too long; CamelCase• Try to include time– Date using YYYYMMDD– Use version numbers• Don’t use spaces– May use “-” or “_”• Don’t change defaultextensions
Organize Files LogicallyMake sure your file system is logical andefficientBiodiversityLakeGrasslandExperimentsField WorkBiodiv_H20_heatExp_2005_2008.csvBiodiv_H20_predatorExp_2001_2003.csvBiodiv_H20_planktonCount_start2001_active.csvBiodiv_H20_chla_profiles_2003.csvProjectNameLocation ExperimentNameDate FileFormat
• Check for missing, impossible,anomalous values– Plotting– Mapping• Examine summary statistics• Verify data transfers fromnotebooks to digital files• Verify data conversion from onefile format to anotherData ValidationHook, et al. 2010. Best Practices for Preparing Environmental Data Sets to Shareand Archive. Available online: http://daac.ornl.gov/PI/BestPractices-2010.pdf.
Data Manipulation• You will need to repeat reduction and analysisprocedures many times– You need to have a workflow that recognizes this– Scripted languages can help capture the workflow– You could just document all steps by hand– After the 20th iteration through your data set; however, you mayfeel more fondly towards scripted languages• Learn the analytical tools of your field– Talk to colleagues, etc. and choose at least one tool tomaster
Preserve InformationKeep Original (Raw) File– Do not includetransformations,interpolations, etc.– Consider making the rawdata “read-only”Save as a new fileProcessing Script (R)
Preserving: Scripted Notes• Use a scripted language to process data– R Statistical package (free, powerful)– SAS– MATLAB• Processing scripts records processing– Steps are recorded in textual format– Can be easily revised and re-executed– Easy to document• GUI-based analysis may be easier, but harder toreproduce
Data Documentation (Metadata)• Informal or formal methods to describe yourdata• Important if you want to reuse your own datain the future• Also necessary when sharing your data
Define Contents of Data Files• Create a Project Document File (LabNotebook)• Details such as:– Names of data & analysis files associated withstudy– Definitions for data and codes (include missingvalue codes, names)– Units of measure (accuracy and precision)– Standards or instrument calibrations
Data Dictionary Example
Data Dictionary Example
Data DocumentationProject Documentation Dataset Documentation• Context of data collection• Data collection methods• Structure, organization of data files• Data sources used• Data validation, quality assurance• Transformations of data from theraw data through analysis• Information on confidentiality,access and use conditions• Variable names and descriptions• Explanation of codes and schemasused• Algorithms used to transform data• File format and software (includingversion) used
File Format SustainabilityTypes ExamplesText ASCII, Word, PDFNumerical ASCII, SPSS, STATA, Excel, Access, MySQLMultimedia Jpeg, tiff, mpeg, quicktimeModels 3D, statisticalSoftware Java, C, FortranDomain-specific FITS in astronomy, CIF in chemistryInstrument-specific Olympus Confocal Microscope Data Format
Choosing File Formats• Accessible Data (in the future)– Non-proprietary (software formats)– Open, documented standard– Common, used by the research community– Standard representation (ASCII, Unicode)– Unencrypted & Uncompressed
1. Use Consistent Data Organization2. Use Standardized Naming, Codes and Formats3. Assign Descriptive File Names4. Perform Basic Quality Assurance / Quality Control5. Preserve Information - Use Scripted Languages6. Define Contents of Data Files; CreateDocumentation7. Use Consistent, Stable and Open File FormatsBest Practices for Creating Data
• Will improve the usability of the data by youor by others• Your data will be “computer ready”• Save you timeFollowing these Best Practices…….
Research Life CycleData Life CycleRe-PurposeRe-UseDepositDataCollectionDataAnalysisDataSharingProposalPlanningWritingDataDiscoveryEnd ofProjectDataArchiveProjectStart Up
Managing Data in the Data Life Cycle• Choosing file formats• File naming conventions• Document all data details• Access control & security• Backup & storage
Data Security & Access Control• Network security– keep confidential or sensitive data off internetservers or computers on connected to the internet• Physical security– Access to buildings and rooms• Computer Systems & Files– Use passwords on files/system– Virus protection
Backup Your Data• Reduce the risk of damage or loss• Use multiple locations (here, near, far)• Create a backup schedule• Use reliable backup medium• Test your backup system (i.e., test filerecovery)
Storage & Backup
Sustainable StorageLifespan of Storage Media: http://www.crashplan.com/medialifespan/
Best Practices BibliographyBorer, E. T., Seabloom, E. W., Jones, M. B., & Schildhauer, M. (2009). Some simpleguidelines for effective data management. Bulletin of the Ecological Society ofAmerica, 90(2), 205-214. http://dx.doi.org/10.1890/0012-9623-90.2.205Graham, A., McNeill, K., Stout, A., & Sweeney, L. (2010). Data Management andPublishing. Retrieved 05/31/2012, fromhttp://libraries.mit.edu/guides/subjects/data-management/.Hook, L. A., Santhana Vannan, S.K., Beaty, T. W., Cook, R. B. and Wilson, B.E. (2010).Best Practices for Preparing Environmental Data Sets to Share and Archive.Available online (http://daac.ornl.gov/PI/BestPractices-2010.pdf) from Oak RidgeNational Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A.http://dx.doi.org/10.3334/ORNLDAAC/BestPractices-2010.
Best Practices Bibliography (Cont.)Inter-university Consortium for Political and Social Research (ICPSR). (2012).Guide to social science data preparation and archiving: Best practicesthroughout the data cycle (5th ed.). Ann Arbor, MI. Retrieved 05/31/2012,from http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf.Van den Eynden, V., Corti, L., Woollard, M. & Bishop, L. (2011). Managing andSharing Data: A Best Practice Guide for Researchers (3rd ed.). Retrieved05/31/2012, from http://www.data-archive.ac.uk/media/2894/managingsharing.pdf.

Recommended

PPTX
Data Quality & Data Governance
PDF
Data Strategy Best Practices
PDF
Introduction to Data Governance
PDF
Data strategy demistifying data
PPTX
Chapter 4: Data Architecture Management
PDF
Data Integration, Access, Flow, Exchange, Transfer, Load And Extract Architec...
PPT
Data quality and bi
PPTX
Microsoft Data Platform - What's included
PDF
Data Architecture for Solutions.pdf
PDF
Data Governance Powerpoint Presentation Slides
PDF
Who Should Own Data Governance – IT or Business?
PPTX
Data Governance Workshop
byCCG
 
PPTX
How to Build & Sustain a Data Governance Operating Model
PPTX
Introduction to Data Science
PDF
Moving to Databricks & Delta
PPTX
‏‏‏‏‏‏Chapter 10: Document and Content Management
PDF
Designing An Enterprise Data Fabric
PDF
Collibra - Forrester Presentation : Data Governance 2.0
PPTX
Introduction to Data Engineering
PDF
Data Strategy
PDF
CDMP Overview Professional Information Management Certification
PPTX
Capability Model_Data Governance
PPT
Gartner: Master Data Management Functionality
PDF
The Chief Data Officer Agenda: Metrics for Information and Data Management
PPTX
Most Common Data Governance Challenges in the Digital Economy
PPTX
Delta Lake with Azure Databricks
PPTX
Data Governance
PPTX
Agile Data Engineering - Intro to Data Vault Modeling (2016)
PPTX
Best practices data collection
PPTX
Introduction to Data Management

More Related Content

PPTX
Data Quality & Data Governance
PDF
Data Strategy Best Practices
PDF
Introduction to Data Governance
PDF
Data strategy demistifying data
PPTX
Chapter 4: Data Architecture Management
PDF
Data Integration, Access, Flow, Exchange, Transfer, Load And Extract Architec...
PPT
Data quality and bi
PPTX
Microsoft Data Platform - What's included
Data Quality & Data Governance
Data Strategy Best Practices
Introduction to Data Governance
Data strategy demistifying data
Chapter 4: Data Architecture Management
Data Integration, Access, Flow, Exchange, Transfer, Load And Extract Architec...
Data quality and bi
Microsoft Data Platform - What's included

What's hot

PDF
Data Architecture for Solutions.pdf
PDF
Data Governance Powerpoint Presentation Slides
PDF
Who Should Own Data Governance – IT or Business?
PPTX
Data Governance Workshop
byCCG
 
PPTX
How to Build & Sustain a Data Governance Operating Model
PPTX
Introduction to Data Science
PDF
Moving to Databricks & Delta
PPTX
‏‏‏‏‏‏Chapter 10: Document and Content Management
PDF
Designing An Enterprise Data Fabric
PDF
Collibra - Forrester Presentation : Data Governance 2.0
PPTX
Introduction to Data Engineering
PDF
Data Strategy
PDF
CDMP Overview Professional Information Management Certification
PPTX
Capability Model_Data Governance
PPT
Gartner: Master Data Management Functionality
PDF
The Chief Data Officer Agenda: Metrics for Information and Data Management
PPTX
Most Common Data Governance Challenges in the Digital Economy
PPTX
Delta Lake with Azure Databricks
PPTX
Data Governance
PPTX
Agile Data Engineering - Intro to Data Vault Modeling (2016)
Data Architecture for Solutions.pdf
Data Governance Powerpoint Presentation Slides
Who Should Own Data Governance – IT or Business?
Data Governance Workshop
byCCG
 
How to Build & Sustain a Data Governance Operating Model
Introduction to Data Science
Moving to Databricks & Delta
‏‏‏‏‏‏Chapter 10: Document and Content Management
Designing An Enterprise Data Fabric
Collibra - Forrester Presentation : Data Governance 2.0
Introduction to Data Engineering
Data Strategy
CDMP Overview Professional Information Management Certification
Capability Model_Data Governance
Gartner: Master Data Management Functionality
The Chief Data Officer Agenda: Metrics for Information and Data Management
Most Common Data Governance Challenges in the Digital Economy
Delta Lake with Azure Databricks
Data Governance
Agile Data Engineering - Intro to Data Vault Modeling (2016)

Similar to Best practices data management

PPTX
Best practices data collection
PPTX
Introduction to Data Management
PPTX
Data Management for Research (New Faculty Orientation)
PPTX
Responsible conduct of research: Data Management
PDF
Bren - UCSB - Spooky spreadsheets
PDF
Coping with Data for WHOI JP Students
PPTX
Data Management 101
PDF
Data Management Lab: Session 2 slides
 
PPTX
Responsible Conduct of Research: Data Management
PPTX
Good Practice in Research Data Management
PPTX
Managing the research life cycle
PPT
ManagingOrganizingData_ReusableSlides.ppt
PDF
Data Stewardship for SPATIAL/IsoCamp 2014
PPTX
Good data practices for graduate students
 
PPTX
Data Management Crash Course
PPTX
Practical Data Management - ACRL DCIG Webinar
PPTX
Data Management 101
PDF
Data Matters for AGU Early Career Conference
PDF
Data Management Tips Handout
Best practices data collection
Introduction to Data Management
Data Management for Research (New Faculty Orientation)
Responsible conduct of research: Data Management
Bren - UCSB - Spooky spreadsheets
Coping with Data for WHOI JP Students
Data Management 101
Data Management Lab: Session 2 slides
 
Responsible Conduct of Research: Data Management
Good Practice in Research Data Management
Managing the research life cycle
ManagingOrganizingData_ReusableSlides.ppt
Data Stewardship for SPATIAL/IsoCamp 2014
Good data practices for graduate students
 
Data Management Crash Course
Practical Data Management - ACRL DCIG Webinar
Data Management 101
Data Matters for AGU Early Career Conference
Data Management Tips Handout

More from Sherry Lake

PPTX
Documentation and Metdata - VA DM Bootcamp
PPTX
Creating dmp
PPTX
Data Management Planning for Engineers
PPTX
Dmp tool presentation
PPTX
Library support for life cycle
PPTX
Planning for Libra Data
PDF
Using a Case Study to Teach Data Management to Librarians
PPTX
Re tooling for data management-support
PPTX
Why managedata
PPTX
Funder requirements for Data Management Plans
PDF
DMPTool Workshop University of Georgia
PPTX
Virginia Data Management Bootcamp: Building the Research Data Community of Pr...
PPTX
Lake us-canada policesupdate
PPTX
DMPTool Webinar Environmental Scan
PDF
Federal funder mandates
PPTX
Lake dmp tool_i_conference
PPTX
Environmental scan - Keeping Updated
PDF
DMPTool2 demo for DMPTool-DMPonline Workshop IDCC 2014
PPTX
Web links
PDF
DMTool-ASERL-Webinar
Documentation and Metdata - VA DM Bootcamp
Creating dmp
Data Management Planning for Engineers
Dmp tool presentation
Library support for life cycle
Planning for Libra Data
Using a Case Study to Teach Data Management to Librarians
Re tooling for data management-support
Why managedata
Funder requirements for Data Management Plans
DMPTool Workshop University of Georgia
Virginia Data Management Bootcamp: Building the Research Data Community of Pr...
Lake us-canada policesupdate
DMPTool Webinar Environmental Scan
Federal funder mandates
Lake dmp tool_i_conference
Environmental scan - Keeping Updated
DMPTool2 demo for DMPTool-DMPonline Workshop IDCC 2014
Web links
DMTool-ASERL-Webinar

Recently uploaded

PDF
AI Workflows and Workflow Rhetoric - by Ms. Oceana Wong
PDF
AI and ICT for Teaching and Learning, Induction-cum-Training Programme, 5th 8...
PDF
Risk Management and Regulatory Compliance - by Ms. Oceana Wong
PDF
Multimodal and Multimedia AI - by Ms. Oceana Wong
PDF
Past Memories and a New World: Photographs of Stoke Newington from the 70s, 8...
PPTX
Chapter 3. Pharmaceutical Aids (pharmaceutics)
PDF
45 ĐỀ LUYỆN THI IOE LỚP 8 THEO CHƯƠNG TRÌNH MỚI - NĂM HỌC 2024-2025 (CÓ LINK ...
PDF
Unit 4_ small scale industries & Entrepreneurship
PDF
Rigor, ethics, wellbeing and resilience in the biomedical doctoral journey
 
PDF
Digital Electronics – Registers and Their Applications
PDF
Unit 2: Functions of Management (POSDC.)
PPTX
Plant Breeding: Its History and Contribution
PDF
CXC-AD Associate Degree Handbook (Revised)
PDF
The invasion of Alexander of Macedonia in India
PDF
Integrated Circuits: Lithography Techniques - Fundamentals and Advanced Metho...
PDF
Unit 1- Basics of Management Cha. of Magmt
PPTX
Masterclass on Cybercrime, Scams & Safety Hacks.pptx
PPTX
Time Series Analysis - Weighted (Unequal) Moving Average Method
PPTX
Organize order into course in Odoo 18.2 _ Odoo 19
PPT
n-1-PMES-Guidelines-for-SY-2025-2026.ppt
AI Workflows and Workflow Rhetoric - by Ms. Oceana Wong
AI and ICT for Teaching and Learning, Induction-cum-Training Programme, 5th 8...
Risk Management and Regulatory Compliance - by Ms. Oceana Wong
Multimodal and Multimedia AI - by Ms. Oceana Wong
Past Memories and a New World: Photographs of Stoke Newington from the 70s, 8...
Chapter 3. Pharmaceutical Aids (pharmaceutics)
45 ĐỀ LUYỆN THI IOE LỚP 8 THEO CHƯƠNG TRÌNH MỚI - NĂM HỌC 2024-2025 (CÓ LINK ...
Unit 4_ small scale industries & Entrepreneurship
Rigor, ethics, wellbeing and resilience in the biomedical doctoral journey
 
Digital Electronics – Registers and Their Applications
Unit 2: Functions of Management (POSDC.)
Plant Breeding: Its History and Contribution
CXC-AD Associate Degree Handbook (Revised)
The invasion of Alexander of Macedonia in India
Integrated Circuits: Lithography Techniques - Fundamentals and Advanced Metho...
Unit 1- Basics of Management Cha. of Magmt
Masterclass on Cybercrime, Scams & Safety Hacks.pptx
Time Series Analysis - Weighted (Unequal) Moving Average Method
Organize order into course in Odoo 18.2 _ Odoo 19
n-1-PMES-Guidelines-for-SY-2025-2026.ppt

Best practices data management

  • 1.
    Best PracticesCreating andManaging Research DataPresented by Sherry LakeShLake@virginia.eduhttp://dmconsult.library.virginia.edu/Data Life CycleRe-PurposeRe-Use DepositDataCollectionDataAnalysisDataSharingProposalPlanningWritingDataDiscoveryEnd ofProjectDataArchiveProjectStart Up
  • 2.
  • 3.
    Best Practices forCreating Data1. Use Consistent Data Organization2. Use Standardized Naming, codes and formats3. Assign Descriptive File Names4. Perform Basic Quality Assurance / Quality Control5. Preserve Information - Use Scripted Languages6. Define Contents of Data Files; CreateDocumentation7. Use Consistent, Stable and Open File Formats
  • 4.
  • 5.
  • 6.
    Consistent Data Organization•Spreadsheets (such as those found in Excel)are sometimes a necessary evil– They allow “shortcuts” which will result in yourdata not being machine-readable• But there are some simple steps you can taketo ensure that you are creating spreadsheetsthat are machine-readable and will withstandthe test of time
  • 7.
  • 8.
  • 9.
    Problems• Dates arenotstoredconsistently• Values are labeled inconsistently• Data coding is inconsistent• Order of values are different
  • 10.
    Problems• Confusionbetweennumbers andtext•Different types of data are stored in thesame columns• The spreadsheet loses interpretability if itis sorted
  • 11.
    How would youcorrect this file?
  • 12.
    Spreadsheet Best Practices•Include a Header Line 1st line (or record)• Label each Column with a short but descriptive nameNames should be uniqueUse letters, numbers, or “_” (underscore)Do not include blank spaces or symbols (+ - & ^ *)
  • 13.
    • Columns ofdata should be consistent– Use the same naming convention for text data• Each line should be “complete”• Each line should have a unique identifierSpreadsheet Best Practices
  • 14.
    Spreadsheet Best Practices•Columns should include only a single kind of data– Text or “string” data– Integer numbers– Floating point or real numbers
  • 15.
    Use Naming Standards& Codes• Use commonly accepted label names thatdescribe the contents (e.g., precip forprecipitation)• Use consistent capitalization (e.g., not: temp,Temp, and TEMP in same file)• Standard codes– State Postal (VA, MA)– FIPS Codes for Counties and County EquivalentEntities(http://www.census.gov/geo/reference/codes/cou.html)
  • 16.
    Use Standardized Formats•Use standardized formats for unitsInternational System of Units (SI)http://physics.nist.gov/Pubs/SP330/sp330.pdf• ISO 8601 Standard for Date and TimeYYYYMMDDThh:mmss.sTZD20091013T09:1234.9Z20091013T09:1234.9+05:00• Spatial Coordinates for Latitute/Longitude+/- DD.DDDDD-78.476 (longitude)+38.029 (latitude)
  • 17.
  • 18.
    File Names• Usedescriptive names• Not too long; CamelCase• Try to include time– Date using YYYYMMDD– Use version numbers• Don’t use spaces– May use “-” or “_”• Don’t change defaultextensions
  • 19.
    Organize Files LogicallyMakesure your file system is logical andefficientBiodiversityLakeGrasslandExperimentsField WorkBiodiv_H20_heatExp_2005_2008.csvBiodiv_H20_predatorExp_2001_2003.csvBiodiv_H20_planktonCount_start2001_active.csvBiodiv_H20_chla_profiles_2003.csvProjectNameLocation ExperimentNameDate FileFormat
  • 20.
    • Check formissing, impossible,anomalous values– Plotting– Mapping• Examine summary statistics• Verify data transfers fromnotebooks to digital files• Verify data conversion from onefile format to anotherData ValidationHook, et al. 2010. Best Practices for Preparing Environmental Data Sets to Shareand Archive. Available online: http://daac.ornl.gov/PI/BestPractices-2010.pdf.
  • 21.
    Data Manipulation• Youwill need to repeat reduction and analysisprocedures many times– You need to have a workflow that recognizes this– Scripted languages can help capture the workflow– You could just document all steps by hand– After the 20th iteration through your data set; however, you mayfeel more fondly towards scripted languages• Learn the analytical tools of your field– Talk to colleagues, etc. and choose at least one tool tomaster
  • 22.
    Preserve InformationKeep Original(Raw) File– Do not includetransformations,interpolations, etc.– Consider making the rawdata “read-only”Save as a new fileProcessing Script (R)
  • 23.
    Preserving: Scripted Notes•Use a scripted language to process data– R Statistical package (free, powerful)– SAS– MATLAB• Processing scripts records processing– Steps are recorded in textual format– Can be easily revised and re-executed– Easy to document• GUI-based analysis may be easier, but harder toreproduce
  • 24.
    Data Documentation (Metadata)•Informal or formal methods to describe yourdata• Important if you want to reuse your own datain the future• Also necessary when sharing your data
  • 25.
    Define Contents ofData Files• Create a Project Document File (LabNotebook)• Details such as:– Names of data & analysis files associated withstudy– Definitions for data and codes (include missingvalue codes, names)– Units of measure (accuracy and precision)– Standards or instrument calibrations
  • 26.
  • 27.
  • 28.
    Data DocumentationProject DocumentationDataset Documentation• Context of data collection• Data collection methods• Structure, organization of data files• Data sources used• Data validation, quality assurance• Transformations of data from theraw data through analysis• Information on confidentiality,access and use conditions• Variable names and descriptions• Explanation of codes and schemasused• Algorithms used to transform data• File format and software (includingversion) used
  • 29.
    File Format SustainabilityTypesExamplesText ASCII, Word, PDFNumerical ASCII, SPSS, STATA, Excel, Access, MySQLMultimedia Jpeg, tiff, mpeg, quicktimeModels 3D, statisticalSoftware Java, C, FortranDomain-specific FITS in astronomy, CIF in chemistryInstrument-specific Olympus Confocal Microscope Data Format
  • 30.
    Choosing File Formats•Accessible Data (in the future)– Non-proprietary (software formats)– Open, documented standard– Common, used by the research community– Standard representation (ASCII, Unicode)– Unencrypted & Uncompressed
  • 31.
    1. Use ConsistentData Organization2. Use Standardized Naming, Codes and Formats3. Assign Descriptive File Names4. Perform Basic Quality Assurance / Quality Control5. Preserve Information - Use Scripted Languages6. Define Contents of Data Files; CreateDocumentation7. Use Consistent, Stable and Open File FormatsBest Practices for Creating Data
  • 32.
    • Will improvethe usability of the data by youor by others• Your data will be “computer ready”• Save you timeFollowing these Best Practices…….
  • 33.
    Research Life CycleDataLife CycleRe-PurposeRe-UseDepositDataCollectionDataAnalysisDataSharingProposalPlanningWritingDataDiscoveryEnd ofProjectDataArchiveProjectStart Up
  • 34.
    Managing Data inthe Data Life Cycle• Choosing file formats• File naming conventions• Document all data details• Access control & security• Backup & storage
  • 35.
    Data Security &Access Control• Network security– keep confidential or sensitive data off internetservers or computers on connected to the internet• Physical security– Access to buildings and rooms• Computer Systems & Files– Use passwords on files/system– Virus protection
  • 36.
    Backup Your Data•Reduce the risk of damage or loss• Use multiple locations (here, near, far)• Create a backup schedule• Use reliable backup medium• Test your backup system (i.e., test filerecovery)
  • 37.
  • 38.
    Sustainable StorageLifespan ofStorage Media: http://www.crashplan.com/medialifespan/
  • 39.
    Best Practices BibliographyBorer,E. T., Seabloom, E. W., Jones, M. B., & Schildhauer, M. (2009). Some simpleguidelines for effective data management. Bulletin of the Ecological Society ofAmerica, 90(2), 205-214. http://dx.doi.org/10.1890/0012-9623-90.2.205Graham, A., McNeill, K., Stout, A., & Sweeney, L. (2010). Data Management andPublishing. Retrieved 05/31/2012, fromhttp://libraries.mit.edu/guides/subjects/data-management/.Hook, L. A., Santhana Vannan, S.K., Beaty, T. W., Cook, R. B. and Wilson, B.E. (2010).Best Practices for Preparing Environmental Data Sets to Share and Archive.Available online (http://daac.ornl.gov/PI/BestPractices-2010.pdf) from Oak RidgeNational Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A.http://dx.doi.org/10.3334/ORNLDAAC/BestPractices-2010.
  • 40.
    Best Practices Bibliography(Cont.)Inter-university Consortium for Political and Social Research (ICPSR). (2012).Guide to social science data preparation and archiving: Best practicesthroughout the data cycle (5th ed.). Ann Arbor, MI. Retrieved 05/31/2012,from http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf.Van den Eynden, V., Corti, L., Woollard, M. & Bishop, L. (2011). Managing andSharing Data: A Best Practice Guide for Researchers (3rd ed.). Retrieved05/31/2012, from http://www.data-archive.ac.uk/media/2894/managingsharing.pdf.

Editor's Notes

  • #4 The following are seven basic data habits that will help improve the information content of your data and make it easier to share data with others:Some have estimated that researchers can spend up to 80% of their time finding, accessing, understanding, and preparing data and only 20% of their time actually analyzing the data. The habits described in this module will help scientists spend more time doing research and less time doing data management.
  • #5 Spreadsheets are widely used for simple analyses They are easy to use BUT They allow (encourage) users to structure data in ways that are hard to use with other softwareYou can use them like Word, with columns. These spreadsheets (in this format) are good for “human” interpretation, not computers – and since you probably will need either Write a program or use a software package, then the “human” format is not best.These formats are good for presenting your findings such as publishing…. But it will be harder to use with other software later on (if you need to do any analysis).It is better to store the data in ways that it can be used in automated ways, with minimal human intervention
  • #6 Example of Poor Data Practice for Collaboration and SharingThis illustration shows an example of poor practice for working in spreadsheets for data collection. At first glance, it may appear this data is well formulated, but a closer look reveals a number of practices that will make it difficult to re-use in its present state. For example, there are calculations in the far right columns that appear to have been made during a data analysis phase but that do not represent valid data entries. Notice in the upper right corner a comment stating “Don’t use – old data”, and “Peter’s lab”. These remarks leave the viewer wondering about who Peter is and which lab he was located in, as well as why this data may not be the most accurate data spreadsheet. One also may wonder what the “c” located in the far right column represents, and what the numbers at the bottom of the spreadsheet represent, since they are unaffiliated with a particular row of data in the spreadsheet. Notice there are numbers added in inconsistent places (two numbers at the bottom of the chart) and the letter “C” appears in an unlabeled column.
  • #7 Spreadsheets are widely used for simple analyses They are easy to use, however… They allow (encourage) users to structure data in ways that are hard to use with other softwareYou can use them like you would a Word document, with columns and colors. These spreadsheets (in this format) are good for “human” interpretation, not computers – and since you probably will need to either write a program or use a software package, then the “human” format is not best.These formats are good for presenting your findings (publishing)…. But it will be harder to use with other software later on (if you need to do any further analysis).It is better to store the data in formats that it can be used in automated ways, with minimal human intervention
  • #9 This is some well data measurements, where a salinity meter was used to measure the salinity (top and bottom) and the conductivity (Top & bottom)Take a look at this spreadsheet… What’s wrong with it?Could this be easily automated? Sorted?Would you create a file like this?
  • #10 Dates are not stored consistentlySometimes date is stored with a label (e.g., “Date:5/23/2005”) sometimes in its own cell (10/2/2005)Values are labeled inconsistentlySometimes “Conductivity Top” others “conductivity_top”For Salinity sometimes two cells are used for top and bottom, in others they are combined in one cellData coding is inconsistentSometimes YSI_Model_30, sometimes “YSI Model 30”---- sort of can’t tell if it’s a “label” or a data valueTide State is sometimes a text description, sometimes a numberThe order of values in the “mini-table” for a given sampling date are different“Meter Type” comes first in the 5/23 table and second in the 10/2 table
  • #11 Confusion between numbers and textFor most software 39% or <30 are considered TEXT not numbers (what is the average of 349 and <30?)Different types of data are stored in the same columnsMany software products require that a single column contain either TEXT or NUMBERS (but not both!)The spreadsheet loses interpretability if it is sortedDates are related to a set of attributes only by their position in the file. Once sorted that relationship is lost.Not sure why you would sort this.
  • #12 Hint – think about representing missing values and about sortabilityYou want each row to be a complete record…. With no blank cells – think about a way to represent “missing values”Designed to be machine readable, not human readable The original spreadsheet loses interpretability if it is sortedDates are related to a set of attributes only by their position in the file. Once sorted that relationship is lost
  • #13 -SherryStandard convention for many software programs (usually a “check” yes,no) is for the 1st line (record) to be a header line… lists the names of variables in the file. Rest of records (lines) are data.Not too long some software programs may not work with long variable names
  • #14 Each line in the spreadsheet should have each cell filled. Otherwise, it isn’t machine-readable, and it won’t even survive a “sort” operation.Note we’ve changed the format of the date to an ISO YYYYMMDD format.
  • #15 Format the columns so they contain a single type of data…One problem with Excel is that it doesn’t like to show trailing zeros. So “33.0” in F2 is shown as “33” unless you change the formatting, as we have done here.
  • #17 (am/pm NOT allowed) T appears literally in the string. Min. for date is YYYY.YYYY = four-digit yearMM = two-digit month (01=January, etc.) DD = two-digit day of month (01 through 31) hh = two digits of hour (00 through 23)mm = two digits of minute (00 through 59) ss = two digits of second (00 through 59) s = one or more digits representing a decimal fraction of a second TZD = time zone designator (Z or +hh:mm or -hh:mm) 38.029N. The longitude is -78.476W
  • #18 File names should reflect the contents of the file and uniquely identify the data file. File names may contain information such as project acronym, study title, location, investigator, year(s) of study, data type, version number, and file type.Think about how the name will look in a directory with lots of other files, want to be able to “pick it out”.Having trouble finding files, telling the most recent one?
  • #19 File names are the easiest way to indicate the contents of the file. Use terse names but indicative of their content. Want to uniquely id the data file.Be unique but reflect the file content. Think about the organizing principle, don’t just make up a system as you go along.Don’t’ make them too long, some scripting programs have a filename limit for file importing (reading)Don’t use blanks/spaces in file names, some software may not be able to read file names with blanks.Think about how the name will look in a directory with lots of other files, want to be able to “pick it out”.
  • #20 As with naming files…. Similar logic is useful when designing file directory structures and names, which you should ensure is logical and efficient in design
  • #21 Perform basic quality assuranceDoing quality control will help you in your project but it will also help those who want to use your data. Would you want to use data that you were not sure of the quality?
  • #22 You don’t want to change something (or delete something) that could be important later (and you don’t know now what that may be). Make corrections/deletions in a derivative file, never in the original file.Things to think about:Operationally, you want to keep the raw data until you are “finished”Whether you preserve the raw data after the project is over depends on various factors. Most importantly, can the data be easily regenerated? If this is experimental data, it often is. However, observational, or survey data usually isn’t reproducible, and needs to be preserved after the end of the project.How would you name the new file?If use a scripted language you could re-run analyses It is important to take good notes of what changes you make to the data (file).
  • #23 To preserve your data and its integrity, save a "read-only" copy of your raw data files with no transformations, interpolation, or analyses. Use a scripted language such as “R”, “SAS” or “MATLAB” to process data in a separate file, located in a separate directory.In this example, an “R” to call is made on the data set to plot the data and perform a log transform – this way, changes are not retained in the original, raw data file.
  • #24 Analysis “scripted” software: R, SAS, SPSS, MatlabAnalysis scripts are written records of the various steps involved in processing and analyzing data (sort of “analytical metadata”).Easily revised and re-executed at any time if needs to modify analysisVS. GUI (easier) but does not leave a clear accounting of exactly what you have doneDocument scripted code with comments on why data is being changed.The scripts you have written are an excellent record of data processing, they can also easily and quickly be revised and rerun in the event of data loss or requests for edits, and have the added benefit of allowing a future worker to follow-up or reproduce your processing. Keep in mind that while GUI-based programs are easy on the front end, they do not keep a record of changes to your data and make reproducing results difficult.
  • #25 Metadata and associated documentation is absolutely crucial for any potential use or reuse of data; no one can responsibly re-use or interpret data without accompanying compliant and standardized metadata or documentation.Metadata describe your data so that others can understand what your data set represents; they are thought of as "data about the data" or the "who, what, where, when, and why" of the data.Metadata should be written from the standpoint of someone reading it who is unfamiliar with your project, methods, or observations. What does a user, 20 years into the future, need to know to use your data properly?Informal something like a ReadMe file. Formal is use of a structured like a data dictionary, codebook, or metadata. Different disciplines may have format standards.Informal is better than nothing
  • #26 More documentation: Documentation can also be called metadataDescription of the data file names (especially if using acronyms and abbreviations).Record why you are collecting data, Details of methods of analysisNames of all data and analysis filesDefinitions for data (include coding keys)Missing value codesUnit of measures.Structured metadata (XML) format standards for discipline (Ecological Metadata language – EML)
  • #27 Here the data dictionary specifies units for each field (parameter)
  • #29 In fact, you probably already have metadata in some form. You just may not recognize it as such. For instance, among your work records, you certainly have notebooks stuffed with color-coded pages or assorted keys to your data stored on your computer. Perhaps the most common form of metadata that you may already have is a file folder filled with notes on your data sources and the procedures that you used to build your data. However, unless you’ve been unusually diligent, your information is probably not organized so that a stranger could stroll into your office at any time, and read and understand it easily.Start at the beginning of the project and continue throughout data collection & analysisWhy you are collecting dataExact details of methods of collecting & analyzingGood for reproducability. Some of the issues include reproducibility of science that you can go back when questioned or when updating your results, and reproduce the algorithms. There’s also efficiencies in how the science is done. If you have to spend a lot of time figuring out what was done last time you are losing some efficiencies in reproducing those results or updating analysis. Along the line of those efficiencies is sharing across groups. Much of the work we do nowadays is collaborative, involves more than one agency or university or partner and if you can document the data and the analysis it helps to share the information and have everyone in that collaborative team understand what’s being done. We also, like documenting the data and the analysis create a provenance that gives a full history of when the project was started how the analysis was done and how the final results were completed.
  • #30 Collection/Analysis format does not have to be the same as Preservation format, but if not, then it will need to be converted (interchangeable format – will talk about this later) for archivingWant to choose a file format that can be read well into the future and is independent of software changes.Fundamental Practice #3: Use stable file formatsData re-use depends on the ability to return to a dataset, perhaps long after the proprietary software you used to develop it, is available. Remember floppy disks? It is difficult to find a computer that will read a floppy disk today. We must think of digital data in a similar way. Select a consistent format that can be read well into the future and is independent of changes in applications. If your data collection process used proprietary file formats, converting those files into a stable, well-documented, and non-proprietary format to maximize others' abilities to use and build upon your data is a best practice. When possible, convert your tabular dataset into ASCII text format.To be accessible in the future:Non-proprietaryOpen, documented standardCommon, used by the research communityStandard representation (ASCII, Unicode)UnencryptedUncompressed
  • #31 Storing data in recommended formats with detailed documentation will allow your data to be easily read many years into the future.Spreadsheets are widely used for simple analysesBut they have poor archival qualities Different versions over time are not compatibleFormulas are hard to capture or displayPlan what type of data you will be collecting. Want to choose a file format that can be read well into the future and is independent of software changes.These are formats more likely to be accessible in the future. to replace old media, maintaining devices that can still read the proprietary formats or media typeFormat of the file is a major factor in the ability to use the data in the future. As technology changes, plan for software and hardware obsolescence. System files (SAS, SPSS) are compact and efficient, but not very portable. Use software to “export” data to a portable (or transport) file. Convert proprietary formats to non-proprietary. Check for data errors in conversion.
  • #32 Remember create spreadsheet so it can be automatedDate/Time standards, Geospatial coords, Species, other standards from disciplineDescriptive File Names – File names can help id what’ inside Quality Assurance – when planning on data entry can “program” data checks in forms (Access and Excel), create pick lists (codes), missing data valuesMake it easier to replicate data transformation, can be documentedDocument EVERYTHING, dataset details, database details, collection notes – conditions, You will not remember everything 20 years from now! What someone would need to know about your data to use it.Stable File Formats – easier if all files were same format, also knowing what formats are better in the long-term
  • #33 Planning the management of your data before you begin your research AND throughout its lifecycle is essential to ensure its current usability & long-term preservation and access.With a repository keeping your data, you can focus on your research rather than fielding requests or worrying about data on a web page. Your project may have lots of people working on it, you will need to know what each is doing and has done. Project may last years.Funding agencies now require a data management planHaving your data documented will allow future users understand your data and be able to use it.If follow plan then data should be ready for archiving (documenting the data throughout) insures proper description of the data are maintained.
  • #34 Collecting the data is just part of a research project. Here’s a view of the complete life cycle of research. The data you collect (all the files, notes) need to be managed throughout the project. Steps in the Research Life Cycle:Proposal Planning & Writing: Conduct a review of existing data setsDetermine if project will produce a new dataset (or combing existing)Investigate archiving challenges, consent and confidentialityId potential users of your dataDetermine costs related to archivingContact Archives for advice (Look for archives)Project Start Up Create a data management planMake decisions about document form and contentConduct pretest & tests of materials and methodsData CollectionFollow Best PracticeOrganize files, backups & storage, QA for data collectionAccess Control and SecurityData Analysis Manage file versionsDocument analysis and file manipulationsData Sharing Determine file formatsContact Archive for adviceMore documenting and cleaning up dataEnd of Project Write PaperSubmit Report FindingsDeposit Data in Data Archive (Repository) Remember: Managing Data in a research project is a process that runs throughout the project. Good data management is the foundation for good research. Especially if you are going to share your data. Good management is essential to ensure that data can be preserved and remain accessible I the long-term, so it can be re-used and understood by other researchers. When managed and preserved properly research data can be successfully used for future scientific purposes.
  • #35 Here’s the details about what we are going to manage in the Data Life Cycle.Many of the criteria for managing data are the best practices that we already went over. The 2 highlighted we haven’t talked about yet.
  • #36 Keep master copy to an assigned team memberRestrict write access to specific membersRecord changes with Version controlNetwork: keep confidential data off internet servers (or behind firewalls), put sensitive materials on computers not connected to the internetPhysical security… who has access to your office,. Allowing repairs by an outside companyComputer: Keep virus protection up to date, does your computer have a login password, not sending personal or confidential data via e-mail or FTP, transmit via encrypted data, imposing confidentially agreements for data users
  • #37 Why backup data?Keeping reliable backups is an integral part of data management. Regular back-ups protect against data loss due to:Hardware failure, software of media faults, virus infection or hacking, power failure, human errorsRecommendation, 3 backup copies original, external/local, external/remoteFull-backups, incrementalIf using departmental server, check on backup/restore procedures (how quickly can you get files restored?)May want to have the backup procedures controlled by you.Test your backup system, test restoring files, don’t over re-use backup media
  • #38 There are a variety of alternative methods to store and share your data - from thumb drives to shared online environments. Personal ComputerDepartmental or University ServerHome Directory or UVa Collab (Storage only)Tape BackupsSubject Archive Each type of storage (can’t forget backups) has their strengths and weaknesses. You need to be able to evaluate them for your research.
  • #39 Point… to think about migrating information from obsolete media to new media.Using CD-Roms as data backups is popular. Blank CDs are inexpensive, and copying data onto CDs is easy. However, this is the most unreliable method of all the data backup methods listed here. Who hasn't had the experience of putting a CD into a drive only to find that the data is unreadable and the disk "doesn't work"? CDs, like the floppy disks they've replaced, have a limited shelf life. If you are writing your data backup files onto CDs, make sure that you make (and keep) multiple copies over time.external hard drive for data backups is recommended. External hard drives are relatively cheap. They’re also easy to use; in many cases, all you have to do is plug the hard drive into your computer’s USB port. And while hard drives do fail, their failure rate is much lower than that of backup media such as CDs.Cloud Storage for-fee or free for up to 10G (storage costs, data transfer)

[8]ページ先頭

©2009-2025 Movatter.jp