refers to a structured and standardised way of formatting and organizing datasets that adheres to the principles of simplicity, consistency, and usability. Many popular software packages for analyses (including python, R, and MATLAB) work best when data is arranged in a tidy way.
Tidy Data Principles
- Each variable has its own column.
- Each observation has its own row.
- Each value has its own cell.
The aim is to make sure that each cell contains a single piece of information. This with relational database principles and tools commonly used in data and statistical analysis so that data can be more easily manipulated, analysed, and visualised.
Tips for Setting Up Data Files
- Dont combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if thats the only way youll want to be able to use or sort that data., e.g. FirstName, LastName rather than Name.
- Always keep a copy of the raw data separately to your working files.
- Avoid formatting to convey information, e.g. bolding words, colour coding, adding comments to cells.
- Avoid merged cells.
- Export the cleaned data to a text-based format like CSV. This ensures that anyone can use the data, and is the format required by most data repositories.
Other Help with Tidy Data
The library carpentry project provides on for tidy data. The UC 厙ぴ勛圖 also runs a that includes Tidy Data and Open Refine, and you can attend in person.