Project tips + resources

Tips and tricks for visualisation

Consistency is the most important thing for a statistician.
- jumping between different citation styles is bad
- having some captions centered above the figure produced by R and others flushed right below using Markdown is annoying
The report should be self-contained.
- it should not be a tutorial for the code
- the reader should not have to jump between the report and the code to understand your work
Save space for better readability.
- plots that convey little information don’t have to be large, several (related) plots can be put next to each other on the same line, etc.
- even in an html file, unnecessary scrolling back and forth when reading a report is annoying
- a plot frame with one boxplot is a waste of space (e.g., histogram would be better)
- barplots can often be replaced by tables to both save space and improve readability
- in general, if a plot only shows like 3 numbers and is not important for any argument made, it should not be a plot
Story-telling matters.
- it is important to grasp attention with an introduction (and describe at the same time what to expect from a report)
- re-iterate the most important ideas/results in several places
- comment on plots even if they are self-explanatory
More on plots.
- text in figures (labels, etc.) should be of similar size as the main text
- labels have to be readable, e.g., no overlaying etc.
- captions are necessary and should make the plot self-contained (without looking at the paragraphs around it)
List of references should be itemized or enumerated (in order to be readable).
Avoid using local paths.
- reproducibility of the report itself!
Transforming variables.
- if your plots look bad because of a clear skewing in one of the variables, transform the varible (typically plot it on a log-scale)
- if plotting on a log-scale, you might consider log10 or log2 to have better interpretability

Some Links to Open Data

fivethirtyeight: article data of Nate Silver’s data journalism platform freely available (see also R package - fivethirtyeight)
data-is-plural: weekly newsletter of datasets by Jeremy Singer-Vine
re3data: Registry of research data repositories
openml datasets: many uniformly formatted datasets for training machine learning models – however, not always good descriptions available
Worldbank Datacatalog: the World Bank data catalogue
UK Data Service: UK’s largest collection of social, economic and population data resources (filter for open data) or also data.gov.uk
ICPSR: unit within the Institute for Social Research at the University of Michigan, social and behavioral research. In particular including replication datasets for published studies.
govdata: Open Government - German administrative data freely accessible
gapminder: “an independent educational non-proﬁt ﬁghting global misconceptions”; collection and vizualisation of datasets concerning gobal developement
nature.com: peer-reviewed, open-access journal for descriptions of datasets (broad range of natural science disciplines)
NIH (National Institute of Health) Data Sharing Repositories: overview on different thematically sorted medical databases
UCI Machine Learning Repository or the new beta version: containing various datasets – however, sometimes with a little few description
data.bris Research Data Repository: Data repository of the University of Bristol

… no systematic selection. Much more out there