Interesting methods or simple tips: Did you know?

I’d love to hear about Epi or Stats methods or even Code things that you don’t think many people know about but that are actually robust and helpful.

Here’s one of mine.

The rule of three

If you have zero events in a group, you can get an upper bound of a confidence interval by doing n/3. Here’s the link

Really simple but also really helpful in strange circumstances.

3 Likes

This is a good topic!

One thing I like to do is define my colour palettes at the top of my script, or separately from my reports. Then all the categories have consistent colours throughout my reports and if I need to change a colour then I only have to change it once.

image

2 Likes

A Quick Tip for Smoother Choropleth Maps

Here’s a little trick to make your choropleth maps easier on the eyes if you’re dealing with a lot of polygons, like the 5570 municipalities in Brazil on the example below.

Set both colour and fill to the same color, and then decrease alpha a bit (I find 0.8 works pretty well).

With this small tweak, you can go from a cluttered, overwhelming map (especially in areas with many small polygons) (A) to a cleaner, more polished look (B).

What do you think? Do you prefer (A) or (B)? Sure, we could make the black lines in (A) thinner, but I still find that the approach I mentioned creates a more visually appealing result.

4 Likes

@lnielsen that’s really cool! and such a great example country for it. Brazil is so hard to map due to the size and scale.

1 Like

Overlapping Confidence Intervals

Does anyone else still have to go back to their old stats notes every time they need to interpret overlapping confidence intervals?

Often when looking at a graph of estimates with confidence intervals, people assume that if they overlap then they’re not statistically different, but actually because the test would be on the differences and not on the two individual groups, there is more power so actually if they overlap they can still be different and significantly significant.

The picture below summarizes this concept and this article goes into more detail:

3 Likes

I came across an article today, and it says that there are 9 algorithm to calculate quantile in R :open_mouth::open_mouth::open_mouth:

Have you ever noticed that quantiles calculated by R differ from other statistical software such as SPSS and Graphpad Prism?
Thid discrepancy arises from the fact that there are different methods for calculating quantiles. Interestingly, R provides 9 algorithms for calculating quantiles.
Quantile algorithms – AMRBiostats

Haha yeah that’s kind of mad Luong! So once you start using multiple software you see that they’ll give different results for the same regression model. Sometimes it’s things like Stats and R we’ll compare the 1 of a binary group to the 0 and SAS will compare the 0 to the 1, that’s all good because it’s just an inverse.

Other times you learn the algorithms are different. There’s a whole host of way to calculate the same things.

Linear regression can use ordinary least squares or maximum likelihood estimation which is equivalent in this case but you could also use gradient descent and this is before you get to the underlying computer code. Often you discover it’s some Fortran or C code under the hood and often it’s running something called BLAS which are basic linear algebra subprograms.

For the most part I’d imagine most Epis and Statisticians don’t spend much time thinking about different fitting algorithms. Mathematical modellers will but usually to answer a problem. Then you get optimisation specialists who really delve into which algorithm should you use for each thing.

For us mere mortals though we just click the run button, it’s a much easier life!

Hi Chris. This is a very well explanation about the chaos world we are having :joy::joy: I am only aware OLS and MLE to estimate linear regression, and personally I like the idea of MLE. Yes you are right, an epi may not need to understand the whole difference between methods as they are just slightly different. Save the time and do other fun jobs :sweat_smile::sweat_smile:

B definitely much easier on the eyes - I like that it almost looks like a heatmap (though one would have to remember the relative lack of granularity).

1 Like

Has anyone here used word frequency analysis to encode free text responses from a surveillance or case investigation questionnaire?

I find it really useful - saves a lot of time, more consistent encoding too (than having different individuals manually do it). You can create the word frequency table from a document term matrix with the tm package and friends - then create relative word clouds to represent the results of logistic regression for example.

There are a couple of caveats - I did have a cleaning script to catch common spelling variations, and when I was working on this in 2019 it was not possible to find a lemmatization package that worked independently in R (to bring singular, plural etc words back to their common root). Also the pre-prepared list of ‘stop words’ to ignore in the tm package are fine for English, but I haven’t tried doing this in other languages (yet).

This might all be moot now that we can use structured data with AI large language models - I’m experimenting with this at the moment.

2 Likes

I’ve never used it formally only for fun stuff like analysing conversations.

Can you tell me a bit more about how the words represent the results of a logistic regression? What is the regression on?

Which LLM have you been trying with Perplexity, ChatGPT, and Gemini?

@lnielsen are you decreasing the alpha on both or just the colour?

The idea is to decrease the alpha on fill. Here is a reprex:

# Install pacman if not already installed
if (!require("pacman")) install.packages("pacman")

# Load required libraries
pacman::p_load(geobr, ggplot2, dplyr, cowplot)

# Download Brazilian municipalities shapefile
mun <- read_municipality(year = 2020)

# Add a random numeric column
set.seed(123)  # For reproducibility
mun <- mun %>% mutate(rand_val = runif(nrow(mun), min = 100, max = 1000))

#Maps

A <- ggplot(mun) +
  geom_sf(aes(fill = rand_val)) +
  theme_void()+
  labs(title = "A") +
  scale_fill_viridis_c() +
  theme(legend.position = "none")

B <- ggplot(mun) +
  geom_sf(aes(fill = rand_val, colour = rand_val), alpha = 0.8) +
  theme_void()+
  labs(title = "B") +
  scale_fill_viridis_c() +
  scale_colour_viridis_c()+
  theme(legend.position = "none")

#Print
plot_grid(A, B)

1 Like