(Making of) the one figure that summarizes HBV prevalence research in Bangladesh

Hassan uz-Zaman
6 min readFeb 16, 2020

My first scientific publication came out in MDPI’s Genes. Very simply put, it’s a quasi-systematic review of all the published research on Hepatitis B virus epidemiology in Bangladesh thus far. The paper involved some quantitative analysis of the prevalence data we pooled (somewhat akin to a meta-analysis), but that was only a part of it. We also comment extensively on the studies themselves- the study populations chosen, the motivations behind designing the studies in a certain way, major gaps in research and why those gaps exist, and so forth. At the risk of immodesty, I would venture to say it’s a very important resource for HBV epidemiology researchers in Bangladesh going forward.

In this post, I want to go a little behind the scenes.

In his book Mobile DNA: Finding Treasures in Junk, Haig Kazazian Jr. recounts the following story about L1 retrotransposon research carried out by two of his graduate students (pp. 110–111):

When Brouha was writing his paper on these data, he wanted to have one summary figure that showed the chromosomal location of all 82 tested L1s, their allele frequency, their L1 subset, their ability to retrotranspose, and their relative activity as retrotransposons. Brouha and Shustak thought very hard about this figure and finally came up with a plan. They would show all the chromosomes individually and portray the L1s as human stick figures next to their chromosomal location. Allele frequencies were shown by the act of shading of the human figure. The extent of activity was shown by both the size of the figure and its state of recumbence. The “hot” L1s were shown as large and standing tall. Dead L1s were smaller figures lying flat on their backs. Different L1 subsets were shown as different shadings of the figures. For example, an upright tall figure shaded in from the waist down next to the short arm of chromosome 6 is a highly active Ta subset L1 with an allele frequency of 0.5 on the short arm of chromosome 6. In my view, this figure is one of the most innovative I have ever seen, and all the credit for it goes to Brouha and Shustak.

Here’s one part of the figure (for the first 8 chromosome) under discussion:

I was blown away but how much information is condensed into this one figure. For review:

  1. L1 locations: by their position next to the chromosomes
  2. Activity: Size and posture of stick figures
  3. Allele frequency: Degree of shading
  4. Subclass: Color.

And at the same time, it didn’t seem overly congested. Since reading this story, I always wanted to include such a figure in my publications- one that would summarize a lot of information, and at the same time make use of unique markers to not look cluttered. I mean, a table can summarize a lot of information, but it’s not often easy on the eyes.

Coming back to our paper on HBV, we spent a large part of the paper discussing HBV prevalence in the general population of Bangladesh. Researchers had pointed out earlier that a lot of the HBV epidemiology work conducted in Bangladesh focused on particular risk groups- injecting drug users or commercial sex workers, say. The prevalence rate among these groups couldn’t be said to represent the prevalence rate of the general population. In our article, we wanted to explore this in detail. After we looked at the prevalence studies conducted on putative “general” populations (i.e. people not explicitly engaging in HBV risk behaviors), we realized this characterization was not only true, but true to a degree that we hadn’t foreseen. We could hardly find any recent study that was done on a truly representative population with decent sample size.

I wanted to summarize these findings in one grand figure, one that would show:

  1. The prevalence values spread across time (from early 80’s to the present),
  2. The sample sizes of the studies,
  3. Whether the study populations were over- or under-representing HBV prevalence in the general population.

This is the story of how that figure came to be.

To start off, I made a horizontal bar diagram fused to a table that showed the prevalence values of all the studies, arranged downward from earliest to latest. At this point the figure looked like this:

Of course, percentages are useless without sample sizes. The higher the sample size of a study, the more seriously it has to be taken. Is the study reporting an 8.74% prevalence just as reliable as the study reporting a 0% prevalence? Well, to begin to make that assessment, we need sample sizes. I could’ve just written the sample sizes next to all of the bars, but that’s just so mundane and clutter-fodder, which was exactly what I was trying to avoid. So instead of that, I decided to make the bars thicker or thinner depending on their sample sizes. Thicker bars meant more sample size, which in turn meant the study needed to be taken more seriously. To achieve this, I first took the normalized values of the sample sizes, and then adjusted the height of each row to a value corresponding to a factor of its normalized sample size. I knew these operations are way too niche and ad hoc, so I did everything manually on Excel, instead of using any graph or chart tool. This is how the figure looked at this point:

The sample sizes ranged from 130 in [4] to above 40,000 in [8]. This shows neither the 0% nor the 8.74% prevalence values need to be taken super seriously, as the studies have very thin bars (=very low sample sizes).

This is still not the full story, however. Remember how I said almost none of the populations of the recent studies were truly representative of the general population? I wanted to capture that information in the figure as well. After all, a very high prevalence rate doesn’t translate to actual population prevalence if it is done among people of a low socio-economic status (and hence low health awareness), or with a high male:female ratio (males have more HBV than females). This time, I thought this could be done by coloring the bars in a certain way- red would indicate studies underrepresenting actual population prevalence, while green would indicate studies overrepresenting it. Black means no bias either way, and yellow means both. I also added one more column mentioning the factors which lead to this over- or underrepresentation. Here’s how the final product looked after a little more polish:

Voilà.

I know this is nowhere close to the degree of innovative thinking that went into Brouha and Shustak’s figure, but it manages to summarize a ton of quantitative and qualitative information while keeping things accessible. Chronological arrangement of the prevalence values from early to recent means one could note the decline in prevalence in recent years. The very few black bars (none since 2013) indicate there’s a clear dearth of prevalence studies that can be representative of the population at large. At a broader level, it also shows that a lot of these study designs fall short of truly capturing the general population prevalence. This last part is bolstered by other analyses in our paper as well- like the predominance of prevalence studies among blood donors. These studies have huge sample sizes, but are almost never representative because of higher awareness among the donor population and consistently high male:female ratio. What was clearly happening with a lot of these studies is medical institutes were sitting on top of a stash of screening data, and they decided to publish it in national research journals. A proper study design to truly reflect general prevalence would in fact need a lot of forethought (a good example is [1]), as opposed to piggy-backing on screening data collected for other purposes and calling it prevalence.

So that’s what I wanted to plug. Can I get y’all to cite my paper now?

--

--

Hassan uz-Zaman

Husband, biologist, philosophy enthusiast, nothing else much besides. In pursuit of happiness and understanding.