Posts tagged “data”

Some more graphs of Beijing's Air Pollution

A bunch of folks across the internet have been doing some great stuff with the air quality data coming out of China via official channels and the US Embassy twitter feeds. My advisor asked for some graphs of available data. They are posted below (all were created in R using ggplot2). If time ever permits, I’ll post some interactive visualizations.

Using RAW to visualize Global Burden of Disease Data

RAW is a really impressive and easy-to-use data visualization tool created by Density Design. I created the following plot in about five minutes from existing GBD data (of DALYs in India for women of all ages).

Air Pollution ? Household air pollution from solid fuels 14,430,417Dietary/Physical ? Dietary risks 14,139,801Dietary risks ? Cardiovascular and circulatory diseases 12,251,100Undernutrition ? Iron deficiency 10,145,794Undernutrition ? Childhood underweight 10,112,321Physiological ? High blood pressure 9,598,107Iron deficiency ? Nutritional deficiencies 9,245,200High blood pressure ? Cardiovascular and circulatory diseases 9,236,250Childhood underweight ? Diarrhea, lower respiratory infections, meningitis, and other common infectious diseases 7,993,580Air Pollution ? Ambient particulate matter pollution 6,963,544Physiological ? High fasting plasma glucose 6,839,755Tobacco ? Tobacco smoking 6,456,925Undernutrition ? Suboptimal breastfeeding 5,430,200Suboptimal breastfeeding ? Diarrhea, lower respiratory infections, meningitis, and other common infectious diseases 5,430,200Household air pollution from solid fuels ? Cardiovascular and circulatory diseases 4,939,660Sexual abuse ? Intimate partner violence 4,907,625Dietary/Physical ? Physical inactivity and low physical activity 4,684,952Household air pollution from solid fuels ? Chronic respiratory diseases 4,629,250Household air pollution from solid fuels ? Diarrhea, lower respiratory infections, meningitis, and other common infectious diseases 4,242,530Ambient particulate matter pollution ? Cardiovascular and circulatory diseases 4,051,780High fasting plasma glucose ? Diabetes, urogenital, blood, and endocrine diseases 3,758,890Physical inactivity and low physical activity ? Cardiovascular and circulatory diseases 3,212,120Intimate partner violence ? Self-harm and interpersonal violence 3,066,340Alcohol & Drugs ? Alcohol use 3,020,381Tobacco smoking ? Chronic respiratory diseases 2,767,440WatSan ? Unimproved sanitation 2,691,430Unimproved sanitation ? Diarrhea, lower respiratory infections, meningitis, and other common infectious diseases 2,691,430High fasting plasma glucose ? Cardiovascular and circulatory diseases 2,527,440Physiological ? High body-mass index 2,517,676Occupational risks ? Occupational risks 2,341,920Physiological ? High total cholesterol 2,308,860High total cholesterol ? Cardiovascular and circulatory diseases 2,308,860Childhood underweight ? Nutritional deficiencies 2,096,190Tobacco smoking ? Cardiovascular and circulatory diseases 1,914,150Ambient particulate matter pollution ? Diarrhea, lower respiratory infections, meningitis, and other common infectious diseases 1,721,660Intimate partner violence ? Mental and behavioral disorders 1,577,950Other Env ? Lead exposure 1,397,538Sexual abuse ? Childhood sexual abuse 1,374,294Lead exposure ? Cardiovascular and circulatory diseases 1,345,890Tobacco smoking ? Diarrhea, lower respiratory infections, meningitis, and other common infectious diseases 1,282,190Alcohol & Drugs ? Drug use 1,210,892High body-mass index ? Cardiovascular and circulatory diseases 1,197,470Undernutrition ? Vitamin A deficiency 1,185,772Vitamin A deficiency ? Diarrhea, lower respiratory infections, meningitis, and other common infectious diseases 1,178,350Undernutrition ? Zinc deficiency 1,126,100Zinc deficiency ? Diarrhea, lower respiratory infections, meningitis, and other common infectious diseases 1,126,100Dietary risks ? Neoplasms 1,101,710Ambient particulate matter pollution ? Chronic respiratory diseases 1,099,640Occupational risks ? Chronic respiratory diseases 1,008,390Physical inactivity and low physical activity ? Diabetes, urogenital, blood, and endocrine diseases 990,530Drug use ? Mental and behavioral disorders 986,262Iron deficiency ? Maternal disorders 900,594High body-mass index ? Diabetes, urogenital, blood, and endocrine diseases 896,822Occupational risks ? Musculoskeletal disorders 775,941Alcohol use ? Mental and behavioral disorders 718,838Childhood sexual abuse ? Mental and behavioral disorders 716,375Dietary risks ? Diabetes, urogenital, blood, and endocrine diseases 707,042Childhood sexual abuse ? Self-harm and interpersonal violence 657,919Alcohol use ? Cirrhosis of the liver 617,146WatSan ? Unimproved water source 604,815Unimproved water source ? Diarrhea, lower respiratory infections, meningitis, and other common infectious diseases 604,815Alcohol use ? Cardiovascular and circulatory diseases 576,253High fasting plasma glucose ? HIV/AIDS and tuberculosis 553,425Air Pollution ? Ambient ozone pollution 548,650Ambient ozone pollution ? Chronic respiratory diseases 548,650Physical inactivity and low physical activity ? Neoplasms 482,302Household air pollution from solid fuels ? Other non-communicable diseases 443,135Tobacco smoking ? Neoplasms 418,225High blood pressure ? Diabetes, urogenital, blood, and endocrine diseases 361,857Physiological ? Low bone mineral density 301,652Low bone mineral density ? Unintentional injuries other than transport injuries 301,652High body-mass index ? Musculoskeletal disorders 268,266Occupational risks ? Unintentional injuries other than transport injuries 265,035Intimate partner violence ? Maternal disorders 263,335Alcohol use ? HIV/AIDS and tuberculosis 225,666Occupational risks ? Other non-communicable diseases 205,120Alcohol use ? Unintentional injuries other than transport injuries 199,629Drug use ? Self-harm and interpersonal violence 176,098Household air pollution from solid fuels ? Neoplasms 175,842Alcohol use ? Neoplasms 168,593Alcohol use ? Transport injuries 166,825High body-mass index ? Neoplasms 155,118Alcohol use ? Diarrhea, lower respiratory infections, meningitis, and other common infectious diseases 144,849Alcohol use ? Self-harm and interpersonal violence 134,200Ambient particulate matter pollution ? Neoplasms 90,464Dietary risks ? Musculoskeletal disorders 79,949Occupational risks ? Transport injuries 68,973Tobacco smoking ? HIV/AIDS and tuberculosis 53,020Lead exposure ? Diabetes, urogenital, blood, and endocrine diseases 48,752Other Env ? Residential radon 46,637Residential radon ? Neoplasms 46,637Drug use ? HIV/AIDS and tuberculosis 41,603Alcohol use ? Diabetes, urogenital, blood, and endocrine diseases 34,055Alcohol use ? Neurological disorders 30,366Childhood underweight ? Neglected tropical diseases and malaria 22,551Tobacco smoking ? Diabetes, urogenital, blood, and endocrine diseases 21,900Occupational risks ? Neoplasms 18,461Vitamin A deficiency ? Nutritional deficiencies 7,422Drug use ? Cirrhosis of the liver 4,606Alcohol use ? Digestive diseases (except cirrhosis) 3,961Lead exposure ? Mental and behavioral disorders 2,896Drug use ? Other communicable, maternal, neonatal, and nutritional disorders 1,687Drug use ? Neoplasms 635Alcohol & Drugs 4,231,273Alcohol & DrugsAlcohol use 3,020,381Alcohol useDrug use 1,210,892Drug useAir Pollution 21,942,611Air PollutionAmbient ozone pollution 548,650Ambient ozone pollutionAmbient particulate matter pollution 6,963,544Ambient particulate matter pollutionHousehold air pollution from solid fuels 14,430,417Household air pollution from solid fuelsSexual abuse 6,281,919Sexual abuseChildhood sexual abuse 1,374,294Childhood sexual abuseIntimate partner violence 4,907,625Intimate partner violenceUndernutrition 28,000,186UndernutritionChildhood underweight 10,112,321Childhood underweightIron deficiency 10,145,794Iron deficiencySuboptimal breastfeeding 5,430,200Suboptimal breastfeedingVitamin A deficiency 1,185,772Vitamin A deficiencyZinc deficiency 1,126,100Zinc deficiencyDietary/Physical 18,824,753Dietary/PhysicalDietary risks 14,139,801Dietary risksPhysical inactivity and low physical activity 4,684,952Physical inactivity and low physical activityPhysiological 21,566,050PhysiologicalHigh blood pressure 9,598,107High blood pressureHigh body-mass index 2,517,676High body-mass indexHigh fasting plasma glucose 6,839,755High fasting plasma glucoseHigh total cholesterol 2,308,860High total cholesterolLow bone mineral density 301,652Low bone mineral densityOther Env 1,444,175Other EnvLead exposure 1,397,538Lead exposureResidential radon 46,637Residential radonOccupational risks 2,341,920Occupational risksOccupational risks 2,341,920Occupational risksTobacco 6,456,925TobaccoTobacco smoking 6,456,925Tobacco smokingWatSan 3,296,245WatSanUnimproved sanitation 2,691,430Unimproved sanitationUnimproved water source 604,815Unimproved water sourceCardiovascular and circulatory diseases 43,560,973Cardiovascular and circulatory diseasesCirrhosis of the liver 621,752Cirrhosis of the liverDiabetes, urogenital, blood, and endocrine diseases 6,819,848Diabetes, urogenital, blood, and endocrine diseasesDiarrhea, lower respiratory infections, meningitis, and other common infectious diseases 26,415,704Diarrhea, lower respiratory infections, meningitis, and other common infectious diseasesDigestive diseases (except cirrhosis) 3,961Digestive diseases (except cirrhosis)HIV/AIDS and tuberculosis 873,714HIV/AIDS and tuberculosisMental and behavioral disorders 4,002,321Mental and behavioral disordersNeoplasms 2,657,986NeoplasmsNeurological disorders 30,366Neurological disordersSelf-harm and interpersonal violence 4,034,557Self-harm and interpersonal violenceTransport injuries 235,798Transport injuriesUnintentional injuries other than transport injuries 766,316Unintentional injuries other than transport injuriesChronic respiratory diseases 10,053,370Chronic respiratory diseasesNeglected tropical diseases and malaria 22,551Neglected tropical diseases and malariaNutritional deficiencies 11,348,812Nutritional deficienciesMusculoskeletal disorders 1,124,156Musculoskeletal disordersOther communicable, maternal, neonatal, and nutritional disorders 1,687Other communicable, maternal, neonatal, and nutritional disordersOther non-communicable diseases 648,255Other non-communicable diseasesMaternal disorders 1,163,929Maternal disorders

The first column contains risk categories as defined by the comparative risk assessment of the 2010 Global Burden of Disease. The second column contains individual risk factors (each of which fits into an aforementioned risk category). The final column shows attributable DALYs by cause. Some color would help differentiate the different risks and causes, but the basic picture is clear if you spend a few minutes with the graph. Women in India, according to the 2010 GBD, predominantly lose healthy life years from CVD, chronic respiratory diseases, nutritional deficiencies, and infectious disease. A fair amount of this is attributable to air pollution.

To make this plot, I opened a CSV, copied its contents, pasted into a text field at RAW, and then used its simple, elegant GUI to generate the code for the plot. The options are a little limited now (would like to add some color, shift label positions around, etc). If I really wanted to make those changes, I could edit the code and do it manually. A really impressive showcase of what can be done in the browser and definitely worth checking out and keeping an eye on.

A robust, low-cost particle monitor and data platform for evaluation of cookstove performance

Johnson M, Pillarisetti A, Allen T, Charron D, Pennise D, Smith KR. A robust, low-cost particle monitor and data platform for evaluation of cookstove performance. EPA Air Sensors 2013: Data Quality & Applications. Research Triange Park, NC: March 18-19, 2013.

Beautiful new service from the creators of Dark Sky. A number of cool things about it, including its beautiful visualizations and use of data from around the globe. Particularly, the developers note:

We’ve gathered hour-by-hour observations from tens of thousands of ground stations world-wide, in some places going back a hundred years. We expose it as a sort of “time machine” that lets you explore the past weather at any given location. We’ve also used the data to develop statistical forecasts for any day in the future. For example, say you have an outdoor family reunion in 6 months: with the time machine, you can see what the likely temperature and precipitation will be at the exact day and hour.

Their API sounds good, too, though I haven’t taken the plunge on that yet.

Now that we’ve developed a general-purpose weather API, we’re trying to compete with the other weather APIs available around the Internet. We’ve found those APIs to be difficult and clunky to use, so we’ve tried to make our API as streamlined as possible: you can sign up for a developer account without needing a credit card, and start making requests right away—you can worry about payment information when your app is ready. Additionally, we’ve lowered our prices so that we’re competitive with the other data providers out there.

Via DF

R + Global Burden of Disease / Comparative Risk Assessment Data: A tutorial (version 0.1)

R can be scary for those new to it, but it is exceptionally useful for a number of things, including managing, importing, and merging text files; resaving them; and performing statistical analyses to your heart’s content. It is your friend, albeit one that you must learn to love slowly and painfully.

This brief tutorial does not serve as an introduction to R. Instead, it focuses on reading in a large, complex data set with ~1 million rows and 50+ columns. It was created to help facilitate some analysis in a GBD course at Berkeley. It will help you figure out how to do some basic manipulation and subsetting and export these subsetted data into a comma-separated text file (“csv”) for analysis in your favorite spreadsheet program. It is a work in progress and will be updated over time.

The Internet, a series of tubes.

Google’s data center doors flung open earlier this week. And, somehow, it looks remarkably like Ted Stevens’ often-teased quotation about the internet being nothing but “a series of tubes.”

click here to see many more photos from Google’s data centers

Obviously, that’s not the whole of it. The tubes, the languages, the infrastructure all come together, a weird amalgamation of technologies that gives rise to our internet, a sum that transcends the somewhat mundane parts. Andrew Blum, author of Tubes: A Journey to the Center of the Internet, said the following in an interview with Terry Gross:

The Internet is absolutely made of tubes. What else could it be made of? It’s many other things — these protocols and languages and machines and a whole set of fantastically complex layers and layers of computing power that feeds the Internet every day. But if you think of the world in physical terms, and you’re trying to be as reductive as possible and try to understand what this is, there’s no way around it — these are tubes. And from the very first moment, from the basement of a building in Milwaukee to Facebook’s high-tech, brand-new data center, and along the ceiling and the walls, are these steel conduits. But I know a tube when I see one.

A couple of days ago, Wired published a piece by Steven Levy about Google’s data centers. Levy was one of the first non-essential Google staff to visit the center, and his report is pretty astonishing. Google’s built a lot of their own infrastructure in an attempt to meet two important standards — speed and energy efficiency.

All of these innovations helped Google achieve unprecedented energy savings. The standard measurement of data center efficiency is called power usage effectiveness, or PUE. A perfect number is 1.0, meaning all the power drawn by the facility is put to use. Experts considered 2.0—indicating half the power is wasted—to be a reasonable number for a data center. Google was getting an unprecedented 1.2.

For years Google didn’t share what it was up to. “Our core advantage really was a massive computer network, more massive than probably anyone else’s in the world,” says Jim Reese, who helped set up the company’s servers. “We realized that it might not be in our best interest to let our competitors know.”

Make no mistake, though: The green that motivates Google involves presidential portraiture. “Of course we love to save energy,” H�lzle says. “But take something like Gmail. We would lose a fair amount of money on Gmail if we did our data centers and servers the conventional way. Because of our efficiency, we can make the cost small enough that we can give it away for free.”

thanks to Charlotte K. for sharing Levy’s article + the photos

It's getting hot in here: Shifting Distribution of Northern Hemisphere Summer Temperature Anomalies, 1951-2011

From Goddard Space Flight Center:

This bell curve graph shows how the distribution of Northern Hemisphere summer temperature anomalies has shifted toward an increase in hot summers. The seasonal mean temperature for the entire base period of 1951-1980 is plotted at the top of the bell curve. Decreasing in frequency to the right are what are defined as “hot” anomalies (between 1 and 2 standard deviations from the norm), “very hot” anomalies (between 2 and 3 standard deviations) and “extremely hot” anomalies (greater than 3 standard deviations). The anomalies fall off to the left in mirror-image categories of “cold, “very cold” and “extremely cold.” The range between the .43 and -.43 standard deviation marks represent “normal” temperatures.

As the graph moves forward in time, the bell curve shifts to the right, representing an increase in the frequency of the various hot anomalies. It also gets wider and shorter, representing a wider range of temperature extremes. As the graph moves beyond 1980, the temperatures are still compared to the seasonal mean of the 1951-1980 base period, so that as it reaches the 21st century, there is a far greater frequency of temperatures that once fell 3 standard deviations beyond the mean.

There’s another telling animation showing extreme heat events across the Northern Hemisphere. This one, like the above, was created by Goddard using Hansen’s data. It is available here.

From Twitter to Graph: Plotting the US Embassy's Air Quality Monitor in Beijing

There’s been a lot of dingus kerfuffle around the US Embassy monitoring air quality in Beijing and posting the results to Twitter at @BeijingAir. I personally like this kind of thing — its almost as though the government is acting as an environmental activist with infinite clout, stirring up problems by bringing known issues to light.

I thought, in passing, that it would be fun to pull the data stream from Twitter, parse it, and graph it. The embassy updates the data hourly; I figured I could make a call to Twitter’s API, without the need for any hacky AJAX refreshing. When people view the post, it’ll show the most recent two hundred tweets, representing 200 hours of data. Perhaps there’d be a need/interest to backup more to a database, but I was running out of steam - turns out that this undertaking wasn’t as easy as one would have hoped.

So, without further ado, here’s approximately the latest week of PM2.5 data from Beijing. The lower line — in red — is the PM2.5 concentration; the upper line — in green — is the air quality index (AQI). The dotted, light-grey line is the US EPA 24h PM2.5 standard. Note that Beijing is rarely, if ever, below that designation. I’ll do my best to explain what each of those lines represents below. But now, the graph:

PM2.5 is defined by the US EPA as follows:

Particles less than 2.5 micrometers in diameter are called “fine” particles. These particles are so small they can be detected only with an electron microscope. Sources of fine particles include all types of combustion, including motor vehicles, power plants, residential wood burning, forest fires, agricultural burning, and some industrial processes.

Exposure to particles of this size has been implicated in a wide range of health effects. Like other chemical exposures, at a first approximation the intensity of the health effect depends on the duration of exposure, the concentration of particles in the environment, and an individual’s proximity to the source. There’s increasing evidence that any exposure above very low levels — the types we rarely see anywhere on Earth these days — are bad for health and can exacerbate heart and lung disease, asthma, bronchitis, and the like.

The Air Quality Index (or AQI) is a summary measure that

tells you how clean or polluted your air is, and what associated health effects might be a concern for you. The AQI focuses on health effects you may experience within a few hours or days after breathing polluted air. EPA calculates the AQI for five major air pollutants regulated by the Clean Air Act: ground-level ozone, particle pollution (also known as particulate matter), carbon monoxide, sulfur dioxide, and nitrogen dioxide. For each of these pollutants, EPA has established national air quality standards to protect public health. Ground-level ozone and airborne particles are the two pollutants that pose the greatest threat to human health in this country.

Finally, the US EPA standard is pretty straightforward. For the US, there are not supposed to be 24-hour average PM levels above the 35µg/m3. Of course, as we can expect, not every locale in the country can meet this standard.

Back to China.

It’d be interesting to add some summary statistics and look at variation between weekdays and weekends — I’m working on that now. I’m also trying to find an accessible data source from China to plot along with the US data. Some comparison would be good, especially after China began posting its own data not too long ago.

The previous (and awesome) work that inspired this undertaking was done by China Air Daily. They’ve got some amazing visuals of the air pollution. One is attached below; I recommend checking out their site for more great stuff.

all rights reserved
snarglr is written & maintained by ajay pillarisetti

click here to turn on all posts