Tuesday, October 26, 2010

On Sharing and Caring: Two Unusual Requests

I receive lots of requests every day: requests to give extensions on homework, requests to serve on committees of all sorts, requests to review papers or grants, requests to consider unsolicited grad school and postdoc applications... People ask for my time or my expertise, and often both.

But today I received two requests that were fairly unusual. The first one explores the issue of sharing in science, and the second the issue of caring for students and when our duty to care ends.

Request 1, or "On Sharing": I received an email from a student I don't know, who is in a foreign country and at a university I have never heard of. I am paraphrasing the email:
Dear Prof GMP, I want to work on Interesting Physical System. I want to simulate it using Cool Technique and I have seen that you wrote papers on simulation of the System using Cool Technique. I do not know how to write this code, so can you send me yours?

Huh? Send you my code?

I don't blame the student for asking, I actually replied politely, thanking him/her for the interest and stating that regrettably I cannot share the code. My group develops detailed microscopic simulations of certain physical phenomena. These codes can have wonderful predictive power and take years to develop. Sharing the codes is absolutely not the norm in my field, and there is no way in hell I would share any of my research codes with anyone other than close collaborators and colleagues. (There are plenty of simpler codes I use for teaching and these are free to use.)

The Cool Technique for the simulation of Interesting System, which the student mentioned, is actually one my group pioneered. It was one student's PhD to develop it and there is plenty of our work published, so that someone could develop the code on their own if they so desire. I am also happy to share my student's thesis but I am not giving the source codes. It's like an experimentalist having developed a unique technique or built a unique piece of equipment; you do want to keep the edge that it gives you and not let everyone use it. I suppose with new experimental techniques, once you develop it and publish it others are free to copy it, but it's not like you hand them over your actual tool/apparatus.

But this got me thinking about how much we should really share in good faith. Is dissemination of information in journals, out there for everyone to repeat, all that is required of us? How much of a duty -- ethically -- do we have to the public (nationally or internationally) to enable others to benefit directly from our work? It is important to remember that federal funding does not preclude claiming intellectual property: some people copyright and sell their codes, experimentalists often patent their work and some start companies based on their federally funded, patented work.

Request 2, or "On Caring": A former student, who just started his graduate school at Awesome University emailed me to help him with his fellowship application. I have known him for a couple of years, first through classes, and then as he worked with me this past summer before starting grad school. I have a high opinion of his abilities and consider him a smart and focused person. At Awesome U, he now has their fellowship (common for 1st year students); after that, he's expected to have research funding. I advised him to apply for several fellowships with federal agencies in order to ensure funding and give himself more freedom in choosing an advisor.

Now, he writes that he's approaching a deadline for the NSF fellowship and wants to apply but basically doesn't know what to write about. He said that he had asked some professors at Awesome U, but they are all busy and only have time for their graduate students. So he's asking me to give him ideas and help him write the fellowship application.

I was fairly amused by the fact that he apparently thinks I have oodles of time (unlike the apparently very important profs at Awesome U), and that I am endlessly selfless to spend this time on developing a project for him so he would go and work for someone else at another school.

Still, I feel a perhaps misplaced duty to help him. I told him he was free to use the project he did over the summer and build on it using anything that he has learned in my group, I sent him several of my group's related papers, and told him to draft a paragraph on each of a few possible topics, that I would give him feedback once he has something written, and we set a date for his first draft.

What I am pondering here is whether our duty to help our trainees ever ends. I think the bond between grad students and their PhD advisors is fairly deep and often means a lifetime of mutual support. I think similar holds for postdocs and advisors. Probably not so much for a short-term advising relationships such as with undergrads. How many of the readers would just blow off the student once he's gone elsewhere to grad school, and how many would decide to help?

65 comments:

Anonymous said...

I would actually say sharing code is not that uncommon and there should be more of it. Free software packages which are open source have been developed to do a lot of simulation, some of which would take a student the full 3 years of their phd. I would like to see more of it and have no problem sharing my code with others, regardless of how long it took to develop. I think everyone starting from scratch to get to the same endpoint wastes a lot of time in which 'cool science' could be discovered. Sharing means if there are bugs they are more likely to be picked up, even if you think you have done rigorous or extensive testing. I guess the current 'publish of perish' nature of academic science inhibits this though, making it attractive to stay ahead of others without sharing your work. On the other hand, for a student a simpler coding exercise is good if they are only learning to program. But overall, I'd vote for a) share code.

Re b) I think it really depends on the advisor / mentor and the relationship the student has with them, and of course how reliable and 'fast' they are at providing feedback and advise.

I know people who are developing code that

Anonymous said...

I have to respectfully disagree on your stance regarding "sharing". I think those of us that work in computational sciences have a duty to share the code used to generate results that we publish (regardless of the funding source). In an empirical study, one can expect the underlying physical phenomenon to stay the same and be verifiable by another lab. However, with computation, so much is wrapped up in the implementation of an idea (rather than just the idea itself), that it is almost impossible to expect someone else could reproduce your work exactly. Replicability and validation is obviously the cornerstone of scientific research, and without it I believe the computational sciences are incomplete. Furthermore, I have found that in contrast to your statements about having a professional advantage by keeping the code secret, in my subfield the evidence is overwhelming that there is much more of a professional advantage to sharing code. The citations, visibility and follow-up research for your work increases exponentially. The advantages associated with being known more widely as a leader in the area you are working in (e.g., more success finding grants to significantly extend the work) far outweigh the extra paper or two you might get out while people are still trying to figure out what you've done or whether your first paper is even correct.

Take a look at: http://reproducibleresearch.net . Though the specific field is very different, I particularly like this commentary: http://www.stanford.edu/~arianm/Reproducible_research.pdf

Namnezia said...

Once it's published, aren't you required to share your code? If a lab develops a transgenic mouse to produce a hot new result, after the paper is published the lab is required to share those mouse lines. Most granting agencies and journals would require this. You could request a collaboration as a condition of sharing.

Comrade PhysioProf said...

You published papers containing simulation results generated using code that you are now refusing to share?????? Are you fucken kidding!?!?!?!?

This is despicable behavior, and one that almost certainly violates all sorts of conditions you have agreed to with respect to your funding agencies and the journals you have published in. And trying to browbeat someone into a "collaboration" in order to share published materials is also despicable. You publish shit, you are obligated to share it.

Dave said...

I'm surprised at your strong aversion to sharing the code, although I suppose this post is some evidence that you are having second thoughts.

"These codes... take years to develop." Yes, of course. This is a reason to share, not the opposite. You are deliberately slowing down the science by making others waste their time.

Anonymous said...

So the former student is basically asking you to give him a fundable research idea. Huh? Part of getting a PhD is learning how to formulate your own ideas. He shouldn't be asking you this question. For me, coming up with novel ideas to pursue is the hardest part of the research process and takes a lot of a work. He needs to do this himself.

Anonymous said...

Whoops! Anon 9:41 here. I see that he is a new graduate student, not a new post-doc. Yes, in that case I would probably help, and I would approach it the same way you did.

Anonymous said...

Sharing the code is absolutely NOT the norm in my area of computational research (like you, between physics and engineering). It can take several graduate-student or postdoc-years to develop a code, and that takes a substantial investment of time and money. The mouse analogy is interesting, but illustrates that the norms are very different in biology than in engineering/physics/physical sciences.

Some groups do choose to release their code as open source, for the use of the entire research community. But it is not mandatory, and they do it on their own timeline (for example, after publishing certain key papers). Some groups also sell their code, and I have bought one of these tools for use in my group.

GMP said...

My field is like that of Anon at 10:03 AM. Sharing the code is absolutely NOT the norm, so there is no reason to call my practices despicable, as they are the established practices in my field. I pass no judgment on the practices in other fields.

I think the mice analogy is not right here. It's more like developing a very unique and expensive piece of equipment. I do not see my experimental colleagues enable anyone to use their one-of-a-kind, in-house equipment unless they are a collaborator or a collaborator's student/postdoc.

Regarding compliance with federal funding agencies. First, codes are intellectual property and being federally funded does not preclude claiming intellectual property. The university and the developer have dibs on claiming copyright and some people copyright and sell their codes. Just like a lot of people open companies based on their federally funded and subsequently patented research.

Anonymous said...

Anon 8:05 here. GMP, I think the mice analogy is a closer analogy than the one regarding new equipment. A new piece of equipment is only useful insofar as it gives access to a new physical phenomenon. Since this phenomenon doesn't change (presumably), it can be verified by other groups potentially using an entirely different approach. Said another way, the engineering of new equipment enables the science, but what's important is that the science can be independently verified. But, for the mice example, it seems that the phenomenon IS the mouse. The only way to really let another group verify the work is to give them a mouse from the line so they can inspect it. There is no other way to validate the claim of "I have a mouse with properties XYZ".

As I mentioned above, with computational work, when you publish results you are publishing specifically the results of your current implementation (which may have errors, much the same way experimental work can have errors). Without releasing the code, there is no way that anyone can actually verify the work that you've done. How can we have a scientific process with any intellectual honesty when we cannot verify each other's work?

I won't go so far as to call it despicable because I know social norms differ and cultural inertia is powerful, but I do think there is an ethical obligation not being fulfilled when published results cannot be verified against the code that produced them. How can we have a legitimate scientific process like this?

GMP said...

Anon at 11:11 AM, codes enable new science. Codes are computational tools used to predict new physical phenomena, inspire new experiments, and explain existing experiments. New codes are carefully calibrated to reproduce existing experiments and compared with other well-established programming techniques. I assure you that before a new code gets used to predict any new physics, we all make painfully sure that it correctly reproduces the known physics -- that means several papers where the purpose is validation of the code against known codes and known experimetnal results.
Nobody would publish predictions of new phenomena based on a new code until the code has been extensively, and I mean EXTENSIVELY tested.

Please rest assured that the field has very stringent mechanisms for bullshit elimination, and, besides, experiements are the ultimate test of the computational results.

but I do think there is an ethical obligation not being fulfilled when published results cannot be verified against the code that produced them.

Anon, these are not black box codes. It takes 1-2 years for a new grad student to learn enough to just be able to understand and run an already existing code. I don't have the resources or the time to provide customer service and train someone unknown, remotely, for years just so they would be able to run my code. I do share codes with some groups that already have expertise similar to mine, as I trust they would train their students on their own (for the most part) to be able to use the code.

Venkat said...

Some want to share the code, saying it is good for science. Some want restricted release so that one can (in a way rightly) benefit from all the work they put in.

The same dichotomy exists in the commercial world too. For eg., some people say that drug companies are unethical when they profit billions from their drugs without releasing the drug formula (or what not) sooner. But the drug companies want to benefit from all that investment.

I had the same debate going on in my head. Then I heard a lecture from an expert in intellectual property. He said (what seemed obvious in retrospect) that if you don't have IP ownership, several people would not have the incentive to put in all that effort. In that sense, advances in science and technology will be adversely affected.

I think that the present rules are designed for people as they are (wanting security etc), not for what they ought to be (selfless etc). Rules would only go so far. Also, the rules are often vague enough that there is wriggle room for doing one thing or the other. If someone wants to change the status quo, I think it should be on ones own volition, not based on what the field does. That is why the only question for me is 'what am I going to do?'. And only I can answer that question for myself. In that, there is the risk of adverse effect on one's career and such, but the (to some degree) selfless human should be able to take that in stride, right?

Venkat said...

tl dr: Only you can decide what you are going to do. Whatever rules you impose will be cleverly circumvented by people.
Also, practical issues which GMP refers to are a whole other ball game. What if you send the code and then you get 3 other emails asking for clarifications of what's in the code? Will not answering those emails be considering "effectively not sharing the code"?

Alex said...

I'm a theoretical physicist who works on the edge of biology. I think the ethos of the biomedical field is (on this issue) quite appropriate. (On other issues, I find them to be a weird bunch.) Experimental physicists may not share equipment, but biologists share mice and reagents, because they are smaller and easier to replicate. Although I'm a theoretician, my understanding is that when there's a cost associated with the production of a reagent or organism, they can charge for reasonable costs of materials. (Correct me if I'm wrong.)

But, much like your experimental physics colleagues, biologists don't usually share home-built machines. Machines, unlike fruit flies, do not (yet) reproduce.

Codes are easy to reproduce. They aren't like machines. They're much more like fruit flies.

Now, as I understand it, while biologists share reagents and organisms, they aren't obligated to do any of your experiment for you. So, they'll give you a mouse, but if their study involved putting the mouse on a special diet, subjecting it to some sort of exercise, or implanting something in it, they won't put the mouse through the regimen and surgery for you. You still have to do that. So, the core routines that do the key computations would be appropriate for sharing, but your gold-plated user interface or customized tools for digging through the results, those might appropriately be yours.

FWIW, when I read biophysics paper I see that lots of people are willing to share image processing software. I don't do molecular modeling, though, so I don't know what the custom is in that field.

Finally, I recognize that as long as you are adhering to the customs of your field you are acting ethically. I think the customs should change in a more open direction. Also, with interdisciplinary research, be prepared for the possibility that you might encounter people with different but reasonable expectations. I gather that you do some sort of materials simulation. If at some point you find yourself working on materials that turn out to have biological relevance, don't be shocked if some biology types show up wanting some sharing.

Alex said...

BTW, while the exact same standards may not be appropriate for every field, if you look around condensed matter and materials physics it's clear that Schon's data forgery would have been exposed sooner with some sort of tradition of sharing samples. I'm not saying that it's appropriate to every situation (some samples far poorly when brought to room temperature, exposed to atmosphere, etc.) but some sort of tradition along those lines would be helpful.

I also harbor a suspicion that a few of the "nanotoys" groups should really start sharing some samples. Not that their data is wrong, but I'd love to know whether the nanotoy with the cool images was a representative sample or an exceptional sample...

Psycgirl said...

I have mixed feelings about the sharing - in my field (psychology) I am supposed to provide data to accompany a publication if anyone requests it, but that doesn't mean I don't have weird possessive feelings about this. From what I understand, this rarely happens.

Re: Caring. What kind of ridiculous grad school has this student gone to where they are expected to win external funding but no one helps them with the application? How successful can everyone be if they are a newbie grad student writing a proposal alone? It's nice of you to help him (and he will be a colleague, someday) but why is his program not filling that hole?

Anonymous said...

Anon from 8:05 and 11:11. I work in the computational sciences, so I understand full well what goes in to this work. I also understand that nothing gets published until you have verified it extensively (and I'm not claiming someone is trying to push BS on the community). In the experimental sciences, we presume that there can be errors, which is why we insist on reproducibility. In the case you're describing, it is impossible for anyone else to actually verify that your work (meaning your *implementation* of a computational model) is correct. Mistakes can happen, and we have ample evidence that mistakes happen even in the code of the most well-meaning and competent individuals (and even with extensive validation). It could be as simple as a set of code that reproduces a physical phenomenon well, but a typo in the code means that there is a difference between the stated model and the actual code.

Please don't be disingenuous with my comments. Nowhere did I suggest that you provide world-wide customer support for anything anyone wants to do with your code. I am suggesting that the only way to verify your work (and therefore have a legitimate scientific process) is to make your code available. This would involve putting your code online so that the data (presumably the figures in your papers) could be reproduced by running the code as-is. I understand it can take years for a new student to really understand the code so they can build on it, but it's certainly possible with minimal burden on you to package your code so that any other reasonably experienced person working in your area can simply RUN your code and reproduce a figure. People are doing this all the time in scientific disciplines that are very complex (did you read anything that I linked above?). If this is really not possible in your codebase and other disciplines are doing this regularly, is it possible that there is a problem with the way your codebase is structured?

The bottom line is, how can anyone ever verify your implementation if they cannot access it (similar to the mouse example above where its properties cannot be verified without access). You asked for opinions on code sharing, and obviously several of us (including myself) have the opinion that people working in the computational sciences have an obligation to share the code.

Anonymous said...

I work in a field similar to GMP (Materials Physics), and I can say that it is certainly not the norm in the field to share the code, while sharing the raw data is. In that sense, I agree with GMP that code is more like a complex equipment that you build yourself with a 2-3 year of painstaking work in your lab. Perhaps in biosciences, the norms are different. It would be very interesting to hear the opinions of scientists trying to build new bio-medical related equipments about this sharing issue.

A lot has been said about sharing the code being the only way to verify the work, and I do take issue with that. All the details of the underlying theoretical principles and implementation details appear in the published work, presumably, then why can one not build his or her own code based on the knowledge that is out there?

And it is exasperating to see that some people think that in scientific computing, one has to simply run the code and reproduce the figure. I wish life was as simple as that!

Girlpostdoc said...

Well I wanted to comment but it got to long and turned into a post. You can find it here.

GMP said...

My apologies for some of the comments posting late. Apparently Blogger has a new spam filter that caught them (come to think of it, FCS did mention it the other day)...

Also, thanks to the many Anons for the comments, and to Venkat, Psychgirl, Dave, Namnezia, Girlpostdoc, and yes, even you CPP! :)

@Anon at 10:03 AM and Anon at 5:53 PM, thanks for your supportive comments. I completely agree. I am in a field at the physics/electronics/materials science interface. All of our in-house codes were built by me or my group members. If we like someone else’s work, we write our own code based on published information and then develop it further. Others in my field do the same. Among trusted coleagues, people will sometimes offer to share the codes or pieces thereof, but I could never imagine asking anyone to give me their code.

@Alex, I like the idea of my codes being fruit flies. Although I like to think more of them as beautiful, colorful butterflies! ;) But, seriously, I still really think codes are more like equipment (which you use as a tool to perform a study) rather than a fruit fly or a mouse (the object of the study).

@Anon at 8:05, 11:11, and 4:19 (sorry about your last comment posting late), I didn't mean to be disingenuous. And I certainly appreciate your thoughts on sharing codes. I have no problem having the data checked by peers in my community and we do share codes among groups. So I believe we agree here.

Also, there is no shared codebase in my field. Zero. For instance, the research code referred to in the post was developed by my student literally from scratch. All the research codes are legacy codes, developed and closely guarded by research groups. The community is relatively small, which is probably part of the problem. We nominally have a couple of repositories of simulation tools, but no one will post their cutting-edge research codes there (politics, how the repositories are run, etc.) The only things that are freely available are simple teaching codes.

I am also a bit disappointed that only a few people commented on the "caring" part of the post, about advising students once they have left the nest.

Namnezia said...

One more thing... the tools are also part of the published work, not just the data. Transgenic mouse lines are also tools used to generate data, and some can take several years and lots of money to develop, yet these are made available to the scientific community once results are published. Same with molecular tools - the folks who developed channelrhodopsin to use light to activate neurons spent years tweaking its molecular structure and now make it available, same goes for calcium sensitive fluorescence molecules. And also for very novel pieces of equipment. When two-photon microscopy was introduced, the lab that developed the equipment provided the knowhow and instructions for those who wanted to implement it in their labs. I know that certainly the NSF and NIH require that you share data/tools once they are published. So just because people don't do it in your field, it doesn't mean it ethical to not share, in my opinion.

Now, as far as the caring part - I would absolutely help your former trainee. Sounds like he's in a grad program full of idiots, if no one wants to help him develop something which is not only going to help him but the grad program in general.

GMP said...

Namnezia: the lab that developed the equipment provided the knowhow and instructions for those who wanted to implement it in their labs.

This. We provide this too: I am happy to send my students' papers and notes and disseration to whoever is developing the code. We share expertise with whoever wants it. I answer questions that arise as they develop it. But I do not just hand over the code. For instance, if I send you my non-copyrighted code, nothing prevents you from legally copyrighting it yourself and selling it. Or just using it without so much as citing the original work in any of your subsequent papers. Or worse yet -- tainting it by tinkering with it and then selling it as though it's my original code. All of these have happened to the people I know. That's why I will send the code to you only if I trust you.

know that certainly the NSF and NIH require that you share data/tools once they are published.

I can't speak for NIH, but I know the NSF does not require you to share computational tools unless you have said in your proposal that you would do so. I am very careful in my NSF proposals about specifying what I will share with the community (in my Broader Impacts) and I do that. I have no problem sharing any data with whoever wants it. (See what Anon at 5:53 PM says.)

Federal agencies enable the funded parties to retain ownership of intellectual property. People start companies and make tons on money on work that was patented but had originally been funded by federal agencies. I can licence and sell the code if I want to, and be totally in compliance with the NSF.

Let me conclude here with a general remark (this is not directed at Namnezia per se): there were several posters from physics/engineering/materials science that confirm that sharing practices are similar to what I describe. There is nothing unethical about these practices, and nothing non-compliant with federal agencies. I think we will have to agree to disagree and that established practices in biomedial sciences and at least some physical sciences/engineering are simply different.

Please refrain from referring to the code of conduct in other fields as unethical. You probably don't understand fully why established practices are what they are; god knows I don't understand a whole bunch of practices in the biomed, but that is because I don't know enough about these fields, not because there's something wrong with them. Let's just assume that there are reasons why practices are what they are and not immediately label our fellow scientists in other fields are unethical bastards, shall we?

Alex said...

On caring: Take pity on your former student. You don't need to hold his hand, but a critique of a draft wouldn't be so bad. I know that in my first year of grad school, when I was still learning the ropes and getting to know the people, I sometimes emailed former professors.

There's nothing more flattering than a former student who still thinks you have something useful to say. One of my current undergraduate research students seems to think that he knows more than me, or at least thinks that if he goes and gets advice from other professors then he can ignore my advice on a project based on ideas that I developed. I'm willing to arrange a trade, my soon-to-be-alum for yours.

Anonymous said...

Same anon as above. No problem regarding the late post from blogger.

Your post above seems to indicate you have no problem having your code checked by people you decide to share with. But, if we view this as a form of peer review (which is really what I'm trying to argue for), why should that constitute impartial validation? We all know that peer review of our writing would be useless if we got to decide who all of the reviewers were, and only close collaborators were selected. How is this different?

Also, I wasn't talking about a shared codebase in your whole community....just your lab. This was in reference to your post above about it taking years for a new student just to run a piece of code. I work in the computational sciences, and I simply don't believe that the codebase in your lab couldn't be arranged so that a collection of scripts necessary to reproduce a figure with minimal intervention could be released (I'm not sure why Anon 5:53 is so exasperated by the suggestion that the figures resulting from data produced in a computational model could be produced with another piece of code....this is commonplace in many areas of science/engineering, and I can't believe that many figures are drawn by hand now). Good code structure of the kind I would expect from anyone who is a profession in the computational sciences should permit this with relatively little overhead. If this doesn't seem true, then we can have a separate conversation about how to do this efficiently.

You keep saying that code is more like equipment than a mouse line, but I haven't heard you articulate any reasons why. Can you please elaborate? As another poster noted, it's not because the same physical difficulty/cost of sharing a large piece of equipment is incurred with sharing code. If you have a website (or use a repository such as the Reproducible Research one I linked above), it can be distributed to as many people as want it with essentially no additional cost above your effort to prepare it.

Sorry...I know cultural norms differ in fields, but that doesn't mean I have to think those norms are all good for the larger scientific community. I continue to maintain that there is no way to get an *independent* validation of your results without releasing the code (despite your obligations to your funding agency), and this is necessary for good science no matter what your field is. I haven't heard you say anything to actually address this point. Review by your collaborators doesn't count as independent. Examination of the output doesn't count as validation of your result, because your result *is your implementation*. Despite the comments above about how your papers include enough detail to reproduce the result, the whole point I'm trying to make is that this can never be true. Your papers might describe the model, but there are so many variations and choices that you make in your particular implementation (each with the possibility of errors being introduced) that there is no way to give all of that detail without the code.

Also, be flattered about the former student. Despite their current situation and the apparent deficiencies of his new dept, he obviously thinks highly of you. I bet he actually is aware of your tight schedule and feels pretty bad for asking for help, but is also in a situation where he feels pretty desperate.

Dr. Shellie said...

Anon@10:03, here. Switching over to my (defunct) blog name.

I wrote my NSF graduate fellowship application on a project idea that came from my undergraduate research, because I had not yet started grad school. I didn't end up doing that specific project in the end, because my thesis advisor was in a related but different area. The NSF graduate fellowship did not actually hold me to doing the specific project I proposed; the essay describing a possible project seemed to be more of a test that the applicant could, in fact, formulate and describe a reasonable line of research.

In your former student's case, he has already started grad school. So he will have to develop a project with someone at his new university sooner or later, in order to find his actual PhD topic. I would also think that if he describes a project that no one at his university works on, it is not going to be that plausible.

I think that helping him out by reading a draft seems reasonable, but encourage him to think about things that will actually move him closer to a thesis advisor and project at the new university. Writing a project description related to your research may be easier for him, but is kind of a waste of time for all involved, unless he can find some links between your area/interests and potential advisors areas/interests.

Anonymous said...

In my field, it wouldn't be odd to ask to share code (esp for anything already published). Many people just may not respond to the email asking for code or other materials though...

I'd say no to the student. I'd offer to give them recommendation letters and suggest they can build on their undergrad project with me and maybe give them quick feedback (e.g., if it takes less than 10 minutes) on a draft. Then it's good luck and leave it at that. But then again I am tenure track so I say no to everything right now so I can sit & be writing-blocked in front of my grant applications.

GMP said...

Alex, Dr Shellie, and Anon 11:30 AM, thanks for the comments.

Anon at 7:42 AM (this is comment 1 of 2 in response to your post above):

Good code structure of the kind I would expect from anyone who is a profession in the computational sciences should permit this with relatively little overhead. If this doesn't seem true, then we can have a separate conversation about how to do this efficiently.

I am going to pretend that you are not implying what you are implying here: that I don’t know how to do my job and that you can somehow help me do it better even though what you know about what I do is next to zilch. I will refrain from speculating where this sense of superiority comes from.

I simply don't believe that the codebase in your lab couldn't be arranged so that a collection of scripts necessary to reproduce a figure with minimal intervention could be released

I continue to maintain that there is no way to get an *independent* validation of your results without releasing the code (despite your obligations to your funding agency), and this is necessary for good science no matter what your field is. … Your papers might describe the model, but there are so many variations and choices that you make in your particular implementation (each with the possibility of errors being introduced) that there is no way to give all of that detail without the code.

Let me put it this way: if you have a PhD in my field and have expertise similar to mine, i.e. you have written codes similar to mine in the past, then it will probably not take you more than a few days to reproduce any figure of your choosing directly from my source files, with minimal documentation from me.

What I don’t understand how reproducing a figure is the same as validating that the code doesn’t have bugs or that the implementation is sound. If all you want is to generate data for a figure, I can provide an executable with an input file, there’s no need to look at the source code. But, if you want to absolutely validate my implementation and check for bugs, you have to go through the code line by line and presumably run tests that I never thought of, and that would certainly take weeks or months even for an expert.

in reference to your post above about it taking years for a new student just to run a piece of code.

What takes a newbie student 1-2 years is mostly to learn enough physics to be able to understand what the code really does in order to be able to develop it further. This lead time has much more to do with them grasping the underlying physical concepts than the actual code. Once they have a reasonable grasp of the physics, then we can talk about what the code does, what the limitations of the technique are, and coding minutiae. However, even a newbie student can be trained to change a couple of parameters and compile a code fairly quickly, but there isn’t much use in that. I would say 1-2 years to competently start developing the code.

(To be continued...)

GMP said...

Comment 2 of 3, in response to Anon 7:42 AM

You keep saying that code is more like equipment than a mouse line, but I haven't heard you articulate any reasons why. Can you please elaborate? As another poster noted, it's not because the same physical difficulty/cost of sharing a large piece of equipment is incurred with sharing code. If you have a website (or use a repository such as the Reproducible Research one I linked above), it can be distributed to as many people as want it with essentially no additional cost above your effort to prepare it.

About codes being tools: Let’s say you holding a piece of a semiconductor and you wonder what its resistance is. An experimentalist would perform van der Pauw 4-pt probe measurements and get the resistance. I may write a transport code, with details on the material’s bandstructure, important scattering mechanisms, given geometry and position of contacts, and from solving the Boltzmann equation compute the resistance. To me, the measurement and the theory/coding are two different ways we ask question “what is the resistance” and obtain answers. In this particular case, it takes a newbie experimental student no more than 2-3 months to learn how to prepare the sample and learn how to perform the measurement, whereas writing the code above from scratch takes much much longer (of order years) because it involves learning about the microscopic physics of carrier transport in significant detail. On the other hand, if I already have a code in place, then a student can be trained to run it in a week’s time and get resistance for differently shaped samples and different materials. To me, once this simulation tool has been built, it gives answers to questions just like experiments do. That’s why I consider them tools.

I hope that the cost incurred and ability to reproduce/share are not the only criteria for what constitutes a valuable tool. If it’s so heavy that I cannot move it or so expensive that only the chosen few can purchase it, then it’s equipment/tool. If it is not expensive or I can replicate or lift it, it is not a tool but some sort of a lesser tool-light and I am automatically obligated to share it just because it’s easy to share? Codes are intellectual property and by implying otherwise you are trivializing the work that goes in it. I am sorry, I see no reason to release codes into the wild without some protection of intellectual property.

GMP said...

Comment 3 of 3, in response to Anon 7:42 AM

Your post above seems to indicate you have no problem having your code checked by people you decide to share with. But, if we view this as a form of peer review (which is really what I'm trying to argue for), why should that constitute impartial validation? We all know that peer review of our writing would be useless if we got to decide who all of the reviewers were, and only close collaborators were selected. How is this different?

Actually, I don’t mind the code being checked by anyone who (a) is an expert in the field, (b) is not going to screw me over by misappropriating my code. I do not know how (b) is ever ensured when you just post codes online, but let’s leave that for now. Regarding (a), I honestly do not want to have to provide training to random people in addition to providing code (what I referred to as customer service). My experiences have been really off-putting in this regard. I get a lot of email inquiries about some smaller codes I posted online for teaching purposes, which have reasonable user interfaces and documentation, because anyone can access them and then they have questions and then there are tons of emails... Regarding research codes, posting a code online for anyone to use without a good enough interface and supporting documentation is really problematic, and creating them with a good quality is a time investment and of little interest to the person developing the code. Posting stuff also means one is ready to answer seemingly endless inquiries, or there’s someone paid to do that... Ideally, I could get a person to just write GUIs and write up detailed documentation and answer inquiries, and then we can post stuff online, but getting funding for such a person would not be easy.

Btw, what is your field, if you don’t mind me asking?

prodigal academic said...

Field norms are field norms. Biomedical scientists want to consider GMP's code like a mouse line. Perhaps a more appropriate consideration is a material (given that GMP is at the border of physics/electronics/materials science. My understanding is that if someone WANTS to send a material they synthesized, it is their choice to do so. Otherwise, it is completely normal to send the synthesis and answer questions about it from interested parties. No one would consider this unethical behavior.

I think it is a totally valid and realistic concern that someone might modify GMP's code and then send it on without comment, try to patent/sell it, or use it without attribution. If GMP provides all the information necessary to reconstitute the code to someone else, they can do so if desired, just like a synthesis, and I think that is where her obligation ends. I do not think this is unethical, nor do I think this means her results can't be validated--the validation is in comparison to observed physics, not in repeating the calculations!

As for the student, I agree with Dr. Shellie. The student needs to develop a mentor-mentee relationship with people at Awesome U. Helping her/him in this way could be short term beneficial, but long term detrimental to their development. I have seen this at National Lab with postdocs who remain attached to their grad advisors and never develop a mentor-mentee relationship with their postdoc advisors due to that crutch (this is with postdoc advisors who have fine relationships with all their other trainees, so not pathological on the postdoc end).

Becca said...

First, it seems there is significant confusion about why people are making the mouse comparison.
Transgenic mice are FREQUENTLY the TOOLS to study something. Half the reason people make knockout mice, at least in my field (immunology), is to develop a model for a disease.
That kind of application DOES constitute intellectual property. Mice are not (merely) a biological self-renewing resource.
@prodigal academic- a material that requires significant people hours to create enough of to send to someone is something that should, ideally, be commercialized if it's useful enough to the field. Neither code nor mice take MASSIVE additional investment to distribute. That is why the analogy is being made (if anything, the mouse takes a lot more, and biologists DO regularly charge for the cost of the mouse based on housing costs from their institution).

What do folks think goes into making a transgenic mouse? Putting two existing mices in a cage, dimming the lights, putting on some Barry White, and BOOM data pops out 6 weeks later??

It's more like:
Step one: take several weeks to plan and clone the genetic construct to go into the mouse. This is *comparatively* easy-peasy (unless of course you irk the cloning gods), and biologists will nearly always send you a bit of the product (plasmid), for free.
Step two: do the embryo injection. Often these days this is shipped out to a core facility or company. Much like electrophysiology, or brain surgery, the physical technique is very challenging and is simply not possible for everyone. If you have to develop the technical expertise to use the method from scratch, you're looking at months, years, or never, to become efficient, depending on the scientist.
Step three: wait a month or two, tail snip test the offspring
Step four: wait another month, Barry White time
Step five: perform necessary backcrossing and begin characterizing the mouse. This could involve repeating steps three and four 10 times (ten backcrosses is the gold standard).
Two years is wonderfully fast for the entire process, BEFORE we talk about generating any real data. If you're talking about a 'newbie student'- well, there's a reason people tell grad students that want to graduate in <8 years not to do a knockout mouse project.

Creating transgenic mice is routine. That does NOT it's fast or easy.

GMP, you seem to want to imply that a student who takes 1-2 years to write a code is investing OMG SOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO MUCH TIME AND EFFORT!1111
in comparison to those 'easy' things that biologists can share because anyone can do biology and it takes no time at all.


You also list these objections to sharing:
"For instance, if I send you my non-copyrighted code, nothing prevents you from legally copyrighting it yourself and selling it. "
So use a creative commons thingy.
"Or just using it without so much as citing the original work in any of your subsequent papers. "
Shockingly, this simply does not (to my knowledge) happen in biology, if you give someone a mouse or the like. If it's a COMMON problem in your field (not a freak horror story), your field sucks. A lot.
"Or worse yet -- tainting it by tinkering with it and then selling it as though it's my original code. "
Again. Perfectly addressable with the appropriate IP sticker.

In a technical sense, what YOU want done with the code might actually be irrelevant in a sense, if you work at a University. They own the IP, not you personally. You can always call up the IP office and ask "if I *wanted* to share this without getting burned by people not citing it or meddling it and distributing the meddled with version, how could I do that?"

I don't think you are unethical *within the context of your field*. I also don't think a soldier killing some dude in war is unethical *in context*. It's still nasty behavior.

S&O said...

GMP -- I get your analogy with comparing your codes to a novel experimental technique/apparatus (I'm also in the materials science/physics/chemistry camp). There are fancy, cool experimental tools out there that are not "off the shelf" (like an SEM -- can we get this thing off Amazon now??) that perhaps only a few groups in the whole world have the capability to design, build, and operate it properly. Of course, other groups can go try to build it because I'm sure there's a paper about that cool tool in Review of Scientific Instruments. So I like to think of your work as that shiny, high-efficiency photon spectrometer trying to detect those transitions with very low cross sections. By the way, I love your comment at 12:21pm!

GMP said...

Prodigal, thanks for your thoughtful comment!

Spins & Orbitals, it's good to hear from you again! Thanks for the support and glad you enjoyed the comment above. :)

Becca, thanks you for your, as usual, spirited post. ;)

I now know more about those infernal transgenic mice than I ever wanted to... Nobody here said work in the biomedical sciences is easy or quick.

My students take ~2 years to understand the code to be able to start developing on it further. Codes take many more years to develop from scratch and often span multiple student generations. Standard duration of a PhD in my field is about 5 years.

About IP: actually, the university has dibs on IP, and if they pass (which they often do for codes) then I can claim intellectual property myself.

I don't think you are unethical *within the context of your field*. I also don't think a soldier killing some dude in war is unethical *in context*. It's still nasty behavior.

So my not sharing code is equivalent to a soldier killing someone?! WTF?! Talk about a metaphor gone too far...

Context is everything. Always.

grumpy said...

It is my (not so informed) understanding that there is a growing movement in CM computations to publicly release lengthy, time-consuming codes. I'm sure those who are doing it spend a fair amount of time dealing with IP issues and working out things like how many papers they want to get out beforehand and how much support they are willing to offer...anyone care to set me straight?

Personally, as an experimental physicist who wouldn't dare spend more than a week or two writing code for a simulation, I think more sharing of these types of codes is good for science. It would be nice if people could iron out the IP/other issues to make this happen.

I also totally can relate to the desire to hold on to trade secrets/equipment whenever possible, so obviously this only works if lots of people are willing to share so everybody benefits.

Opencode said...

Regarding customer service:

Again, I am not contending that you offer endless customer support to any novice who can click a mouse. I am saying that the code should be "available" so that it can be evaluated by the experts in your field (not just the ones you already collaborate with). By available, I mean accessible online and in a state that it can reproduce results such as your figures (did you look at the paper regarding reproducible research I linked yesterday morning?). Of course this simple reproduction does not count as "evaluation" and an expert in your field will want to do much more with it, but it's what I consider is a minimum standard to be really "available". If people are writing to you with all sorts of requests for help using it in other ways and you have provided adequate documentation to reproduce what you have published with it, I think you are completely justified in saying you cannot provide more support than that. Really, look at examples of code people release as part of the reproducible research initiative. Though there are a few exceptions, these are not generally fancy interfaces of the kind you seem to think are necessary. Just scripts that run a set of code with the right collection of parameters to reproduce a figure, and the raw source code so that the implementation can be examined by experts.

Opencode said...

Having trouble posting this...sorry for any duplicates.

This is the repeated anon poster from above. I'll start using the pseudonym Opencode so you don't have to reference everything by timestamp. I am in a ECE/EECS type department at a large R1. I don't work in materials and device physics, but our areas are closely related. We are acquainted and any more information would probably let you identify me, which is why I am posting anon here. I'll try and group my responses into topics to make it easier to parse. Regarding making code available:

I didn't mean any offense by my comment regarding code structure, and I certainly wasn't trying to imply that I know anything about your particular research area. However, I realize that there is a broad spectrum of backgrounds in engineering/physical sciences (especially with regard to software development). This comment was in response to your comment at 11:41 that "I don't have the resources or the time to provide customer service and train someone unknown, remotely, for years just so they would be able to run my code". My assertion is that good code structure and internal documentation should make it a minimal burden to publish online code that runs and reproduces the data you have published. You also said "creating [a good enough interface and supporting documentation] with a good quality is a time investment and of little interest to the person developing the code." I would contend that the interface doesn't have to be fancy and graphical...just "type this to run and the figures/data from the paper ABC will emerge in XYZ format". I would also contend that with a clear publication and well-documented code, the supporting documentation can be just a few pages that can be put together in a day or two. If your code it not documented already (and it's a codebase that spans several generations of students), how do the new students learn how to use it and modify it? Are they really interpreting undocumented code from scratch?

I have some experience writing software and releasing it to the scientific community, but I wouldn't claim to have the same level of expertise as a software engineer. When I said "we" can have a separate conversation about code structure to make this easier, I was trying to say that if this is something that you decided you wanted to do and the actual overhead of preparing the code was a barrier, then this community could brainstorm ways to make it easier. It is certainly possible to do with reasonable effort (my lab and many others do it), and there is expertise available from others in the community we can all learn from to reduce this barrier. I can certainly contribute my experience, but there is a lot I have to learn about this as well.

Opencode said...

Regarding the need for validation:

Let's try an example. Several years ago we were working on a piece of code and were calling a built-in FFT routine (in Matlab in this case). This is a well-defined mathematical function that has many different ways to actually calculate it. The exact algorithm used for the calculation depends on a number of things (including the length of the data vector you pass in). It turns out that in some cases Matlab was making a different choice for the actual FFT algorithm depending on a number of undocumented things, and we could alter that choice (and consequently the final answer) by simply changing the order of a few lines in our code. This behavior was verified at the time by their tech support (it's been 10 years since, so I have no idea if it still happens). While the difference was not huge in terms of sheer magnitude (it was out several decimal places), it was noticeable and actually changed the specific quality of the answer we were looking for (we were looking at very small phenomenon). Here is the main point to my story: changing something about the implementation that is irrelevant with respect to the physical model (i.e., the exact ordering of a few independent lines of code) produced different results. There is no way I could have specified this level of detail about my implementation without releasing the code, but in this case the exact implementation (and not just the model equations) was important for someone who wanted to verify my claims that this model produced this phenomenon. You said earlier that the real validation of your code was the fact that you could reproduce a physical phenomenon, but what I am trying to contend is that differences (and even mistakes) can arise in your implementation of a model, and that needs to be evaluated just as much as the output.

Regarding code as tools, I don't disagree with what you said about the value of code as a tool, but it's also not addressing my main point. Using your example of resistance, since the underlying physical phenomenon is the same, any other lab should be able to make the same physical measurement of the resistance to validate your answer (using any measurement method they want). But, no one can exactly reproduce your implementation of the model, which may contain errors. If you are doing science with this model, then you have produced something that can never be exactly reproduced by another lab (they will never guess the exact implementation you've used). Reproducibility is key to our scientific enterprise.

Opencode said...

Regarding code release:

You mention several times that you are concerned about people using or modifying your IP in ways you don't want ("screw me over by misappropriating my code"). As another poster noted, you can take care of this legally by claiming your IP rights. For example, place your code under a creative commons license (creativecommons.org) that restricts what you want to allow people to do with it. While people may of course violate that license illegally, the same risk is present that someone will copy your writing and violate your (our more likely your publisher's) copyright every time you publish a paper. This is a risk we live with to disseminate research findings.

You've mentioned several things, but I'm having a hard time actually pinning down your major objection to releasing your code. You've mentioned:
1) The burden of preparing it for release. I hope I've addressed this, and there are certainly ample examples of labs that do it efficiently so that I hope it's clear this can be done.
2) The burden of answering inquires. As I've said, with enough documentation to reproduce your results, I don't find it necessary to provide any more support unless there is a serious follow-up claim from an expert trying to validate something specific.
3) The threat of misappropriation. As I noted, this can be taken care of legally. While there is some risk, it's the same type of risk we accept when publishing anything.

Way back in your original post you said "you do want to keep the edge that it gives you and not let everyone use it". Is this really it? Then is this a consideration of what your obligation is to the scientific community, or is it a consideration of your own professional advancement? As I said initially, in my subfield the opposite trend has clearly been established. People releasing their code gain much more advantage than they would be keeping the code private. Citations go up and recognition as a pioneer in the field goes up (which both lead to more opportunities for funding advancements of the work). I suspect that your mind is not really open on this (since you said in your original post "there is no way in hell I would share any of my research codes with anyone other than close collaborators and colleagues"), but if you are really thinking about this openly then I hope you will help me pinpoint what the real substantial objections are.

structurefunction said...

Just wanted to clarify some points related to copyright (in the US, and most of the world thanks to various treaties):

But I do not just hand over the code. For instance, if I send you my non-copyrighted code, nothing prevents you from legally copyrighting it yourself and selling it. Or just using it without so much as citing the original work in any of your subsequent papers. Or worse yet -- tainting it by tinkering with it and then selling it as though it's my original code.

Copyright is obtained as a function of you creating something. You don't even need to say that it's copyrighted (thanks to the Berne Convention); all rights are reserved by default. It doesn't require an act of "copyrighting" -- registering a copyright just means that you can more easily sue them for financial loss and attorney's fees. If what you say actually happened, the authors have the right to sue for copyright infringement.

On the openness note, you might be interested in the Artistic License (which requires maintaining the original copyright notice [aka authorship credit] and documenting exactly how modified versions differ from the original) or the Academic Free License (which requires attribution of the author and a notice of modification).

You could be the pioneer! The way I see it, the code isn't the interesting part; the problems you apply it to are. I find it hard to imagine that there are such a limited set of problems that everyone works on the same ones. If that's true, the "competitive advantage" wouldn't amount to much in most cases.

Anonymous said...

xmds is a nice example of open source code used in physics, that has been shared (and is extremely user friendly).

http://xmds.sourceforge.net/index.html

You don't necessarily have to go down this route and make your code as user friendly, I just provide it as an example of something that works in another field of physics. You can find other small packages around the place if you look, of code that can take a PHD student (in physics) the full period of their PHD to develop and apply. It does happen, and yes, it happens even in fields of physics, not just in biology. I think it is a positive thing.

GMP said...

Opencode and Structurefunction,

Thanks for your comments. There's a lot of stuff here to address, so I will try to do it in a few parts (let's aim for 3).

Let me start with the one thing that both of you address, which is the competitive advantage that not sharing the code gives someone.

Well, roughly, I would say my group has two types of codes:

(a) Codes based on established numerical techniques, with generally well understood limits of validity. Typically, several groups would have similar codes. In this case, the novelty of a work isn't in the tool but in the novel physical systems it addresses. For instance, such codes are bases of much of my work with experimentalists where we are embarking on addressing new experimental findings on systems that fall in the range of applicability of existing computational tools (I also write small codes for my experimental colleagues and their students to play with.)
I agree that these codes per se do not give me a competitive advantage and these codes I would totally not mind sharing, as the codes themselves are not novel. I suppose in this case the barrier to sharing would be my laziness.

(b) Codes that feature a significant advancement in the state of the art, such as open completely new vistas in the physics that can be captured and/or feature algorithmic breakthroughs. Now, these are codes that indeed do give me a competitive advantage as no one else has them. On the other hand, as Opencode rightly states, these are the codes whose outside validation by independent parties would be the most beneficial, if not critical. However, how does one balance the need to retain a competitive edge with the need for validation? Perhaps I am not a very trusting person, but the prospect of putting my shiny new one-of-a-kind code on the web gives me the creeps. For instance, I know a couple of groups that are twice or three times the size of mine and have multiple postdocs -- if they can access my code, due to sheer manpower they can build on it much faster and completely scoop me on a number of new projects. Not all, of course, but some problems are hotter than others at a given point in time, so yes, lots of people work on them. Only certain problems have the potential to result in high impact pubs now and not later. If I get scooped on those, I have to drop them althogether.
That's why for these one-of-a-kind codes I stick to sharing only with the people I trust.

The way science currently works (at least in my field) is that whoever is first takes all. We may argue that this is not how it should be, but it is what it is. As long as publishing first and high impact matters most, which means novelty is praised most highly, I have to say that -- when I weigh keeping a competitive advantage while testing the code internally versus having it out there for everyone else to validate or simply scoop me -- I just cannot see how it is professionally in my interest to release the new code. As I said, I am not a very trusting person, and I know some major assholes out there.

More coming...

GMP said...

Anon at 10:21 AM, thanks for the comment and the link. And thanks structurefunction for the information about the different licensing options.

Continuation of response to Opencode (part 2 of 3):

Using your example of resistance, since the underlying physical phenomenon is the same, any other lab should be able to make the same physical measurement of the resistance to validate your answer (using any measurement method they want). But, no one can exactly reproduce your implementation of the model, which may contain errors. If you are doing science with this model, then you have produced something that can never be exactly reproduced by another lab (they will never guess the exact implementation you've used). Reproducibility is key to our scientific enterprise.

While I agree with your need for validation of the code, I think your parallel is not entirely accurate.

I actually think that validation of my code by an external party would be equivalent to another researcher coming to the lab of the experimentalist in the original example and asking to perform the same experiment themselves on the original van der Pauw setup.

Performing the measurement of resistance in different labs or even using different techniques is not validation of the original experiment. The original experiment could also have had leaky contacts, incompetent student, faulty data analysis, whatever, just like my code can have bugs. Performing a measurement in a different lab would be equivalent to another computational group writing their own code, not to another group running my code.

Regarding code documentation: there is some electronic documentation for our larger codes but the most valuable information are in extensive notes in students' lab notebooks. We have multiple hard copies of these and these are great for training new students, but I don't think they are appropriate as web-documentation materials even if scanned (I would hate it if my unedited scribbles made it on the web, and students would too).

GMP said...

Thanks grumpy for the comment above!

Continuation of the response to Opencode (part 3 of 3):

Opencode would like me to pinpoint what my true objection to code sharing really is, among the ones I listed at various points in the post and comments:

You've mentioned several things, but I'm having a hard time actually pinning down your major objection to releasing your code. You've mentioned:
(1) The burden of preparing it for release. I hope I've addressed this, and there are certainly ample examples of labs that do it efficiently so that I hope it's clear this can be done.
(2) The burden of answering inquires. As I've said, with enough documentation to reproduce your results, I don't find it necessary to provide any more support unless there is a serious follow-up claim from an expert trying to validate something specific.
(3) The threat of misappropriation. As I noted, this can be taken care of legally. While there is some risk, it's the same type of risk we accept when publishing anything.
(4) (Added by GMP): The threat of losing the edge, as described in comment at 10:23 AM


The answer is "all of them". For codes of type (a) in my my comment at 10:23 AM, where the codes are not unique, I guess the barrier is that of preparing the release (1), followed by answering inquires (2). Misappropriation (3) is a slight concern.

For codes of type (b), where the codes themselves are novel and unique, losing the edge (4) is the most critical concern [I suppose it comes with the threat of misappropriation (3)], followed by (1) and (2). Btw, the code in the original post is one of type (b) codes.

My attitude towards sharing codes is certainly influenced by my career stage (I still have a lot to prove) and the practices in the field (sharing not viewed as common or prudent). I understand and respect the rationale behind the open source movement, and I appreciate your comments and the offer to help lower the barrier regarding code releases. (I also welcome a continuation of this conversation, here or offline.)

Comrade PhysioProf said...

Here is a really good example of what can happen when people develop in-house computer code and do not share it with anyone else for independent validation:

http://en.wikipedia.org/wiki/Geoffrey_Chang

Chang and coauthors published papers on the structures of two multidrug resistance transporters, known as EmrE, MsbA, and NorM between 2001 and 2010. Although the initial structures were widely considered puzzling in the field due to their unexpected placement of their ATP binding sites in the assembled dimer,[2] the publication of an additional structure in the same protein family indicated that the Chang structures were unlikely to represent the biologically active conformation of the molecules.[3] Chang and coauthors issued retractions of their structural papers on EmrE, MsbA and NorM, citing an error in an internal software utility as the source of the data misinterpretation that led to the appearance of wrongly assembled dimers.[4][5] The application of a popular protein structure validation tool to one of the retracted MsbA structures results in scores that indicate severe errors in this structure.[6]

Dude retracted five motherfucken papers, including three in Science, because his home-brewed code was fucken wrong.

GMP said...

CPP, when I see something like this, I think Hendrik Schoen scandal. Your dude, Goeffrey Chang, was at best too eager to publish data in GlamourMag to properly run all the necessary tests, or, at worst, he committed fraud by purposefully avoiding to test and/or overlooking negative data in order to get into a high IF journal.

My policy is that, if the results look too good to be true, that's often because they aren't true. The more earth-shattering the results look, the more you need to test and retest and check every freakin' angle to ensure they are real. I delayed the submission of one of our papers by nearly 2 years, because while our code yielded perfectly plausible results, its output in a subset of tests was ever so slightly off. It was months and months of meticulous debugging, checking every line over and over and over, and talking to people in the field, until we did indeed find a very subtle implementation error. I honestly cannot imagine that someone else would ever invest the same effort and time to go through it with a fine-tooth comb. We did it because our reputation hinges on the quality of the codes and the quality of our data. At some point one has to rely on the integrity of scientists and that they will do everything in their power to ensure the accuracy of the data.

I think it's offensive on some level to assume that all codes are somehow wrong and buggy and not to be trusted unless someone else checks them. Does that mean all experiments are true and flawless and need never be checked? One experiment in my field, published in 2000 in Science, made a big splash as it predicted certain properties of certain nanostructures to be unbelievably exciting. The paper got a gazillion citations and we now know the prediction was completely wrong. It was an indirect measurement and the quantity of interest was obtained from the raw data by assuming (rather than measuring) certain parameters' values to be sort-of plausible but more in the realm of wishful thinking. So it was a review process fail that let an incomplete sensationalistic report be published in a high-profile journal. Some people were suspicious of the data right away, but only after a few years when different experiments were designed and performed did people realize how far off the predictions were. The equivalent of new experiments in the theory/computation world would be another group writing their own code and simulating the same system independently.

Comrade PhysioProf said...

I honestly cannot imagine that someone else would ever invest the same effort and time to go through it with a fine-tooth comb.

HAHAHAHAHAHAHAH!!!!!!!!! Yeah, I'm sure you're the only scientist on the entire fucken planet conscientious enough to have done this.

I'm sure you have plenty of awesome convincing reasons you tell yourself why you conceal primary data--which is what the details of your algorithmic implementations in code are--from others in your field, and thus make it impossible to replicate your experiments. That doesn't make it any less unsavory.

GMP said...

@CPP: Actually, what I meant is that no one else would invest 2 years to go through my code with a fine tooth comb. I know that I don't have that time/resources -- to pick up someone else's code and then have my own student comb through it for years. Who's going to pay for that? People get funding to develop their own codes, not debug other people's codes.

And I am not concealing primary data. Primary data are perfectly accessible -- you can have any and all output of any of my codes. But my freaking computational tools themselves are not accessible. No, you do not get to run my one-of-a-kind code unless I am 100% sure that you are not a backstabbing ass.

What you choose to consider unsavory or any other derisive adjective you've thrown at me is your business. For the last time, my practices are perfectly within the norms of my field; I don't care if you agree with them or understand them. There are a few practices in the biomed that would be condedered somewhere from weird to unacceptable in the physical sciences, so don't go acting all high and mighty.

Dave said...

"...my practices are perfectly within the norms of my field." I don't think anyone has disputed that. What people have said -- some explicitly, as with Becca's analogy to the wartime soldier -- is that the norms might be harmful in some respects. Your initial post seemed to be asking for opinions on whether a different norm would be worthwhile, such as one where people did share their code. It might be more productive to discuss the merits of the norm and not get caught up in defending not releasing your code.

For what little it's worth, I read your 11:46 comment the same way CPP did -- that somehow you were claiming to be more diligent than anyone else in checking your code. Glad to see that interpretation was wrong.

I find this statement rather strange: "People get funding to develop their own codes, not debug other people's codes." Surely code is not the output? If you can use a simulation to get new information about a physical phenomenon, that's great no matter who wrote the simulation: you are reporting about the physics, not about the process of building the simulator.

GMP said...

Dave, thanks for the thoughtful comment.

It might be more productive to discuss the merits of the norm and not get caught up in defending not releasing your code.

Agreed. But it's hard to do that with commenters who open with name calling or comparing my professional practices to acts of wartime murder.

I understand the reasons for code sharing: validation, advancement of science as a whole etc. But just because this works in some fields, because the field culture is that of sharing, does not mean it can simply be wished on or forced on others. For instance, I work in a field where codes are closely guarded and where there are people who have been known to backstab and where code release can results in a significant danger of losing your footing on a number of projects. I would like someone to convince me how, in this particular environment, it is in my professional interest to share my code.

In my field, code sharing (for the benefit of all science) would likely result in professional suicide (getting scooped). Must you always do what's best for all others even if it comes at your own great professional expense? Or are you entitled to safeguard your own professional interest as long as you are working within the ethical norms of your field? I would say the latter. I think it's a good question to ponder in general, not just in science: how much do you owe to your community/mankind vs how much do you owe to yourself.

I find this statement rather strange: "People get funding to develop their own codes, not debug other people's codes." Surely code is not the output? If you can use a simulation to get new information about a physical phenomenon, that's great no matter who wrote the simulation: you are reporting about the physics, not about the process of building the simulator.

Dave, physics is the ultimate goal. However, it is not uncommon that experimental advances make interesting phenomena accessible at, for instance, ever shorter spatial and temporal scales, where current theoretical understanding and simulation tools don't work. So it is not uncommon to have a proposal saying "Cool New Experiments open doors into phenomena at previously inaccessible time/space scales. The Interesting Physical Phenomena at these scales reveal signatures of New Complexity, not fully understood and far beyond the limits of current techniques used to describe Interesting Physical Phenomena at previously accessible and now well understood scales. In order to properly describe New Complexity that emerges in such and such systems, we propose to develop Cool New Simulation Tool that significantly pushes the boundaries of our understanding of the physics at new temporal/spatial scales and Interesting Physical Phenomena on these scales in particular." I know this sounds vague, sorry. Let me just say there is a lot of activity in the development of multiscale and multiphysics simulation tools for a number of systems. Certain divisions in several funding agencies will fund work on the development of novel broad-impact tools that open doors into understanding a whole new class of phenomena and/or systems.

Dr. Shellie said...

Off topic, but of possible interest (and relevant to your post some time back about sometimes not being that interested in the million things you have signed yourself up for):

http://www.joyful-professor.com/workshops.php

Massimo said...

I have to agree with those who disagree with your stand on sharing. Why in the world would you not want to share your code ? You have nothing to lose and everything to gain. They will cite your papers, one more group adopting your pioneering and visionary approach will make your contribution look even more valuable and prize-worthy, maybe they will need your collaboration and your name will end up on the paper, they will send you students and postdocs and hire your own students and postdocs...
The only drawback is that they may use it to study cool physical systems that you could have studied yourself but have not thought of, but number one they will study them anyway -- possibly with a crappy method but they will still get the credit for the idea, whereas you will get nothing, even if you redo the calculation better after them; number two, they deserve credit if they have the idea to use your code for something you have not thought of, and you will still derive some recognition.
Just my opinion. By the way, want my code ? :-)

Anonymous said...

Dear GMP,

I would like to invite you to reconsider your opinion on sharing code. The rewards for sharing code will be manifold.

1) I would think that it is fair to charge a modest amount for a license for the code, which you can use to contribute to the cost of maintaining the software, and to reward those who wrote it. Obviously, the cost shouldn't be prohibitive. As analogies have been cited in this thread, this is similar to charging for time on some piece of fancy equipment, or for food/cage/staff time in case of mice.

2) By sharing your code, it will be more likely that others pick up your work and cite your papers. You will have more collaborations, increasing your impact in the field. The fact that other people use your software is similar to an external stamp of approval of your work.

3) Let me also mention that there are various degrees of sharing code. E.g., you could begin by distributing compiled (binary) code. This would allow others to reproduce your published results, while maintaining your competitive edge on producing new models. It's fair to do so for some amount of time.

4) You will receive valuable feedback and possibly corrections from others.

Comrade PhysioProf said...

In my field, code sharing (for the benefit of all science) would likely result in professional suicide (getting scooped). Must you always do what's best for all others even if it comes at your own great professional expense? Or are you entitled to safeguard your own professional interest as long as you are working within the ethical norms of your field? I would say the latter.

What you are not allowed to do is reap the professional benefit of publishing experiments if you do not reveal sufficient information for others to replicate those experiments. In the case of simulation studies, this mean you can't keep your code secret. The fact that your field tolerates this kind of unethical behavior doesn't make it any less unethical.

Anonymous said...

What you are not allowed to do is reap the professional benefit of publishing experiments if you do not reveal sufficient information for others to replicate those experiments. In the case of simulation studies, this mean you can't keep your code secret. The fact that your field tolerates this kind of unethical behavior doesn't make it any less unethical.

The methods sections of any good simulation paper have enough detail on the equations and algorithms to allow reproduction. If you know what you are doing, it is probably faster to code it all up yourself than to deal with someone else's code. This also facilitates replication, since the errors in the implementation would (probably) be independent.

GMP said...

Anon 3:03, right on!

Hermitage said...

Jesus Maria, who are all these oblivious people? The 'sharing' part of the post makes perfect since, maybe because my field is less out in lala kumbaya land than others.

Any good computational paper exhaustively covers the math/physics/whatever involved, the important parameters, and how they scale using analytical relations and indicate which numerical methods techniques were implemented to generate the result. Using that, any douchehat can try their hand at reproducing the results. I'm an idiot n00b grad student who's not even in a computational field and I was perfectly capable of reading a few papers and cobbling together some code that reproduced the results I found interesting.

This is EXACTLY the same as a methods paper describing the steps of fabricating a device that is used to investigate some phenomena. You don't go fedexing the shit out to anyone who wants to use your chip/equipment/doohickey. You are liable to run into precisely the problems GMP has outlined again and again 1) They might not know how to use the shit and promptly turn out a paper saying you don't know anything 2) To avoid that you dedicate a few of your grad minions to being service reps for your shit 3)Someone with a ton of firepower will simply crank out data using your instrument 100x faster than you ever could. However, it is perfectly reasonable for Joe Blow from Nowhere University of Sunshine to ask for a clarification step on your fabrication process so they can go build it themselves. There is nothing intellectually dishonest about that at all.

I think many of these comments don't understand the difference between a non-novel (but possibly unwieldy) hunk of code that looks at a novel thing, and a novel code that can look at novel things. How that informs sharing is very different.

qaz said...

Wow. I'm shocked at the vehemence of this discussion. I originally trained in computation, and have experience in several mathematical, physical, computational, and biological fields. I know of NO fields in which the academic community does not expect a basic willingness to share code. I would be very interested in what scientific field GMP comes from that believes that code is private.

It is very important to recognize that code-sharing does not mean that one trains another student in the 1-2 years it takes to learn to use the code properly. Nor does it mean that you will supply any support. Nor does it mean you should not get credit for your code. In the fields I'm familiar with, code is cited (often with a methods paper), and people know who supports code and who doesn't. (Supporting code gets you "favor" points, which is often useful in an academic setting.)

In any case, many (not all) journals require some placement of code in a public repository so that your work can be basically replicated.

Finally, do not underestimate the expense or uniqueness or difficulty in sharing transgenic constructs. All of your complaints apply completely to these transgenic biomedical constructs, yet they are reliably shared, and required to by NIH.

GMP said...

Hi qaz, thanks for coming over.

If you have some time, try to go through the comment thread above in detail (I know its length is quite daunting). You will see that there are number of fields where code sharing is not the norm, also most of the points you raised were already discussed above.

znoop333 said...

I have experience publishing in a field where sharing code is not typical. I think the value of sharing code is being vastly overestimated here.

Writing code is easy, reading code is hard. Even if you had all of the code, some code is written poorly, it just barely works at all, and it can still be a challenge just to compile well-written code and get it working. Don't take my word for it, go download ITK and try to get it to work. I'll wait. I bet that you didn't accomplish any research goals while configuring your C++ compiler.

You might not get any insight into the fundamental problem being solved just by seeing one particular implementation of how to solve it. The published paper already has everything you need to know to write your own code. Reproducing the experiment should include rewriting the code to catch bugs, as previously noted. The odds are actually pretty bad that you'll be able to take someone else's code and directly make it work to solve your own problems with no further effort. Your results might be limited by using someone else's buggy code that just barely worked under one narrow range of conditions.

The comparison to sharing mouse lines to sharing code is unfair. Breeding more mice is expensive and slow, but having them can produce new results. Copying source code files is trivial and fast, and having it will produce exactly the same results. These types of sharing are fundamentally different. The current system of informally sharing code after publication works pretty well -- I've never had anyone refuse to share their code, given that we will collaborate further.

I'd further argue that sharing code is more like sharing grant text than sharing any physical object. Are you all willing to post the full text of your grant applications on the public internet the day that they are funded?

Spiny Norman said...

I'm a biological scientist but in a past life I worked in a private sector technical software house which developed proprietary analytical solutions.

The problem with not making your code base available is that the devil is, inevitably, hiding in the details. It's a variant on the classic Sidney Harris cartoon, but all the more insidious because you and your colleagues don't even have to specify where the "miracle" is occurring.

It's one thing to sit on your secrets prior to publication. It is quite another to sit on them after publication. Secret methods are sometimes necessary for commerce or warfare, but they are antithetical to the progress of science.

If the status quo in your field is to hide the methods and sequester the tools, it's either a corrupt and rotten field, or it's raw applied commerce -- not open science. If you are not expecting major financial rewards, you might want to ask yourself why you'd want to spend your finite life and energy there. This is especially so if your work is paid for by taxpayers, who have every right to expect maximum benefit from their investment (v. personal benefit accruing to the researchers). In such a situation, the best things you could possibly do are to either: (a) change your field (by publishing your code), or (b) change your field (by finding a more open environment within which to do science).

But, you whine -- and to these ears, it's unambiguously a whine -- we've spent years developing our tools. Well, Sakman and Neher can say the same. But they shared their methods freely -- and then shared the Nobel Prize. As did Roger Tsien and his colleagues. As did Capecci and Evans, Milstein, Arber, Nathans, and Smith, Mullis, Geim and Novsoelov, and many, many, many others.

Their fields are not less competitive than yours.

I also can readily think of at least two high-profile examples where proprietary code led to faulty conclusions and embarrassing retractions, with the labs relying on proprietary code being completely overtaken by labs who made their code available under reasonable licensing terms.

GMP said...

znoop333, thanks for the comment (sorry it took a while to publish it, it got caught up in moderation). You make several excellent points.

Spiny Norman, with all do respect, you don't have a clue about my field, no freakin' idea how competitive it is, and certainly no right to call the whole field corrupt. (But I suspect this is just yet another manifestation of the general sentiment that I get from many biomedical folks who comment here -- that biomedical fields are oh so much more awesome and competitive and important and ethical and overall holy than all others.)

If you read the comments above carefully, people who actually develop code have addressed many pertinent details (read e.g. the comment before yours, by znoop333) and no, my field is not unique in that the codes are not freely shared. Details are published in peer reviewed literature and anyone can write their own version if they wish.

The fact that taxpayers pay for someone's research does not prevent people from patenting their results, starting a company and making money off of it, so please don't go into the "but taxpayers have the right to your cooooode...!"

Spiny Norman said...

"The fact that taxpayers pay for someone's research does not prevent people from patenting their results, starting a company and making money off of it, so please don't go into the "but taxpayers have the right to your cooooode...!"

Others behave unethically! QED!

Thanks for clarifying your thought process.

GMP said...

Researchers and the universities that employ them retain intellectual property over the results of their federally funded research. There are even grants specifically meant to foster transition of research to industry.
http://www.acq.osd.mil/osbp/sbir/

Just because you don't like something doesn't make it unethical.

Spiny Norman said...

To confuse "legal" with "ethical" is a common enough error (or, for some, tactic).

GMP said...

Ahahahahaha!

How lucky am I, having Holy Norman here to educate me about ethics...