-
Website
http://www.drmaciver.com -
Original page
http://www.drmaciver.com/2008/12/living-on-the-edge-of-academia/ -
Subscribe
All Comments -
Community
-
Top Commenters
-
llimllib
1 comment · 2 points
-
Chris Vest
1 comment · 1 points
-
David R. MacIver
1 comment · 1 points
-
-
Popular Threads
As for the implementation issues:
I work in a different field but I've found the overstatement of an algorithms effectiveness to be a universial atribute of most scholarly works. It's incredibly dissapointing to spend a few days implementing a siliver bullet only to find that it's a lump of coal when applied to real world data. This is just the way it is when the work you're doing is pushing the boundaries of common knowledge.
On implementation: I understand that overstatement of effectiveness is a general problem. It would just be nice to find that out up front rather than after I've put in all the work. :-)
I try to be charitable about it - in some cases it's probably just that it only works in a certain subset of cases. I'm sure many of the things we're doing in SONAR are similarly ineffective when applied to things outside our domain (e.g. our techniques are optimised for lots of short documents and probably don't do nearly so well on fewer long ones). And sometimes I just can't be charitable. A lot of algorithms published are actually complete nonsense and don't do what they claim to do but get away with it because of massaging of data and other pre/postprocessing.
It's not always enough to release the code, though. You can find plenty of bits of software out there. Sometimes it's in a state so bad, it's almost worse having the code than just reimplementing it yourself (at least, that's what you tell yourself).
the culture is changing, so i'm not too worried. keep spreading the word and encouraging academics to use GitHub or whatever makes sharing knowledge most convenient and effective.
I've generally had very good luck just emailing the authors. I don't think I've ever been rejected with a polite request for a paper and I've certainly always been willing to send my paper to anyone who asks me.
As for not publishing a reference implementation, I'm willing to bet a huge part of that is also that most academics are kind of embarrassed about the quality of their code. It's not always the most elegant of software engineering and the effort to make it publication worthy is often neglected with so much other stuff to do. Again, I've had good luck emailing them and asking if they'd be willing to send the source code. I don't think I've ever had a reply back that didn't include some sort of apology for the quality of the code.
Jason: Thanks for the link. It was an interesting read.
On the subject of bad code: I genuinely would rather have bad code than no code. Even if I can't actually get the damn thing to work it still gives me a source for figuring out the hidden details. To take the punkt example - as far as I can tell, nowhere in the paper does it specify its exact tokenizing scheme. It's easy to guess an appropriate one, and the one I've guessed seems to work, but if it didn't work it would be really nice to be able to go back to the source and figure out exactly what it considers a token.
Good point about asking people for the papers and code. For some reason I never think to do that. I shall try to be better about it in future.
My personal experience wrt asking for papers and code is as follows: Papers good, code bad. For papers beyond my research area (ie LNCS Springer) I am without free access either. My solution: ask other researchers for an ssh account within their LAN and download the papers via VPN tunnel. Regarding references implementations I have very bad experience asking for non-open-sourced code. If researchers dont open-source their code, they still give it to you but will become very possessive about all what you do afterwards, however unrelated it might be.
Through Northwestern University's library I can get electronic access to almost everything that has published since 1996 which is worth reading. A subscription costs something on the order of 120 USD/year.
Also, about publishing code, consider the incentive structure.
The citation distribution follows a power law, meaning that a few papers get many citations but most papers get few citations or none at all. In other words, the probability is high that no one will care about your code. Since academics are often ashamed to release an inferior product, they feel they must make a cleaned-up version for public release. And the expected return on this is negative, since the probability is so low that anyone actually cares.
Academics would publish their code much more freely if they were rewarded for it, and those rewards were clear and tangible. For example, if released code factored into tenure decisions, you would see a quantal change in how much open-sourcing goes on.