DISQUS

David R. MacIver: Cleaning up a set of tags, part 1

  • Jason Adams · 11 months ago
    In irb, I've taken to habitually tacking ";nil" onto the end of a command that would return a ginormous hash/array/etc. Speeds things up quite a bit, too.

    >> a = (1..1000000).map {1}; nil
    => nil
    >> a.size
    => 1000000
  • david · 11 months ago
    Yeah, I know. I try to do that, but I forget every now and then and when you have 100K items you only need to do it once and you've lost the log.
  • Jason Adams · 11 months ago
    Yeah, true. I wish there were a command line option to suppress it. There's an option to uglify the output (--noinspect), which slightly reduces the space.. I guess there's no easy way to figure out what to suppress and what not to.
  • david · 11 months ago
    One slightly 'orrible trick I tried was to monkey patch inspect to return "", but for some reason this didn't work correctly. I think it might be that some of the inspect methods are intrinsic.
  • A · 11 months ago
    to mute IRB ...

    IRB.conf[:PROMPT][ IRB.conf[:PROMPT_MODE] ][:RETURN]=''

    from

    http://groups.google.com/group/ruby-talk-google...
  • david · 11 months ago
    Wow. Fantastic. That's going right in my .irbrc.
  • Fergus Gallagher · 11 months ago
    Re: tag "all-articles". We sometimes get requests to tag all articles with the same tag (typically so a user can delete all their articles) and this is the tag name we typically use.
  • david · 11 months ago
    @Fergus: Ah. Interesting. Thanks for the info.
  • Porges · 11 months ago
    As you can see from the pingback I've written a reply semi-tutorial on how to use Awk to do the same task. Hope you don't mind :)
  • david · 11 months ago
    I don't mind at all. :-) It was an interesting read. Thanks.

    I will however probably continue using Ruby for these tasks. In particular I think you're going to be completely unable to port the second one to awk because that's where it stops being remotely line oriented and where I start making use of more general purpose libraries. I could start out with awk and switch to Ruby when that happens, but frankly I'm much more comfortable keeping it in the same language throughout.
  • Joshua Drake · 11 months ago
    One comment on pluralization. Although it sounds like this did not apply to the data set you normalized, it is possible that Statistic applied to a single fact or piece of information, while Statistics referred to the field of, and their may be additional cases where the plural and singular are different usefully different entities.

    That said I appreciate your work on this, especially the details given on your approach. I hope that because of work like yours, sites will consider cleaning up their tags. Only two sites I can think of attempt to control their tags to any degree. Amazon, which presents previously used tags and Stack Overflow which handles this with tag auto-completion and a reputation requirement for creating new tags.
  • david · 11 months ago
    It's true that stripping off pluralisation can change the meaning of the tag - my expectation is that that these cases will be sufficiently far outliers compared to the number of cases where this removes that I can live with the slight loss of useful information. Generally the meaning will be close enough to preserved that it's tolerable.

    I'm going to do some analysis on usage later when I try to clean up things further, and when I do I'll see if I can spot any cases where it's actually breaking things.
  • david · 11 months ago
    Also I'm glad you're finding it interesting. :-) I don't expect that much of what I'm doing will see use on these sites - it's not really for that. It's more for building data analysis on top of a noisy tag set.