The original paper itself, for those who are interested.
Overall, this is really interesting research and a really good “first step.” I will be interested to see if this can be replicated on other models. One thing that really stood out, though, was that certain details are obfuscated because of Sonnet being proprietary. Hopefully follow-on work is done on one of the open source models to confirm the method.
One of the notable limitations is quantifying activation’s correlation to text meaning, which will make any sort of controls difficult. Sure, you can just massively increase or decrease a weight, and for some things that will be fine, but for real manual fine tuning, that will prove to be a difficulty.
I suspect this method is likely generalizable (maybe with some tweaks?), and I’d really be interested to see how this type of analysis could be done on other neural networks.
YES! I study AI, and this is exactly how I feel!
Side note-One of my favorite things to do is ask people what their use case for using AI is, and watch them sputter out “uh…emails and productivity and things.”