Interpretations, questions, and a few speculations from “Deep Learning with Python” by François Chollet.
First of all, I need to clarify that my point of view is all wrong for this book, though I enjoyed reading it and found it quite thought provoking. Using one of my typically strained metaphors, this book is similar to a recipe book for practitioners, and I’m not a practitioner. (Just a neophiliac? (Or do I suffer from full-blown neomania?))
Comparing deep learning to cooking, the raw data is like the ingredients and the recipes in the book are like instructions to tell you how to produce delicious information and predictions by using various implements (rather than cooking utensils) to analyze and do various things to the raw data. The book focuses on the various hardware, utilities, and programs that you use to handle the ingredients in various ways. The book makes it clear that the author is a very skilled practitioner, a master cook, in this field of dreamy metaphors. I used to be a pretty good cook and even a fair programmer, but I’m definitely (and a bit sadly) not one of the practitioners this book was written for.
However to really understand what is going on, you would need to approach the topics from a mathematical perspective. As far as Keras is concerned, in this book all of the deep mathematics have been buried inside “magic spells” that you chant as you add layers to the networks. Not a functional approach, but object oriented. Or you could say it feels like parameters being passed in to invoke the right tools. Notwithstanding his clear explanations, what is going on feels akin to deep magic. This book basically shows you how to poke at the various parameters and settings to get various kinds of results, without requiring any deep mathematical understanding of what is really going on inside. In spite of my lengthy studies, I’m only a shallow mathematician, but at the other end of the mathematical spectrum, I actually know one deep mathematician who might be capable of approaching these analyses from above, understanding what is going on and figuring out what the proper parameters should be without all the fiddling about. (It’s even likely that some people who are using Keras seriously consult him for guidance in optimizing their models. Certainly not the mathematically precise way to word it, but if you make the models too big or use the wrong kinds of models, then you may be able to get some results, but with extreme inefficiency. In contrast, if the models are too small or the wrong tools are selected, then the results will be weak or meaningless. (In Chapter 7 of the book Chollet briefly introduces some tools to support more automated tuning of the models.))
So much for the practitioner’s and the mathematician’s perspectives. Where is the author and where am I? It is clear that Chollet knows more math than I do, but in the end I concluded he is mostly a programmer, probably even a genius of programming, but not a deep mathematician. And me? My perspective is very shallow but broad. I have never been enough of a programmer to practice on this level, nor am I sufficiently deep mathematician to really understand what is going on. But I am able to link what is described in the book to a lot of other things that I’ve read about, and the premise of this “descriptive reaction” is that I can write something interesting from my odd perspective. I certainly found the book interesting enough.
A good place to start is on page 165 which shows various levels in the interpretation of a Keras model. This model has been trained to distinguish between images of dogs and cats. A test image of a cat has been input, and this page shows what each of the nodes in each of the layers is reacting to as it responds to that image. He didn’t say anything like this in the text, but I interpreted all of those thumbnail images from the perspective of neuroscience analyzing how the visual cortex works. The first layer at the top of page 165 (32 thumbnail images) corresponds to the surface levels of the visual cortex where various low-level features are identified, and the image triggers many of those neurons (or groups of neurons). At the next level (64 thumbnails) certain features are contributing towards the target of recognizing a cat, and some of the “neurons” at this level are no longer activated (15 black thumbnails), because those features are not relevant. At the level below that (corresponding to going deeper into the visual cortex) more cells have become black and irrelevant and the features are higher level constructs. This particular model ends with a fourth level where most of the thumbnails have become black and irrelevant because the features they are looking for not relevant to the input image in relation to being an image of a cat versus a dog. (Later on the book talks about splitting things up and using precompiled models, which corresponds to taking out some of the deeper levels and then training only for the interpretation of the features triggered at higher levels, thereby limiting the training work. But from the perspective of how our brains are wired, those surface-level models correspond to the basically naive features that are most nearly hardwired for expression by our genes, with the deeper-level learning built built upon them.)
Another strongly beyond-the-text reaction was on page 208, where he’s talking about weather forecasting using vectors with 14 pieces of data. He writes about predicting the temperature on the next day, but he doesn’t go to the level of considering how the model works. What the Keras library seems to be doing behind the scenes is looking for correlations between various tensors in the input vectors. For example if the air pressure has been falling and the wind is from a certain direction at a certain time of year, and the temperature is compatible with a certain kind of storm pattern, then the next day’s temperature prediction should reflect the arrival of that kind of weather pattern. (This would also make sense when extended to data from more locations, but I may be mapping too much to the same kind of model that was used for image recognition, considering the sequential patterns in the weather data as corresponding too closely to the surface levels of the visual cortex.)
I had a strong reaction to his cautionary note on page 224. He says that you shouldn’t jump right in and try to predict the stock market with this kind of data, citing the infamous warning that “past performance is not a good predictor of future returns” [Author’s italics]. I thought he was kind of detached from reality here, because the reality of the performance isn’t the point of gambling in the stock market now. You don’t need a model that understands the reality, especially when the reality is fundamentally unpredictable. You just need a model that is smarter than the models used by the many suckers who are putting their money into the stock market and buying and selling based on what they fantasize the future share prices are going to do. (I still think in terms of solutions, and I still think the best solution to many of the problems of the stock market would be a transaction charge on every gamble.)
On page 234 he begins Section 7.1 about mixing models. This actually relates to something that was bothering me back with the visual cortex example. The connections among neurons don’t have to be strictly constrained to one level, and mixing models seems in a sense similar to allowing for the extra branches between networks of neurons at different levels.
Then on page 235 he mentioned something quite close to a deep learning application that had started to pique my interest as I read the book. He was considering models to identify the genres and dates of novels, and I had been wondering about looking for stylistic consistencies and patterns within the work of a single author. Another use would be when comparing books by one author with continuations written by other authors. (In particular, I was just thinking about Robert Goldsborough’s Nero Wolfe novels based on Rex Stout’s work. But the same kind of analysis could be used in various ways, such as trying to identify authors who write along the lines of such deceased authors as Iain Banks, Robert Parker, or Umberto Eco.) He returned to this theme from the angle of AI generated art and literature in Chapter 8, even claiming to be a bit prophetic in 2014. (However I’m sure I was thinking about similar ideas earlier than that, so I file this part under “intuitively obvious to the most casual observer”. Of course someone is going to want to make a deep fake video of the Marx Brothers visiting a certain clown (“He whose name need not be mentioned”) in the White House. However my earlier speculations were about Laurel and Hardy appearing in new computer-generated films. (And that was before I learned how much of the later Star Wars movies were CG over green screens.))
(For a major diversion, I’m going to slip in a capsule review of Archie Goes Home by Robert Goldsborough. (If I was editing this piece, I’d demand this paragraph get cut, but I’m basically writing for my own amusement, so…) He is trying to imitate Stout’s style, but with limited success. While it’s hard for me to describe the differences, I think a deep learning model trained on the original novels would find patterned differences when tested with this derivative work. Tempo? Lightness? Uncharacteristic behaviors and speeches from the recurring characters? It would be most amazing if it could find the major plot holes, in this case [Spoiler alert!] involving a character who was too far away to be framed.)
For these kinds of analyses I actually predict that the Keras models will identify many characteristics that we cannot label in non-mathematical and human terms. However it will still be possible to use such features to compare authors. Or perhaps more interesting, we could track how these features evolve over time within the work of a specific author and even predict what sort of books that author would have written had he lived longer. Or perhaps characterize the lost works of Aristotle? (Then those missing books could be “written”?) Chollet mentioned analyzing some translations, so how about high-level characterizations of translated books to be compared with the original versions, abstracting and filtering out many of the language-specific features? (Obvious gold standard available from bilingual authors who translated their own work.)
The last part of the book was where I finally concluded that Chollet isn’t a heavy mathematician, because his predictions for the future were quite heuristic rather than analytical. Also much too concrete for a pure mathematician (though I had already ruled out that possibility). He’s clearly a stellar programmer and a pretty good author.
However that reminds me I should include a few mechanical notes. The books seemed to be carefully written. The only typo I noticed was in the last line on page 343, where it should say “… should look like this.” The word “look” was omitted. A couple of other places the text described results that were slightly different from the examples. Also, the intro made it sound like Appendix A belonged inside of Chapter 3, so maybe that was a late change in the editing.
Interesting overall and fairly practical. Rating most things with Likert scales these days, so I guess I’d call it 4 out of 5 stars. Recommended, but not so strongly.
P.S. One more thing. I disagree with him about the history of neural networks. I think the links to neuroscience are stronger than he seemed to indicate in the last chapter, and I was alive and paying some attention in those days. I think the whippersnapper is wrong on that one, even though it’s his field.