Alexandre Trilla, PhD - Data Scientist | home publications
 

Blog

-- Thoughts on data analysis, software development and innovation management. Comments are welcome


Post 84

The value-added and the value-perceived of an MBA

16-Feb-2014

I was the other day riding a high-speed train to Madrid when I couldn't help hearing a conversation between two business-men that were sitting next to me (besides, they were speaking rather loudly). They were discussing the use (and value) of an MBA in the business world. That inevitably caught my attention as I've just enrolled in one this year (which takes all my spare time and keeps me from writing here, I apologise for the poor throughput of posts lately). One of them supported how useful it always is to have such a pragmatic view of a business, while the other thought that it is the other way around, that such a tight focus is a double-edged sword, and that it all comes down to the competence of the candidate, sometimes for the better, and the other times for the worse.

That left me a little puzzled. I had already asked this question to HakerNews some time ago, and the answers I got were somewhat negative. Then why did I still enrol? I guess the first thing that comes to my mind is my obstinate passion to be learning continuously and improving my skills. I already have a technical education, experience and background that allows me to face technical problems with confidence (I end up solving the problems I already know how to solve, or else I first learn from the literature and then I find a practical solution), but I feel my professional career needs a veneer of business to achieve a more compelling profile. In addition, my commute takes me quite a lot of time and I wanted to profit from my reading time while being on board of the train. Thus, getting an MBA made sense to me. Do I think of ever running my own business? Of course I do, although I also enjoy taking my job at the big company (which I like very much) as my own business, bearing in mind that my employer is my customer, and that I must always deliver a service a little over his expectations. In the end, it pays my bills, it allows me to reinvest in my own education, and I feel fulfilled with my work. However, in order to fully develop this compelling profile I'm seeking, I also reckon it is of utmost importance to take care of my online profile so as to explore new business opportunities and to make commitments to freelance projects on the side. This is great for getting better at what I do. I have long directed my efforts towards this goal, I have gathered positive feedback, and I am determined to follow this path.

Before I bring this post to a close, I recently found a piece of evidence that supports my decision to enrol on a MBA in order to polish my professional profile. On this month's edition of Emprendedores mag there's an article that depicts some entrepreneurship ideas, and in addition to the ones directly related with tech consultancy and tech transfer, many of them are related with strategic management, financial management and customer management. It is to note that the three of them are comprised in the syllabus of an MBA.



Post 83

Down to the nitty-gritty of graph adjacency implementation

02-Sep-2013

The core implementation of the adjacency scheme in a graphical abstract data type is a rich source of discussion. Common knowledge, i.e., Stackoverflow knowledge, indicates that a matrix is usually good if the graph in question is dense (squared space complexity is worth it at the advantage of random-access time complexity). Instead, if the graph is sparse, a linked list is more suitable (linear space complexity at the expense of linear-access time complexity). But is this actually a fair tradeoff?

Part of these arguments is due to considering all of its elements internally or managing some of them externally. Matrix-based graphs are only fast to access if it is assumed that the storage positions for the nodes are already known. Since computers don't know (nor assume) things per se, they generally have to be instructed step by step. Thus, the matrix-based approach (i.e., the fastest to access) needs some sort of lookup table to keep track of the node identifiers and their associated storage positions (be it maintained internally or externally). Bearing in mind that storing and searching this dictionary introduces some complexity offsets, its impact must be taken into account in any scenario. Moreover, the list-based graph also needs this lookup table to be able to reference all its nodes unless it is asserted that the graph is always connected, so that every node is accessible from a single root node (which not very realistic, especially for directed graphs). Therefore, the choice of implementation does not simply seem to come down to a tradeoff between graph time and space complexities, but to a somewhat more blurry scenario involving the complexities of the auxiliary data structs. Let's dig into this.

The goal of conducting this asymptotic analysis of graph adjacency implementation is to validate the following hypothesis:

Matrix-based graphs are more effective than list-based graphs for densely connected networks, whereas list-based graphs are more effective than matrix-based graphs for sparsely connected networks. This is what is usually taken for granted, but it only works if the complexities introduced by the auxiliary (and necessary) data structs of the implementation are of a lesser order of magnitude than the default graph complexities.

In order to carry out some experimentation, time and space complexities are taken for proxy effectiveness rates. A matrix-based and a list-based approaches are evaluated on some graph properties and functionalities (storage, node adjacency and node neighborhood functions) and on two topologies with different levels of connectivity. The sparse scenario is defined to be a tree and the dense scenario is defined to be a fully meshed network. Considering that V is the set of nodes (aka vertexes) and E is the set of edges, the aforementioned adjacencies are related as follows:

Network topology Number of edges |E|
Sparse graph (tree) |V|-1
Dense graph (full mesh) |V| (|V|-1)/2

Now the typical complexities of the two implementations of graph adjacency are shown as follows:

Complexity Matrix Linked-list
Storage O(|V|^2) O(|V|+|E|)
Node adjacency O(1) O(deg|V|)
Node neighborhood O(|V|) O(deg|V|)

Just for the record: without loss of generality, node degrees for a sparse graph are similar to 1, whereas node degrees for a dense graph are similar to the number of nodes |V|.

And the complexities of some implementations tackling the dictionary problem:

Dictionary implementation Space complexity Time complexity
Hash table O(|V|) O(1)
Double-list (unordered keys) O(|V|) O(|V|)
Double-list (ordered keys) O(|V|) O(log|V|)

It is to note that the dictionary implementation of use to relate node identifiers with storage positions introduces a diverse scenario of complexities. The implementation based on a hash table is the least intrusive with the graph complexities as the storage complexity is equivalent (or of lesser order) and it shows a random-access time complexity. Instead, the remaining two approaches do introduce some equivalent changes worsening the performance of the graph functions (the linear search for the double-list with unordered keys and the binary search for the ordered keys).

Conclusion
Common approaches to deal with the graph adjacency scheme are based on a matrix and on a linked list. The former is usually suitable for densely connected graphs while the latter is more effective for sparse graphs. However, both these implementations require an additional dictionary to associate node identifiers with their storage positions, and the time and space complexity of this auxiliary structure must be taken into account in the whole graphical abstract data type. From here on, the dictionary shaped as a hash table stands as the only approach that allows appreciating the differences between the matrix-based and the liked list-based approaches because it operates unobtrusively with the fundamental adjacency structures. The other two dictionary approaches based on linked lists (with ordered or unordered keys) introduce a complexity burden that hinders the advantages of the adjacency implementations wrt the level of connectivity of the graph in question.

Anyhow, in order to prevent premature optimization, TADTs implements a graphical abstract data type with a dictionary based on a linked list. This is good enough for most real-world scenarios.



Post 82

A story about growing better tomatoes with genetic selection and other farming hacks

14-Aug-2013

This is a seasonal post about my grandfather's story and his relentless efforts to produce better tomatoes during his whole farming-labor activity. His goal is focused on manual genetic selection. My grandfather is a farming hacker, he doesn't know what this means, but I admire what he has always done. He has conducted this selection process over more than forty years, and now I have tomato seeds that develop into the sweetest tomatoes ever. I have now taken his token and continue his magical deed at a smaller scale, though.

The whole process begins with sowing some tomato seeds in pots by the end of January, watering them about three times a week (constant humidity is important for germination) while keeping them in a warm place (indoors). By early March all plants should have sticked out of the turf and should keep growing until May, a time when they shall be replanted to a bigger piece of land, preferably a vegetable garden. Flowers should follow, pollination is automatic as the flowers are bisexual, and the red fruits culminate the cycle. At this point, all is set for some human intervention (otherwise nature tends to maximum entropy by growing a little bit of everything). Only the seeds from the best tomatoes (i.e., the ones with adequate size, intense rosy color, well-rounded shape, firm touch, delicate smell, sweet taste, ...) must be released from the pulp to be dried under the sun and to be safely stored for the next season (next year's summer). A good way to do it is by smashing the selected tomatoes, and leaving them in a jar with some water for 8-10 days so that the pulp rots releasing the seeds, which, by density, settle at the bottom of the container. This is an old farmer's advice for having excellent seeds. Then, by repeating this process over many years, the genetic content of selected ones is such that the resulting tomatoes have no equal in the market.

image

The only setback of this variety of tomatoes, which we call "pink-colored tomatoes" (see picture), is that the plants do not produce as many tomatoes (or pounds of tomatoes) as other hybridized varieties like bodar, for example. This is why the pink ones are rarely seen on the shelves of the grocer's. Nonetheless, many farmers I know do still cultivate some of them for personal delight in addition to the rest of the varieties for trading. However, the commercial viability of my grandfather's forgotten tomatoes is yet unknown. I can't presently afford to spend much on them so as to iterate them with the consumers, or with other producers (there is a lovely community of farmers on the outskirts of Barcelona that I see every day during my commute). I just grow a couple of plants at the roof of my house to maintain the wonderful seeds that my grandfather cared to select for such as long time. As a research engineer, I will definitely continue the manual selection process of tomato seeds, and measure as many variables as I can to quantitatively model and analyze the process. In the end, as Burrell Smith (designer of the Mac computer) put it at a hacker conference, hacking has to do with careful craftsmanship, and this is not limited to tinkering with high tech. In fact, there are already other personalities in this field with many interesting ideas in this regard. One of them is John Seymour. From his books I learned to revitalize plants with fermented nettle broth, to kill caterpillars with tobacco infusions, and to eliminate greenflies with basil and ladybugs. By applying these natural techniques I contribute to the enchanting and miraculous process of tomato improvement with ecologically-friendly methods free from pernicious chemicals. This is not a matter of fashion, we are what we eat, and fortunately more and more people nowadays adhere to hacking for a better food system beyond growing vegetables.



Post 81

Facelifting my homepage while sticking to my long-term research/engineering projects: writing and coding

20-Jul-2013

The first point of this post is that I'm a little fed up with changing the style and layout of my homepage and I want to set on a clear line. I'm a little dazed and confused with so many tweaks, not to say what this can mean for my readers (I beg your understanding). Nonetheless, I usually despise the critique focuses on style first and content second. It is positive feedback from the latter that provided me some work as a freelance, for example. Anyhow, I need a sort of stationary brand that identifies me with my research engineer role. And lately that I had come up with a catchy title, "researchineering", that is the portmanteau of "research" and "engineering", the two terms that best define my professional career. I first thought this was great, but now I see this is overcomplicated, missing the simplicity goal and straightforwardness that I pursue. Now I just go with my bunny (that was a present drawn on the back cover of Steven Levy's masterpiece) and a most simple design: no more double content column, no search box, and no lousy online resume (I am a curious telecom engineer in Barcelona, I speak English and I code, further details, Linkedin).

Sticking to a long-term project is one important determination. And here I am with my 5-years old homepage, right after having renewed the service contract with my provider. I'm ready to keep on sharing my thoughts, my rants... seeking to produce fruitful content that is of interest (at least to me), in addition to hosting online services that serve me well to validate business hypotheses with the real world (I don't want to live in a vacuum), and to improve my skills by iteration. I want to be constantly learning something new (sometimes re-learning for the sake of clarification), tackling challenging projects and reporting my experiments here. There is high correlation between good writing and good learning, because writing entails learning. However, I am not an assiduous writer yet (I still tend to code more that I write), but I'm trying to develop into one. I am deeply content with Google Reader shutting down because this forced me to clean the dust off my old rusty list of feeds (I had long wanted to so that but never found a moment to do it), and now I have a frightening amount of accumulated content that deserves my reading and posterior digestion here.

In the end, writing is a creative process to acquire experience. It is a social activity that exposes one's work to others and reveals how to turn good products into great ones. But there is no creativity without facing the fear of failure. However, I guess the first step to overcome this fear is to maintain a publicly accessible blog. Done. But the challenge increases with time to write better and to keep up with the reader's expectations (well, I did mention I love challenging projects, so that's fine). Learning to write (and to speak publicly) is a requisite to become a good software engineer. This takes time and a lot of deliberate practice, just like programming. And to maintain the professional appeal, one must do something that obviously takes a lot of effort and time, showing one can do the job. Hence, there I go. Writing and coding are my reasons to be here.



Post 80

Introducing VSMpy: a bare bones implementation of a Vector Space Model classifier in Python

27-Jun-2013

Following the good advice to publish old personal code projects to GitHub, this post introduces VSMpy. VSMpy is a bare bones Python package implementing a standard Vector Space Model classifier (i.e., binary-valued vectors from a Bag-Of-Words language model compared with the cosine similarity measure) in the context of a ready-to-distribute package. It illustrates:

  • package and module structure
  • configuration for build and installation
  • running scripts for testing

In the end, this is almost the same as Bob Carpenter's (Alias-i, Inc) pyhi skeletal project distribution with more veneer of Text Classification. Nonetheless, a project like this one may come handy when starting something new from scratch with Python. Python is a wonderful and powerful programming language that's starting to take over mammoths like Matlab for scientific and engineering purposes (e.g., for signal processing), including Natural Language Processing. I use it extensively at work, and many of the suppliers I deal with do it as well. It is a great tool for product development to iterate fast and release often. Paul Graham noted it explicitly in his "Hackers and Painters" book: during the years he worked on Viaweb, he worried about competitors seeking Python programmers, because that sounded like companies where the technical side, at least, was run by real hackers.



Post 79

New publication: Condition Based Maintenance On Board

16-Jun-2013

Rotating mechanical components are critical elements in the rail industry. A healthy condition of these mechanisms is vital to provide a reliable long-term service. In this regard, optimizing the corresponding maintenance operations with predictive technology offers many attractive advantages to operating and maintenance companies, being the security and the LCC (life cycle cost) improvement (economic savings) some of the most important criteria. To this end, this paper describes the ongoing research and development works for Prognosis and Health Management at ALSTOM Transport, named Condition Based Maintenance On Board (CBM OB). The system is coined CBM OB after its purpose to be of flexible and easy use and installation on moving train units. It describes a general-purpose framework, with an emphasis on data processing power. It is based on a Wireless Sensor Network that is able to monitor, diagnose and prognosticate the health condition of different mechanical elements. The architecture of this framework is modular by design in order to accommodate data processing modules fitting specific needs, adapted to the peculiarities of the problem under analysis and to the environmental conditions of the data acquisition. Empirical experimentation shows that CBM OB provides a detailed analysis that is equivalent to other commercial solutions even with stronger hardware equipment.

This paper is to appear on September in the Chemical Engineering Transactions journal (vol. 33, 2013), and the reported CBM OB system is to be presented at the 2013 Prognostics and System Health Management Conference.



Post 78

Gaining control over the tools: goodbye Google Reader

25-May-2013

Gaining control is one of the most important traits to acquire with one's career capital (Newport, 2012). Having a say in what one does and how one does it entails having the liberty to do so, and relying solely on Google Reader to be up to date with the posted news is too much of a risk to accept. Thus, I truly celebrate its decision to shut it down. That made me realize how much dependent on its service I was. Now I run Tiny Tiny RSS on my server and I feel I'm a lot wealthier than I was before (citing Paul Graham by the way) just because of my increased control over my tool. Since I mainly read about professionally-related topics like software development, machine learning and business management, this is to be taken seriously.

Hosting one's own web services does cost some money, indeed. Nothing is for free, but freedom is priceless. There is no such thing as a "free web service" anyway. Users always pay with their trust and their personal information, which is then sold to advertising companies, like Google! Because Google is an advertising company, right? If Google can't deliver (ergo stuff) personalized ads in your browsing experience, then it must change its strategy. And it is determined to rule the computing world through Chrome. Thus, its products must be able to so, and Reader did not seem to do very good at this. Don't get me wrong, I believe Google provides wonderful service products developed by brilliant professionals, it just happens that I don't want to feel myself constrained by its business objectives. Therefore I decide to provide me my own tool to gain control and autonomy. Will we be seeing more actions like this one in the next months to come? Note in passing that Google Code has just deprecated download service for project hosting.

--
[Newport, 2012] Newport, C., "So Good They Can't Ignore You", New York: Business Plus, 2012, ISBN: 978-1-4555-2804-2



Post 77

Lean Startup hackers were already there back in the early eighties

02-May-2013

The fancy "lean" adjective that accompanies every rocking tech business issue nowadays is an already old story. I found it out the other day while skimming through Steven Levy's groundbreaking book "Hackers: Heroes of the Computer Revolution".

image

Many hackers of the Homebrew Computer Club (HCC) followed these lean principles as a means to avoid building things that no one really wanted or needed:

  • Bob Marsh, following Ed Roberts from MITS company, would announce his product first, and then collect the money required to design and manufacture the product.
  • Lee Felsenstein would incorporate the user in the design of the product.
  • Steve Wozniak would sit in the back of the auditorium of the HCC, where the electrical outlet was, getting suggestions for improvements and incorporating those improvements into the Apple II design.

Considering that all these business approaches were raised during the recessionary period of the early eighties, and that they served the American economy very well, perhaps they should still be regarded of utmost importance nowadays.



Post 76

NLP-Tools broadens its capabilities with a RESTful API service

18-Apr-2013

In the software tool development business, the API is the new language of the developers, i.e., the customers. In this regard, nlpTools keeps pace with the evolution of the industry market and introduces its RESTful API service to facilitate its integration. And in that quest for added-value and kaizen it partners with Mashape to handle the commercialisation issues. The original website still maintains the evaluation service, but further performance features now need to be routed through the Mashape nlpTools endpoint.

image

In the dark jungle of validated learning through product iteration nlpTools relies on the five keys that make a great API:

  • Provide a valuable service: the amount of emails asking for the API indicates that there is at least a deal of demand, it is therefore sensible to think that this is an added-value service product.
  • Have a plan and a business model: this a tool for the software development market offering a paid service to scale up to customer needs.
  • Make it simple and flexible: a domain-specific service is represented by a single identity parameter, which can be tuned to customer needs just by offering a differentiated service.
  • It should be managed and measured: Mashape keeps track of these measurement aspects and provides the necessary information to make informed decisions.
  • Provide great developer support: we are all working to deliver a wonderful experience to our customers, and considering that the service is still in its first stages, we pay a lot of attention to personalised developer support.

APIs may have nonetheless some caveats that could threaten the success of a project built around them, but most of them boil down to not having a paid option entailing a high quality of service. However, nlpTools does consider this commercial option and may thus scale up to the needs required by the developers by contracting more powerful hosting features. The added-value of the service (which is also its core business) lies in its customisation, that is its ability to adapt to the particularities of the developer's problem, such as the fitting to the specific salient characteristics that represent their data.



Post 75

Foraging ants as living particle filters

24-Feb-2013

Ant colonies are admirable examples of cooperative societies. Some of its members are prepared to build their complex lairs, some others constitute an army to protect their population, some others explore the outer world and gather food, etc. With respect to the latter function, which to me is the most representative of ant colony behaviour, I coded a simple simulation in JavaScript inspired by the js1k competition (demo and code available here).

image

The ants in the app have been implemented following a state machine. Initially, they forage for food, drawing a random walk while they operate in this searching-state. Once they find a source of nurture, the ants transit to another state where they return home, leaving a pheromone trail behind for others to follow. Finally, they end up in a loop going back and forth collecting more food. And as time goes by, more and more ants flock to the food-fetching loop. Therefore, they get the job done more rapidly and minimise the danger of an outer menace.

In a sense, foraging ants remind me of a particle filter where the particles are living beings moving stochastically to reach some objective. Thus, their behaviour could be cast as a biologically-inspired search algorithm for an optimisation procedure, considering that the objective is a cost function to be minimised.



older - RSS - Search


All contents © Alexandre Trilla 2008-2024