April 27, 2007: 8:41 pm: Nikola ToshevUncategorized

The author of Gmail has a great post called “The problem with conventional databases”. He discusses how long term trends in hardware influence the architecture options for heavily loaded servers. In Blue Edge we have used with good success non-database web server architectures, where the application has most of its data on the filesystem, loads them at startup and rarely touches the disc while it functions. This allows great performance to be achieved with little hardware.

Paul touches an interesting topic about the Flash memory prices and its use:

Finally, one more interesting stat: 8 GB of flash memory cost about $80

Flash has some weird performance characteristics, but those can be overcome with smarter controllers. I expect that flash will replace disk for all applications other than large object storage (such as video streams) and backup.

I think there is a great use for Flash memory right now: you can get your computer 128GB of RAM for about $1000, have your application work directly and only in RAM and still have all your data persistently stored when you reboot or in case of power failure! How can you do that?

First, lets review the “weird performance characteristics” - here is a summary of what Wikipedia says:

  1. Flash transfer speed is relatively low: the “normal” speed seems to be about 10MB/sec, top speed I’ve seen in the specs is 32 MB/sec
  2. Reading is a little faster than writing. When you need to overwrite a block of memory, you erase it on the first step, and then write the new data
  3. The erase-write cycles are finite, most Flash is guaranteed to work for at least 1 million cycles
  4. Flash memory, of course, doesn’t need power to preserve its state, and it is a random-access device

With the hardware available today you can use Flash for disc storage. It has a slower transfer rate than normal discs and no seek time. To overcome the speed limit, you can batch a few flash memories and access them in parallel, in RAID-0 fashion. Then you could install a database and have it operate faster than hard disc under specific access pattern (a lot of small transactions).

However, databases are not optimized for this kind of operation. They are optimized to minimize seek, and they solve a lot of hard problems under assumptions that are not necessarily true anymore.

Alternatively, you can use memory-mapped files persisted on your Flash RAID array. This files make the content of the file available at some memory location - you get a pointer and can read/write data at a random offset of the file. The virtual memory mechanism takes care to read the file portion if it’s not there yet, or to write it back if you changed it. This almost means you can treat it like any other memory and store arbitrary data structures. You can’t really, because the base address of memory mapping is usually different, so the pointers within that memory area would be wrong after remapping the file, but this problem is solved using specialized pointers, so let’s forget about it for now.

In this way you can basically work with flash storage space as if it is RAM. Your real DRAM becomes a cache for the persistent flash storage. You still need to tell it when to flush the pages to disc and take care to have some locality of data so that the “cache” is effective. You can have a 64bit architecture with huge virtual memory space and big 64 kb memory pages to make the transfer to/from flash more efficient. You need some software infrastructure to support this scheme, but it is nowhere near the complexity of a database.

You can play on the strengths of flash memory and get really decent results. Working with packs of USB sticks or Compact Flash cards seems messy and unreliable, but you can bet hardware manufacturers will produce “hard drives” made of inexpensive flash memory chips soon. The trends in hardware development do open new possibilities for software architectures.

Update: It seems flash memory performance has another quirk: random writes are very slow, much slower than random reads or sequential writes or harddisk random writes. This seems to be caused not by some inherent characteristics of the medium, but rather by controllers having to implement their algorithms using very little memory.

So, does the above till holds true? To some extent, yes. You can implement a datastore that uses multiversion concurrency control (the modern way to implement concurrency control anyway) that uses only continuous writes and random reads. Or you can wait for the next generation of smarter controllers.

March 7, 2007: 10:50 pm: Nikola ToshevUncategorized

My girlfriend is translating / scientific editing a book about the brain, with exercises. Lots of them are not directly translatable, and one is about solving anagrams. So she needed a bunch of anagrams in Bulgarian, and making an anagram is very different from solving one (and harder). I set out to write a program for generating anagrams from a wordlist (I didn’t find such a program on the net).

First, a wordlist is available as part of the Open Office’s Bulgarian language support.

So, how do you find out the anagrams in a wordlist? Anagrams are words composed of the same letters, in a different order. You could start with the list and for every word, make a pass on the list to see what are its anagrams. The problem is that this gives an O(N2) solution which would be way too slow with realistically long wordlists.

The better way to do it is with a single pass over the wordlist, somehow memoizing the words and comparing every incoming word with what you already got. “Comparing” of course would be using a hashtable. So we need a word signature that is invariant for anagrams. This means that two anagrams would have the same signature, and words that are not anagrams would have different signatures. In this way we can do simple hash lookups to check if the current word has an anagram in the ones we already passed.

One simple representation is to have all the letters in the word and their respective count.So, in Ruby:

signature (“anagram”) = { ‘a’ => 3, ‘n’ => 1, ‘g’ => 1, ‘r’ => 1, ‘m’ => 1}

We’d want to use this Hash object as key in another Hash mapping it to the words themselves. , We can’t do this directly as Hash#hash is based on object identity and not the values inside the table. We could override Hash#eql? and Hash#hash, but here I’ll just convert it so a string and use that as the key.

dictionary = {}
File.open(‘bg_words.dat’).each { |line|
        word = line.chomp
        next if word.length < 8 #interested just in longer words
        sig = {}
        word.each_byte { |c|
                if sig.include? c
                        sig[c] = sig[c] + 1
                else
                        sig[c] = 1
                end
        }
        sig = sig.sort.join
        if dictionary.include? sig
                dictionary[sig] = dictionary[sig] << word
        else
                dictionary[sig] = [word]
        end
}
dictionary.find_all {|key, value| value.length >=2}.each {|k, value|
        puts value.join(“,”)
}

All this works well, but I am also interested in the question what anagrams are harder to solve than others. This will be the subject of a subsequent post.

: 2:23 am: Nikola ToshevUncategorized

Continuing the topic of beliefs in software development.

The stereotype of programmers (geeks) says they love open source and hate Microsoft. But often it is the other way around - look at any social site dedicated to Microsoft technologies.

This strikes me as next to impossible in theory. A religion (or any contagious belief) usually appeals to higher ideals. Freedom works perfectly for Open Source (especially for Americans, as freedom is the defining value of being American). Microsoft has nothing to offer here - they act out of selfish commercial interest. It is controversial if the community (industry, etc) benefits from Microsoft. How could they create a cult following?

I have started programming professionally in Microsoft Visual Basic 5. The most recent thing I did for MS platform was a toolbar for their browser about an year ago, which involved C++ and COM programming, interfacing with IE. I went through several substantially different platforms and languages in between and after, so I think I am pretty qualified to talk about this from experience.

The root of the Microsoft religion seems to be the convenience of living in a monoculture. Developers get accustomed with the tools Microsoft gives them, and it is a very complete set of tools, that work in a pretty consistent way. You can do anything with Microsoft tools, as long as you release your software for Windows. It is not that Microsoft’s tools are the best in every area (in some they are one of the best), it is being accustomed with them. Even going out and searching for something better is a pain, and you may well not recognize if it is better unless you actually invest time in it. On the other hand, the label “Microsoft” on some random technology almost guarantees it is somewhat decent.

From that point of view, the open source world is pretty hostile. It is full of random projects, and most of them are bad or don’t work or you. Even the good ones often lack documentation and convenience, they are not consistent with one another. And the biggest advantage - the access to the source, Microsoft developers don’t even notice, because they are not accustomed to having it. (I was surprised to see the our .NET development team looking at Mono’s source to figure out some details in .NET framework - instead of decompiling, for example.)

So, one problem is that programming only in Microsoft environment makes you less adaptive. But I think the worst way it cripples you is that you get accustomed to waiting for Microsoft to release their next big thing that enables you to do things in a better way. It rarely crosses your mind that you can invent this new way yourself, because you think you need enormous resources. More importantly, you need to mentally escape from that convenient model Microsoft currently gives you. It is no coincidence that most .Net innovations outside Microsoft have Java history (NHibernate, Resharper, etc).

All this can explain why perfectly good developers can feel good about The Microsoft Way ™. Being religious about something is easier than trying to evaluate alternatives.

December 14, 2006: 12:49 am: Nikola ToshevUncategorized

I watched The Departed recently.

The first lines in the movie were:

I don’t want to be a product of my environment. I want my environment to be a product of me.

By the mafia boss Frank Costello (Jack Nicholson).

Too bad the movie didn’t quite live up to the good start.

November 24, 2006: 5:00 pm: Nikola ToshevUncategorized

It is beyond any doubt for me that in software development, people do their jobs supported mostly by faith. That’s curious, because programming itself mercilessly points you your own mistakes and demands them to be fixed, and your good intentions or anything else but correct code make no difference. So you’d think your beliefs get validated, daily, by your programs that work. This is the case indeed, on the small scale. Then complexity surges exponentially with the number of elements in whatever you are doing, and you get only partial validation. On the large scale you find your programs and beliefs mostly, and never perfectly, working. You can get no meaningful metrics for the most useful things: your performance or how bug free is your program, for example.

That’s why programmers sign up early with sets of beliefs, and then occasionally change them as they get imperfect feedback from what they actually accomplish. Some examples are that you should program in the X programming language because it is fastest, or will be most productive in the Y programming language, that Z framework will make your life so much easier or you should have Cappa process in place to get consistent results from your job.

You hear hype about new things all the time, but the results depend too much on subtleties in the context where they are applied. None of the claims is truly verifiable. This is sad.

August 23, 2006: 6:30 pm: Nikola ToshevUncategorized

Sisley ad

Branding is ubiquitous in modern markets, although it is a popular notion that brands add no real value for the customer. For every piece of clothing by a big name brand you are likely to find another, no-name alternative with same or higher quality which costs less. That is, if you are willing to invest your time in searching.

The internet may strongly reduce this differentiation of brands by making things easy to find and generally matching customers with goods. Currently this is achieved by consumer reviews and ratings and automatic recommendations on sites like Amazon. These mechanisms do not realize their full potential (partly because Google is so bad in searching commercial stuff), and yet we have started to see changes in shopping patterns like the “long tail“ concept.

So if the quality products become easy to find via internet, will the brands become devoid of meaning? They will still matter a lot for impulsive purchases, for one. More importantly, brands are associated with feelings that complete the self-image of the customer – he/she may feel more „sporty“ for choosing Adidas, for example. This kind of value would not decrease in any way.

There is another effect, however, which is similar to the last one, but may actually become much more important. I’ll give an example with Sisley, which is one of the few brands that seem to exploit it. I have noticed Sisley billboard photographs for their very specific provocative and sexual spirit. Well, I just did a quick search in Google Trends and I realize this brand is probably unknown outside Continental Europe and a part of Asia, but anyway - think about the heading photo.

If you see someone wearing Sisley clothing (perhaps only recognized by the label), you make an association between the Sisley advertising style and the wearer. In this way the brand becomes a tool for the buyer to convey a pretty specific message of self-expression. This doesn’t stem from the design or colors or anything else but the brand itself. The brand becomes a symbol - something that everyone recognizes in the the same way. The way pushed by advertising.

What should brand managers do differently from what they already do, by trying to create a set of positive feelings toward the brand? They should stress less on quality and be more specific about the message the brand sends. It shouldn’t be a universal message appealing to as many people as possible; it should rather be a specific message that appeals very strongly to a limited number of people. The buyers should choose the brand to express parts of their personalities, and different brands would suit different personalities. Also, the advertising should be targeted not only to the potential buyers, but also to their social environment, in order to achieve the desired effect.

This seems to work only if you cover everyone in a geographic region with your ad campaign and make your brand really recognizable. This is not necessarily the case: not everyone needs to recognize the brand, just the people who matter for your potential buyers. As we see subcultures proliferating and growing stronger, a brand may target just a specific subculture. Interestingly, internet both facilitates higher diversity and strength of subcultures and makes possible more specific targeting of advertising.

July 10, 2006: 10:44 pm: Nikola ToshevUncategorized

There is a summer school in cognitive science at NBU. The first week I attended just a neuroscience course by Kalina Christoff (great lecturer), too bad Jeff Elman couldn’t come. During the second week, the cognition of apes and young children is in focus and looks very interesting too.

If I could only get enough sleep it would have been perfect.

June 29, 2006: 9:08 pm: Nikola ToshevUncategorized

I like to be able to listen to all mp3s linked from a page with just clicking a button, without downloading or copy-pasting individual links in Winamp for streaming. The m3ugen bookmarklet for Firefox does just that, all happening client-side in your browser, without referring to external sites.

I had problems using it on certain sites, so I modified it to work everywhere (or at least the sites I encountered ;-) ).

Enhancements include:

  • handling pages with frames
  • removing duplicated entries
  • treating a link as an mp3 if the text ends with .mp3, even if the link itself doesn’t have the extension
  • isolating identifiers used in the bookmarklet from the page namespace to avoid conflicts

Here is the modified bookmarklet:
m3ugen

Drag the link to your Bookmarks toolbar in Firefox. Then load a page with links to mp3 files and click on it. You will be asked to start an m3u file, so just confirm it and you’ll hear the music.

: 9:03 pm: Nikola ToshevUncategorized

If you do code reviews with newbies you have inevitably seen it.

if (myBoolVar==true) {

Or:

if (_some_expression_) {
    return true;
} else {
    return false;
}

Of course this can and should be reduced to:

if (myBoolVar) {

And:

return _some_expression_;

If you don’t think there is a lot of code of the first kind floating around, try searching with google. You should be able to use the specialized source code searching engines, but they aren’t any better in searching; they just add a glamorous UI and some metadata. They don’t even get tokenization right and ignore == or brackets.

Ok now, back to the code. Some people even argue the first kind is “easier to understand”. Why do they find such code easier to write and understand, when there is direct shorter version with exactly the same semantics?

I think there are two ways that people use to think about boolean expressions. One is about a type with values true and false, and (rarely) boolean operations to perform upon them. The other is about checking the state and conditions in your program. These two vaguely correspond to a formal logic system and the semantics behind it.

When you use statements like if or while, you are in “semantic mode”. When you check a value of a boolean variable you may fall out of it to “syntactic mode”. If the variable is not properly named it is easier to think like “is this flag true?” (syntactic) instead of e.g., “is the connection secure?” (semantic mode).

Obviously you are better programmer if you operate in semantic mode and apply the syntactic operations automatically. Semantic mode is closer to the programmer’s intent and code reviews should enforce it, thus facilitating deeper understanding of the program.

If you happen to hear a programmer complaining about the lack of boolean XOR operator, he has probably fallen into syntactic mode. Otherwise he would have realized that what he needs is the inequality operator. A smell for expressions written in syntactic mode is the use of boolean constants as part of the expression.

June 21, 2006: 4:01 pm: Nikola ToshevUncategorized

I tend to write in Bulgarian (my native language) disregarding some rules although I know and can follow them. For example, I often don’t place all the commas, emphasizing only the main structure of a sentence with several subordinate sentences (an example follows).

Например често не поставям запетайките при всички подчинени изречения, за да подчертая главната структура и да бъда по-лесно разбираем (тук липсват 2 затепайки).

I think this is ok with informal text. After all the primary purpose of the language is communication, and if a modified or a slang version seems to work better, why not use it. This might be obvious to someone coming from a culture heavily using slang (e.g., an American).

Disclaimer: Any English errors in this blog are unintentional. I’m not sure what my text here says about my code ;-)

On the other hand I expect good programmers to be able to use natural language correctly. There should be a relation between someone’s ability to use natural and artificial language. On a programming interview, if someone understands his last project’s architecture but is unclear in its description, will he be able to write clean code? Maybe understanding in natural language is somewhat subjective thing, but if you can’t easily understand the guy talking about technical stuff that’s a problem anyway, even if it’s not really his fault.

Programmers often mistakenly call spelling and other errors “syntax error”, because their compiler calls them this way. But syntax is only about word order. When you type a keyword wrongly, compilers think it is an identifier out of place, so they call it a syntax error. You should know better (ideally compilers also should, but current parser theory pretty much ignores the problem).

Next Page »