trees

Archive for September, 2010

More sed trickery

I’ve just needed to do the following on the linux command line:

  1. Search through files within a directory for a certain bit of text
  2. Display the unique files with that bit of text in there

This required a bit of jiggery. First I’ll show how you do it using three commands, then I’ll show you the shortcut method which I used which really displays the power of linux/unix command line workflows.

First we have to use grep on the files within a directory, which is done like this:

grep -rin keyword(s) directory > outputfile

The we have to get rid of the non-useful bits that grep outputs:

sed 's/:.*//g' outputfile> outputfile

Finally we have to get all of those unique filenames:

uniq outputfile > outputfile

It is quite neat that we can even do it as I’ve shown above. But here is the Pièce de résistance, the linux/unix piping:

grep -rin keyword . | sed 's/:.*//g' | uniq > outputfile.csv

Here we see I’ve piped the output of grep into sed and then into uniq, which outputs its result into a CSV file… so none of that working-on-and-saving-files-multiple-times malarkey. All your lines are consolidated into one easily manageable line :-)

Hopefully thats quite a neat example of piping, grepping, sedding and uniqing… feel free to use, and feel free to comment!

Back in the office / Language Processing

Back in the office

First of all, I’d just like to say that I am back in the home-office after the honeymoon, and working hard! So I am now picking up emails, having meetings and doing work as usual… so do feel free to get in touch.

Language Processing

Secondly… Beki and I went to Sicily (a town called Cefalu’) on our honeymoon, and before we went I decided that it would probably be a good idea to learn some of the basics of Italian. I know a bit of Spanish, and a tiny bit of French and a very tiny bit of Swedish, so picking up the basics of another language could potentially be a bit easier than learning a language from scratch. While learning Italian it came to me that Italian, and the other Latin-based languages are incredibly “rules-based“, they follow quite strict grammar rules which are almost followed, even the shortening of sentences follow a certain rule. This differs from English, which I would call “lexicon-based“, where words have set meanings with a very subtle grammatical influence and are strung together using very liberal grammar rules.

The question then comes then, that, maybe it would be easier (i.e. more semantically viable) to do computational language processing - in its logical form rather than its statistical form - on a language such as Italian, or even Latin.

This is relevant for my research and development in Natural Language Processing, which I have always favoured the symbolic processing over the connectionist or statistical approaches. Maybe it is something I need to look into in more detail, and maybe there is some research about it elsewhere (if my readers know of any then please do let me know, either by a comment or by sending me an email!).