18 June 2016
by Deimos
0 comments

Do not waste time with grep anymore, use tag

0.00 avg. rating (0% score) - 0 votes

tag

Long time since my last post. This one is not very technical post, but it’s a nice to have solution if you use grep usually. Are not you fed up to type vim <file> and search the line after a grep command ? If yes, this post is for you.

First of all, you may know that an alternative more user friendly to grep exist, called ag (perf comparison). I really like ag and grep, but something make me loose my time since several years and I’m pretty sure it’s your case too.

Let’s say you generally want to parse files, find the information and jump with vim into it. With tag (using ag only) you can do it in only 2 commands. Let say I want to search my username in /etc:

Now if I want to edit /etc/gshadow at line 54 with vim, I only have to type in the terminal:

“e” will read a temporary file containing the result and 10 will ask vim to position in the file to the matching line number.

User friendly and really simple right ?

Hope this will help

21 January 2016
by Deimos
1 Comment

Zero downtime upgrade with Ansible and HAProxy

5.00 avg. rating (91% score) - 1 vote

ansible_logo

Some of you may not be familiar with the terms “Rolling upgrade” or “Rolling restart“. This is the action of upgrading or restarting a cluster without service interruption (alias zero downtime). In most cases, this is done node by node, but in fact it depends of the technology you’re managing and the number of active nodes in your cluster.

At Nousmotards  we have several Java Spring Boot applications running. Restarting one application can take up to 1 min. During that time, the service was down which is a shame when you make multiple deployments during per day. To avoid this, I’ve wrote an Ansible playbook to perform a rolling upgrade with the help of HAProxy. I’m going to share it with you and explain in details how it works.

First of all, you need to understand our infrastructure. Regarding our Java micro services, each one is running at least two instances and a load balancer is in front in active/active mode. All our applications are stateless which help us to avoid taking care of the data. Here is a simplified example of our infrastructure for a single Java app:

nm_haproxy_allThe goal is to perform a live upgrade without loosing user’s connections and without bringing the service down. I’m going to explain the playbook step by step:

Here you can see the hosts parameter as a variable (nm_app is the name of the application and the container dns name as well) because I don’t want something specific but applicable to any kind of application. This variable is in fact a group of machines pulled out from Consul (it can be from a flat file).

The serial means that it should not perform those tasks on several hosts in parallel (limit to 1 host at a time). If you’ve got a large set of machines, you can define a percentage. This will stop upgrading/restarting if it fails at some point.

The any_errors_fatal parameter ensures me that it will stop if an issue occur. I personally prefer to manually repair the issue in case of a failure rather than putting the service down and perform a full restart. It’s a security option.

Here I’m running pre_tasks to get out the application from Haproxy (meaning setting it in maintenance mode, so no new connections will be established on this application). For this first task, I just want to be sure that the application I’m requesting is registered in Haproxy to avoid playbook failure during next steps.
Of course, Haproxy is not on the same host as the application, that’s why I’m using the delegate_to parameter pointing to a dictionary with the host information of the application (which, in my case, is running inside a Docker container but can be triggered from anywhere like Consul kv).

The result is stored in the app_in_haproxy variable. Now the idea is to set Haproxy for this current host in maintenance mode and stop the service to get something like this:

nm_haproxy_right

I’m using Monit to manage the services inside Docker. That’s why I stop the service managed by Monit (Note that I’m not using the monit module in Ansible because it’s buggy). Then I’m waiting for the port application to be closed. That way, I’m sure it will be gracefully stopped and no clients will be connected anymore:

Now I’m calling a role that will ensure my application is installed with the good version. If it’s not the case, it will perform the upgrade:

When finished, the new application is deployed. We can now start everything in reverse order:

To be sure Haproxy had enough time to switch to the green state, I’m triggering the state for 20s and only continue when green. This is normally not necessary but a useful guaranty tip:

This host is now delivering the latest version of the application, Ansible will now continue with the others hosts one by one:

nm_haproxy_left

When finished, all apps are upgraded without any downtime and no stress.

This kind of solution is not new, but with the help of Ansible, it becomes really easy.

20 January 2016
by Romaric Philogene
0 comments

Elasticsearch: How we speed up news feeds and user walls display

5.00 avg. rating (91% score) - 1 vote

logo-elastic

 

In my previous post, I talked about how to speed up reads and writes coupling Redis with Neo4J. Now I want to share with you how it’s possible to unload your server and use Elasticsearch to speed up news feed and user wall.

1/ Show me your news feed

A news feed is the main page of traditional social networks. Its main goal is to show you all recents updates from your friends and your interests. Nousmotards news feed is like any other social networks (facebook, linkedin..), the user’s feeds are available from their web browser and their NM mobile application and it looks like this.

Screen Shot 2016-01-20 at 00.11.36

The web feed above, the android feed below

device-2016-01-20-003342

 

As a user, you want to get the more relevant things  and the most updated posts from your friends. And while you are scrolling you want to have the best experience possible. Which means, no glitch/freeze from the UI, no latency while loading images and bunch of news.

more to less relevant

2/ First attempt

In our first approach we were using directly Neo4J to read data and display them to the users. Direct access to the database is expensive, 40 concurrent accesses and your server will become so slow than nobody will ever come back.

The datas to retrieve are unique to each user because it depends of what they liked, the people they are “friends” with, getting the last relevant things…  These are contextual datas to retrieve for each users.

Screen Shot 2016-01-19 at 12.18.32

I let you imagine the complexity of this, and the effort that the database has to do to respond to the 40 concurrent users, enrich them and serialize them all into JSON before sending them to the requester. (note: red dots are users)

Enrichment is the process of setting some properties depending on who made the request. For instance: Romaric has like a post from Pierre, Romaric request the server to have his feed page, he’s retrieving the post that he liked in the same state. Pierre see that his post has been liked by Romaric.

 

Enrichment processing is very expensive in time, and must be done for each document (10 documents per page) that you want to send to the requester.

enrichments

 

4 requesters, 4 read queries in database, 40 enrichments which results to 40 read requests in database more.. server took at best 4 seconds to respond to each of them. Very bad for user experience.. More requesters you have, and more the service take time to respond.. latency ~= O(n^2) where n is the number of simultaneous requesters..

3/ Elasticsearch to the rescue

Elasticsearch is now a very popular documents store solution very used by OPS and developers. It’s commonly used to store logs, search inside them, do sophisticated analytics.. Access times are very good and the product own large community to make it evolving into the right direction.

In Nousmotards we are using it for log analytics, monitoring and above all as feeds and walls cache solution !

Instead of directly using Neo4J to ship documents to the user, we have placed Elasticsearch between Neo4j and the “core app” to retrieve them, then enriching them from Redis before replying to the user.

1

 

Neo4J disappear ? No it’s just not used anymore on read requests. Now with 40 concurrent requests we are around 200 and 300ms to reply, that’s huge difference ! Neo4J is no more bother by users who get their last updates and can process write requests smoothly. (stats from yesterday, 41245 req, 260ms average request time)

Screen Shot 2016-01-19 at 17.02.38

4/ To conclude

Elasticsearch is our life saver, we are using it for more and more things at Nousmotards. In the other hand, it is not a database, you should not use it as your primary datastore ! Corrupt indices may happen and you should be able to reload all your data from scratch. If you know that, you can start to use it at its full potential and with no fear to break things 😉

 

sources:

To get more informations on nousmotards: blog.nousmotards.com

Join us ! 🙂

nm app all screenshots

15 January 2016
by Romaric Philogene
2 Comments

How we use Neo4J on our social network and workaround performance issues

5.00 avg. rating (94% score) - 2 votes

no4j_logo

Neo4J is the most popular graph database in the NoSQL databases family. We’re using it in the Nousmotards project to store all social data through nodes and relationships.

Today I am going to talk about our experience after having using it for 2 years now (take a look at directed graph if you are not really familiar with Graph Database).

1/ Our use case

Nousmotards is a social network for bikers, we aim to provide valuable tools to ride, meet and share nice time with people who like motorcycling.
Screen Shot 2016-01-13 at 16.27.39

Neo4j is the way that we manage connections between riders. Connections are how people interact which each others, what they do, how and where.. All actions are represented by relationships (edges) and nodes (vertices).

For instance, if you ask someone to be your friend, it’s just a relationship between you and him.

1

Now imagine that you want to that A ask B as friend, it’s called directed graph.

2

Each nodes and relationships can have properties. It’s like properties on object instance, but each instance of the same type can have totally different properties if you want. It’s schema free by design.

3

Our social network is at the moment composed of +100k nodes, +300k relationships and +500k properties. We are accessing Neo4J in server mode by using REST API and more precisely Spring Data Neo4J 3.4.x. Server mode is good to have multiple clients and HA features at the cost of performance (note: If you are looking for intensive graph processing, you should consider using embedded mode or developing your own server extension). This way (using server mode) our micro services can access graph data at anytime.

2/ Fast to implement

At first use, Neo4J was really impressive to find connected data, even if they are really “far away” from each others. Neo Technology has wrote their own query language called “Cypher” and let you express yourself to easily retrieve data. For example getting friends of friends (FOF) is amazingly easy:

MATCH (a: Rider)-[:IS_FRIEND]-(b: Rider)-[:IS_FRIEND]-(c: Rider) WHERE id(a)=123  RETURN c;

And if you want to avoid returning riders who are already friends with A :

MATCH (a: Rider)-[:IS_FRIEND]-(b: Rider)-[:IS_FRIEND]-(c: Rider) WHERE id(a)=123 AND NOT (a: Rider)-[:IS_FRIEND]-(c: Rider) RETURN c;

Definitely easy..

By this way you can start to implement all social features intuitively. This is how we manage friends, likes, who has upload photos, where you live and so much bikers things..

3/ The reality

Let’s talk about reality of using Neo4J and the performance. Nousmotards backend expose its API through RESTful API, which is directly consumed by browser and mobile (Android and iOS) applications. A user will directly consider that your application is bad quality if the time for loading data is too slow. That means your end user app is bad designed or/and that your API to high latency ( > ~2s ) and low throughput.. which means your backend stack is slow to get data, processing them, and send them back to the requester… And this is what happened to us !

The first mistake we made was to think we could be able to fetch data in real time from Neo4J.. That was a really bad idea!

You should simply forget doing this if you are using Neo4J in server mode. Server mode is using REST API and the overhead of opening an http connection > making your request > the server deserializing your request > fetching data > serializing the response > send it back to your application > closing the http connection cost so much that you will never get over it. Note that Spring Data Neo4J is adding an abstraction layer and low performance as well.

How did we figure it out ? By putting a cache layer between our application and Neo4J. This layer intermediate each call to the Neo4J API to retrieve data from Redis at first, and then in Neo4J if they are not available.

4

This drastically improved the read performance. It’s not only about network overhead, but the way how Neo4J take its time to respond.. They are using locking to ensure data coherence which is a good thing, but this is bad for performance purpose if you’re expecting to read or write data in near real time.

The second problem was to improve write performance.. For almost all databases, what you have to do is to batch your inserts into one statement instead of putting one statement per insert. Neo4J REST API gives you the possibility to do batch insert, but SDN (Spring Data Neo4J) don’t (There is a non official workaround available) ! A user who post a new message, do not want to wait 10s to see his message appearing, this can be a very bad user experience and your app will for sure lose its users..

What we imagine to solve write latency was:

  1. Use a queue to append write intent into it, then process it in background.
  2. Respond to the user directly.

5

 

This is one part of our solution to improve read and write latency, our average response time is lower to 300ms during all day. Sometimes it can be higher in this configuration, if we access Neo4J to refresh Redis cache during lock on an object that we want to access.

4/ To conclude

Neo4J is definitely a very good product, but lack of community support at this time. It perfectly fit graph usage and it’s very good for analytics but not for concurrent real time access (trough server mode at least).

You can find more info on graph databases and neo4j:

To get more informations on nousmotards: blog.nousmotards.com

Join us ! 🙂

nm app all screenshots

13 January 2016
by Deimos
0 comments

Docker: why you should use Monit instead of supervisord

0.00 avg. rating (0% score) - 0 votes

docker-logo

This title can sounds like a troll, but this is not the case! I’m writing this post as a feedback. When I spoke during an Ansible meetup presenting Nousmotards infrastructure, I had several questions regarding why I chose monit as the init system instead of supervisord. That’s what encouraged me to write this post and that’s why I wanted to clarify here the things.

Introduction

Docker is made to run a process in a confined area. The problem is when you want to run several processes inside a container! Docker is simply not designed for it by default and that’s why a process manager should be considered. Docker officially recommend to use supervisord as the init system. This is why I first started with it. However I encountered several crashes with supervisord when a failing managed process occurred and were looping. I also wanted to look at the available options and I was very surprised by the lack of features compared to Monit.

Why Supervisord is a bad idea for Docker?

I decided to switch to he well known Monit and got an unattended behavior with the Docker usage there. When you’re running a Docker container, you need to define the first command that will be launched and will take the PID 1. This is generally where a process manager like Supervisord or Monit enter in the game. When you decide to stop a docker container, this will kill the process manager and its forked processes.

With supervisord it works like a charm as it forks the commands it has to manage and those should not be run in a daemon mode.
Monit however prefers to run services instead as daemon and check periodically the processes state. This may not be the best method regarding Docker ideology, however this is the best in my opinion. Why ? For several reasons in fact ! Here they are:

  • Did you already run a packaged java application like elasticsearch, Cassandra, Tomcat, etc…? Have you already seen the number of arguments passed when you run this kind of service with Sysv or Systemd? If not, here is an example:

Are you really going to copy/paste all those lines in supervisord to start your process? What will happen if one of those arguments needs to be changed, as it could be the case during an application life cycle with the package manager of your distribution? Will you always be aware at any changes and update them in consequence? I’m not sure and this is typically a source of potential issues that can be avoided. So in Docker, using init scripts services is the best solution in my opinion as you don’t have to care about them. You just have to let the distribution/editor choose the best arguments for you and override them in the /etc/default folder as usual if needed.

  • Now if we just compare them outside the Docker scope, what will happen when supervisord completely crashes? You’re just losing all your forked applications in that case! And what will happen if one of them is a database? In the best case, a recovery scenario will occur to get your data back and it the worst case they are corrupted. Regarding Monit, it starts demonized services, you can restart it, upgrade it or doing what you want with the Monit process as it is completely not linked with launched process.
  • The lack of features in supervisord is a problem as well in my opinion. In Docker, you really need a featured process manager if you want to run several processes. That mean this entry point should be able to manage boot order, dependencies, retry count, timeout etc…and only Monit is able to do that.

Avoid Monit as PID 1

I hope you’re now convinced that Monit is a very good candidate to manage processes inside a Docker container. As said, the first launched command on Docker will take the PID 1. This is where the main issue is as you may want to restart Monit or upgrade it for an emergency case (because in theory you shouldn’t have to touch processes inside your container and deploy a new container instead that will replace the current one). More than that, when you’re going to kill Monit, it won’t gracefully shutdown the managed processes. So you need to handle the kill signal and perform a stop on all of them properly. I’ve made a basic and simple init script for Docker that permit to do it:

The kill signal with call another script that will request monit to stop all managed services and wait for them to be stopped before returning:

Conclusion

With Monit and those basic scripts, you can have a featured and stable launcher, do emergency things on Monit if necessary and keep it simple by using your distribution init scripts.

I recently discovered another solution that nearly do the same thing like dumb-init. I personally prefer to keep the most stupid thing but stable.

To finish, I’ve nothing against Supervisord, this is just not the best appropriate solution for Docker.

8 January 2016
by Deimos
0 comments

Graphite events on Debian jessie

0.00 avg. rating (0% score) - 0 votes

graphite

I’m using Graphite at work and for my Nousmotards project. For Nousmotards, I’m using the Graphite version available in the default Debian Jessie repositories. This to avoid mismatch django dependencies etc…

A few days ago, I wanted to try Graphite events to get something pretty cool in Grafana :
grafana_annotations

This to be able to know when a new app version is deployed and make it easier to understand when an issue occurs. However I had troubles with curl command when I wanted to
send data, because there is a mismatch dependencies between the Django version and Graphite version in Debian Jessie:

I’ve found several solution proposing Django upgrade etc… but the simplest solution I’ve found for me is to simply make a replacement directly in a graphite lib. Edit the file /usr/lib/python2.7/dist-packages/graphite/events/views.py and update the event line like this:

Then restart uwsgi service and finally it works like a charm 🙂

7 January 2016
by Deimos
0 comments

Ansible color and buffer fix on Jenkins

0.00 avg. rating (0% score) - 0 votes

ansible_logo

Hi all ! Long time since last blog post ! I’m using Jenkins with Ansible on my Nousmotards project for building docker containers or deploying into production.

You may noticed that you’ve no Ansible color when running it into Jenkins! The simple way to correct it is to export an environment variable in your Jenkins job:

Now that’s pretty cool! However, you may also noticed that there’s some kind of lags on the console when you want to see a real time console log on Jenkins. That’s pretty annoying when you’re doing rolling upgrades or restart. The solution is once again an environment variable:

That’s nice now, like on a shell display!