Smart Video Filter project

After 2 years and 3 months of efforts I finally released the Smart Video Filter project. It started as a long hope of mine from several years ago: as a YouTube viewer, I should be able to filter only universally liked videos with high ratio of likes to dislikes. The YouTube search results and recommendations often contain objectively bad videos, e.g. which are low quality/take 10% of the screen or contain static pictures or are offensive/cruel or are aimed at advertising some product with an unrelated video. All those videos would get a lot of dislikes and relatively few likes, such that their ratio = likes/dislikes is low. The videos I would really like to watch are the ones with very high ratio = likes/dislikes. These videos are objectively amazing and are out of this world. Hence the idea of the filtered search, where for any combination of search terms the user could filter only the videos with high enough "ratio". The idea can be trivially extended to ratios of likes to views, comments to views, dislikes to views, etc.

A small concern in the above approach is that videos of similar quality with fewer views tend to naturally have high ratios of likes to dislikes (or likes to views). The videos with fewer views tend to be watched by subscribers/aficionados/lovers. As the video gains popularity, it gets exposed to a wider audience, which typically likes it less. Then the ratio goes down. The primary goal is to find the best of all videos. Then we can for each number of views check how many videos have a higher ratio of, e.g. likes to views and only provide the results in top x% of the ratio. A uniform sample of videos across all range of views would be readily obtained by selecting videos in top x% (of some ratio) without specifying the target number of views. That is precisely the idea behind  Smart Video Filter.

Though the idea appears simple, the implementation took many evenings over a long period of time. The project is quite rich in required expertise: UX design, architecture, back-end engineering, front-end engineering, and devops skills. The technologies include NoSQL databases MongoDB and Elasticsearch, front-end technologies Angular 5 and Angular Material from Google, CentOS 7 administration and cluster administration. The project went through several stages. The initial UI was written in AngularJS and then rewritten in Angular 5. The initial backend datastore included PostgreSQL, which was unable to support the required I/O loads and was switched to MongoDB. Initially the project used Elasticsearch 2.1 and then gradually migrated up to Elasticsearch 6.2. Retrieval and refresh of videos and channels metadata need to rely on non-trivial algorithms to be efficient and not use up a daily quota before noon. All that relies on heavily multi-threaded and fault-tolerant Java backend. The underlying cluster system implements high availability, when after a complete loss of one machine all systems still work. YouTube terms of service and developer policies are quite strict and took a while to comply with. Recently, I got an audit review from YouTube compliance team, and well, they aren't shutting me down yet.

I'm greatly enjoying the final implementation myself, while searching for the best videos and filtering over duration in a fine-grained way. My long-term hope came true! Even now I got heavily distracted using the service instead of writing the post. 46:1 ratio video of Donald Trump "singing" Shake it Off by Taylor Swift is an amazing composition! The project has obvious ways to improve to substitute the YouTube original search even more: provide recommendations and play videos on the site without redirecting to YouTube. However, Smart Video Filter is a "small market share" application. If the "market maker" YouTube itself was to implement it, then lots of videos would not be regularly shown in search results/recommendations, which would have discouraged the content creators. Hope you enjoy this niche service as I enjoy it myself!

HA (high-availability) setup for Smart Video Filter

With high expectations on website and service availability in 2018, it is especially important to ensure the redundant DR (disaster recovery) copies of the service are running at all times and are ready to take on the full PROD load within seconds. Hosting companies like Amazon have long solved this problem for standard services, e.g. for Elasticsearch cluster. Since cluster always runs with 1 or more replicas, the replica node is ready to take over for a short period of time, till a new primary is spun up and synched after the primary failure. A level of abstraction such as Kubernetes also allows creation of high-availability service.

With all available options, what should we use in the real world? It depends on the budget and the available hardware. My recently released service, Smart Video Filter, is a low budget solution working on 2 physical machines running CentOS 7. Two enterprise-grade SSDs with large TBW (terabytes written) resource is substantially cheaper than AWS in terms of storage cost and provisioned IOPS cost. It is recommended to run HA setups with 3 machines, but 2 machines (PROD and DR) provide enough reliability and redundancy in most cases. Four different services on those machines needed to seemlessly switch between PROD and DR: ElasticSearch, MongoDB, Mining Service, and Search Service.

ElasticSearch setup over 2 machine includes creating a cluster with 1 primary and 1 replica. ElasticSearch reads and writes happily proceed even if one of those nodes is down. No special setup is necessary.

MongoDB setup on 2 nodes is trickier. MongoDB has a protection against a split brain condition. The cluster does not allow writes, if a primary is not chosen. A primary can only be chosen by the majority of nodes and there is no majority with 1 out of 2 nodes down. Addition of an Arbiter instance is recommended in such cases. However, a simple arbiter setup isn't going to work, if the arbiter is deployed on one of data nodes. If the entire node goes down, then it takes the arbiter down with it. What I ended up implementing is a workaround of the split-brain protection, when a MongoDB config is overwritten by the mining service. The mining service provides an independent confirmation that one of data nodes is dead, and adds an Arbiter on a different machine to the cluster, while removing an arbiter running on the same machine as the failed data node. Node health detection by the mining service is described below.

Search Service makes use of the health API. One instance of the service is deployed on PROD and DR node each. Each instance deploys a RESTful endpoint with a predictable response, which consists simply of the string "alive". Each instance also deploys a client to read this status from itself and from the other node. When both nodes are alive, the PROD node takes over. When DR node detects that it is alive, but the PROD node is not alive, then it takes over. Each node is self-aware: it detects its role by comparing its static IP (within a LAN network) to the defined IPs of the PROD and DR nodes. When the node takes over, it uses port triggering on the router to direct future external requests to itself. It was shown within testing that port triggering can switch the routed node within seconds.

Mining Service employs same health API + another external API, which reports, whether the job is running. When PROD or DR nodes are ready to take over, they let the current job finish on another node, before scheduling jobs on itself. Jobs should not run on PROD and DR node simultaneously. The health detection also helps to switch a MongoDB Arbiter between nodes to ensure MongoDB can elect a primary.

After the full setup is implemented, the system is capable of correctly functioning with only disruptions of several seconds, when one of the machines goes down entirely. This was readily demonstrated within testing, when all services remained highly available throughout a rolling restart of 2 machines!

Specialization Review – Leading People and Teams (University of Michigan)

Here is my review of "Leading People and Teams" specialization taken on Coursera from Aug 2017 till Jan 2018. Courses in this specialization are ranked very high between 4.5 and 5.0. I passed with the average grade of 99.0%. The specialization consists of 3 courses focusing on leadership and team work, 1 course emphasizing human relationships (HR) side of management, and a capstone project.

The first course, "Inspiring and Motivating Individuals", is quite inspirational indeed. Surprising research evidence suggests that most employees around the world are not engaged/motivated at work, and lots of them are even actively disengaged. The course outlines the origin of meaning of the work, the importance of company vision and engagement, the drivers of people motivations, and the ways to align the employees with the company's goals. 

The second course, "Managing Talent", is aimed primarily at managers conducting the onboarding, managing performance and evaluations, coaching the team members, and maintaining continuity of talent. Research shows that managers play crucial role in personnel turnover. A variety of organizational behavior effects and biases are discussed, such as Dunning-Kruger effect, availability error, racial bias, and gender bias. Knowledge from this course might, just like CSM and PMP certifications, backfire in startups or companies without rigid structure, where many of the standard techniques are not followed.

The 3-rd course, "Influencing people", is probably the most practical of the specialization. If outlines the bases of power and the bases of strong relationships with people and goes in great depth with examples. The course offers practical advice on how to positively interact with colleagues, how to build relationships, and how to gain influence, while protecting oneself from unwanted influence. Expert knowledge, information power, and referent power are presented as influencing means without formal authority. The material assumes a workplace in US, which provides great insight into the informal expectations for immigrant workers. E.g. the expected level of socializing at the workplace is different around the world, and is somewhat higher than average in the US.

The 4-th course, "Leading teams" takes it to the higher level of team dynamics. It provides practical advice for improving team work, coordination, output, and overall happiness. The course discusses topics as team structure, team size, subteams and splits based on demographics/similarity. Coordination problems and common design making flaws are emphasized and the prevention methods are presented. Psychological safety is presented as a cornerstone for team performance. Team charters and team norms are discussed. Performance-oriented vs. learning-oriented mindsets are shown to provide different outcomes.

The final capstone project, "Leading People and Teams Capstone" is automatically graded as a pass. It offers 3 options on improving leadership skills: (1) solve a real-world leadership business case, (2) take on a leadership challenge at work, or (3) interview a business leader to gain insight of their practices. The option (2) is probably best aligned the main goal of the course to improve the learner's leadership skills.

Overall, I had a great experience taking the specialization. It emphasizes that leadership skills is not something a person born with. They should and readily are acquired as a result of systematic work. A lot of material is focused on leading without formal authority, which is especially helpful to team members of self-organizing Scrum teams in the software industry. The courses are filled with real-life stories and interviews with people from the industry, which help solidify the concepts. Many pieces of homework are peer graded. Assignments of the others provide insight into ideas, styles, and techniques of people at various stages of career ladder. Those techniques summarize real-life experiences of people managing their subordinates, resolving conflicts, influencing the team, which might not otherwise be accessible to the learners.

Specialization is taught by instructors from the University of Michigan, Ross School of business: Scott DeRue, Full Professor, business school Dean; Maxym Sych, Associate Professor; Cheri Alexander, Chief Innovation Officer. All three are charismatic, knowledgeable, and are great presenters. The material is delivered very coherently and to the point. The lecture slides are very detailed and are great for returning the the material in the future.

Spring Boot + Angular 4

Modern web applications must rely on the best services frameworks and the best user interface frameworks to be most reliable, versatile, and the easiest to develop and maintain. That is why many software development teams choose Spring Boot for the services layer and Angular for the UI layer. Sustainable practices for continuous integration and development of these layers is a key to productivity of your team. Two more choices the team needs to make is the Integrated Development Environment (IDE) and a build automation tool. De facto leading choices are, respectively, IntelliJ IDEA with top of the line support of both backend and UI, and maven traditionally used for the backend with multiple plugins for the UI. Let me describe my latest setup for a project with Spring Boot 2.0.0, Angular 4.2, and Maven 3.3.9 using IntelliJ IDEA 2017.2.3. Backend code must reside in its own (server) maven module, while UI code must reside in a separate (client) maven module. A module, parent to both, provides easy means of building the entire application. Angular 4 project has its own dependency management with npm, but it can readily be integrated with maven using frontend-maven-plugin. I develop in Windows 10, but steps should be practically the same for other OSes. Part 1. Front-end module.
  1. Choose distribution of NodeJS and install on your machine. I installed v6.11.2.
  2. Installed NodeJS contains "npm" executable. Check its version as "npm -v". My version is 3.10.10.
  3. Install Angular command line interface package globally by running as Administrator "npm install -g @angular/cli". The previous version of this package exists under "angular-cli" name - do NOT install that one, as it only supports Angular 2.
  4. Create a maven module in IntelliJ for the UI (mine is named "search-client") with a "pom.xml" file, but without any other files or directories.
  5. Populate "search-client" module with a UI template by executing "ng new search-client --skip-git" in a folder parent to "search-client" folder. I have a separate version control repository and prefer to skip provided integration with git.
  6. Merge existing Angular 4 files into "search-client" project or write your UI from scratch, integrate with version control of choice.
  7. Open "package.json" and define a command for prod compilation:
    "scripts": {
    ...
      "prod": "ng build --prod --env=prod"
    },
  8. Use the following setup in "pom.xml" for search-client in <build><plugins> section. Match NodeJS and npm versions to the ones discovered above. "npm install" execution can be commented out after the first run to save time.
    <plugin>
      <groupId>com.github.eirslett</groupId>
      <artifactId>frontend-maven-plugin</artifactId>
      <version>${frontend-maven-plugin.version}</version>
      <executions>
        <execution>
          <id>install node and npm</id>
          <goals>
            <goal>install-node-and-npm</goal>
          </goals>
          <configuration>
            <nodeVersion>${node.version}</nodeVersion>
            <npmVersion>${npm.version}</npmVersion>
          </configuration>
        </execution>
    
        <execution>
          <id>npm install</id>
          <goals>
            <goal>npm</goal>
          </goals>
          <configuration>
            <arguments>install</arguments>
          </configuration>
        </execution>
        <execution>
          <id>prod</id>
          <goals>
            <goal>npm</goal>
          </goals>
          <configuration>
            <arguments>run-script prod</arguments>
          </configuration>
          <phase>generate-resources</phase>
        </execution>
      </executions>
    </plugin>
  9. Take note of the compilation output directory specified in ".angular-cli.json" file under option "apps -> outDir" and use it in pom.xml <build> section as a resource
    <resources>
        <resource>
          <filtering>false</filtering>
          <directory>dist</directory>
        </resource>
    </resources>
  10. Execution of "mvn clean install" leads to a jar file "search-client-1.0-SNAPSHOT.jar" in a local repository containing compiled frontend code. The command takes 20 seconds for a new project on a 2-nd and subsequent runs.
  11. Define a new "npm" Run/Debug Configuration in IntelliJ to run UI code in development mode: Run -> Edit Configurations -> "+"; then choose correct path to "package.json" file; Command -> run; Scripts -> ng; Arguments -> serve; choose a node interpreter, global one is fine.
  12. Run this configuration and open http://localhost:4200 in a browser. Try modifying Typescript, JS, CSS, or HTML files and observe how the displayed pages change.
Part 2. Integration with backend module.
  1. In our backend module (named "search-server") declare dependency on a frontend module in "pom.xml" file. Apart from providing access to the UI code, this ensures that the UI code builds before the backend code.
    <dependencies>
        <dependency>
            <groupId>${project.groupId}</groupId>
            <artifactId>${frontend.artifact.id}</artifactId>
            <version>${project.version}</version>
        </dependency>
  2. Here the property ${frontend.artifact.id} is defined in a parent module ("search-parent")
    <properties>
        <frontend.artifact.id>search-client</frontend.artifact.id>
    </properties>
  3. Define a path to UI files in "target" folder of compiled code in "search-server" module:
    <properties>
        <UI.files.folder>${project.build.directory}/classes/static</UI.files.folder>
    </properties>
  4. Use "maven-dependency-plugin" to unpack UI files into the target resources folder. Here the version of the plugin is managed by "spring-boot-starter-parent" project, which is the parent of "search-parent".
    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-dependency-plugin</artifactId>
        <executions>
            <execution>
                <id>unpack</id>
                <phase>compile</phase>
                <goals>
                    <goal>unpack-dependencies</goal>
                </goals>
                <configuration>
                    <includeGroupIds>${project.groupId}</includeGroupIds>
                    <includeArtifactIds>search-client</includeArtifactIds>
                    <outputDirectory>${UI.files.folder}</outputDirectory>
                    <excludes>META-INF/**</excludes>
                    <overWriteReleases>true</overWriteReleases>
                    <overWriteSnapshots>true</overWriteSnapshots>
                </configuration>
            </execution>
        </executions>
    </plugin>
  5. Create a Run configuration in IntelliJ to execute the main application class, e.g. annotated with @SpringBootApplication or @ComponentScan or @EnableAutoConfiguration. After running "mvn install" and starting this configuration, the UI entry point, e.g. "index.html", will be accessible at the specified application root, port (and host).
  6. Executable "jar" file or "war" file is readily produced with "spring-boot-maven-plugin" used in <build><plugins> section, where the jar file of search-client dependency is excluded:
    <plugin>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-maven-plugin</artifactId>
        <configuration>
            <mainClass>SUBSTITUTE_YOUR_PACKAGE_NAME.SearchServerAppInitializer</mainClass>
            <classifier>exec</classifier>
            <excludeArtifactIds>search-client</excludeArtifactIds>
        </configuration>
    </plugin>
  7. An additional bind for "maven-clean-package" helps to refresh UI of a running application kicked off by starting the main application class configuration from 5. For that, run "mvn install" on a client module and "mvn install" on a service module (or can run "mvn install" on a parent module instead of both). New UI will load upon refreshing the browser page. The application doesn't need to be stopped and no "clean" goal needs to be issued:
    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-clean-plugin</artifactId>
        <executions>
            <execution>
                <phase>generate-resources</phase>
                <configuration>
                    <excludeDefaultDirectories>true</excludeDefaultDirectories>
                    <filesets>
                        <fileset>
                            <directory>${UI.files.folder}</directory>
                        </fileset>
                    </filesets>
                </configuration>
                <goals>
                    <goal>clean</goal>
                </goals>
            </execution>
        </executions>
    </plugin>
The provided instructions are optimized with respect to developer experience and compilation time, e.g. one doesn't have to "clean" each time. One can find similar setups online. This notable discussion operates with an old version of Angular CLI plugin. While global installation of "angular-cli" is not necessary, some suggest to use only a locally installed NodeJS to do the compilation. Then, however, one can't readily generate the UI project template. This github project uses a setup very similar to mine.

E-mail server hosting on Amazon EC2

In the previous post I described how to set up web hosting with HTTPS and WordPress. All those steps require less work compared to settings up a fully secured e-mail server. Technologies For e-mail self-hosting we need postfix as a message transfer agent (MTA), dovecot for POP3 e-mail server, cyrus SASL (simple authentication security layer) for SMTP relay security, Amazon SES (simple e-mail service) for SMTP relay authority and reverse DNS lookup, SSL certificate from Let's Encrypt described in the previous post. Base setup
  1. Install postfix, dovecot, cyrus SASL, start them and enable the correspondent services (postfix, dovecot, saslauthd), and remove sendmail.
    1. sudo yum install postfix dovecot cyrus-sasl
    2. sudo yum remove sendmail
    3. sudo yum postfix start # repeat for dovecot and saslauthd
    4. sudo chkconfig postfix on # repeat for dovecot and saslauthd
  2. Create real user with password + directory (or a virtual user with a virtual mailbox).
    1. sudo useradd admin
    2. sudo passwd admin
    3. sudo mkdir /home/admin/mail/
    4. sudo chown admin /home/admin/mail
  3. Configure postfix for basic SMTP on port 25
    1. Edit /etc/postfix/main.cf to specify
      1. myhostname=yourhostname.com
      2. mydomain=yourhostname.com
      3. inet_interfaces=all
      4. inet_protocols=all
      5. home_mailbox=mail/
      6. message_size_limit=10485760 # for 10MB
      7. mailbox_size_limit=1073741824 # for ~1GB
      8. smtpd_recipient_restrictions=permit_mynetworks, permit_auth_destinations,permit_sasl_authenticated,reject
  4. Configure dovecot for basic POP3 on port 110
    1. Edit /etc/dovecot/10-auth.conf to specify
      1. disable_plaintext_auth=no
      2. auth_mechanisms=plain login
    2. Edit /etc/dovecot/10-mail.conf to specify
      1. mail_location=maildir:~/mail
    3. Edit /etc/dovecot/10-ssl.conf to specify
      1. ssl=no
  5. Open ports 25 and 110 in EC2 security groups, restart dovecot, postfix, and check that you can send e-mail to yourself and receive it via your favorite e-mail agent (SMTP and POP3 hosts are yourhostname.com, no encryption, no SSL/TLS).
Authenticated SMTP The above setup is the least secure. The first step for amending is to require authentication for SMTP. For that, use dovecot for SASL authentication with SMTP server (smtpd).
  1. Edit /etc/postfix/main.cf to specify
    1. smtpd_sasl_type = dovecot
    2. smtpd_sasl_path = private/auth
    3. smtpd_sasl_auth_enable = yes
    4. smtpd_sasl_security_options = noanonymous
    5. smtpd_sasl_local_domain=$myhostname
    6. broken_sasl_auth_clients=yes
    7. smtpd_sasl_authenticated_header = yes
  2. Edit /etc/dovecot/10-master.conf to specify
    1. unix_listener /var/spool/postfix/private/auth  {
    2. mode = 0666
    3. user = postfix
    4. group = postfix
    5. }
  3. In your favorite e-mail application set "My outgoing server (SMTP) requires authentication" -> "Use same settings as my incoming mail server" and test that the new set up can send and receive e-mails to self and to/from one external account.
Secure SMTP and POP3 The above setup doesn't allow for anonymous access to the e-mail server. However, the established connections are not secure. Both POP3 and SMTP can be secured with the same SSL certificate, we used for HTTPS as long as the connection server names coincide with the domain name.
  1. Enable SMTP port 587, which makes life easier as an addressee, as many popular mailservers would prefer to send to port 587. Note that SMTP port number itself has little to do with the use of SSL.
    1. Edit /etc/postfix/master.cf and uncomment "submission inet n ..." line.
  2. Configure smtpd setting to require SSL by editing /etc/postfix/main.cf:
    1. smtpd_tls_cert_file=/etc/letsencrypt/live/yourhostname.com/fullchain.pem
    2. smtpd_tls_key_file=/etc/letsencrypt/live/youthostname.com/privkey.pem
    3. smtpd_tls_security_level = encrypt # this is the main setting to require SSL
    4. smtpd_tls_loglevel = 1 # raise to 2 or 3 if you plan to dig through logs /var/log/maillog
    5. smtpd_tls_received_header=yes
  3. Configure dovecot to require SSL:
    1. Edit /etc/dovecot/conf.d/10-auth.conf to specify
      1. disable_plaintext_auth = yes
    2. Edit /etc/dovecot/conf.d/10-master.conf to specify
      1. service pop3-login { ...
      2. inet_listener_pop3s {
      3. port = 995
      4. ssl = yes
      5. }
      6. }
    3. Edit /etc/dovecot/conf.d/10-ssl.conf. Mind "<" signs for ssl_cert and ssl_key.
      1. ssl = required
      2. ssl_cert=</etc/letsencrypt/live/yourhostname.com/fullchain.pem
      3. ssl_key=</etc/letsencrypt/live/yourhostname.com/privkey.pem
  4. Restart postfix and dovecot, open ports 587 and 995 on EC2 instance, configure SMTP in your client to use port 587 and "Use the following type of encrypted connection = TLS", configure POP3 in your client to use port 995. Tests should pass.
Relay sending SMTP messages to Amazon SES. The above SMTP and POP3 client setup looks identical to the one for Gmail, which brings the false sense that we are done. Your first e-mail from such self-hosted SMTP server to Gmail will end up in a Spam folder. I know as I tried it. The problem is that your own SMTP server doesn't have an authority standing by it to certify that the sender is good. Amazon SES serves as such authority after you promise them you won't be doing anything bad. In short, an e-mail from your SMTP server needs to be relayed to Amazon SES server in a correct hosting zone. Then Amazon SES provides reverse DNS lookup.
  1. Sign up with Amazon SES, verify your primary e-mail on yourhostname.com and e-mail on Gmail, obtain a correct relay host based on a hosting zone, obtain SMTP credentials, verify DKIM. Generally follow guide for integration with postfix.
  2. Configure smtp server for relay. As a rule of thumb "smtpd" server handles e-mail by itself, while "smtp" server asks someone else to handle their e-mail => we need "smtp" and many smtpd options need to be duplicated into smtp options:
    1. Edit /etc/postfix/main.cf to specify
      1. relayhost = email-smtp.us-east-1.amazonaws.com:25 # port doesn't matter - 587 is as good as 25, the server depends on a hosting zone
      2. smtp_sasl_auth_enable = yes
      3. smtp_sasl_security_options = noanonymous
      4. smtp_tls_security_level = encrypt #outgoing connection must be secure as well
      5. smtp_sasl_password_maps = hash:/etc/postfix/sasl_passwd
      6. smtp_use_tls = yes
      7. smtp_tls_note_starttls_offer = yes
      8. smtp_sasl_mechanism_filter = plain, login # essential, but not found in a official guide
      9. smtp_tls_CAfile = /etc/ssl/certs/ca-bundle.crt # we verify authenticity of Amazon SES server
      10. smtp_sasl_type = cyrus # which is the default
    2. It may come as a surprise, but dovecot doesn't support SASL authentication for "smtp" and we have to use cyrus-sasl. One can store hashes of passwords in a file, which is simpler than the database:
      1. Ensure saslauthd service is running and is set to start automatically.
      2. Create /etc/postfix/sasl_passwd file with Amazon SES SMTP server and SMTP credentials.
      3. Run "sudo postmap hash:/etc/postfix/sasl_passwd" to generate a hash file referenced by password_maps above.
  3. Restart postfix and test sending/receiving e-mail between your 2 verified account.
  4. Apply on Amazon SES for a production account, which allows sending e-mail to unverified accounts (aka clients).
This is basically it! We now have a production e-mail system, which is fully secured and can send 50,000 high authority e-mails per day.  Dependent on the use case, you may consider forwarding incoming e-mails to Gmail.

Website self-hosting on Amazon EC2 cloud

The upcoming 3-yr renewal of website hosting plan on Hostgator and the desire to learn AWS cloud made me thinking about self-hosting of my personal website http://astroman.org.

Costs

Hostgator gradually increased regular costs of its Hatchling shared plan from $3.95/mo to $6.95/mo + the cost of the domain to $15/yr => total for a 3yr term is about $300. For a Positive SSL certificate one has to pay $50/yr + upgrade to the next tier of shared hosting plans => total cost over 3yrs readily rises to $600. Amazon cloud prices are predictably lower. At present t2.nano EC2 instances are priced at $0.0059/Hour without a long-term commitment and at $69/3yrs = $2.9/mo for a 3 yr dedicated instance. Standard 8GiB of EBS storage go for $0.8/mo. Thus, ones beats even the most discounted pricing of Hostgator... except one has to do much more work! Technologies Typical web hosting consists of a lot of static contents in a form of HTML pages, images, CSS + WordPress blog + e-mail. SSL support is a premium paid feature. Under the hood, web hosting implies:
  • a lot of HTML/PDF/CSS/JPEG/PHP/etc in a folder on a Linux host
  • a domain name with adequate DNS service
  • Apache web server routing to the content
  • PHP engine + MySQL database to run WordPress
  • Postfix SMTP and Dovecot POP3 servers for e-mail
  • CA-signed SSL certificate with a mechanism for certificate renewal
I aimed at replicating all those features on an EC2 instance and succeeded in about 2 weeks of working on it about 5 evenings a week. Base setup t2.nano EC2 instance has only 500MB memory, which prohibits installation of WHM/cPanel => more manual work. Luckily, all other software runs on such box without a hitch on a chosen Amazon AMI Linux distribution. Provisioning of EC2 instance is fairly standard, except I got a discounted dedicated instance on AWS marketplace with $2.9/mo pricing, but only a 2yr commitment. An instance should have an associated Elastic IP address, which is free for as long as an instance is running. Regardless of the domain hosting registrar, the DNS service could be provided by Amazon via Route 53 service, which offers seamless integration with other AWS services and best possible access to your domain. A hosted zone costs extra $0.5/month and I decided to pay that. Web hosting Amazon Linux is based on RedHat and has standard tools like yum available. However, be careful as the default versions of packages may need to be abandoned in favor of compatible versions, e.g. use httpd24 instead of httpd:
  1. Configure DNS, edit /etc/sysconfig/network set HOSTNAME=yourhostname.com and restart server "sudo reboot".
  2. Install Apache: "sudo yum install httpd24".
  3. Copy files to /var/www/html with the entry point file named index.html.
  4. Edit /etc/httpd/conf/httpd.conf and comment out "AddDefaultCharset UTF-8" unless you have a Unicode-compatible website.
  5. Start Apache and set the service to autostart "sudo service httpd start" and "sudo chkconfig httpd on".
The website should now be accessible at http://yourhostname.com. WordPress A WordPress blog can either be installed at the website root or made available at a specific URL such as http://yourhostname.com/blog. It requires PHP engine and MySQL database. Using PHP56 avoids conflicts with other versions. I migrated WordPress from a different hosting.
  1. Install PHP and MySQL
    1. sudo yum install php56-mysqlnd php56-gd php56 php56-common sudo yum install mysql-server mysql
  2. Start and enable autostart of "mysqld" service.
  3. Secure MySQL installation with "sudo mysql_secure_installation", set root password etc.
  4. Connect to MySQL from command line and create a database for WordPress.
    1. mysql -u root -p password CREATE DATABASE wordpress; CREATE USER wordpressuser@localhost IDENTIFIED BY 'password'; GRANT ALL PRIVILEGES ON wordpress.* TO wordpressuser@localhost IDENTIFIED BY 'password'; FLUSH PRIVILEGES;
  5. On old WordPress instance install Duplicator plugin and create the archives, then copy the archives to the relevant folder in the new hosting.
  6. Access installer.php and follow the prompts to hook up to MySQL database, unpack the archive and make selections.
  7. If the website address/folder changes at a subsequent time, make necessary changes to MySQL database.
  8. Consider using TRUEedit plugin, which prevents conversion of "--" and other symbols to something non-copy-pastable.
SSL certificate Free self-signed certificates cannot be used for anything other then testing. SSL certificates signed by trusted CA were always a paid premium feature, but not anymore. A new company Let's Encrypt now provides free SSL certificates for anyone! The certificates are cross-signed by IdenTrust, whose Certificate Authority public key is already present in most major browsers/operating systems. Steps to get the certificate and use it with Apache:
  1. Get Let's Encrypt project
    1. sudo yum install git
    2. sudo git clone https://github.com/letsencrypt/letsencrypt /opt/letsencrypt
  2. Obtain a certificate. Amazon Linux AMI support is experimental, but --debug flag successfully forces installation of relevant dependencies.
    1. sudo -H /opt/letsencrypt/letsencrypt-auto certonly --standalone -d astroman.org --debug
  3. The resultant 3 certificate files are referenced for Apache in /etc/httpd/conf.d/ssl.conf file as:
    1. SSLCertificateFile /etc/letsencrypt/live/yourhostname.com/cert.pem
    2. SSLCertificateKeyFile /etc/letsencrypt/live/yourhostname.com/privkey.pem
    3. SSLCertificateChainFile /etc/letsencrypt/live/yourhostname.com/fullchain.pem
  4. Include a permanent redirect to HTTPS in Apache config file /etc/httpd/conf/httpd.conf
    1. <VirtualHost *:80>
    2. ServerName yourhostname.com:80
    3. Redirect permanent / https://yourhostname.com/
    4. </VirtualHost>
  5. Allow overrides in VirtualHost on port 443 (otherwise links to individual posts will display error 404)
    1. vi /etc/httpd/conf.d/ssl.conf
    2. <Directory /var/www/html/blog>
    3. DirectoryIndex index.php
    4. AllowOverride All
    5. Order allow,deny
    6. Allow from all
    7. </Directory>
  6. Open 443 port on EC2 instance and restart Apache. All links on your website including the main page and WordPress will now by HTTPS.
  7. Set up automatic certificate renewal to run daily in root crontab and redirect output to a file to check the renewal command runs. Most of the times the renewal script will skip renewal as the certificate is not yet due - it will only renew once in 60 days.
    1. sudo crontab -e
    2. 30 2 * * * /home/ec2-user/renewal.sh
  8. renewal.sh file logs the date and attempts to renew the certificates without Apache restart and without updating dependencies
    1. sudo echo `date` >> /home/ec2-user/renew.log
    2. sudo /opt/letsencrypt/certbot-auto renew --webroot -w /var/www/html --no-bootstrap >> /home/ec2-user/renew.log

First experience with Elasticsearch

Many modern enterprise applications rely on search to some extent. As of Nov 2016 the most popular search engine is Elasticsearch. It is an open source engine based on Apache Lucene. The need to perform search arose in my home project as well. I chose Elasticsearch for the engine and readily dived into the tutorials. My methodology for writing interactions with the 3rd party systems is to create Facade APIs within Test Driven Development process. The tests for indexing and retrieving documents worked flawlessly, but the test results for the search queries got me puzzled. I have formal training in search engines within Coursera Data Mining specialization, thus I know concepts like TF-IDF. The hope was to get the relevance scores and match them precisely to the numbers computed by formulas in the tutorials. Basic index for 4 test documents returned me the numbers vastly different from my expectations... After some googling I turned on the "explain" functionality and was up to an even bigger shock: the returned scores didn't match the scores in the explain section. I started suspecting the unthinkable: the relevance calculations are broken! Elasticsearch tutorial confirmed my worst fears... well, it rather explained to me how little I know about the real search engines. After couple more hours of comparing numbers the discrepancies were decomposed into an optimization feature, a bug pretending to be a feature, and a bug. The optimization feature is that several shards are created for each index and documents are randomly distributed between those shards. The relevance calculations are only performed within each shard for DEFAULT search type. Setting search type to DFS_QUERY_THEN_FETCH forces shard statistics to be combined into a single IDF calculation, thus leading to values closer to the expected numbers. However, the "explain" functionality always employs the DEFAULT search type leading to a mismatch, hence a bug. A bug pretending to be a feature is in really coarse-grained rounding of the relevance norm. The discrepancies reach 15%, which hurts testing.

Course Review – Big Data – Capstone Project (Ilkay Altintas, Amarnath Gupta)

Here is my review for Big Data Capstone Project course offered on Coursera in Jul 2016. The course represents the final project for the Big Data specialization, it does not have separate rankings, while I passed with 98.2% score. Technologies/Material: As a final project, the course does not have lectures, but rather brief descriptions of relevant project parts each week. The project is about making suggestions on how to increase revenue of a company promoting a fictional game "Catch the Pink Flamingo". A lot of simulated game data is made available to the learners. The part assigned each week represents a separate area of big data analytics: data exploration, classification, clustering, and graph analysis. The suggested technologies are: Splunk, KNIME, Apache Spark, and Neo4j, respectively. As usual within the specialization instead of free exploration a "correct" path is given along with substantial help on the way. The assignment each week is peer graded with the ability to submit multiple times and get regraded. Grading asks to compare learners' numbers with the correct numbers, which means that almost everyone gets correct answers on their second attempt. Unfortunately, many people slack off on their first attempt or simply submit an empty report. At the end of the course a final report with a powerpoint presentation are submitted and also peer graded. Instructor/lectures: the task instructions are given by Amarnath Gupta and Ilkay Altintas. The course offers a realistic view of a job of a Data Scientist: analyze all available data to increase revenue of a company, improve retention rates, suggest the ways of development, and, most importantly, make presentations to the management. The instructors emphasize each week that the company's bottom line is of the utmost importance. Even though the specialization is called Big Data, there is no emphasize on especially large volumes of data or on distributed computations, thus we are in the Data Science realm.

Course Review – Graph Analytics for Big Data (Amarnath Gupta)

Here is my review for Graph Analytics for Big Data course offered on Coursera in Feb 2016. The course is ranked 2.5 out of 5, while I passed with 99.4% score. Technologies/Material: The course provides introduction to graph theory with practical examples of graph analytics. Most of examples and homework is done in Neo4j, a leading graph database. The last assignment employs GraphX API in Spark. Since graph databases are so different from regular databases, the special graph query language called Cypher was developed to write code for Neo4j. Extensive Cypher tutorial and executable code samples grouped by topics are given. Graph analytics offers simple answers to many questions. The discussed graph techniques are Path Analytics, Dijkstra algorithm and its variations, Connectivity Analytics, Community Analytics, and Centrality Analytics. Instructor/lectures: the course is taught by Amarnath Gupta, an Associate Director of San Diego Supercomputer Center. Amarnath is an amazing instructor. The course is well taught with just the right speed and the right amount of material given. In my view, he succeeded in making an introduction to graphs, while not oversimplifying the concepts.

Course Review – Machine Learning with Big Data (Natasha Balac, Paul Rodriguez)

Here is my review for Machine Learning with Big Data course offered on Coursera in Jan 2016. The course got 2.0 out of 5 rankings and I passed it with 100% score. Technologies/Material: The course provides basic theory and some exercises on popular machine learning techniques after presenting business justification and ML pipeline. The presented techniques are decision trees, association rules, and clustering. Exercises are largely done in KNIME with some parts in Apache Spark. Thankfully, the course has copyable code samples and provides basic information on how to get started with KNIME. The assignments require digging into non-trivial details of KNIME from its documentation/Internet/forums. For me the course provided valuable insights and examples of decision trees and association rules, which not many other courses offer. Instructor/lectures: The course is taught by Natasha Balac, who provides most of business background, and Paul Rodriguez, who is a technical person. The presentation is organized better than in previous courses, though the depth of the material is often not sufficient for solid learning. Some slides can be reused to present Big Data to managers.