Sorting Your Digital Mess – How to Easily Set-Up a Private Search Engine

Motivation

In my vacations, I kicked-off some new projects. One of it is an ARM64 based SBC with integrated SATA mostly like the Odroid-HC1/2. The main difference is the ARM (64 vs. 32 Bit) to be able to run Ceph on it.

If you read some of my previous posts, you will notice, that I already put in some effort to make Ceph run on 32-bit controllers. But the effort was way to high, to make and keep it running. But the new SBC is another story, I will tell, when the time is ripe for it.

When I thought of putting all my data on a Ceph cluster, I just stumbeld over the probem, how to manage all these bits and bytes and how to keep the overview. The should be something like a Google-Search for your private data. When you dig through the net, you will find many sites, blogs and books about elastic search. But there is always a problem, how to get your data in there without a deep knowledge on data providers, ETL, graphs,… But you could also find OpenSemanticSearch (OSS) and I gave it a try.

A Private Search Engine – Open Semantic Search

To be honest, OpenSemanticSearch is not the most beatiful engine you could think of, but it gives you a very deep insight on your data, assuming you configured it correctly. Unfortunately, the documentation is quite sparse and there have not been many bloggers writing on the topic, at least not for some current version of it.

After you finished Installation, first do some deeper configuration. If you choose to use a VM, I suggest, to usa a RAM centric configuration. Better take 12+ GB of RAM and only 4 cores. Every core will lead to an etl-task that eats up a significant amount of RAM, depending on the data it is advised to index.

To avoid links of the WebUI to point you to some unreachable destination, do some configuration ahead of adding local crawler paths. If you mount your data using mount to a path in your filesystem on the OSS server, the link will be provided unchanged (e.g. /mnt/myData) and the URL will just be the location it is on your server without http, hostame, location-suffix,… Not really helpful, if you have mounted NFS shares and want to access it on a remote windows machine.

OSS provides already some document server, that proxies your requests to HTTP, but you need to configure it correctly. But if you do not configure it in the beginning, your whole indexing run will create wrong links and it will be hard to stop the indexing and do a rerun (see the troubleshooting section below). The task will be running, until it has done it’s job. For terabytes of data, this will last at least some days.

Installation and Configuration of OSS

Installation of OSS is quite straight forward, as described on their web page.

Configuration is a bit more complex due to the fact, that the documentation is a bit sparse. Some of it can be done with the web UI, but others are a bit more covered. But first things first.

Mounting Your Data

The first to do is, to make your data accessible to the indexer.

[TODO]

Configuring Apache to Proxy the Documents

The quick hack is, to add your path to proxy in /etc/apache2/sites-available/000-default.conf

Insecure Proxy for documents

This allows the documents to be accessed throuch http://<IP_OF_YOUR_OSS>/documents/mnt but also through http://<IP_OF_YOUR_OSS>/mnt. We will secure this later on. But for debugging, this is quite helpful.

To get the links in OSS right, you need also to adjust /etc/opensemanticsearch/connector-files to have a line with the following content

config['mappings'] = { "/": "http://172.22.2.108/documents/" }

You can also add more mappings, but this helps to get it right for the links in OSS Web UI.

(Re-)Starting the OSS server

I would advise to do a simple reboot of your server, to also check, if the shares get mounted correcly. For my debian buster, this is not the case. It is failing to mount the NFS shares for some reason, I have not yet digged down. I manually mount it, after it has startet. I assume, the systemd start dependencies are a bit buggy.

[Much more to write on… Coming soon]

Troubleshooting

So, what to do, if something goes wrong? E.g. if your decided to index a ton of data, it is time to purge the queue. But how?

Purging the Queue

OpenSemanticSearch uses RabbitMQ to organize the indexing tasks. If OSS decides to index a path, it will simply list all files in this path and put it in the open_semantic_etl_tasks-Queue of your RabbitMQ server. But the user interface of OSS does not provide a means of purging or deleting the content of the queue. But therefore, you neet to activate the rabbitmq management web ui.

sudo rabbitmq-plugins enable rabbitmq_management

After this, you can check, if your server listens on the apropriate interface (0.0.0.0 for acces from another host) and port (15672)

netstat -nlpt output of a typical OSS host with enabled RabbitMQ Web UI

This checked, we need to add a user with administrative rights to the OSS worker.

rabbitmqctl cluster_status   # Check if everything is fine
rabbitmqctl list_users       # Check if your webuser doesn't exist
rabbitmqctl add_user webuser <PASSWORD>
rabbitmqctl set_user_tags webuser administrator
rabbitmqctl set_permissions -p / webuser "." "." "."

This done, you can simply acces the user interface. Type in your browser http://<IP_OF_OSS_HOST>:15672/ and the login will be presented to you.

RabbitMQ Web Login

After loging in, head over to the queues and enter your open_semantic_etl_task-queue.

Queues view of RabbitMQ web UI

You should be presented with the queue details

Queue details of open_semantic_etl_tasks

In my example, you can see a very limited count of ready tasks. This can go up to a few thousand tasks when indexing a large directory with many files. Each file will add a task to this queue and RabbitMQ will hand this out to the etl workers of OSS.

To delete all messages in the queue, you can simply hit the Purge Messages button.

Purging the queue

After this, you can re-enqueue the index files on your OSS search site through http://<IP_OF_OSS_HOST>/search-apps/files/

Doxygen – Tips and Tricks

LaTeX non-interactive

To make LaTeX skip some errors without user interaction, you can add the option --interaction=nonstopmode to the pdflatex call. Easiest way to do so, is changing the LATEX_COMMAND_NAME in your Doxyfile.

LATEX_CMD_NAME = „latex –interaction=nonstopmode“

Do not forget the double quotation marks. Otherwise doxygen will remove the space and the command in your make.bat will fail.

If you now want to generate the code, step into your doxygen-generated latex folder (designated by LATEX_OUTPUT option in Doxyfile) and execute make.bat (on Windows) or make all (on *nix).

Adding a favicon to html output

Adding a favicon to html output, you need to specify it in a custom header and include the original image in HTML, as described here. To extract the default header file:

doxygen -w html headerFile

Add the follwing line to in headerFile within the html header

<link rel="shortcut icon" href="favicon.png" type="image/png">

And add your headerFile and the image to the HTML_EXTRA_FILES in your Doxyfile. Its path is relative to your Doxyfile.

HTML_HEADER = headerFile
HTML_EXTRA_FILES = some_rel_path/favicon.png

Now you can generate your html documentation with some favicon in place.

PDF output destination

Did you ever search for the PDF file, doxygen (or better the Makefile in latex) generates? I just added an option to doxygen, copying the refman.pdf to a location of your choice. (Hopefully it soon get’s merged and released).

If you want to test it out? Compile doxygen from my doxygen fork and add the following option to the Doxyfile of your project.

PDF_DST_FILE = ../MyGenerated.pdf

The destination is relative to your Makefile in your doxygen latex folder. As soon as make finished it’s job, the PDF is just in the same folder, the latex folder resides in.

That’s all. Enjoy generating software documentation with doxygen

Atlassian And New SSL Certificates

Or How CAs Drain Your Lifetime

When you check my activity, you simply can see, that the past weeks my time for writing was quite limited. Since I had a few very urgent projects, I had no time to care about my blog. But when it can get no worse, your website provider, that previously relied on the Symantec CA, decides to switch it’s root CA to some better and quite new CA that 1. is not listed in some JRA distributions‘ keystore and 2. uses trust chains with intermediate certificates. And this leads to an ugly situation when you run e.g. some Atlassian tool environment where the tools in turn use SSL to connect to each other. I would have not realized the problem so early, nor it would not have been so urgent, when I would not have decided to do all authentication (Bitbucket and Confluence) through JIRA’s Crowd-API.

But let’s start from the beginning… My provider, where I get my SSL certificate from (namely the german 1&1) used the Symantec CA for years. This CA somehow attracted Google’s anger, so Google decided to remove Symantec from the trusted CAs list of it’s chrome browser and announced this around beginning of 2018. 1&1 did not find a reason to get in hurry, so they kept their CA until mid of 2018. Then they started to remind their customers to update their SSL certificates, forcing their customers to hurry a lot. I realized, that there is some neccessity to follow their appeal, but I also felt to still have some time and do it, when the certificate expires…

Then the time came and my Chrome browser refused to show my Atlassian pages. So I logged in to my 1&1 Account and my Linux machine where JIRA & Co. runs, ordered a new SSL certificate, copy-pasted it into my Nginx configururation (which I use as an SSL proxy) and everything went fine at the first glance. I could log in to JIRA without a hassle, did not see any JIRA warnings or hovers and my browser also was not shouting at me about the SSL connection. Everything was fine. But then came the surprise… I tried to log into Confluence (no I currently have no SSO 🙁 ). I let the browser enter my credentials and… got refused. I tried some more times manually with different combinations of users and passwords, checked my password store, checked CAPS LOCK,… but I did not get in. Since I disabled the „local admin“ of Confluence (due to being tight on the 10 user limit), I could also not check from inside Confluence.

What happened? After digging (this is what I often do 😉 ) through the settings, I stumbled over the Application Links section, which stated, that the connection could not be established due to SSL errors. Ah, OK, nice that Atlassian is recommending to install the JIRA SSL Add-on that helps with all that suff. Really all? For sure, not! Especially not with problems I encounter 🙁 . After digging deeper and deeper, I found, that the new chain of trust, 1&1 uses, is not setup in the JIRA keystore, the atlassian tools ship with their JRE. In contrast Chrome’s certificate store is updated tightly.

How to fix? First you need to find the Java keystore to add the CA certificate to be trusted. This is quite a bit difficult, if you do not know about the tool’s shipped JRE. So don’t hassle around with your distributions keystore, step into the folder where your tools are install (/opt/atlassian/ in my case) and do a

find . -type f -name cacerts

In my case, the list shows as follows:

./jira/jre/lib/security/cacerts
./confluence/jre/lib/security/cacerts
./bitbucket/5.2.2/jre/lib/security/cacerts

I know, Bitbucket 5.2.2 is quite old (and the other tools, too), but keeping it up-to-date is quite hard for a spare-time setup. I believe, with an up-to-date installation, the problem would not have encountered. Let’s see, when I update… With this knowledge, it is quite easy (and some manual work) to add your cert to the keystore.

To get the certificate from your server, just do a

bash$ openssl s_client -showcerts -servername www.example.com -connect www.example.com:443 </dev/null | sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > ${HOME}/www.example.com.crt

Then you can import that certificate into your Java keystore with the following command set (e.g. for confluence):

bash$ cd /opt/atlassian/jira/jre
bash$ bin/keytool -delete -alias www.example.com \
-keystore lib/securety/cacerts -storepass changeit
bash$ bin/keytool -import -alias www.example.com \
-keystore lib/securety/cacerts -storepass changeit \
-noprompt -file ${HOME}/www.example.com.crt

After doing that (I didn’t even need a restart of the tools), the application connections resurrect and you can log in to your confluence again.

The more convenient way is to write a little bash script that does the job. You can find mine here on Github. Feel free to improve it and issue some pull request if you think it’s worth to be shared.

Lessons learned

  1. Alsways keep a local admin account active in each and every Atlassian tool 😉
  2. Better use an automated SSL framework like Let’s encrypt. With it, you need to make the key rolling-update working from the beginning, not when it is too late (OK, this would not have helped me in my situation, nevertheless it is a good idea to do so)
  3. Document your problem solutions (which I do with scripts and this blog 🙂 )
  4. Don’t document the tool in the tool (e.g. this howto in Confluence), you will shoot both your feet 🙂
  5. Keep your tools up to date!