Page 4 of 4

Re: Chris working on issue

Posted: Wed Oct 04, 2017 5:17 pm
by Steve Sokolowski
OK, now I'm back. I thought it's worth a comment on the issues pointed out here to allow people to better understand this problem. First, I wanted to comment that miners who encountered the problem received a bonus of 15%, as always, in return for their trouble.

This particular problem was a result of two issues: unavailability at the time the problem occurred and the integration problem that the system represents.

First, the core issue was that nobody was available at the time the problem occurred to troubleshoot it. Had someone been available, the solution could have been arrived at eight hours earlier. The timing was particularly unfortunate, as Chris had just fallen asleep and I had just left for work. While Chris does have all day to work on this project, he does need to sleep sometime, and I can't access the system while at work. Had someone been available at the time, then we would have been able to investigate more quickly. I suspect that Coinbase and other systems have issues, but the difference is that their employees are monitoring all the time and they are able to fix problems more quickly.

Second, the system is fundamentally an integration problem. When I worked for traffic.com in 2006, I developed software to manage commercials. The software was used by about 30 people in another department and consisted of a Windows program that connected to a SQL server database. Users entered the cost of the spots, the times the commercials were to air, what was to be read on the radio or TV, and printouts were generated for the accounting department and for the anchors to read the copy. When someone reported a bug, we copied the production database, ran the system and did the exact same thing in the UI, and then attached a debugger and figured out what the problem was.

With this system, we can't do that. There are many different types of mining equipment and configurations, and the cost of the equipment connected to this system is tens of millions of dollars. There are hundreds of types of coins and many exchanges with incompatible APIs. Almost every problem we encounter is unreproducible in development because the user has a very specific configuration. We can't attach a debugger because it immediately puts CPU usage at 100%. Our best option is to carefully examine the problems and make educated guesses, ensure that the guesses don't create bugs, and then ask whether the new release with the guess fixed the problem. There are few domains where this type of programming is necessary.

This particular issue was caused by coins reaching 100% CPU load, one of those integration problems. We tested to see what would happen if the target coin was overloaded, but did not consider what would happen if another coin on that server became overloaded and slowed down the target coin. The fix is for Chris to write a script that will monitor coins for high CPU usage and send him weekly E-Mails recommending high CPU but low profit coins for discontinuation.

Hopefully that helps everyone understand a little bit more about the challenge. The first reason, the unavailability of anyone to fix the problem at specific times, is possible to resolve if the lawyer returns with a positive outlook. The second reason is something that improves over time, but has no quick solutions. Every time an issue occurs, a fix or a detection routine is implemented to prevent it from happening again. That means that there are fewer remaining potential issues to cause problems. Long-term users have probably noticed that the number of these issues has declined significantly since July as more and more issues are discovered and resolved.

Finally, I wanted to discuss the issue of performance. Like many here, we also continue to be struck at how performance would be a continuing problem. Performance improvements are all that I seem to do anymore, and the initial 7GH/s capacity has now been improved to at least 20TH/s, a 3000x increase. I think, however, that those who suggest that the system be overhauled are making a grave mistake. The best course of action is to continue the current process of identifying problems, fixing them permanently, and slowly improving and parallelizing components one by one. For example, we reduced the charts bandwidth and CPU usage by half last weekend, and this weekend we will reduce the miner status updates in the same way. Later, we will parallelize coin assignment.

Many projects get into a loop of starting over, believing that the next time will result in fewer problems, but instead that means that every problem that has been permanently fixed would then be at risk of recurring. Starting over might be a good choice for a self-contained program, but for a huge integration problem like this, starting from scratch would be disastrous.

Re: Chris working on issue

Posted: Wed Oct 04, 2017 6:09 pm
by mikhalkinv
I dont realise, are you planing any share correction about easturday? couse im still dont have any. Does any get correction? Is im only one wout? Could you please Steve look over

Re: Chris working on issue

Posted: Wed Oct 04, 2017 6:10 pm
by Steve Sokolowski
I also wanted to add a post here just to say that spirited discussion is welcomed, but this isn't reddit, and personal attacks are not allowed here. Please keep things civil, as I had to delete a post.

Re: Chris working on issue

Posted: Wed Oct 04, 2017 7:16 pm
by Harrison
Appreciate the very detailed response Steve - thank you for taking the time to do that!