Skip to main content

· 5 min read

关于Nacos

Nacos(/nɑ:kəʊs/)是Dynamic Naming and Configuration Service的首字母简称,是一个更易于构建云原生应用的动态服务发现、配置管理和服务管理平台。它孵化于阿里巴巴,成长于十年双十一的洪峰考验,沉淀了简单易用、稳定可靠、性能卓越的核心竞争力。 image.png

Nacos经过数年的社区共建,支持诸如Java、Go、Python等主流语言、支持如Dubbo和Spring Cloud Alibaba等主流服务框架和配置管理框架、同时还支持对接了如Istio,CoreDNS等云原生组件,是目前使用最广泛的微服务框架基础设施。 image.png

Nacos开源后快速成为国内首选,被互联网、视频、直播、金融领域公司广泛使用,支撑这国计民生的核心业务系统。 image.png

为什么选择Nacos

社区开放活跃

Nacos在github上有17.6k stars 和 6.9k fork;社区共有28位核心Committer,其中一半来自于阿里巴巴,一半来自于社区各行各业。 ​

产品功能强大

Nacos 提供服务发现和服务健康监测、提供动态配置服务、提供动态 DNS 服务、提供服务及其元数据管理等核心能力,具备简单易用、特性丰富、超高性能、超大容量、高可用等优势。 ​

生态联系紧密

Nacos支持主流的微服务框架和应用框架,支持云原生生态。选择Nacos能够更容易嵌入和学习整个微服务体系及云原生体系。 ​

参与Nacos社区

Nacos主要通过官方网站和Github来进行社区参与,参与内容不仅限于代码提交,还包括文档、翻译、测试、文章分享、社区治理等等。

Nacos采用Apache 2.0 Lisence作为开源许可证,同时也赞同其社区高于代码的理念,并按照自己的方式践行。通过与生态伙伴积极共建能力、广泛接受来自各类用户的诉求和贡献、参与两届编程之夏活动,为Nacos社区带来和超过200位贡献者和十分合理的Committer构成。我们欢迎更多希望参与开源,从开源中提升技术、分享思考的同学加入到Nacos社区中。

暑期2021

Nacos--暑期2021项目已上线,项目申请报名已于5.24日开启,感兴趣的同学可以前往官网查看项目详情,先行了解和沟通并报名参与~。 ​ Nacos社区为同学们准备了数个难度不一的题目,可以帮同学们快速了解微服务生态中配置中心和服务管理的相关内容:

  • 为Nacos Python SDK适配新版本gRPC接口(难度:简单)
  • 添加配置加解密的插件SPI,并新增一类简单的加解密实现(难度:中等)
  • 添加鉴权的插件SPI,并将当前的默认鉴权实现转化为插件化实现(难度:中等)
  • 添加新的可观测性系统(难度:中等)
  • 构建内嵌的反脆弱体系插件(难度:困难)

更多详细内容可进入,Nacos官网、Nacos github或ISCAS官网中了解: ​

Nacos官方网址:https://nacos.io

Nacos社区相关内容:https://github.com/alibaba/nacos/issues/5693

ISCAS相关内容:https://summer.iscas.ac.cn/#/org/orgdetail/nacos?lang=en

ISCAS学生指南:https://summer.iscas.ac.cn/help/

· 5 min read

Nacos is Alibaba's open source service discovery and configuration management project. The 2.0.1 version released is mainly dedicated to supporting the MCP-OVER-XDS protocol and making the JRaft election more stable. The 1.4.2 version released also greatly enhances the stability of JRaft Leader election.

Nacos 2.0.1

Main upgrades in 2.0.1:

  1. Supporting MCP-OVER-XDS protocol in both nacos-istio plug-ins and modules.
  2. Fixing the stability of the leader election of the JRaft protocol in the k8s environment.
  3. Fixing the problem of frequently throwing 'Server is Down' error.

Detail change logs:

[#3484] Support ldap login.
[#4856] Support mcp over xds.
[#5137] Support service list add view subscriber.
[#5367] Support client encryption plugin for nacos 2.0.
[#5307] Push support config some parameters
[#5334] Fix Server is Down problem in k8s environment.
[#5361] Check isUseGrpcFeatures() when register instance using GRPC protocol.
[#5486] Refactor Distro Config as singleton and replace GlobalConfig.
[#5169] Fix instance beat run only by responsible server.
[#5175] Fix publishConfig lost type.
[#5178] Fix NPE when init server list failed.
[#5182] Fix throw NoSuchFieldException in ConfigController when service connect to nacos.
[#5204] Fix query error when pageNo is larger than service number.
[#5268] Fix subscriber app unknown
[#5327] Fix ThreadPool usage problem and add some monitor for distro.
[#5384] Fix the problem of unable to shutdown NacosConfigService.
[#5404] Fix frequently udp push for client 1.X.
[#5419] Fix Nacos 2.0 client auth may invalid for non public namespace.
[#5442] change state to UP when received status form old version server.
[#5096] Add unit tests in nacos 2.0.
[#5171][#5421][#5436][#5450][#5464] Fix IT for nacos 2.0.

Nacos 1.4.2

  1. This version mainly enhances the stability of Leader election in JRAft protocol. It can be used in conjunction with the latest nacos-k8s project to optimize stability to a large extent.
  2. Moreover, it fixes many bugs in version 1.4.1 such as the prompt about "Server is Down" problems.

Detail change logs:

[#4452] Add config compare features.
[#4602] Add new way for export config.
[#4996] Make log level changeable for nacos-core module.
[#5367] Add pre-plugin in client for encrypting config.
[#3922] Method createServiceIfAbsent in ServiceManager require sync.
[#4274] skip master-select task when db.num is 1.
[#4753] Use SafeConstructor to parse yaml configuration.
[#4762] Naming health check thread num support user define it by self.
[#4770] Beta publish: change the way of select betaIps, from input to select.
[#4778] Make SecurityProxy.accessToken threadsafe in single writer multi reader.
[#4903] Add securuty hint for login page.
[#4917] Raft ops interface add auth.
[#4980] Log4J2NacosLogging.loadConfiguration() return directly When location is blank.
[#5010] Fix the usage of TemplateUtils.
[#5190] Add some hint log when login failed.
[#5234] Solve the problem that page can be edited while publishing-config request is processing.
[#5331] Fix the mouse hovers over the margin in a pointer state and cannot be clicked.
[#5350] Add hint and detail reason for consistence status Down.
[#5439] Support specified naming UDP push port for client.
[#5434] Optimize the ConfigType.isValidType method.
[#3779] Check groupName can't be empty.
[#4661] ConfigServletInner#doGetConfig code optimization.
[#3610] Fix the press F1 to full screen issue in new config page.
[#3876] Fix push empty service name.
[#4306] Fix search service by group error problem.
[#4573,#4629] Jraft leader status check error.
[#4672] Fix cloning configuration lost description.
[#4699] Fix metadata batch operation may delete instance problem.
[#4756] Fix config list sort and search problem.
[#4787] Losting member if parsing host throw UnknownHostException.
[#4806] Fix addListener method comment.
[#4829] Remove instance when distro and raft remove instances data.
[#4852] Fix main.js is too large problem.
[#4854] Modify Header to support Keys Ignore Case.
[#4898] Fix instance list page bug.
[#4925] Fix member list change will cover member status and metadata problem.
[#5078] Fix the problem of inconsistent results for querying subscriber list data multiple times.
[#5026] Fix MetricsHttpAgent metrics twice.
[#5018] Check group and dataId in groupKey.
[#5114] ConcurrentHashSet.java is not compatible with jdk1.6 or 1.7.
[#5253] Fix missing auth identity header error.
[#5291] Fix Beat task will stop when throw unexpected exception.
[#5301] Respond all kinds of collections for istio's request.
[#5351] Fix Consistence status can't switch to UP after Jraft election.
[#5390] Fix ip verify error.
[#5427] Fix NPE if Jraft leader is null in CurcuitFilter.
[#5437] Fix config beta feature will lost dump event problem.
[#5451] Fix the tag can't be removed problem.
[#4822][#4823][#4824][#4825][#4979][#5506] Fix dependency security problem.
[#5277] Subscriber list servername add required.
[#5380][#5418] Add and enhance unit test.

Community

With Nacos 2.0.1 released, Nacos community has a new committer:

This committer has made a lot of contributions in promoting multi-data source support, authentication and security, configuration module optimization and improvement, and actively participated in community discussions.

haoyann_commiter.jpg

The Nacos community welcomes more partners who would participate to contribution, including but not limited to:

  • Source codes
  • Document
  • Discussion in community
  • Multiple-language
  • Combination of surrounding ecological products

Active participation will get exquisite small gifts from the Nacos community~

About Nacos

Nacos is committed to help you discover, configure, and manage your microservices. It provides a set of simple and useful features enabling you to realize dynamic service discovery, service configuration, service metadata and traffic management.

Nacos makes it easier and faster to construct, deliver and manage your microservices platform. It is the infrastructure that supports a service-centered modern application architecture with a microservices or cloud-native approach.

· 4 min read

The annual double eleven shopping festival is coming again. Have you grabbed goods you want?

In this festival, Nacos community will provide a gift for yours -- Nacos 1.4.0 and nacos-sdk-go 1.0.1 released.

Nacos 1.4.0

Main upgrade in 1.4.0:

  1. Refactors the distro protocol of the naming module and sinks to the nacos-core module.
  2. Used jraft to replace the old self-implemented raft protocol to improve the performance and accuracy of raft semantics.
  3. Unifies the http clients completely used by nacos, optimizes the usage of some http clients, and reduces connection cost, especially the number of CLOSE_WAIT connections.
  4. Add a BETA interface to modify service metadata separately.
  5. Fix some old version bugs and optimized console usage.

Detail change logs:

[#1654] Fix content hightlight does not work in config detail page.
[#2792] Save user information in login when auth open.
[#2835] Fix the console loading continuously if there is no permission of the namespace.
[#2866] Fix client do not have permision for api /nacos/v1/ns/operator/metrics.
[#3117] Sink and Optimize the Notify implementation into common module.
[#3192] Unified http client in nacos server.
[#3315] nacos-client support https.
[#3397] Fix some error in start script.
[#3384] Fix raft information show error in console.
[#3500] Make page list of service manager same as config manager.
[#3509] Fix address server mode cannot be obtained application.properties.
[#3518] When binding roles, the user list is changed to the drop-down selection mode.
[#3530] Add refresh buttons for each page in console.
[#3533] Change client cache directory config.
[#3515][#3536][#3899] Upgrade dependency to fix security problem.
[#3528] Fix client get illegal project.version.
[#3550] Fix persistency file can't create in server side for raft protocol.
[#3560] Change title logo in browser.
[#3566] Extract and sink auth feature to nacos-auth from nacos-config.
[#3576] Adding the destroy lifecycle method on NamingMaintainService.
[#3592] Fix incorrect prompt when accessing unauthorized namespace.
[#3628] Enhance the client update interval when subscribe non-exist service.
[#3635] Replace raft of naming module by Jraft of consistency module.
[#3651] Enhance http client usage to reduce CLOSE_WAIT connection in nacos-server.
[#3661] Enhance raft group update logic for using Jraft.
[#3671] Move some util class into common package.
[#3676] Fix revert chunk does not work in Content Comparison page.
[#3692] Refactor Distro protocol in nacos naming module.
[#3687] Check serviceName's format in server and client.
[#3710] Fix service metadata can't be special words problem.
[#3781] Fix service list intermittently lost service.
[#3790] Fix the configuration garbled problem that may occur on the client.
[#3815] Fix client cache may be truncated when contain Chinese.
[#3833] Fix NotifyCenter will throw NullPointerException when no subscriber.
[#3855] Add changed detail from previous version in configuration detail page.
[#3904] Support operate instance's metadata alonely.
[#3909] Fix nacos server can't config domains.
[#3973] Fix load config failed during the first run.
[#4110] Naming modules failed to work properly during the nacos capacity expansion.

Nacos Go SDK 1.0.1

This version mainly fixes some bugs in the old version and support https.

Detail change logs in release notes

Community

With Nacos 1.4.0 released, Nacos community will invite new two committers : Maijh97 and wangweizZZ.

They have made a lot of contributions in the unifying http client, sinking the auth module, supporting https in client, reorganizing part of the server thread pool and fixing bugs, and actively participated in community discussions.

The Nacos community welcomes more partners who would participate to contribution, including but not limited to:

  • Source codes
  • Document
  • Discussion in community
  • Multiple-language
  • Combination of surrounding ecological products

Active participation will get exquisite small gifts from the Nacos community~

About Nacos

Nacos is committed to help you discover, configure, and manage your microservices. It provides a set of simple and useful features enabling you to realize dynamic service discovery, service configuration, service metadata and traffic management.

Nacos makes it easier and faster to construct, deliver and manage your microservices platform. It is the infrastructure that supports a service-centered modern application architecture with a microservices or cloud-native approach.

· 3 min read

After promoted by Alibaba Summer of Coding(ASoC), nacos-sdk-csharp released a new version 0.5.0. So far, nacos-sdk-csharp will have the same capabilities as the Java-sdk.

Thanks for contribution of Aman, Wenqing Huang during ASoC.

Main release note of v0.5.0 of Nacos-sdk-csharp

  1. Fixed auth request return 403
  2. Failover of configuration was change from memory to file
  3. Fixed can not retrieve available service after specify load balance strategy
  4. Fixed can not refresh accesstoken due to only login once
  5. Support Yaml and Ini parser
  6. Support subscribe and unsubscribe of naming
  7. Support PreferredNetworks to choose the network adapter
  8. Perfect ASP.NET Core integration

近期预告

Nacos社区和阿里巴巴编程之夏的成果远不止此,很快将会有cpp和python的新版本sdk发布哦。

如何共建

我们欢迎任何人积极参与Nacos社区。如果您在文档中发现拼写错误,在代码中发现错误,或想要新功能或想要提供建议,您可以在GitHub上创建一个issues

如果您想开始着手,可以选择github仓库中有以下标签的issues。

  • good first issue :对于新手来说是非常好的入门issues。
  • contribution welcome :非常需要解决的问题和非常重要的模块,但目前缺少贡献者,欢迎贡献者来贡献。

除了以上的通用标签,还可以关注Nacos的多语言共建,目前我们已经支持各类主流语言:

欢迎大家加入Nacos社区,贡献社区。用Apache的话说,“社区高于代码”!

新人时刻 - "什么是Nacos?"

还不知道什么是Nacos? 没关系,在github上star一下跟程序猿兄弟打个招呼吧!!

Nacos 是阿里巴巴于2018年7月份新开源的项目,Nacos的主要愿景是期望通过提供易用的 动态服务发现服务配置管理服务共享与管理 的基础设施,帮助用户在云原生时代更好的构建、交付、管理自己的微服务平台。

github项目地址在 这里

· 4 min read

Nacos自2018年8月5日开源以来,在社区两年的共同努力之下,获得了13400+ stars,发布了30个版本,吸引了 125 位优秀贡献者,积累了上百家企业案例的成绩。在Nacos开源两周年之际,社区同时发布Nacos 1.3.2版本和Go SDK 的1.0.0版本,为Nacos庆生。

Nacos 1.3.2

Nacos 1.3.2版本在1.3.1的基础上,继续重构和优化内核功能,主要改进如下:

  1. 重构并统一 nacos-client 中http客户端的内容,增加拓展性和可读性
  2. 回滚在 nacos-client 1.3.1对apache http的使用,以减少依赖冲突和不可控的日志输出
  3. 重构Nacos中的事件通知模块,提高性能和可读性
  4. 修复Nacos服务端在Windows环境下启动失败的问题
  5. 修复Nacos控制台的部分问题
  6. 修复部分文档的描述错误

Nacos Go SDK 1.0.0

Nacos-sdk-go的v1.0.0版本主要的改进如下:

  1. 支持一个连接监听多个配置
  2. 支持取消配置监听功能
  3. 日志功能重新实现,支持日志轮转和日志实现替换
  4. 提升了实例选择的性能
  5. 文档和示例的完善
  6. Nacos与dubbo-go、sentinel-golang等go语言微服务生态集成

随着Go SDK 1.0.0正式版的发布,意味着Nacos在支持所有主流开发语言上更进了一步。目前,社区已经有Java、Golang、Python、Nodejs的语言支持。C++和C#的SDK也通过编程之夏稳步推进中,相信很快就可以和大家见面。

后期规划

在新的一年中,Nacos将会继续成长、聚焦 Nacos 内核构建,打造一个更稳定、更安全、更高效的微服务引擎 目前核心规划是:

  • 下沉并统一一致性模型
  • 下沉并优化鉴权模型
  • 升级连接通道,提升交互效率
  • 升级服务数据模型,提升服务端运行效率

致谢

随着视频、直播及线上教育行业大规模发展,Nacos在 虎牙、爱奇艺、芒果、Zoom、掌门、KK直播 等高速发展的企业中落地,有此成就,离不开社区贡献者和用户们的共同努力。

在此,感谢所有参与社区共建,为Nacos提供代码,设计以及讨论的贡献者和用户。

同时欢迎更多的个人及企业,参与到Nacos的共建中,让Nacos变得更加高效、稳定。

结尾

Nacos 致力于帮助您发现、配置和管理微服务。Nacos 提供了一组简单易用的特性集,帮助您快速实现动态服务发现、服务配置、服务元数据及流量管理。

Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。 Nacos 是构建以“服务”为中心的现代应用架构 (例如微服务范式、云原生范式) 的服务基础设施。

· 15 min read

-- 来自一名2019阿里巴巴编程之夏同学的亲述

作者:廖春涛

作者简介:

  • 阿里巴巴 nacos 项目管理委员会成员
  • 阿里巴巴 spring-cloud-alibaba 项目提交者
  • 阿里巴巴 nacos-spring-project 项目维护者
  • 阿里巴巴 nacos-springboot-project 项目维护者
  • spring-cloud/spring-cloud-sleuth 项目贡献者
  • 阿里巴巴云原生日讲师
  • 2019年第一季阿里巴巴编程之夏学员
  • 2018年中国大学生服务外包大赛三等奖(国家级)
  • 2017年大学生创新创业大赛校级立项
  • 2017年杭州电子科技大学互联网+大赛二等奖

起源

不知不觉大学四年时光就过去了,而我的毕业,不仅仅是作为一名普通应届毕业生,同时,也是作为一名开源项目的PMC。

与开源结缘是在大三上的时候,那时与同学承接了一个商业外包,需要使用一个WxJava的开源项目。在项目交接之后,我打算用golang重新翻译此项目,顺便去学习了一下。在学习过程中,我在发现了几个小问题的过程中,贡献了一个优化,由此算正式与开源结缘了。

真正完全参与开源,是从大三下开始。那个时候经常和学长去参与各种技术讲座,比如Apache Flink、Apache APISIX、Service Mesh、分布式DB、服务治理等等。在四月的一天,学长推给我了一个有关服务治理的开源项目社区群,而我正好打算从理论到实践,将所学真真切切应用在实际当中。从这一天开始,真真切切的开始投入到开源当中。

前进

参与开源,其过程就好比RPG游戏一般,一路升级打怪。从最开始的在SDK侧新增简单的增删改查功能,到参与维护两个Spring生态组建的维护。这期间,我重新学习了Spring内部的原理。在此期间,我对于Spring的整个设计理解更近了一步,彼时我能够更加灵活地运用Spring提供的各种钩子去实现用户对于组件的需求。

期间比较自豪的事情是发现了spring-cloud-seluth的bug,提交PR并进行了fix。其实发现这个问题的过程是比较曲折,最开始是有用户反馈zipkin无法与服务治理中心进行整合,于是我带着疑问,通过资料查找和研究源码等路径,发现zipkin从某个版本开始,他们自己写了一个webserver,因此无法使用spring相关的能力将zipkin-server注册到服务治理中心,因此我进行了一个简单的测试,将注册时机进行了简单的调整。但是由于过于定制化,因此没有进行回馈(其实问题的根本原因倒不是这个),只是将方案告诉给有此问题的相关用户。后面再持续跟进此问题时,发现仍然有zipkin与服务治理中心存在整合问题,但是这次是客户端,因此进行长时间的问题跟踪调试,最终确定问题的原因,接着进行反馈,最终提交PR进行修复。这一次的经历,让我学习到解决问题不应该只会埋头谷歌或者百度,而是应该从问题本身出发,去跟踪、观察问题,并成功解决。

突破

有了一次的成功经验,我变得更加有信心了。后面,我从客户端转战服务端,真正切入服务治理中心的核心。而此时,我已经成为一名committer了。为了能够更好地参与项目,同时符合committer的身份,我重新开始学习项目的源码、设计,纠正了许多第一次看源码时出现的理解误区,对于某些功能模块代码的设计有了更深的理解。同时,将高可用思想穿插于源码中,令我后面参与实习时,参与项目改造时,能够有更多的思考。 成为committer之后,不知道是不是初生牛犊不怕虎,我接受了内核模块的重构以及去MySQL依赖这两个艰巨的任务。其内核重构设计了一致性协议层的抽象设计、寻址模式的统一、事件机制的统一。其中,最难的莫过于一致性协议层的抽象以及设计了。其实,我对于一致性协议了解的并不是很多,只是知道CAP、BASE理论而已。但即使如此,我在接过任务之后,就开始各种开源项目源码的探究,比如JRaft、Etcd、Memberlist、hashicorp/raft等等,同时下载了各类的电子PDF进行学习,为我后面的工作打下了一定的理论基础。

探索

待秋招以及实习结束之后,我正式开始了相关任务的工作,设计文档编写、基础理论支持、相关项目设计学习、代码编写,其实就是一个需求,从成立到最终产出的全过程,综合性挺强的。这个时候的代码设计不再是随心所欲了,将一个单机的关系型存储变为一个分布式强一致性的关系型存储,其必须保证数据的一致性、事务的ACID性质,需要结合大量的资料以及前人项目的设计进行参考,当时提出的思路方案就有四五种。其中,为了从数据库内部解决这个问题,我还去学习了apache derby的源码——了解到插入一条数据的流程是怎么样的以及其master-slave机制的实现。可以说,通过这些的前期准备以及与其他大佬们的交流,令我后面的代码编写变得更加游刃有余。

感想

对于应届生的我来说,参与开源项目并且成为committer,也算是我的一项优势吧,也正因如此,我在秋招的时候基本是面试一家,收获一家公司的offer,其中也不乏SP。

参与开源项目,是一个将理论付诸于生产实践的有效途径,它让你需要考虑各种因素,比如接口设计、新老版本的数据兼容、可扩展性、边界因素的思考等等,同时还会使得自己知识面的横向以及纵向的延伸;不仅如此,参与开源的过程中,你需要和世界不同的开发者进行思想的碰撞交流,有时候通过交流,能够使得自己对于自己的设计有更深的认识,发现设计上的不足,同时也锻炼了自己口述、文字的能力。

虽然自己没几天就要去某大厂工作了,但是还是希望自己能够保证工作质量同时深入学习工作方向内容的空闲时间,保持对开源参与的热情,从开源中学习,并将自己学习的知识回馈当中。

最后汇报一下今年阿里巴巴编程之夏的最新进展。经过 3 轮的严格筛选,本次阿里巴巴编程之夏共有 29 名同学突出重围,马上开始 coding 的夏天啦!献上部分同学“开学照”,唯有热爱,可抵岁月漫长,加油,同学们!


Nacos 团队招人啦~ 欢迎21年毕业同学加入阿里云,共建云原生!

一、招聘对象

2020年11月- 2021年10月毕业的应届毕业生

二、部门介绍

我们是谁?

云原生致力于打造世界上最先进、稳定的云原生基础设施,是阿里云最核心的部门之一。 我们的目标是让云成为成本最低,效率最高,稳定性最强的应用运行环境。在这里,我们设计超大规模容器与调度系统,发挥出云的极致弹性能力;我们构建高性能微服务架构,提供云端应用的无限拓展的能力;我们打造标准易用的 PaaS 平台,让云上研发变得简单、可控。 在这里,你会参与到容器、Kubernetes、Service Mesh、Serverless 等最前沿的技术研发与探索中来;你也会与 CNCF TOC 和 SIG 联席主席, etcd 创始人、K8s Operator 创始人等组成的国内最顶尖的云原生技术团队一同工作。 在这里,你会参与到全球最顶级的开源项目(如 Kubernetes、Containerd、OAM、Apache Dubbo、Nacos、Arthas)研发工作中,一同拓展云技术的边界,既赋能阿里巴巴全球经济体,更服务全世界的开发者用户。

团队大咖

  • 丁宇(花名:叔同,研究员),云原生应用平台团队负责人。2010 年加入淘宝,9 次参与双 11 作战,阿里高可用架构、双 11 稳定性负责人,阿里容器、调度、集群管理、运维技术负责人,推动和参与了双 11 几代技术架构的演进和升级。
  • 张瓅玶(花名:谷朴,研究员),负责集群资源管理和利用率优化。之前在 Google 基础设施事业群的集群管理部门工作了 5 年多,并领导了资源管理和优化调度团队,负责了 FlexBorg, Autoscaling 等多个产品。加入 Google 前在加州大学伯克利分校从事智能系统的研究工作,本科和博士毕业于清华大学。
  • 易立(花名:微垣,资深技术专家),目前负责阿里云区块链服务和容器服务的研发工作。之前曾在IBM中国开发中心工作,担任资深技术专员;作为架构师和主要开发人员负责或参与了一系列在云计算、区块链、Web 2.0,SOA领域的产品研发和创新。本科和硕士毕业于北京大学。
  • 李响(资深技术专家),基础软件开源方向战略负责人,CNCF 全球 9 位 TOC 之一。前 CoreOS 分布式项目主管,负责 Kubernetes、etcd 等分布式系统相关项目在 CoreOS 的开发工作,他的主要兴趣在于分布式一致协议、分布式存储、分布式系统调度等,开源项目 etcd 作者。
  • 张磊(高级技术专家),Kubernetes项目资深成员和联合维护者,主要关注容器运行时接口(CRI)、调度、资源管理和基于虚拟化技术的容器运行时等特性,共同负责Kubernetes上游和阿里集团大型集群管理系统的工程工作,曾就职于微软研究院(MSR)和KataContainers团队。

三、岗位要求

  • 本科及以上学历,计算机、数学、电子工程、通信等相关专业;
  • 具备扎实的数据结构和计算机系统基础,精通一种开发语言;
  • 对基础软件充满热情,具备较好的动手能力和技术热情,有成功的研究型或实战型项目技术成果落地者优先;
  • 关注开源技术,有开源贡献者优先;
  • 快速学习,不断突破技术瓶颈,乐于探索未知领域,随时准备好去面对新挑战;
  • 良好的团队合作精神,能够做到严谨、皮实、乐观。

四、开放职位

  • Golang研发工程师
  • JAVA研发工程师
  • C/C++研发工程师
  • 基础平台研发工程师
  • 前端开发工程师

五、工作地点

杭州/北京/深圳

六、投递渠道

water.lyl@alibaba-inc.com

· 12 min read
# Nacos权限控制设计方案## 方案背景 Nacos自开源依赖,权限控制一直需求比较强烈,这也反应了用户需求将Nacos部署到生产环境的需求。最新发布的Nacos 1.2.0版本已经支持了服务发现和配置管理的权限控制,保障用户安全上生产。本文主要介绍Nacos权限控制的设计方案和使用指南。### 什么是权限控制? 在分布式服务调用时,需要对未知的或者不受信任的请求来源的请求进行识别和拒绝。权限控制一般分为两个阶段:身份识别(Authentication)和权限识别(Authorization)。身份认证主要确定访问者的身份,权限识别则判断这个访问者是否有对应资源的权限。

在Nacos的场景中,配置管理的权限控制指的是设置某个配置能否被某个用户读写,这个比较好理解,没有权限的用户旧无法读取或者写入对应的配置。服务发现的权限控制指的是用户是否有权限进行某个服务的注册或者订阅,这里需要注意的是服务发现的权限控制只能够控制用户是否可以从Nacos获取到服务的地址或者在Nacos上修改服务的地址。但是如果已经获取到了服务的地址,Nacos无法在服务真正调用时进行权限控制,这个时候的权限控制需要由服务框架来完成。

image.png

### 常见实现方式#### 认证(Authentication)
  • 用户名+密码
  • Cookie(只适用于浏览器)
  • Session
  • Token(JWT,Oauth,LDAP,SAML,OpenID)
  • AK/SK

鉴权(Authorization)

  • ACL: 规定资源可以被哪些主体进行哪些操作;
  • DAC: 规定资源可以被哪些主体进行哪些操作 同时,主体可以将资源的权限,授予其他主体
  • MAC:a. 规定资源可以被哪些类别的主体进行哪些操作 b. 规定主体可以对哪些等级的资源进行哪些操作 当一个操作,同时满足a与b时,允许操作
  • RBAC: a. 规定角色可以对哪些资源进行哪些操作 b. 规定主体拥有哪些角色当一个操作,同时满足a与b时,允许操作
  • ABAC: 规定哪些属性主体可以对哪些属性资源在哪些属性的情况下进行哪些操作
## 方案详情 Nacos的权限控制,目标是能够满足用户基本的鉴权需求,同时能够保持扩展性,可以支持去对接用户自带的用户管理系统或者鉴权系统,包括后面和K8S生态以及Service Mesh生态能够无缝的融合。基于这样的考虑,目前Nacos权限控制的设计是自带一个基本的实现,然后可以支持用户扩展。具体的设计如下。### 模块设计 整体的模块设计是尽量将鉴权的逻辑抽象出来,不在服务发现模块或者配置管理模块添加相关的逻辑。通过配置文件可以选择当前使用的鉴权系统。Nacos自带的认证系统使用JWT Token,自带的鉴权系统使用的是RBAC。

image.png

### 认证算法 对于用户来说,不管是在控制台还是在客户端,都是上传用户名和密码来获取一个token,然后后续的每一次到Nacos的请求都会带上这个token来表明身份。这个token会有一个失效时间,对于控制台来说,只需要直接提示用户重新登录即可,对于客户端则需要有一个定期到Nacos刷新token的逻辑。

image.png

### 鉴权算法 Nacos自带的鉴权系统使用的是RBAC模型,可以在网上查询相关的资料。#### 数据模型 鉴权的数据模型也是基于标准的RBAC来设计的,分为用户、角色和权限三部分。用户就是由用户名和密码组成的用户信息,角色则是一个逻辑上的用户组,Nacos启动时会自带一个全局管理员的角色,只有这个全局管理员的角色可以进行添加用户、添加角色、添加授权等操作,保证安全性。而权限则是由资源+动作来组成。

image.png

### 接口设计 以下接口涉及到登录和鉴权的所有逻辑,这些接口除了登录接口,其他接口都只能由全局管理员来调用。#### 用户管理
  • 创建用户:POST
    /nacos/v1/auth/users?username=xx&password=yy
  • 删除用户:DELETE /nacos/v1/auth/users?username=xx&password=yy
  • 更新用户:PUT /nacos/v1/auth/users?username=xx&oldPassword=yy&newPassword=zz
  • 登录:POST
    /nacos/v1/auth/users/login?username=xxx&password=yyy
#### 角色管理
  • 创建角色/绑定用户到角色:POST /nacos/v1/auth/roles?role=xx&username=yy
  • 删除某个用户的角色:DELETE /nacos/v1/auth/roles?role=xx&username=yy
  • 获取用户的所有角色:GET /nacos/v1/auth/roles?username=xxx
#### 权限管理
  • 给角色添加权限:POST /nacos/v1/auth/permissions?role=xxx&resource=yyy&action=zzz
  • 从角色删除权限:DELETE /nacos/v1/auth/permissions?role=xxx&resource=yyy&action=zzz
  • 获取某个角色的权限:GET /nacos/v1/auth/permissions?role=xxx
# Nacos权限控制实战## 安装Nacos 1.2.0
  1. 部署包准备。可以直接下载安装包:https://github.com/alibaba/nacos/releases/tag/1.2.0,也可以将Nacos master分支clone下来进行源码编译:
mvn -Prelease-nacos -Dmaven.test.skip=true clean install -U
  1. 安装包解压,然后使用distribution/nacos-mysql.sql进行数据库初始化,主要是新增了users, roles, permissions三张表,standalone模式使用distribution/schema.sql进行初始化。
  2. Server端打开权限控制开关。修改con/application.properties内容:
nacos.core.auth.enabled=true

这个开关采用了热加载模式,无需重启Server即可生效。因此当权限控制功能使用有异常时,可以直接回滚到不鉴权的模式。

注意: Nacos 1.2.0里登录和鉴权是绑定关系,而由于这个开关的默认值为false,因此默认启动时,是没有登录界面的,这点请读者注意。

## 使用权限控制
  1. 使用管理员账号登录Nacos控制台(如果页面提示错误,可以清空浏览器缓存刷新页面):

image.png

可以看到,左侧边栏增加了一个父菜单和三个子菜单,分别用于权限控制里的用户创建、角色创建以及权限管         理。这个菜单栏只会在管理员登录的时候显示,也就意味着只有管理员才能进行权限的管理和分配。

  1. 管理用户。点击“用户列表”,进入用户管理页面,可以进行用户的创建、修改和删除:

image.png

  1. 管理角色。因为Nacos的自带的权限是基于角色来进行分配的,因此需要给创建好的用户绑定一些角色:

image.png

  1. 管理权限。角色创建好以后,就可以给这个角色赋予特定的权限了:

image.png

在“添加资源”对话框里,可以选择绑定的角色,命名空间资源以及对应的动作类型,例如在上图中,我们给角色role1绑定命名空间test的读写权限。然后又因为刚刚我们是将user1绑定到了role1上,那么user1这个用户就可以对test这个命名空间的资源进行读写操作了。

  1. 使用user1登录控制台。点击控制台右上角,退出admin账号,然后用刚才创建的user1进行登录:

image.png

如上图所示,首先是左侧的权限管理菜单消失了,因为当前用户不是管理员。其次是会弹出一个鉴权失败的提示框。不用担心,这个提示框意思是user1没有public命名空间的读权限,所以会弹出,但是不影响我们将命名空间切换到test:
image.png

如上图所示,我们可以看到test命名空间的配置数据了,下面我们再来介绍客户端的使用。

  1. 首先依赖最新的nacos 1.2.0客户端,然后在初始化时添加如下代码:
Properties properties = new Properties();
properties.put(PropertyKeyConst.NAMESPACE, "99a791cf-41c4-4535-9e93-b0141652bad0");
properties.put(PropertyKeyConst.SERVER_ADDR, "127.0.0.1:8848");
// 配置用户名:
properties.put(PropertyKeyConst.USERNAME, "user1");
// 配置密码:
properties.put(PropertyKeyConst.PASSWORD, "pwd1");
ConfigService iconfig = NacosFactory.createConfigService(properties);
  1. 使用客户端进行正常的读写配置操作。
# 我们在招人 阿里巴巴云原生基础技术中台是隶属于阿里云基础产品事业部的核心研发团队,致力于打造稳定、标准、先进的云原生应用基础平台,推动行业面向云原生技术升级与革命。在这里,你将与来自云计算、大数据领域的顶尖技术专家亲密合作,在全球独一无二的场景和规模中从事Kubernetes、Service Mesh、Serverless、Open Application Model(OAM)、Cloud Native Microservices、OpenMessaging、Event Streaming等云原生生态核心基础技术及Apache Dubbo、Apache RocketMQ、Nacos、Arthas等顶级开源项目的研发和落地工作。在标杆级的平台上,既服务阿里巴巴全球经济体,更服务全世界的开发者用户。目前在招聘技术专家岗位,详情可参考:[http://www.posterhr.com/html/CkgpBwD6f?from=timeline&isappinstalled=0](http://www.posterhr.com/html/CkgpBwD6f?from=timeline&isappinstalled=0)(可以直接投递,也可以将简历直接发送到dungu.zpf#alibaba-inc.com。#替换为@)

· 26 min read

Summarize

This 1.3.0 is implanted to a great extent, involving the modification of two large modules and the addition of a core module

  1. nacos-core module modification
    1. nacos
    2. nacos internal event mechanism
    3. nacos consistency protocol layer
  2. nacos-config module modification
    1. Add embedded distributed data storage components
    2. Separation of embedded storage and external storage
    3. Simple operation and maintenance of embedded storage
  3. Add nacos-consistency module
    1. Unified abstraction for AP protocol and CP protocol

System parameters changes

Updates

corenacos.watch-file.max-dirsJVM parameterMaximum number of monitored directories
nacos.core.notify.ring-buffer-sizeJVM parameterQuick notification of the maximum length of the queue
nacos.core.notify.share-buffer-sizeJVM parameterThe maximum length of the slow notification queue
nacos.core.member.fail-access-cntJVM parameter.propertiesMaximum number of failed visits to cluster member nodes
nacos.core.address-server.retryJVM parameter、application.propertiesAddress server addressing mode, first start request retry times

The future overall logical architecture of Nacos and its components

1561217775318-6e408805-18bb-4242-b4e9-83c5b929b469.png

Nacos cluster member node addressing mode

Before 1.3.0, nacos' naming module and config module had their own member list management tasks. In order to unify the replacement mode of nacos assigning the next member list, the implementation of merge management is replaced from the named module and the config module, unified to the addressing module of the core module, and the command line parameters are added at the same time -Dnacos.member.list **To set the list listed by nacos, this parameter can be called an alternative to the cluster.conf file. The current nacos addressing mode categories are as follows

  1. In stand-alone mode: StandaloneMemberLookup
  2. Play mode
    1. The cluster.conf file exists: FileConfigMemberLookup
    2. The cluster.conf file does not exist or -Dnacos.member.list is not set: AddressServerMemberLookup

If you want to specify an addressing mode, set this parameter:nacos.core.member.lookup.type=[file,address-server]

The logical diagram is as follows

Addressing mode details

Next, I introduce two other addressing modes in addition to the addressing mode in stand-alone mode

FileConfigMemberLookup

This addressing mode is managed based on the cluster.conf file, and each node will read the list of member nodes in the cluster.conf file under their respective ${nacos.home}/conf and then form a cluster. And after reading the cluster.conf file under ${nacos.home}/conf for the first time, it will automatically register a directory listener with the operating system's inotify mechanism to monitor ${nacos.home}/ All file changes in the conf directory (note that only files will be monitored here, and file changes in subdirectories cannot be monitored)
When you need to expand or shrink the cluster nodes, you need to manually modify the content of the member node list of cluster.conf under ${nacos.home}/conf for each node.

private FileWatcher watcher = new FileWatcher() {
@Override
public void onChange(FileChangeEvent event) {
readClusterConfFromDisk();
}

@Override
public boolean interest(String context) {
return StringUtils.contains(context, "cluster.conf");
}
};

@Override
public void start() throws NacosException {
readClusterConfFromDisk();

if (memberManager.getServerList().isEmpty()) {
throw new NacosException(NacosException.SERVER_ERROR,
"Failed to initialize the member node, is empty");
}

// Use the inotify mechanism to monitor file changes and automatically
// trigger the reading of cluster.conf
try {
WatchFileCenter.registerWatcher(ApplicationUtils.getConfFilePath(), watcher);
}
catch (Throwable e) {
Loggers.CLUSTER.error("An exception occurred in the launch file monitor : {}", e);
}
}

The first time you directly read the node list information in the cluster.conf file, then register a directory listener with WatchFileCenter, and automatically trigger readClusterConfFromDisk() to re-read cluster.conf when the cluster.conf file changes file
image.png

AddressServerMemberLookup

This addressing mode is based on an additional web server to manage cluster.conf. Each node periodically requests the content of the cluster.conf file from the web server, and then implements addressing between cluster nodes and expansion and contraction.
When you need to expand or shrink the cluster, you only need to modify the cluster.conf file, and then each node will automatically get the latest cluster.conf file content when it requests the address server.

@Override
public void start() throws NacosException {
if (start.compareAndSet(false, true)) {
this.maxFailCount = Integer.parseInt(ApplicationUtils.getProperty("maxHealthCheckFailCount", "12"));
initAddressSys();
run();
}
}

private void initAddressSys() {
String envDomainName = System.getenv("address_server_domain");
if (StringUtils.isBlank(envDomainName)) {
domainName = System.getProperty("address.server.domain", "jmenv.tbsite.net");
} else {
domainName = envDomainName;
}
String envAddressPort = System.getenv("address_server_port");
if (StringUtils.isBlank(envAddressPort)) {
addressPort = System.getProperty("address.server.port", "8080");
} else {
addressPort = envAddressPort;
}
addressUrl = System.getProperty("address.server.url",
ApplicationUtils.getContextPath() + "/" + "serverlist");
addressServerUrl = "http://" + domainName + ":" + addressPort + addressUrl;
envIdUrl = "http://" + domainName + ":" + addressPort + "/env";

Loggers.CORE.info("ServerListService address-server port:" + addressPort);
Loggers.CORE.info("ADDRESS_SERVER_URL:" + addressServerUrl);
}

@SuppressWarnings("PMD.UndefineMagicConstantRule")
private void run() throws NacosException {
// With the address server, you need to perform a synchronous member node pull at startup
// Repeat three times, successfully jump out
boolean success = false;
Throwable ex = null;
int maxRetry = ApplicationUtils.getProperty("nacos.core.address-server.retry", Integer.class, 5);
for (int i = 0; i < maxRetry; i ++) {
try {
syncFromAddressUrl();
success = true;
break;
} catch (Throwable e) {
ex = e;
Loggers.CLUSTER.error("[serverlist] exception, error : {}", ExceptionUtil.getAllExceptionMsg(ex));
}
}
if (!success) {
throw new NacosException(NacosException.SERVER_ERROR, ex);
}

GlobalExecutor.scheduleByCommon(new AddressServerSyncTask(), 5_000L);
}

During initialization, it will take the initiative to synchronize the current cluster member list information with the address server, and if it fails, retry, the maximum number of retries can be controlled by setting nacos.core.address-server.retry, The default is 5 times, and then after success, a scheduled task will be created to synchronize the cluster member node information to the address server
image.png

How node management and addressing modes are combined

image.png
After MemberLookup starts, it will perform addressing tasks according to different addressing modes, will collect cluster node list information, call memberChange, trigger cluster node changes, and then publish node change events

Nacos consensus protocol protocol layer abstraction

From the overall architecture of nacos in the future, it can be seen that the consistency protocol layer will be the core module of nacos, and will serve each functional module built on the core module, or the service and core module itself. The consistency protocol needs to choose between availability and consistency because of the existence of partition fault tolerance, so there are two major types of consistency: final consistency and strong consistency. In nacos, both types of consistency protocols are possible. For example, the naming module uses AP and CP for data management of service instances, respectively. For the config module, it will involve the use of CP. At the same time, there are the following functional demand points

  1. At present, the persistence service uses a variant version of raft, and the business and the raft protocol are coupled. Therefore, it needs to be decoupled and decoupled. At the same time, a standard Java version of Raft is selected for implementation.
  2. For small and medium-sized users, the configuration is basically not super much. An independent mysql is relatively heavy and requires a light-weight storage solution. It also supports 2.0 not dependent on mysql and 3.0 dependent on mysql configurability
  3. Due to CP or AP, there are many implementations, how to make a good abstraction of the consistency protocol layer, so that in the future can quickly achieve the specific implementation of the underlying consistency protocol replacement, such as the Raft protocol, the current selection of nacos It is JRaft, it is not excluded that in the future nacos will implement a standard raft protocol or Paxos protocol by itself
  4. Since there are multiple function modules working independently in Nacos, there can be no influence between each function module. For example, when the A module processes the request too slowly or an exception occurs, it cannot affect the normal operation of the B module, that is, each function module is in use. How to isolate the data processing of each module when using a consistent protocol?

According to the consensus protocol and the above functional requirements, this time an abstract consensus protocol layer and related interfaces were made

Consensus agreement abstraction

ConsistencyProtocol

The so-called consistency is the characteristic of whether multiple copies can maintain consistency, and the essence of the copy is data, and the operation of the data is either acquisition or modification. At the same time, the consensus protocol is actually for distributed situations, and this necessarily involves multiple nodes. Therefore, there is a need for a corresponding interface to be able to adjust the coordination protocol of the collaborative work node. What if we want to observe the operation of the consistency agreement? For example, the Raft protocol, we want to know who is the leader in the current cluster, the term of office, and who are the member nodes in the current cluster? Therefore, it is also necessary to provide a consistent protocol metadata acquisition.
In summary, the general design of ConsistencyProtcol can come out

public interface ConsistencyProtocol<T extends Config, P extends LogProcessor> extends CommandOperations {

/**
* Consistency protocol initialization: perform initialization operations based on the incoming Config
* 一致性协议初始化,根据 Config 实现类
*
* @param config {@link Config}
*/
void init(T config);

/**
* Add a log handler
*
* @param processors {@link LogProcessor}
*/
void addLogProcessors(Collection<P> processors);

/**
* Copy of metadata information for this consensus protocol
* 该一致性协议的元数据信息
*
* @return metaData {@link ProtocolMetaData}
*/
ProtocolMetaData protocolMetaData();

/**
* Obtain data according to the request
*
* @param request request
* @return data {@link Response}
* @throws Exception
*/
Response getData(GetRequest request) throws Exception;

/**
* Get data asynchronously
*
* @param request request
* @return data {@link CompletableFuture<Response>}
*/
CompletableFuture<Response> aGetData(GetRequest request);

/**
* Data operation, returning submission results synchronously
* 同步数据提交,在 Datum 中已携带相应的数据操作信息
*
* @param data {@link Log}
* @return submit operation result {@link Response}
* @throws Exception
*/
Response submit(Log data) throws Exception;

/**
* Data submission operation, returning submission results asynchronously
* 异步数据提交,在 Datum 中已携带相应的数据操作信息,返回一个Future,自行操作,提交发生的异常会在CompleteFuture中
*
* @param data {@link Log}
* @return {@link CompletableFuture<Response>} submit result
* @throws Exception when submit throw Exception
*/
CompletableFuture<Response> submitAsync(Log data);

/**
* New member list
* 新的成员节点列表,一致性协议自行处理相应的成员节点是加入还是离开
*
* @param addresses [ip:port, ip:port, ...]
*/
void memberChange(Set<String> addresses);

/**
* Consistency agreement service shut down
* 一致性协议服务关闭
*/
void shutdown();

}

For the CP protocol, due to the concept of Leader, it is necessary to provide a method for obtaining who is the current Leader of the CP protocol.

public interface CPProtocol<C extends Config> extends ConsistencyProtocol<C> {

/**
* Returns whether this node is a leader node
*
* @param group business module info
* @return is leader
* @throws Exception
*/
boolean isLeader(String group) throws Exception;

}

Data operation request submission object:Log、GetRequest

As mentioned above, the consistency protocol is actually for data operations. Data operations are basically divided into two categories: data query and data modification, and at the same time, data isolation between different functional modules must be satisfied. Therefore, the data modification operations and data query operations are explained separately here.

  1. Data modification
    1. Data modification operation, you must know which functional module this request belongs to
    2. For data modification operations, you must first know what kind of modification operation this data modification operation is for, so that the function module can perform corresponding logical operations for the real data modification operation
    3. For data modification operations, you must know what the modified data is, that is, the request body. In order to make the consistency protocol layer more general, here for the data structure of the request body, the byte[] array is selected
    4. The type of data, because we serialize the real data into a byte[] array, in order to be able to serialize normally, we may also need to record what the type of this data is
    5. The information summary or identification information of this request
    6. The additional information for this request is used to expand the data to be transmitted in the future

In summary, it can be concluded that the design of the Log object is as follows


message Log {
// Function module grouping information
string group = 1;
// Abstract or logo
string key = 2;
// Specific request data
bytes data = 3;
// type of data
string type = 4;
// More specific data manipulation
string operation = 5;
// extra information
map<string, string> extendInfo = 6;
}
  1. Data query
    1. For data query operations, you must know which function module initiated the request
    2. What are the conditions for data query? In order to be compatible with data query operations of various storage structures, here byte[] is used for storage
    3. The additional information for this request is used to expand the data to be transmitted in the future

In summary, the design of the GetRequest object is as follows

message GetRequest {
// Function module grouping information
string group = 1;
// Specific request data
bytes data = 2;
// extra information
map<string, string> extendInfo = 3;
}

Function modules use consistency protocol:LogProcessor

After the data operation is submitted through the consistency protocol, each node needs to process the Log or GetRequest object. Therefore, we need to abstract a Log and GetRequest object Processor. Different functional modules implement the processor. ConsistencyProtocol will internally According to the group attributes of Log and GetRequest, the Log and GetRequest objects are routed to a specific Processor. Of course, the Processor also needs to indicate which functional module it belongs to.

public abstract class LogProcessor {

/**
* get data by key
*
* @param request request {@link GetRequest}
* @return target type data
*/
public abstract Response onRequest(GetRequest request);

/**
* Process Submitted Log
*
* @param log {@link Log}
* @return {@link boolean}
*/
public abstract Response onApply(Log log);

/**
* Irremediable errors that need to trigger business price cuts
*
* @param error {@link Throwable}
*/
public void onError(Throwable error) {
}

/**
* In order for the state machine that handles the transaction to be able to route
* the Log to the correct LogProcessor, the LogProcessor needs to have an identity
* information
*
* @return Business unique identification name
*/
public abstract String group();

}

For the CP protocol, such as the Raft protocol, there is a snapshot design, so we need to separately extend a method for the CP protocol

public abstract class LogProcessor4CP extends LogProcessor {

/**
* Discovery snapshot handler
* It is up to LogProcessor to decide which SnapshotOperate should be loaded and saved by itself
*
* @return {@link List <SnapshotOperate>}
*/
public List<SnapshotOperation> loadSnapshotOperate() {
return Collections.emptyList();
}

}

Summary

As can be seen from the above points, ConsistencyProtocol is the use interface exposed to the upper layer functional modules. Each ConsistencyProtocol has a backend implemented by a specific consistency protocol. Because Backend cannot be well compatible with nacos existing architecture design, so The additional LogProcessor is designed to solve this problem.
image.png
同At the time, because the backend inside the consistency protocol layer needs to implement the isolation processing of the data of different business modules, and this piece of logic is implemented by the request object and the group attribute of the LogProcessor
image.png

Consistent protocol layer workflow

We can take a look at a sequence diagram, the general workflow of the consistency protocol layer

The implementation option of CP protocol in Nacos consistency protocol layer——JRaft

After the consistency protocol layer is abstracted, the rest is the choice of concrete consistency protocol implementation. Here we have chosen Ant Financial's open source JRaft, so how can we use JRaf as a backend of the CP protocol? The following simple flow chart describes the initialization process when JRaft is used as a Backend of the CP protocol

/**
* A concrete implementation of CP protocol: JRaft
*
* <pre>
* ┌──────────────────────┐
* │ │
* ┌──────────────────────┐ │ ▼
* │ ProtocolManager │ │ ┌───────────────────────────┐
* └──────────────────────┘ │ │for p in [LogProcessor4CP] │
* │ │ └───────────────────────────┘
* ▼ │ │
* ┌──────────────────────────────────┐ │ ▼
* │ discovery LogProcessor4CP │ │ ┌─────────────────┐
* └──────────────────────────────────┘ │ │ get p.group() │
* │ │ └─────────────────┘
* ▼ │ │
* ┌─────────────┐ │ │
* │ RaftConfig │ │ ▼
* └─────────────┘ │ ┌──────────────────────────────┐
* │ │ │ create raft group service │
* ▼ │ └──────────────────────────────┘
* ┌──────────────────┐ │
* │ JRaftProtocol │ │
* └──────────────────┘ │
* │ │
* init() │
* │ │
* ▼ │
* ┌─────────────────┐ │
* │ JRaftServer │ │
* └─────────────────┘ │
* │ │
* │ │
* ▼ │
* ┌────────────────────┐ │
* │JRaftServer.start() │ │
* └────────────────────┘ │
* │ │
* └──────────────────┘
* </pre>
*
* @author <a href="mailto:liaochuntao@live.com">liaochuntao</a>
*/

JRaftProtocol is a concrete implementation of a ConsistencyProtocol when JRaft is used as the backend of the CP protocol. It has a JRaftServer member attribute inside. JRaftServer distributes various API operations of JRaft, such as data operation submission, data query, and member node changes. , Leader node query, etc.

Note: The data generated during JRaft operation is in the ${nacos.home}/data/protocol/raft file directory. Different business modules have different file groupings. If the node crashes or shuts down abnormally, clear the files in the directory and restart the node

Since JRaft implements the concept of raft group, it is possible to use the design of raft group to create a raft group for each function module. Here is part of the code, which shows how to embed LogProcessor in the state machine and create a Raft Group for each LogPrcessor

synchronized void createMultiRaftGroup(Collection<LogProcessor4CP> processors) {
// There is no reason why the LogProcessor cannot be processed because of the synchronization
if (!this.isStarted) {
this.processors.addAll(processors);
return;
}

final String parentPath = Paths
.get(ApplicationUtils.getNacosHome(), "data/protocol/raft").toString();

for (LogProcessor4CP processor : processors) {
final String groupName = processor.group();
if (multiRaftGroup.containsKey(groupName)) {
throw new DuplicateRaftGroupException(groupName);
}

// Ensure that each Raft Group has its own configuration and NodeOptions
Configuration configuration = conf.copy();
NodeOptions copy = nodeOptions.copy();
JRaftUtils.initDirectory(parentPath, groupName, copy);

// Here, the LogProcessor is passed into StateMachine, and when the StateMachine
// triggers onApply, the onApply of the LogProcessor is actually called
NacosStateMachine machine = new NacosStateMachine(this, processor);

copy.setFsm(machine);
copy.setInitialConf(configuration);

// Set snapshot interval, default 1800 seconds
int doSnapshotInterval = ConvertUtils.toInt(raftConfig
.getVal(RaftSysConstants.RAFT_SNAPSHOT_INTERVAL_SECS),
RaftSysConstants.DEFAULT_RAFT_SNAPSHOT_INTERVAL_SECS);

// If the business module does not implement a snapshot processor, cancel the snapshot
doSnapshotInterval = CollectionUtils
.isEmpty(processor.loadSnapshotOperate()) ? 0 : doSnapshotInterval;

copy.setSnapshotIntervalSecs(doSnapshotInterval);
Loggers.RAFT.info("create raft group : {}", groupName);
RaftGroupService raftGroupService = new RaftGroupService(groupName,
localPeerId, copy, rpcServer, true);

// Because RpcServer has been started before, it is not allowed to start again here
Node node = raftGroupService.start(false);
machine.setNode(node);
RouteTable.getInstance().updateConfiguration(groupName, configuration);
RaftExecutor.executeByCommon(() -> registerSelfToCluster(groupName, localPeerId, configuration));

// Turn on the leader auto refresh for this group
Random random = new Random();
long period = nodeOptions.getElectionTimeoutMs() + random.nextInt(5 * 1000);
RaftExecutor.scheduleRaftMemberRefreshJob(() -> refreshRouteTable(groupName),
nodeOptions.getElectionTimeoutMs(), period, TimeUnit.MILLISECONDS);
multiRaftGroup.put(groupName,
new RaftGroupTuple(node, processor, raftGroupService, machine));
}
}

Q&A: Why do you want to create multiple raft groups

Some people may have doubts. Since the LogProcessor has been designed before, you can use a Raft Group. When the state machine is appl, you can route to different LogProcessors according to the Log group attribute. Each function module creates a Raft group, will it consume a lot of resources?
As mentioned before, we hope that the modules that work independently do not affect each other. For example, the A module processing Log may cause the application speed to be slow because of the Block operation, or an exception may occur halfway. For the Raft protocol , When the log apply fails, the state machine will not be able to continue to move forward, because if you continue to move forward, due to the previous step of the apply failure, all subsequent applications may fail, which will cause the data of this node and other nodes Data is never consistent. If we put all the modules that work independently in the same raft group, that is, a state machine, for the data processing request processing, the above-mentioned problems will inevitably occur, and a module will be uncontrollable in the apply log. Factors will affect the normal operation of other modules.

JRaft operation and maintenance

In order to allow users to perform simple operation and maintenance of JRaft, such as leader switching, resetting the current Raft cluster members, triggering a node to perform Snapshot operations, etc., a simple HTTP interface is provided for operation, and the interface has certain Limit, that is, only one operation instruction can be executed at a time

1、Switch the leader node of a certain Raft Group

POST /nacos/v1/core/ops/raft
{
"groupId": "xxx",
"command": "transferLeader"
"value": "ip:{raft_port} or ip:{raft_port},ip:{raft_port},ip:{raft_port}"
}

2、Reset a Raft Group cluster member

POST /nacos/v1/core/ops/raft
{
"groupId": "xxx",
"command": "resetRaftCluster",
"value": "ip:{raft_port},ip:{raft_port},ip:{raft_port},ip:{raft_port}"
}

Note that this operation is a high-risk operation. This operation and maintenance command can only be used when the n/2 + 1 node of the Raft cluster fails to meet the requirements of more than half of the vote after the crash. It is used to quickly reorganize the remaining nodes to the Raft cluster to provide external Service, but this operation will greatly cause the loss of data

3、Trigger a Raft Group to perform a snapshot operation

POST /nacos/v1/core/ops/raft
{
"groupId": "xxx",
"command": "doSnapshot",
"value": "ip:{raft_port}"
}

4、Remove a member of a Raft Group

POST /nacos/v1/core/ops/raft
{
"groupId": "xxx",
"command": "removePeer",
"value": "ip:{raft_port}"
}

5、Remove multiple members of a Raft Group in batches

POST /nacos/v1/core/ops/raft
{
"groupId": "xxx",
"command": "removePeers",
"value": "ip:{raft_port},ip:{raft_port},ip:{raft_port},..."
}
### Sets the Raft cluster election timeout, default value is 5 second
nacos.core.protocol.raft.data.election_timeout_ms=5000
### Sets the amount of time the Raft snapshot will execute periodically, default is 30 minute
nacos.core.protocol.raft.data.snapshot_interval_secs=30
### Requested retries, default value is 1
nacos.core.protocol.raft.data.request_failoverRetries=1
### raft internal worker threads
nacos.core.protocol.raft.data.core_thread_num=8
### Number of threads required for raft business request processing
nacos.core.protocol.raft.data.cli_service_thread_num=4
### raft linear read strategy, defaults to index
nacos.core.protocol.raft.data.read_index_type=ReadOnlySafe
### rpc request timeout, default 5 seconds
nacos.core.protocol.raft.data.rpc_request_timeout_ms=5000

Linear reading parameter analysis

  1. ReadOnlySafe
    1. In this linear read mode, every time a Follower makes a read request, it needs to synchronize with the Leader to submit the site information, and the Leader needs to initiate a lightweight RPC request to prove that it is the Leader to more than half of the Follower, which is equivalent to a Follower read, at least 1 + (n/2) + 1 RPC request is required.
  2. ReadOnlyLeaseBased
    1. In this linear read mode, each time the Follower makes a read request, the Leader only needs to determine whether its Leader lease has expired. If it does not expire, it can directly reply that the Follower is the Leader, but the mechanism has strict requirements on the machine clock. For clock synchronization, consider using this linear read mode.

Nacos embedded distributed ID

The distributed ID embedded in nacos is Snakeflower, the dataCenterId defaults to 1, and the value of workerId is calculated as follows

InetAddress address;
try {
address = InetAddress.getLocalHost();
} catch (final UnknownHostException e) {
throw new IllegalStateException(
"Cannot get LocalHost InetAddress, please check your network!");
}
byte[] ipAddressByteArray = address.getAddress();
workerId = (((ipAddressByteArray[ipAddressByteArray.length - 2] & 0B11)
<< Byte.SIZE) + (ipAddressByteArray[ipAddressByteArray.length - 1]
& 0xFF));

If you need to manually specify dataCenterId and workerId, add command line parameters in application.properties or startup

### set the dataCenterID manually
# nacos.core.snowflake.data-center=
### set the WorkerID manually
# nacos.core.snowflake.worker-id=

Nacos embedded lightweight Derby-based distributed relational storage

Background

  1. If the number of configuration files is small, the cost of supporting a highly available database cluster in the cluster mode is too large, and it is expected to have a lightweight distributed relational storage to solve
  2. Some metadata information storage inside nacos, such as user information, namespace information
  3. Source of ideas:https://github.com/rqlite/rqlite

Design ideas

aims

The design goal is to expect nacos to have two data storage modes, one is the current way, the data is stored in an external data source (relational database); the second way is the embedded storage data source (Apache Derby). Users can use the command line parameter configuration to freely use these two data storage modes
image.png

overall

Save the SQL context involved in a request operation in order. Then synchronize the SQL context involved in this request through the consensus protocol layer, and then each node parses it and executes it again in a database session in sequence.
未命名文件 (1).png

The DML statements of the database are select, insert, update, and delete. According to the nature of SQL statements for data operations, they can be divided into two categories: query and update. The select statement corresponds to data query, and the insert, update, and delete statements correspond to Data modification. At the same time, when performing database operations, in order to avoid SQL injection, PreparedStatement is used, so SQL statements + parameters are required, so two Request objects about database operations can be obtained

  1. SelectRequest
public class SelectRequest implements Serializable {

private static final long serialVersionUID = 2212052574976898602L;
// Query category, because currently using JdbcTemplate, query a single, multiple queries, whether to use RowMapper into an object
private byte queryType;
// sql语句
// select * from config_info where
private String sql;
private Object[] args;
private String className;
}
  1. ModifyRequest
public class ModifyRequest implements Serializable {

private static final long serialVersionUID = 4548851816596520564L;

private int executeNo;
private String sql;
private Object[] args;
}

Configure publishing

The configuration release operation involves three transactions:

  1. config_info saves configuration information
  2. config_tags_relation saves the association relationship between configuration and tags
  3. his_config_info saves a history of configuration operations

These three transactions are all configured and released under this big transaction. If we say that we perform a Raft protocol submission for each transaction operation, assume that 1, 2 transactions are successfully applied after being submitted through Raft, and the third transaction is in Raft. Apply fails after submission, then for the big transaction released by this configuration, it needs to be rolled back as a whole, otherwise it will violate the atomicity, then it may be necessary to say that the transaction rollback operation is again Raft submitted, then the overall complexity Rise, and directly introduce the management of distributed transactions, so in order to avoid this problem, we integrate the SQL contexts involved in these three transactions into a large SQL context, and submit the Raft protocol to this large SQL context. It ensures that the three sub-transactions successfully solve the atomicity problem in the same database session. At the same time, because the Raft protocol processes the transaction log serially, it is equivalent to adjusting the transaction isolation level of the database to serialization.

public void addConfigInfo(final String srcIp,
final String srcUser, final ConfigInfo configInfo, final Timestamp time,
final Map<String, Object> configAdvanceInfo, final boolean notify) {

try {
final String tenantTmp = StringUtils.isBlank(configInfo.getTenant()) ?
StringUtils.EMPTY :
configInfo.getTenant();
configInfo.setTenant(tenantTmp);

// Obtain the database primary key through the snowflake ID algorithm
long configId = idGeneratorManager.nextId(RESOURCE_CONFIG_INFO_ID);
long hisId = idGeneratorManager.nextId(RESOURCE_CONFIG_HISTORY_ID);

addConfigInfoAtomic(configId, srcIp, srcUser, configInfo, time,
configAdvanceInfo);
String configTags = configAdvanceInfo == null ?
null :
(String) configAdvanceInfo.get("config_tags");

addConfigTagsRelation(configId, configTags, configInfo.getDataId(),
configInfo.getGroup(), configInfo.getTenant());
insertConfigHistoryAtomic(hisId, configInfo, srcIp, srcUser, time, "I");
EmbeddedStorageContextUtils.onModifyConfigInfo(configInfo, srcIp, time);
databaseOperate.blockUpdate();
}
finally {
EmbeddedStorageContextUtils.cleanAllContext();
}
}

public long addConfigInfoAtomic(final long id, final String srcIp,
final String srcUser, final ConfigInfo configInfo, final Timestamp time,
Map<String, Object> configAdvanceInfo) {
...
// 参数处理
...
final String sql =
"INSERT INTO config_info(id, data_id, group_id, tenant_id, app_name, content, md5, src_ip, src_user, gmt_create,"
+ "gmt_modified, c_desc, c_use, effect, type, c_schema) VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)";
final Object[] args = new Object[] { id, configInfo.getDataId(),
configInfo.getGroup(), tenantTmp, appNameTmp, configInfo.getContent(),
md5Tmp, srcIp, srcUser, time, time, desc, use, effect, type, schema, };
SqlContextUtils.addSqlContext(sql, args);
return id;
}

public void addConfigTagRelationAtomic(long configId, String tagName, String dataId,
String group, String tenant) {
final String sql =
"INSERT INTO config_tags_relation(id,tag_name,tag_type,data_id,group_id,tenant_id) "
+ "VALUES(?,?,?,?,?,?)";
final Object[] args = new Object[] { configId, tagName, null, dataId, group,
tenant };
SqlContextUtils.addSqlContext(sql, args);
}

public void insertConfigHistoryAtomic(long configHistoryId, ConfigInfo configInfo,
String srcIp, String srcUser, final Timestamp time, String ops) {
...
// 参数处理
...
final String sql =
"INSERT INTO his_config_info (id,data_id,group_id,tenant_id,app_name,content,md5,"
+ "src_ip,src_user,gmt_modified,op_type) VALUES(?,?,?,?,?,?,?,?,?,?,?)";
final Object[] args = new Object[] { configHistoryId, configInfo.getDataId(),
configInfo.getGroup(), tenantTmp, appNameTmp, configInfo.getContent(),
md5Tmp, srcIp, srcUser, time, ops };

SqlContextUtils.addSqlContext(sql, args);
}

/**
* Temporarily saves all insert, update, and delete statements under
* a transaction in the order in which they occur
*
* @author <a href="mailto:liaochuntao@live.com">liaochuntao</a>
*/
public class SqlContextUtils {

private static final ThreadLocal<ArrayList<ModifyRequest>> SQL_CONTEXT =
ThreadLocal.withInitial(ArrayList::new);

public static void addSqlContext(String sql, Object... args) {
ArrayList<ModifyRequest> requests = SQL_CONTEXT.get();
ModifyRequest context = new ModifyRequest();
context.setExecuteNo(requests.size());
context.setSql(sql);
context.setArgs(args);
requests.add(context);
SQL_CONTEXT.set(requests);
}

public static List<ModifyRequest> getCurrentSqlContext() {
return SQL_CONTEXT.get();
}

public static void cleanCurrentSqlContext() {
SQL_CONTEXT.remove();
}

}

A more intuitive understanding through a timing diagram

How to use new features

./startup.sh -p embedded

Whether to enable the embedded distributed relational storage activity diagram

Directly query the data stored in each node's derby

GET /nacos/v1/cs/ops/derby?sql=select * from config_info

return List<Map<String, Object>>

insufficient

  1. Build a distributed data operation synchronization layer on the upper layer of the database, there are restrictions on the operation of the database, such as the first insert operation, then the select operation, and finally the update operation, which is interspersed with query statements in the data modification statement The order of operations is not supported
  2. Limiting the performance of the database, due to the indirect adjustment of the database transaction isolation level to serialization, the concurrency ability is artificially reduced

Future evolution

Apache Derby official will try to realize the synchronous replication operation of BingLog based on Raft, and realize the database synchronization capability from the bottom

· 10 min read

随着使用 Nacos 的企业越来越多,遇到的最频繁的两个问题就是:如何在我的生产环境正确的来使用 namespace 以及 endpoint。这篇文章主要就是针对这两个问题来聊聊使用 nacos 过程中关于这两个参数配置的最佳实践方式。

namespce

关于 namespace ,以下主要从 namespace 的设计背景namespace 的最佳实践 两个方面来讨论。

namespace 的设计背景

namespace 的设计是 nacos 基于此做多环境以及多租户数据(配置和服务)隔离的。即:

  • 从一个租户(用户)的角度来看,如果有多套不同的环境,那么这个时候可以根据指定的环境来创建不同的 namespce,以此来实现多环境的隔离。例如,你可能有日常,预发和生产三个不同的环境,那么使用一套 nacos 集群可以分别建以下三个不同的 namespace。如下图所示:

  • 从多个租户(用户)的角度来看,每个租户(用户)可能会有自己的 namespace,每个租户(用户)的配置数据以及注册的服务数据都会归属到自己的 namespace 下,以此来实现多租户间的数据隔离。例如超级管理员分配了三个租户,分别为张三、李四和王五。分配好了之后,各租户用自己的账户名和密码登录后,创建自己的命名空间。如下图所示:

注意: 该功能还在规划中。

namespace 的最佳实践

关于 namespace 的最佳实践,这部分主要包含有两个 Action:

  • 如何来获取 namespace 的值
  • namespace 参数初始化方式

如何来获取 namespace 的值

无论您是基于 Spring Cloud 或者 Dubbo 来使用 nacos,都会涉及到 namespace 的参数输入,那么这个时候 namespace 的值从哪里可以获取呢?

  1. 如果您在使用过程中没有感知到这个参数的输入,那么 nacos 统一会使用一个默认的 namespace 作为输入,nacos naming 会使用 public 作为默认的参数来初始化,nacos config 会使用一个空字符串作为默认的参数来初始化。

  2. 如果您需要自定义自己的 namespace,那么这个值该怎么来产生?

    可以在 nacos 的控制台左边功能侧看到有一个 命名空间 的功能,点击就可以看到 新建命名空间 的按钮,那么这个时候就可以创建自己的命名空间了。创建成功之后,会生成一个命名空间ID,主要是用来避免命名空间名称有可能会出现重名的情况。因此当您在应用中需要配置指定的 namespace 时,填入的是命名空间ID。重要的事情说三遍:

    1. 当您在应用中需要配置指定的 namespace 时,填入的是命名空间 ID
    2. 当您在应用中需要配置指定的 namespace 时,填入的是命名空间 ID
    3. 当您在应用中需要配置指定的 namespace 时,填入的是命名空间 ID

说明: namesace 为 public 是 nacos 的一个保留控件,如果您需要创建自己的 namespace,最好不要和 public 重名,以一个实际业务场景有具体语义的名字来命名,以免带来字面上不容易区分自己是哪一个 namespace。

namespace 参数初始化方式

nacos client 对 namespace 的初始化流程如下图所示:

nacos client 对 namespace 的初始化,主要包含两部分:

  • 用户态通过 nacos client 构造实例时通过 properties 参数传入的 namespace。

  • 在云环境下(阿里云下的 EDAS)的 namespace 参数解析。

    可通过 -Dnacos.use.cloud.namespace.parsing=true/false 来控制是否需要在云环境自动解析 namespace 参数,默认为 true,是会自动解析,其目的就是方便用户上云时可以以零成本的方式平滑上云。如果用户在云上需要用自建的 nacos 下的 namespace,那这个时候只需将 -Dnacos.use.cloud.namespace.parsing=false 即可。

endpoint

关于 endpoint ,也主要从 endpoint 的设计背景endpoint 的参数初始化 两个方面来讨论。

endpoint 的设计背景

当 nacos server 集群需要扩缩容时,客户端需要有一种能力能够及时感知到集群发生变化。及时感知到集群的变化是通过 endpoint 来实现的。也即客户端会定时的向 endpoint 发送请求来更新客户端内存中的集群列表。

endpoint 的参数初始化

Nacos Client 提供一种可以对传入的 endpoint 参数规则解析的能力。即当通过构造函数的 properties 来初始化 endpoint 时,指定的 endpoint 值可以是一个具体的值,也可以是一个占位符的形式,如下所示:

\${endpoint.options:defaultValue}

说明:

  1. endpoint.options 是一个具体的变量。支持从系统属性,系统环境变量中读取。
  2. defaultValue 是给出的一个默认值。当从具体的变量中没有被正确初始化时,会使用给出的默认值来初始化。

整个 endpoint 的解析规则比较复杂,整体的一个解析流程图如下所示:

注意: 蓝色特别区分的是支持云环境下(阿里云上的 EDAS)自动从系统环境变量中来读取 endpoint 值,以此来达到用户本地开发或者将应用往云上迁移的时候以零成本的改造方式实现平滑上云。

说明:

  • 开启 endpoint 参数规则解析

    1. 如果在初始化 Nacos Client 的时候,没有通过 properties 来指定 endpoint,这个时候会从系统环境变量中变量名为 ALIBABA_ALIWARE_ENDPOINT_URL 指定的值来初始化,如果系统环境变量也没有设置,那么这个时候将会返回一个空字符串。

    2. 如果设置了 endpoint,

    3. 设置的 endpoint 是一个指定具体的值。

      这时会先从系统环境变量中变量名为 ALIBABA_ALIWARE_ENDPOINT_URL 指定的值来初始化,如果系统环境变量没有设置,那么这个时候用用户态传入的具体值来初始化 endpoint。

    4. 以占位符的形式输入。

      这时会解析出具体占位符的值,然后:

      1. 依次从系统属性和环境变量中来取值。

        例如,您输入的是 ${nacos.endpoint:defaultValue},那么解析出来的占位符是 nacos.endpoint。解析出来后,会先读取系统属性中(即 System.getProperty("nacos.endpoint"))是否设置了 nacos.endpoint 变量值,如果没有,则会从系统环境变量中变量名为 nacos.endpoint 指定的值来初始化。

      2. 如果通过解析出来的占位符还没有正确初始化 endpoint,则会从系统环境变量中变量名为 ALIBABA_ALIWARE_ENDPOINT_URL 指定的值来初始化。

      3. 如果经过以上两步还没有被初始化,这时如果您设置了默认值,这个时候就会使用默认值来初始化 endpoint,否则的话以解析出来的占位符返回。

  • 关闭 endpoint 参数规则解析

    当关闭了 endpoint 参数规则解析的时候,这个时候就以用户态在构造 Nacos Client 时通过 properties 参数输入的 endpoint 值为主。

默认情况下, Nacos Client 是开启 endpoint 参数规则解析的能力。如果你想关闭该能力,有两种方式可以帮您来实现。

  1. 可在 Nacos Client 初始化的时候在传入的 properties 实例中指定 key 为 isUseEndpointParsingRule,值为 false 即可关闭。
  2. 如果您的应用是 Java 程序的应用,也可以通过 -Dnacos.use.endpoint.parsing.rule=false 来关闭。

注意:其中第一种方式的优先级高于第二种方式。