From c4e0e1d23d0f73643e54944327decfbca6602a8c Mon Sep 17 00:00:00 2001 From: irebai Date: Wed, 4 Mar 2020 14:53:55 +0100 Subject: [PATCH 001/172] Update test_deployment.sh --- test/test_deployment.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test/test_deployment.sh b/test/test_deployment.sh index 79d509d..b1b8d36 100755 --- a/test/test_deployment.sh +++ b/test/test_deployment.sh @@ -1 +1 @@ -curl -X POST "http://localhost:8888/transcribe" -H "accept: application/json" -H "Content-Type: multipart/form-data" -F "wavFile=@bonjour.wav;type=audio/wav" +curl -X POST "http://localhost:8888/transcribe" -H "accept: application/json" -H "Content-Type: multipart/form-data" -F "file=@bonjour.wav;type=audio/wav" From b42d6e043c7cad6aa67f190587186f694d41033c Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Wed, 8 Apr 2020 11:11:24 +0200 Subject: [PATCH 002/172] add LICENCE and RELEASE files --- LICENCE | 661 +++++++++++++++++++++++++++++++++++++++++++++++++++++ RELEASE.md | 2 + 2 files changed, 663 insertions(+) create mode 100644 LICENCE create mode 100644 RELEASE.md diff --git a/LICENCE b/LICENCE new file mode 100644 index 0000000..c39e3a4 --- /dev/null +++ b/LICENCE @@ -0,0 +1,661 @@ + GNU AFFERO GENERAL PUBLIC LICENSE + Version 3, 19 November 2007 + + Copyright (C) 2007 Free Software Foundation, Inc. + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The GNU Affero General Public License is a free, copyleft license for +software and other kinds of works, specifically designed to ensure +cooperation with the community in the case of network server software. + + The licenses for most software and other practical works are designed +to take away your freedom to share and change the works. By contrast, +our General Public Licenses are intended to guarantee your freedom to +share and change all versions of a program--to make sure it remains free +software for all its users. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +them if you wish), that you receive source code or can get it if you +want it, that you can change the software or use pieces of it in new +free programs, and that you know you can do these things. + + Developers that use our General Public Licenses protect your rights +with two steps: (1) assert copyright on the software, and (2) offer +you this License which gives you legal permission to copy, distribute +and/or modify the software. + + A secondary benefit of defending all users" freedom is that +improvements made in alternate versions of the program, if they +receive widespread use, become available for other developers to +incorporate. Many developers of free software are heartened and +encouraged by the resulting cooperation. However, in the case of +software used on network servers, this result may fail to come about. +The GNU General Public License permits making a modified version and +letting the public access it on a server without ever releasing its +source code to the public. + + The GNU Affero General Public License is designed specifically to +ensure that, in such cases, the modified source code becomes available +to the community. It requires the operator of a network server to +provide the source code of the modified version running there to the +users of that server. Therefore, public use of a modified version, on +a publicly accessible server, gives the public access to the source +code of the modified version. + + An older license, called the Affero General Public License and +published by Affero, was designed to accomplish similar goals. This is +a different license, not a version of the Affero GPL, but Affero has +released a new version of the Affero GPL which permits relicensing under +this license. + + The precise terms and conditions for copying, distribution and +modification follow. + + TERMS AND CONDITIONS + + 0. Definitions. + + "This License" refers to version 3 of the GNU Affero General Public License. + + "Copyright" also means copyright-like laws that apply to other kinds of +works, such as semiconductor masks. + + "The Program" refers to any copyrightable work licensed under this +License. Each licensee is addressed as "you". "Licensees" and +"recipients" may be individuals or organizations. + + To "modify" a work means to copy from or adapt all or part of the work +in a fashion requiring copyright permission, other than the making of an +exact copy. The resulting work is called a "modified version" of the +earlier work or a work "based on" the earlier work. + + A "covered work" means either the unmodified Program or a work based +on the Program. + + To "propagate" a work means to do anything with it that, without +permission, would make you directly or secondarily liable for +infringement under applicable copyright law, except executing it on a +computer or modifying a private copy. Propagation includes copying, +distribution (with or without modification), making available to the +public, and in some countries other activities as well. + + To "convey" a work means any kind of propagation that enables other +parties to make or receive copies. Mere interaction with a user through +a computer network, with no transfer of a copy, is not conveying. + + An interactive user interface displays "Appropriate Legal Notices" +to the extent that it includes a convenient and prominently visible +feature that (1) displays an appropriate copyright notice, and (2) +tells the user that there is no warranty for the work (except to the +extent that warranties are provided), that licensees may convey the +work under this License, and how to view a copy of this License. If +the interface presents a list of user commands or options, such as a +menu, a prominent item in the list meets this criterion. + + 1. Source Code. + + The "source code" for a work means the preferred form of the work +for making modifications to it. "Object code" means any non-source +form of a work. + + A "Standard Interface" means an interface that either is an official +standard defined by a recognized standards body, or, in the case of +interfaces specified for a particular programming language, one that +is widely used among developers working in that language. + + The "System Libraries" of an executable work include anything, other +than the work as a whole, that (a) is included in the normal form of +packaging a Major Component, but which is not part of that Major +Component, and (b) serves only to enable use of the work with that +Major Component, or to implement a Standard Interface for which an +implementation is available to the public in source code form. A +"Major Component", in this context, means a major essential component +(kernel, window system, and so on) of the specific operating system +(if any) on which the executable work runs, or a compiler used to +produce the work, or an object code interpreter used to run it. + + The "Corresponding Source" for a work in object code form means all +the source code needed to generate, install, and (for an executable +work) run the object code and to modify the work, including scripts to +control those activities. However, it does not include the work"s +System Libraries, or general-purpose tools or generally available free +programs which are used unmodified in performing those activities but +which are not part of the work. For example, Corresponding Source +includes interface definition files associated with source files for +the work, and the source code for shared libraries and dynamically +linked subprograms that the work is specifically designed to require, +such as by intimate data communication or control flow between those +subprograms and other parts of the work. + + The Corresponding Source need not include anything that users +can regenerate automatically from other parts of the Corresponding +Source. + + The Corresponding Source for a work in source code form is that +same work. + + 2. Basic Permissions. + + All rights granted under this License are granted for the term of +copyright on the Program, and are irrevocable provided the stated +conditions are met. This License explicitly affirms your unlimited +permission to run the unmodified Program. The output from running a +covered work is covered by this License only if the output, given its +content, constitutes a covered work. This License acknowledges your +rights of fair use or other equivalent, as provided by copyright law. + + You may make, run and propagate covered works that you do not +convey, without conditions so long as your license otherwise remains +in force. You may convey covered works to others for the sole purpose +of having them make modifications exclusively for you, or provide you +with facilities for running those works, provided that you comply with +the terms of this License in conveying all material for which you do +not control copyright. Those thus making or running the covered works +for you must do so exclusively on your behalf, under your direction +and control, on terms that prohibit them from making any copies of +your copyrighted material outside their relationship with you. + + Conveying under any other circumstances is permitted solely under +the conditions stated below. Sublicensing is not allowed; section 10 +makes it unnecessary. + + 3. Protecting Users" Legal Rights From Anti-Circumvention Law. + + No covered work shall be deemed part of an effective technological +measure under any applicable law fulfilling obligations under article +11 of the WIPO copyright treaty adopted on 20 December 1996, or +similar laws prohibiting or restricting circumvention of such +measures. + + When you convey a covered work, you waive any legal power to forbid +circumvention of technological measures to the extent such circumvention +is effected by exercising rights under this License with respect to +the covered work, and you disclaim any intention to limit operation or +modification of the work as a means of enforcing, against the work"s +users, your or third parties" legal rights to forbid circumvention of +technological measures. + + 4. Conveying Verbatim Copies. + + You may convey verbatim copies of the Program"s source code as you +receive it, in any medium, provided that you conspicuously and +appropriately publish on each copy an appropriate copyright notice; +keep intact all notices stating that this License and any +non-permissive terms added in accord with section 7 apply to the code; +keep intact all notices of the absence of any warranty; and give all +recipients a copy of this License along with the Program. + + You may charge any price or no price for each copy that you convey, +and you may offer support or warranty protection for a fee. + + 5. Conveying Modified Source Versions. + + You may convey a work based on the Program, or the modifications to +produce it from the Program, in the form of source code under the +terms of section 4, provided that you also meet all of these conditions: + + a) The work must carry prominent notices stating that you modified + it, and giving a relevant date. + + b) The work must carry prominent notices stating that it is + released under this License and any conditions added under section + 7. This requirement modifies the requirement in section 4 to + "keep intact all notices". + + c) You must license the entire work, as a whole, under this + License to anyone who comes into possession of a copy. This + License will therefore apply, along with any applicable section 7 + additional terms, to the whole of the work, and all its parts, + regardless of how they are packaged. This License gives no + permission to license the work in any other way, but it does not + invalidate such permission if you have separately received it. + + d) If the work has interactive user interfaces, each must display + Appropriate Legal Notices; however, if the Program has interactive + interfaces that do not display Appropriate Legal Notices, your + work need not make them do so. + + A compilation of a covered work with other separate and independent +works, which are not by their nature extensions of the covered work, +and which are not combined with it such as to form a larger program, +in or on a volume of a storage or distribution medium, is called an +"aggregate" if the compilation and its resulting copyright are not +used to limit the access or legal rights of the compilation"s users +beyond what the individual works permit. Inclusion of a covered work +in an aggregate does not cause this License to apply to the other +parts of the aggregate. + + 6. Conveying Non-Source Forms. + + You may convey a covered work in object code form under the terms +of sections 4 and 5, provided that you also convey the +machine-readable Corresponding Source under the terms of this License, +in one of these ways: + + a) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by the + Corresponding Source fixed on a durable physical medium + customarily used for software interchange. + + b) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by a + written offer, valid for at least three years and valid for as + long as you offer spare parts or customer support for that product + model, to give anyone who possesses the object code either (1) a + copy of the Corresponding Source for all the software in the + product that is covered by this License, on a durable physical + medium customarily used for software interchange, for a price no + more than your reasonable cost of physically performing this + conveying of source, or (2) access to copy the + Corresponding Source from a network server at no charge. + + c) Convey individual copies of the object code with a copy of the + written offer to provide the Corresponding Source. This + alternative is allowed only occasionally and noncommercially, and + only if you received the object code with such an offer, in accord + with subsection 6b. + + d) Convey the object code by offering access from a designated + place (gratis or for a charge), and offer equivalent access to the + Corresponding Source in the same way through the same place at no + further charge. You need not require recipients to copy the + Corresponding Source along with the object code. If the place to + copy the object code is a network server, the Corresponding Source + may be on a different server (operated by you or a third party) + that supports equivalent copying facilities, provided you maintain + clear directions next to the object code saying where to find the + Corresponding Source. Regardless of what server hosts the + Corresponding Source, you remain obligated to ensure that it is + available for as long as needed to satisfy these requirements. + + e) Convey the object code using peer-to-peer transmission, provided + you inform other peers where the object code and Corresponding + Source of the work are being offered to the general public at no + charge under subsection 6d. + + A separable portion of the object code, whose source code is excluded +from the Corresponding Source as a System Library, need not be +included in conveying the object code work. + + A "User Product" is either (1) a "consumer product", which means any +tangible personal property which is normally used for personal, family, +or household purposes, or (2) anything designed or sold for incorporation +into a dwelling. In determining whether a product is a consumer product, +doubtful cases shall be resolved in favor of coverage. For a particular +product received by a particular user, "normally used" refers to a +typical or common use of that class of product, regardless of the status +of the particular user or of the way in which the particular user +actually uses, or expects or is expected to use, the product. A product +is a consumer product regardless of whether the product has substantial +commercial, industrial or non-consumer uses, unless such uses represent +the only significant mode of use of the product. + + "Installation Information" for a User Product means any methods, +procedures, authorization keys, or other information required to install +and execute modified versions of a covered work in that User Product from +a modified version of its Corresponding Source. The information must +suffice to ensure that the continued functioning of the modified object +code is in no case prevented or interfered with solely because +modification has been made. + + If you convey an object code work under this section in, or with, or +specifically for use in, a User Product, and the conveying occurs as +part of a transaction in which the right of possession and use of the +User Product is transferred to the recipient in perpetuity or for a +fixed term (regardless of how the transaction is characterized), the +Corresponding Source conveyed under this section must be accompanied +by the Installation Information. But this requirement does not apply +if neither you nor any third party retains the ability to install +modified object code on the User Product (for example, the work has +been installed in ROM). + + The requirement to provide Installation Information does not include a +requirement to continue to provide support service, warranty, or updates +for a work that has been modified or installed by the recipient, or for +the User Product in which it has been modified or installed. Access to a +network may be denied when the modification itself materially and +adversely affects the operation of the network or violates the rules and +protocols for communication across the network. + + Corresponding Source conveyed, and Installation Information provided, +in accord with this section must be in a format that is publicly +documented (and with an implementation available to the public in +source code form), and must require no special password or key for +unpacking, reading or copying. + + 7. Additional Terms. + + "Additional permissions" are terms that supplement the terms of this +License by making exceptions from one or more of its conditions. +Additional permissions that are applicable to the entire Program shall +be treated as though they were included in this License, to the extent +that they are valid under applicable law. If additional permissions +apply only to part of the Program, that part may be used separately +under those permissions, but the entire Program remains governed by +this License without regard to the additional permissions. + + When you convey a copy of a covered work, you may at your option +remove any additional permissions from that copy, or from any part of +it. (Additional permissions may be written to require their own +removal in certain cases when you modify the work.) You may place +additional permissions on material, added by you to a covered work, +for which you have or can give appropriate copyright permission. + + Notwithstanding any other provision of this License, for material you +add to a covered work, you may (if authorized by the copyright holders of +that material) supplement the terms of this License with terms: + + a) Disclaiming warranty or limiting liability differently from the + terms of sections 15 and 16 of this License; or + + b) Requiring preservation of specified reasonable legal notices or + author attributions in that material or in the Appropriate Legal + Notices displayed by works containing it; or + + c) Prohibiting misrepresentation of the origin of that material, or + requiring that modified versions of such material be marked in + reasonable ways as different from the original version; or + + d) Limiting the use for publicity purposes of names of licensors or + authors of the material; or + + e) Declining to grant rights under trademark law for use of some + trade names, trademarks, or service marks; or + + f) Requiring indemnification of licensors and authors of that + material by anyone who conveys the material (or modified versions of + it) with contractual assumptions of liability to the recipient, for + any liability that these contractual assumptions directly impose on + those licensors and authors. + + All other non-permissive additional terms are considered "further +restrictions" within the meaning of section 10. If the Program as you +received it, or any part of it, contains a notice stating that it is +governed by this License along with a term that is a further +restriction, you may remove that term. If a license document contains +a further restriction but permits relicensing or conveying under this +License, you may add to a covered work material governed by the terms +of that license document, provided that the further restriction does +not survive such relicensing or conveying. + + If you add terms to a covered work in accord with this section, you +must place, in the relevant source files, a statement of the +additional terms that apply to those files, or a notice indicating +where to find the applicable terms. + + Additional terms, permissive or non-permissive, may be stated in the +form of a separately written license, or stated as exceptions; +the above requirements apply either way. + + 8. Termination. + + You may not propagate or modify a covered work except as expressly +provided under this License. Any attempt otherwise to propagate or +modify it is void, and will automatically terminate your rights under +this License (including any patent licenses granted under the third +paragraph of section 11). + + However, if you cease all violation of this License, then your +license from a particular copyright holder is reinstated (a) +provisionally, unless and until the copyright holder explicitly and +finally terminates your license, and (b) permanently, if the copyright +holder fails to notify you of the violation by some reasonable means +prior to 60 days after the cessation. + + Moreover, your license from a particular copyright holder is +reinstated permanently if the copyright holder notifies you of the +violation by some reasonable means, this is the first time you have +received notice of violation of this License (for any work) from that +copyright holder, and you cure the violation prior to 30 days after +your receipt of the notice. + + Termination of your rights under this section does not terminate the +licenses of parties who have received copies or rights from you under +this License. If your rights have been terminated and not permanently +reinstated, you do not qualify to receive new licenses for the same +material under section 10. + + 9. Acceptance Not Required for Having Copies. + + You are not required to accept this License in order to receive or +run a copy of the Program. Ancillary propagation of a covered work +occurring solely as a consequence of using peer-to-peer transmission +to receive a copy likewise does not require acceptance. However, +nothing other than this License grants you permission to propagate or +modify any covered work. These actions infringe copyright if you do +not accept this License. Therefore, by modifying or propagating a +covered work, you indicate your acceptance of this License to do so. + + 10. Automatic Licensing of Downstream Recipients. + + Each time you convey a covered work, the recipient automatically +receives a license from the original licensors, to run, modify and +propagate that work, subject to this License. You are not responsible +for enforcing compliance by third parties with this License. + + An "entity transaction" is a transaction transferring control of an +organization, or substantially all assets of one, or subdividing an +organization, or merging organizations. If propagation of a covered +work results from an entity transaction, each party to that +transaction who receives a copy of the work also receives whatever +licenses to the work the party"s predecessor in interest had or could +give under the previous paragraph, plus a right to possession of the +Corresponding Source of the work from the predecessor in interest, if +the predecessor has it or can get it with reasonable efforts. + + You may not impose any further restrictions on the exercise of the +rights granted or affirmed under this License. For example, you may +not impose a license fee, royalty, or other charge for exercise of +rights granted under this License, and you may not initiate litigation +(including a cross-claim or counterclaim in a lawsuit) alleging that +any patent claim is infringed by making, using, selling, offering for +sale, or importing the Program or any portion of it. + + 11. Patents. + + A "contributor" is a copyright holder who authorizes use under this +License of the Program or a work on which the Program is based. The +work thus licensed is called the contributor"s "contributor version". + + A contributor"s "essential patent claims" are all patent claims +owned or controlled by the contributor, whether already acquired or +hereafter acquired, that would be infringed by some manner, permitted +by this License, of making, using, or selling its contributor version, +but do not include claims that would be infringed only as a +consequence of further modification of the contributor version. For +purposes of this definition, "control" includes the right to grant +patent sublicenses in a manner consistent with the requirements of +this License. + + Each contributor grants you a non-exclusive, worldwide, royalty-free +patent license under the contributor"s essential patent claims, to +make, use, sell, offer for sale, import and otherwise run, modify and +propagate the contents of its contributor version. + + In the following three paragraphs, a "patent license" is any express +agreement or commitment, however denominated, not to enforce a patent +(such as an express permission to practice a patent or covenant not to +sue for patent infringement). To "grant" such a patent license to a +party means to make such an agreement or commitment not to enforce a +patent against the party. + + If you convey a covered work, knowingly relying on a patent license, +and the Corresponding Source of the work is not available for anyone +to copy, free of charge and under the terms of this License, through a +publicly available network server or other readily accessible means, +then you must either (1) cause the Corresponding Source to be so +available, or (2) arrange to deprive yourself of the benefit of the +patent license for this particular work, or (3) arrange, in a manner +consistent with the requirements of this License, to extend the patent +license to downstream recipients. "Knowingly relying" means you have +actual knowledge that, but for the patent license, your conveying the +covered work in a country, or your recipient"s use of the covered work +in a country, would infringe one or more identifiable patents in that +country that you have reason to believe are valid. + + If, pursuant to or in connection with a single transaction or +arrangement, you convey, or propagate by procuring conveyance of, a +covered work, and grant a patent license to some of the parties +receiving the covered work authorizing them to use, propagate, modify +or convey a specific copy of the covered work, then the patent license +you grant is automatically extended to all recipients of the covered +work and works based on it. + + A patent license is "discriminatory" if it does not include within +the scope of its coverage, prohibits the exercise of, or is +conditioned on the non-exercise of one or more of the rights that are +specifically granted under this License. You may not convey a covered +work if you are a party to an arrangement with a third party that is +in the business of distributing software, under which you make payment +to the third party based on the extent of your activity of conveying +the work, and under which the third party grants, to any of the +parties who would receive the covered work from you, a discriminatory +patent license (a) in connection with copies of the covered work +conveyed by you (or copies made from those copies), or (b) primarily +for and in connection with specific products or compilations that +contain the covered work, unless you entered into that arrangement, +or that patent license was granted, prior to 28 March 2007. + + Nothing in this License shall be construed as excluding or limiting +any implied license or other defenses to infringement that may +otherwise be available to you under applicable patent law. + + 12. No Surrender of Others" Freedom. + + If conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot convey a +covered work so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you may +not convey it at all. For example, if you agree to terms that obligate you +to collect a royalty for further conveying from those to whom you convey +the Program, the only way you could satisfy both those terms and this +License would be to refrain entirely from conveying the Program. + + 13. Remote Network Interaction; Use with the GNU General Public License. + + Notwithstanding any other provision of this License, if you modify the +Program, your modified version must prominently offer all users +interacting with it remotely through a computer network (if your version +supports such interaction) an opportunity to receive the Corresponding +Source of your version by providing access to the Corresponding Source +from a network server at no charge, through some standard or customary +means of facilitating copying of software. This Corresponding Source +shall include the Corresponding Source for any work covered by version 3 +of the GNU General Public License that is incorporated pursuant to the +following paragraph. + + Notwithstanding any other provision of this License, you have +permission to link or combine any covered work with a work licensed +under version 3 of the GNU General Public License into a single +combined work, and to convey the resulting work. The terms of this +License will continue to apply to the part which is the covered work, +but the work with which it is combined will remain governed by version +3 of the GNU General Public License. + + 14. Revised Versions of this License. + + The Free Software Foundation may publish revised and/or new versions of +the GNU Affero General Public License from time to time. Such new versions +will be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + + Each version is given a distinguishing version number. If the +Program specifies that a certain numbered version of the GNU Affero General +Public License "or any later version" applies to it, you have the +option of following the terms and conditions either of that numbered +version or of any later version published by the Free Software +Foundation. If the Program does not specify a version number of the +GNU Affero General Public License, you may choose any version ever published +by the Free Software Foundation. + + If the Program specifies that a proxy can decide which future +versions of the GNU Affero General Public License can be used, that proxy"s +public statement of acceptance of a version permanently authorizes you +to choose that version for the Program. + + Later license versions may give you additional or different +permissions. However, no additional obligations are imposed on any +author or copyright holder as a result of your choosing to follow a +later version. + + 15. Disclaimer of Warranty. + + THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY +APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT +HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY +OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, +THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM +IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF +ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + + 16. Limitation of Liability. + + IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS +THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY +GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE +USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF +DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD +PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), +EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF +SUCH DAMAGES. + + 17. Interpretation of Sections 15 and 16. + + If the disclaimer of warranty and limitation of liability provided +above cannot be given local legal effect according to their terms, +reviewing courts shall apply local law that most closely approximates +an absolute waiver of all civil liability in connection with the +Program, unless a warranty or assumption of liability accompanies a +copy of the Program in return for a fee. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +state the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU Affero General Public License as published by + the Free Software Foundation, either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU Affero General Public License for more details. + + You should have received a copy of the GNU Affero General Public License + along with this program. If not, see . + +Also add information on how to contact you by electronic and paper mail. + + If your software can interact with users remotely through a computer +network, you should also make sure that it provides a way for users to +get its source. For example, if your program is a web application, its +interface could display a "Source" link that leads users to an archive +of the code. There are many ways you could offer source, and different +solutions will be better for different programs; see section 13 for the +specific requirements. + + You should also get your employer (if you work as a programmer) or school, +if any, to sign a "copyright disclaimer" for the program, if necessary. +For more information on this, and how to apply and follow the GNU AGPL, see +. diff --git a/RELEASE.md b/RELEASE.md new file mode 100644 index 0000000..cc6807c --- /dev/null +++ b/RELEASE.md @@ -0,0 +1,2 @@ +# 1.0.0 +- First build of LinTO-Platform-stt-standalone-worker \ No newline at end of file From 667e6b9af22cbb479070914ae823cd5c2f57ed25 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 19 May 2020 23:51:51 +0200 Subject: [PATCH 003/172] fix minor bugs and replace swagger docker by a python package --- .envdefault | 6 +-- Dockerfile | 115 +++++++++++++++++++++++-------------------- Jenkinsfile | 51 +++++++++++++++++++ RELEASE.md | 8 ++- docker-compose.yml | 17 ++----- document/swagger.yml | 10 +--- run.py | 109 ++++++++++++++++++++++++---------------- 7 files changed, 193 insertions(+), 123 deletions(-) create mode 100644 Jenkinsfile diff --git a/.envdefault b/.envdefault index 42419d0..2246e24 100644 --- a/.envdefault +++ b/.envdefault @@ -1,7 +1,3 @@ AM_PATH=/path/to/acoustic/models/dir LM_PATH=/path/to/language/models/dir -WORKER_PORT=8888 -SERVICE_PORT=2000 - -SWAGGER_PATH=./document -SWAGGER_JSON=/app/swagger/swagger.yml +SWAGGER_PATH=/path/to/swagger/file \ No newline at end of file diff --git a/Dockerfile b/Dockerfile index 988ed0e..83d7300 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,5 +1,5 @@ FROM debian:9 -MAINTAINER Ilyes REBAI +LABEL maintainer="irebai@linagora.com" # Install all our dependencies and set some required build changes RUN apt-get update &&\ @@ -10,62 +10,71 @@ RUN apt-get update &&\ python3-dev \ python-pip \ python3-pip \ - autoconf \ - automake \ - unzip \ - bc \ - bzip2 \ - default-jre \ - g++ \ - git \ - gzip \ - libatlas3-base \ - libtool-bin \ - make \ - sox \ - libsox-fmt-all \ - libav-tools \ - subversion \ - vorbis-tools \ - wget \ - zlib1g-dev &&\ - apt-get clean autoclean && \ - apt-get autoremove -y && \ - ln -s /usr/bin/python2.7 /usr/bin/python ; ln -s -f bash /bin/sh + g++ make automake autoconf bzip2 unzip wget sox libtool git subversion zlib1g-dev ca-certificates gfortran patch ffmpeg nano && \ + apt-get clean -ENV BASE_DIR /opt/speech-to-text +## Build kaldi and Clean installation (intel, openfst, src/*) +RUN git clone --depth 1 https://github.com/kaldi-asr/kaldi.git /opt/kaldi && \ + cd /opt/kaldi && \ + cd /opt/kaldi/tools && \ + ./extras/install_mkl.sh && \ + make -j $(nproc) && \ + cd /opt/kaldi/src && \ + ./configure --shared && \ + make depend -j $(nproc) && \ + make -j $(nproc) && \ + mkdir -p /opt/kaldi/src_/lib /opt/kaldi/src_/bin && \ + mv /opt/kaldi/src/base/libkaldi-base.so \ + /opt/kaldi/src/chain/libkaldi-chain.so \ + /opt/kaldi/src/cudamatrix/libkaldi-cudamatrix.so \ + /opt/kaldi/src/decoder/libkaldi-decoder.so \ + /opt/kaldi/src/feat/libkaldi-feat.so \ + /opt/kaldi/src/fstext/libkaldi-fstext.so \ + /opt/kaldi/src/gmm/libkaldi-gmm.so \ + /opt/kaldi/src/hmm/libkaldi-hmm.so \ + /opt/kaldi/src/ivector/libkaldi-ivector.so \ + /opt/kaldi/src/kws/libkaldi-kws.so \ + /opt/kaldi/src/lat/libkaldi-lat.so \ + /opt/kaldi/src/lm/libkaldi-lm.so \ + /opt/kaldi/src/matrix/libkaldi-matrix.so \ + /opt/kaldi/src/nnet/libkaldi-nnet.so \ + /opt/kaldi/src/nnet2/libkaldi-nnet2.so \ + /opt/kaldi/src/nnet3/libkaldi-nnet3.so \ + /opt/kaldi/src/online2/libkaldi-online2.so \ + /opt/kaldi/src/rnnlm/libkaldi-rnnlm.so \ + /opt/kaldi/src/sgmm2/libkaldi-sgmm2.so \ + /opt/kaldi/src/transform/libkaldi-transform.so \ + /opt/kaldi/src/tree/libkaldi-tree.so \ + /opt/kaldi/src/util/libkaldi-util.so \ + /opt/kaldi/src_/lib && \ + mv /opt/kaldi/src/online2bin/online2-wav-nnet2-latgen-faster \ + /opt/kaldi/src/online2bin/online2-wav-nnet3-latgen-faster \ + /opt/kaldi/src/latbin/lattice-1best \ + /opt/kaldi/src/latbin/lattice-align-words \ + /opt/kaldi/src/latbin/nbest-to-ctm /opt/kaldi/src_/bin && \ + rm -rf /opt/kaldi/src && mv /opt/kaldi/src_ /opt/kaldi/src && \ + cd /opt/kaldi/src && rm -f lmbin/*.cc lmbin/*.o lmbin/Makefile fstbin/*.cc fstbin/*.o fstbin/Makefile bin/*.cc bin/*.o bin/Makefile && \ + cd /opt/intel/mkl/lib && rm -f intel64/*.a intel64_lin/*.a && \ + cd /opt/kaldi/tools && mkdir openfsttmp && mv openfst-*/lib openfst-*/include openfst-*/bin openfsttmp && rm openfsttmp/lib/*.a openfsttmp/lib/*.la && \ + rm -r openfst-*/* && mv openfsttmp/* openfst-*/ && rm -r openfsttmp +## Install python packages +RUN pip3 install flask flask-cors flask-swagger-ui configparser pyyaml -# Build kaldi -## Install main libraries -RUN cd /opt && git clone https://github.com/kaldi-asr/kaldi.git && \ - cd /opt/kaldi/tools && make -j$(nproc) +## Create symbolik links +RUN cd /opt/kaldi/src/bin && \ + ln -s online2-wav-nnet2-latgen-faster kaldi-nnet2-latgen-faster && \ + ln -s online2-wav-nnet3-latgen-faster kaldi-nnet3-latgen-faster && \ + ln -s lattice-1best kaldi-lattice-1best && \ + ln -s lattice-align-words kaldi-lattice-align-words && \ + ln -s nbest-to-ctm kaldi-nbest-to-ctm -#Install MKL package -RUN cd /opt/kaldi/tools && \ - extras/install_mkl.sh +# Set environment variables +ENV PATH /opt/kaldi/src/bin:/opt/kaldi/egs/wsj/s5/utils/:$PATH -## Install main functions -RUN cd /opt/kaldi/src && \ - sed -i -e ':a;N;$!ba;s:\\\n::g' Makefile && \ - sed -i -e 's:^SUBDIRS = .*$:SUBDIRS = base matrix util feat tree gmm transform fstext hmm lm decoder lat cudamatrix nnet bin nnet2 nnet3 chain ivector online2:g' -e 's:^MEMTESTDIRS = .*$:MEMTESTDIRS = :g' Makefile && \ - ./configure --shared && make depend -j$(nproc) && make -j$(nproc) && rm */*{.a,.o} - -RUN apt install -y libatlas-dev -RUN pip2 install flask configparser requests flask-cors - -RUN echo "/opt/kaldi/src/lib/" > /etc/ld.so.conf.d/kaldi.conf && \ - echo "/opt/kaldi/tools/openfst/lib/" >> /etc/ld.so.conf.d/kaldi.conf && \ - ldconfig - -COPY Makefile /opt -RUN cd /opt && ./Makefile - -RUN mkdir -p /opt/tmp -RUN cp /opt/kaldi/egs/wsj/s5/utils/int2sym.pl /opt -ENV PATH /opt:$PATH - -WORKDIR $BASE_DIR +WORKDIR /usr/src/speech-to-text COPY run.py . -CMD ./run.py +EXPOSE 80 + +CMD python3 ./run.py diff --git a/Jenkinsfile b/Jenkinsfile new file mode 100644 index 0000000..b4bdffc --- /dev/null +++ b/Jenkinsfile @@ -0,0 +1,51 @@ +pipeline { + agent any + environment { + DOCKER_HUB_REPO = "lintoai/linto-platform-stt-standalone-worker" + DOCKER_HUB_CRED = 'docker-hub-credentials' + + VERSION = '' + } + + stages{ + stage('Docker build for master branch'){ + when{ + branch 'master' + } + steps { + echo 'Publishing latest' + script { + image = docker.build(env.DOCKER_HUB_REPO) + VERSION = sh( + returnStdout: true, + script: "awk -v RS='' '/#/ {print; exit}' RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" + ).trim() + + docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { + image.push("${VERSION}") + image.push('latest') + } + } + } + } + + stage('Docker build for next (unstable) branch'){ + when{ + branch 'next' + } + steps { + echo 'Publishing unstable' + script { + image = docker.build(env.DOCKER_HUB_REPO) + VERSION = sh( + returnStdout: true, + script: "awk -v RS='' '/#/ {print; exit}' RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" + ).trim() + docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { + image.push('latest-unstable') + } + } + } + } + }// end stages +} diff --git a/RELEASE.md b/RELEASE.md index cc6807c..a2826a4 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,2 +1,6 @@ -# 1.0.0 -- First build of LinTO-Platform-stt-standalone-worker \ No newline at end of file +# 1.1.2 +- New features: + - Word timestamp computing + - Response type: plain/text: simple text output and application/json: the transcription and the words timestamp. + - Swagger: integrate swagger in the service using a python package + - Fix minor bugs \ No newline at end of file diff --git a/docker-compose.yml b/docker-compose.yml index b8ee470..8c8e9aa 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -9,19 +9,10 @@ services: volumes: - ${AM_PATH}:/opt/models/AM - ${LM_PATH}:/opt/models/LM + - ${SWAGGER_PATH}:/opt/swagger.yml ports: - - ${WORKER_PORT}:${SERVICE_PORT} + - target: 80 + published: 8888 env_file: .env environment: - - ${SERVICE_PORT} - - swaggerui: - image: swaggerapi/swagger-ui - ports: - - 80:8080 - hostname: swaggerui - volumes: - - ${SWAGGER_PATH}:/app/swagger/ - env_file: .env - environment: - - SWAGGER_JSON + SWAGGER_PATH: /opt/swagger.yml \ No newline at end of file diff --git a/document/swagger.yml b/document/swagger.yml index 9cab647..ebc5c08 100644 --- a/document/swagger.yml +++ b/document/swagger.yml @@ -20,19 +20,13 @@ paths: - "multipart/form-data" produces: - "application/json" + - "text/plain" parameters: - name: "file" in: "formData" - description: "Wave File" + description: "Audio File (wav, mp3, aiff, flac, ogg)" required: true type: "file" - - name: "metadata" - in: "query" - description: "Accepted header" - required: true - type: "string" - enum: [ "Text", "Json" ] - default: "text" responses: 200: description: Successfully transcribe the audio diff --git a/run.py b/run.py index 822ee68..d678e94 100755 --- a/run.py +++ b/run.py @@ -1,33 +1,26 @@ -#!/usr/bin/env python2 +#!/usr/bin/env python3 # -*- coding: utf-8 -*- from flask import Flask, request, abort, Response, json +from flask_swagger_ui import get_swaggerui_blueprint from flask_cors import CORS -from os import path -import uuid, os -import configparser -import subprocess -import shlex -import re +import uuid, os, configparser, subprocess, shlex, re, yaml app = Flask(__name__) -CORS(app) - -global busy -busy=0 +# Main parameters AM_PATH = '/opt/models/AM' LM_PATH = '/opt/models/LM' -TEMP_FILE_PATH = '/opt/tmp' #/opt/wavs -TEMP_FILE_PATH1= '/opt/models' +TEMP_FILE_PATH = '/opt/tmp' +CONFIG_FILES_PATH = '/opt/config' +SERVICE_PORT=80 +SWAGGER_URL='/api-doc' +if not os.path.isdir(TEMP_FILE_PATH): + os.mkdir(TEMP_FILE_PATH) +if not os.path.isdir(CONFIG_FILES_PATH): + os.mkdir(CONFIG_FILES_PATH) -def dockerId(): - with open('/proc/self/cgroup') as f: - lines = f.readlines() - for l in lines: - if '/docker/' in l: - return l.split('/')[2][:20] def run_shell_command(command_line): try: @@ -36,25 +29,25 @@ def run_shell_command(command_line): output, error = process.communicate() return False, output except OSError as err: - print("OS error: {0}".format(err)) + app.logger.info("OS error: {0}".format(err)) return True, '' except ValueError: - print("data error.") + app.logger.info("data error.") return True, '' except: - print("Unexpected error:", sys.exc_info()[0]) + app.logger.info("Unexpected error:", sys.exc_info()[0]) return True, '' def decode(audio_file,wav_name,do_word_tStamp): # Normalize audio file and convert it to wave format error, output = run_shell_command("sox "+audio_file+" -t wav -b 16 -r 16000 -c 1 "+audio_file+".wav") - if not path.exists(audio_file+".wav"): + if not os.path.exists(audio_file+".wav"): app.logger.info(output) return False, 'Error during audio file conversion!!! Supported formats are wav, mp3, aiff, flac, and ogg.' decode_file = audio_file+".wav" - decode_conf = TEMP_FILE_PATH1+"/online.conf" + decode_conf = CONFIG_FILES_PATH+"/online.conf" decode_mdl = AM_PATH+"/"+AM_FILE_PATH+"/final.mdl" decode_graph = LM_PATH+"/HCLG.fst" decode_words = LM_PATH+"/words.txt" @@ -62,20 +55,27 @@ def decode(audio_file,wav_name,do_word_tStamp): # Decode the audio file + decode_opt =" --min-active="+DECODER_MINACT + decode_opt+=" --max-active="+DECODER_MAXACT + decode_opt+=" --beam="+DECODER_BEAM + decode_opt+=" --lattice-beam="+DECODER_LATBEAM + decode_opt+=" --acoustic-scale="+DECODER_ACWT + + if DECODER_SYS == 'dnn3': - error, output = run_shell_command("kaldi-nnet3-latgen-faster --do-endpointing=false --frame-subsampling-factor="+DECODER_FSF+" --frames-per-chunk=20 --online=false --config="+decode_conf+" --minimize=false --min-active="+DECODER_MINACT+" --max-active="+DECODER_MAXACT+" --beam="+DECODER_BEAM+" --lattice-beam="+DECODER_LATBEAM+" --acoustic-scale="+DECODER_ACWT+" --word-symbol-table="+decode_words+" "+decode_mdl+" "+decode_graph+" \"ark:echo "+wav_name+" "+wav_name+"|\" \"scp:echo "+wav_name+" "+decode_file+"|\" ark:"+TEMP_FILE_PATH+"/"+wav_name+".lat") + error, output = run_shell_command("kaldi-nnet3-latgen-faster --do-endpointing=false --frames-per-chunk=20 --online=false --frame-subsampling-factor="+DECODER_FSF+" --config="+decode_conf+" --minimize=false "+decode_opt+" --word-symbol-table="+decode_words+" "+decode_mdl+" "+decode_graph+" \"ark:echo "+wav_name+" "+wav_name+"|\" \"scp:echo "+wav_name+" "+decode_file+"|\" ark:"+TEMP_FILE_PATH+"/"+wav_name+".lat") elif DECODER_SYS == 'dnn2' or DECODER_SYS == 'dnn': - error, output = run_shell_command("kaldi-nnet2-latgen-faster --do-endpointing=false --online=false --config="+decode_conf+" --min-active="+DECODER_MINACT+" --max-active="+DECODER_MAXACT+" --beam="+DECODER_BEAM+" --lattice-beam="+DECODER_LATBEAM+" --acoustic-scale="+DECODER_ACWT+" --word-symbol-table="+decode_words+" "+decode_mdl+" "+decode_graph+" \"ark:echo "+wav_name+" "+wav_name+"|\" \"scp:echo "+wav_name+" "+decode_file+"|\" ark:"+TEMP_FILE_PATH+"/"+wav_name+".lat") + error, output = run_shell_command("kaldi-nnet2-latgen-faster --do-endpointing=false --online=false --config="+decode_conf+" "+decode_opt+" --word-symbol-table="+decode_words+" "+decode_mdl+" "+decode_graph+" \"ark:echo "+wav_name+" "+wav_name+"|\" \"scp:echo "+wav_name+" "+decode_file+"|\" ark:"+TEMP_FILE_PATH+"/"+wav_name+".lat") else: return False, 'The "decoder" parameter of the acoustic model is not supported!!!' - if not path.exists(TEMP_FILE_PATH+"/"+wav_name+".lat"): + if not os.path.exists(TEMP_FILE_PATH+"/"+wav_name+".lat"): app.logger.info(output) return False, 'One or multiple parameters of the acoustic model are not correct!!!' # Normalize the obtained transcription - hypothesis = re.findall('\n'+wav_name+'.*',output) + hypothesis = re.findall('\n'+wav_name+'.*',output.decode('utf-8')) trans=re.sub(wav_name,'',hypothesis[0]).strip() trans=re.sub(r"#nonterm:[^ ]* ", "", trans) trans=re.sub(r" ", " ", " "+trans+" ") @@ -88,7 +88,7 @@ def decode(audio_file,wav_name,do_word_tStamp): error, output = run_shell_command("kaldi-nbest-to-ctm ark:"+TEMP_FILE_PATH+"/"+wav_name+".words "+TEMP_FILE_PATH+"/"+wav_name+".ctm") error, output = run_shell_command("int2sym.pl -f 5 "+decode_words+" "+TEMP_FILE_PATH+"/"+wav_name+".ctm") if not error and output != "": - words = output.split("\n") + words = output.decode('utf-8').split("\n") trans = "" data = {} data["words"] = [] @@ -117,8 +117,14 @@ def transcribe(): global busy busy=1 fileid = str(uuid.uuid4()) - metadata = True if request.args.get('metadata').lower() == 'json' else False + if request.headers.get('accept').lower() == 'application/json': + metadata = True + elif request.headers.get('accept').lower() == 'text/plain': + metadata = False + else: + return 'Not accepted header', 400 + if 'file' in request.files.keys(): file = request.files['file'] file_ext = file.filename.rsplit('.', 1)[-1].lower() @@ -144,20 +150,27 @@ def transcribe(): json_string = json.dumps(out, ensure_ascii=False) return Response(json_string,content_type="application/json; charset=utf-8" ), 200 -@app.route('/check', methods=['GET']) +@app.route('/healthcheck', methods=['GET']) def check(): return '1', 200 -@app.route('/stop', methods=['POST']) -def stop(): - while(busy==1): - continue - subprocess.call("kill 1",shell=True) - return '1', 200 +# Rejected request handlers +@app.errorhandler(405) +def page_not_found(error): + return 'The method is not allowed for the requested URL', 405 -if __name__ == '__main__': - SERVICE_PORT = os.environ['SERVICE_PORT'] +@app.errorhandler(404) +def page_not_found(error): + return 'The requested URL was not found', 404 +if __name__ == '__main__': + if 'SERVICE_PORT' in os.environ: + SERVICE_PORT = os.environ['SERVICE_PORT'] + if 'SWAGGER_PATH' not in os.environ: + exit("You have to provide a 'SWAGGER_PATH'") + + SWAGGER_PATH = os.environ['SWAGGER_PATH'] + #Decoder parameters applied for both GMM and DNN based ASR systems decoder_settings = configparser.ConfigParser() decoder_settings.read(AM_PATH+'/decode.cfg') @@ -174,15 +187,15 @@ def stop(): AM_FINAL_PATH=AM_PATH+"/"+AM_FILE_PATH with open(AM_FINAL_PATH+"/conf/online.conf") as f: values = f.readlines() - with open(TEMP_FILE_PATH1+"/online.conf", 'w') as f: + with open(CONFIG_FILES_PATH+"/online.conf", 'w') as f: for i in values: f.write(i) - f.write("--ivector-extraction-config="+TEMP_FILE_PATH1+"/ivector_extractor.conf\n") + f.write("--ivector-extraction-config="+CONFIG_FILES_PATH+"/ivector_extractor.conf\n") f.write("--mfcc-config="+AM_FINAL_PATH+"/conf/mfcc.conf") with open(AM_FINAL_PATH+"/conf/ivector_extractor.conf") as f: values = f.readlines() - with open(TEMP_FILE_PATH1+"/ivector_extractor.conf", 'w') as f: + with open(CONFIG_FILES_PATH+"/ivector_extractor.conf", 'w') as f: for i in values: f.write(i) f.write("--splice-config="+AM_FINAL_PATH+"/conf/splice.conf\n") @@ -192,6 +205,18 @@ def stop(): f.write("--diag-ubm="+AM_FINAL_PATH+"/ivector_extractor/final.dubm\n") f.write("--ivector-extractor="+AM_FINAL_PATH+"/ivector_extractor/final.ie") + ### swagger specific ### + swagger_yml = yaml.load(open(SWAGGER_PATH, 'r'), Loader=yaml.Loader) + swaggerui = get_swaggerui_blueprint( + SWAGGER_URL, # Swagger UI static files will be mapped to '{SWAGGER_URL}/dist/' + SWAGGER_PATH, + config={ # Swagger UI config overrides + 'app_name': "STT API Documentation", + 'spec': swagger_yml + } + ) + app.register_blueprint(swaggerui, url_prefix=SWAGGER_URL) + ### end swagger specific ### #Run server app.run(host='0.0.0.0', port=SERVICE_PORT, debug=True, threaded=False, processes=1) From e53ed3f244244fb67a5d597ad8e2eb60ba28d2ae Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 2 Jun 2020 20:20:33 +0200 Subject: [PATCH 004/172] New stt-worker based on pykaldi package --- Dockerfile | 178 ++++++++++++++++++++++++++++----------------- Jenkinsfile | 19 +++++ Makefile | 14 ---- RELEASE.md | 8 +- run.py | 205 +++++++--------------------------------------------- 5 files changed, 158 insertions(+), 266 deletions(-) delete mode 100755 Makefile diff --git a/Dockerfile b/Dockerfile index 83d7300..dc10fa9 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,80 +1,126 @@ -FROM debian:9 +# Dockerfile for building PyKaldi image from Ubuntu 16.04 image +FROM ubuntu:18.04 LABEL maintainer="irebai@linagora.com" -# Install all our dependencies and set some required build changes -RUN apt-get update &&\ - apt-get install -y \ - python2.7 \ - python3 \ - python-dev \ - python3-dev \ - python-pip \ +# Install necessary system packages +RUN apt-get update \ + && apt-get install -y \ + python3 \ python3-pip \ - g++ make automake autoconf bzip2 unzip wget sox libtool git subversion zlib1g-dev ca-certificates gfortran patch ffmpeg nano && \ - apt-get clean + python2.7 \ + autoconf \ + automake \ + cmake \ + make \ + curl \ + g++ \ + git \ + graphviz \ + libatlas3-base \ + libtool \ + pkg-config \ + sox \ + subversion \ + bzip2 \ + unzip \ + wget \ + zlib1g-dev \ + ca-certificates \ + gfortran \ + patch \ + ffmpeg \ + nano && \ + ln -s /usr/bin/python3 /usr/bin/python && \ + ln -s /usr/bin/pip3 /usr/bin/pip -## Build kaldi and Clean installation (intel, openfst, src/*) -RUN git clone --depth 1 https://github.com/kaldi-asr/kaldi.git /opt/kaldi && \ - cd /opt/kaldi && \ - cd /opt/kaldi/tools && \ - ./extras/install_mkl.sh && \ - make -j $(nproc) && \ - cd /opt/kaldi/src && \ - ./configure --shared && \ - make depend -j $(nproc) && \ - make -j $(nproc) && \ - mkdir -p /opt/kaldi/src_/lib /opt/kaldi/src_/bin && \ - mv /opt/kaldi/src/base/libkaldi-base.so \ - /opt/kaldi/src/chain/libkaldi-chain.so \ - /opt/kaldi/src/cudamatrix/libkaldi-cudamatrix.so \ - /opt/kaldi/src/decoder/libkaldi-decoder.so \ - /opt/kaldi/src/feat/libkaldi-feat.so \ - /opt/kaldi/src/fstext/libkaldi-fstext.so \ - /opt/kaldi/src/gmm/libkaldi-gmm.so \ - /opt/kaldi/src/hmm/libkaldi-hmm.so \ - /opt/kaldi/src/ivector/libkaldi-ivector.so \ - /opt/kaldi/src/kws/libkaldi-kws.so \ - /opt/kaldi/src/lat/libkaldi-lat.so \ - /opt/kaldi/src/lm/libkaldi-lm.so \ - /opt/kaldi/src/matrix/libkaldi-matrix.so \ - /opt/kaldi/src/nnet/libkaldi-nnet.so \ - /opt/kaldi/src/nnet2/libkaldi-nnet2.so \ - /opt/kaldi/src/nnet3/libkaldi-nnet3.so \ - /opt/kaldi/src/online2/libkaldi-online2.so \ - /opt/kaldi/src/rnnlm/libkaldi-rnnlm.so \ - /opt/kaldi/src/sgmm2/libkaldi-sgmm2.so \ - /opt/kaldi/src/transform/libkaldi-transform.so \ - /opt/kaldi/src/tree/libkaldi-tree.so \ - /opt/kaldi/src/util/libkaldi-util.so \ - /opt/kaldi/src_/lib && \ - mv /opt/kaldi/src/online2bin/online2-wav-nnet2-latgen-faster \ - /opt/kaldi/src/online2bin/online2-wav-nnet3-latgen-faster \ - /opt/kaldi/src/latbin/lattice-1best \ - /opt/kaldi/src/latbin/lattice-align-words \ - /opt/kaldi/src/latbin/nbest-to-ctm /opt/kaldi/src_/bin && \ - rm -rf /opt/kaldi/src && mv /opt/kaldi/src_ /opt/kaldi/src && \ - cd /opt/kaldi/src && rm -f lmbin/*.cc lmbin/*.o lmbin/Makefile fstbin/*.cc fstbin/*.o fstbin/Makefile bin/*.cc bin/*.o bin/Makefile && \ - cd /opt/intel/mkl/lib && rm -f intel64/*.a intel64_lin/*.a && \ - cd /opt/kaldi/tools && mkdir openfsttmp && mv openfst-*/lib openfst-*/include openfst-*/bin openfsttmp && rm openfsttmp/lib/*.a openfsttmp/lib/*.la && \ - rm -r openfst-*/* && mv openfsttmp/* openfst-*/ && rm -r openfsttmp +# Install necessary Python packages (pykaldi dependencies) +RUN pip install --upgrade pip \ + numpy \ + setuptools \ + pyparsing \ + ninja -## Install python packages -RUN pip3 install flask flask-cors flask-swagger-ui configparser pyyaml +## Install Protobuf, CLIF, Kaldi and PyKaldi and Clean installation +RUN git clone --depth 1 https://github.com/pykaldi/pykaldi.git /pykaldi \ + && cd /pykaldi/tools \ + && sed -i "s/make \-j4/make -j $(nproc)/g" ./install_kaldi.sh \ + && sed -i "s/\-j 2/-j $(nproc)/g" ./install_clif.sh \ + && sed -i "s/make \-j4/make -j $(nproc)/g" ./install_protobuf.sh \ + && ./check_dependencies.sh \ + && ./install_protobuf.sh \ + && ./install_clif.sh \ + && ./install_kaldi.sh \ + && cd /pykaldi \ + && python setup.py install \ + && rm -rf /pykaldi/CMakeLists.txt \ + /pykaldi/LICENSE \ + /pykaldi/README.md \ + /pykaldi/setup.cfg \ + /pykaldi/setup.py \ + /pykaldi/docker \ + /pykaldi/docs \ + /pykaldi/extras \ + /pykaldi/pykaldi.egg-info \ + /pykaldi/tests \ + /pykaldi/build/CMakeCache.txt \ + /pykaldi/build/bdist.linux-x86_64 \ + /pykaldi/build/build.ninja \ + /pykaldi/build/cmake_install.cmake \ + /pykaldi/build/docs \ + /pykaldi/build/kaldi \ + /pykaldi/build/lib \ + /pykaldi/build/rules.ninja \ + /pykaldi/tools/check_dependencies.sh \ + /pykaldi/tools/clif* \ + /pykaldi/tools/find_python_library.py \ + /pykaldi/tools/install_* \ + /pykaldi/tools/protobuf \ + /pykaldi/tools/use_namespace.sh \ + /pykaldi/tools/kaldi/COPYING \ + /pykaldi/tools/kaldi/INSTALL \ + /pykaldi/tools/kaldi/README.md \ + /pykaldi/tools/kaldi/egs \ + /pykaldi/tools/kaldi/misc \ + /pykaldi/tools/kaldi/scripts \ + /pykaldi/tools/kaldi/windows \ + && mkdir -p /pykaldi/tools/kaldi/src_/lib \ + && mv /pykaldi/tools/kaldi/src/base/libkaldi-base.so \ + /pykaldi/tools/kaldi/src/chain/libkaldi-chain.so \ + /pykaldi/tools/kaldi/src/cudamatrix/libkaldi-cudamatrix.so \ + /pykaldi/tools/kaldi/src/decoder/libkaldi-decoder.so \ + /pykaldi/tools/kaldi/src/feat/libkaldi-feat.so \ + /pykaldi/tools/kaldi/src/fstext/libkaldi-fstext.so \ + /pykaldi/tools/kaldi/src/gmm/libkaldi-gmm.so \ + /pykaldi/tools/kaldi/src/hmm/libkaldi-hmm.so \ + /pykaldi/tools/kaldi/src/ivector/libkaldi-ivector.so \ + /pykaldi/tools/kaldi/src/kws/libkaldi-kws.so \ + /pykaldi/tools/kaldi/src/lat/libkaldi-lat.so \ + /pykaldi/tools/kaldi/src/lm/libkaldi-lm.so \ + /pykaldi/tools/kaldi/src/matrix/libkaldi-matrix.so \ + /pykaldi/tools/kaldi/src/nnet/libkaldi-nnet.so \ + /pykaldi/tools/kaldi/src/nnet2/libkaldi-nnet2.so \ + /pykaldi/tools/kaldi/src/nnet3/libkaldi-nnet3.so \ + /pykaldi/tools/kaldi/src/online2/libkaldi-online2.so \ + /pykaldi/tools/kaldi/src/rnnlm/libkaldi-rnnlm.so \ + /pykaldi/tools/kaldi/src/sgmm2/libkaldi-sgmm2.so \ + /pykaldi/tools/kaldi/src/transform/libkaldi-transform.so \ + /pykaldi/tools/kaldi/src/tree/libkaldi-tree.so \ + /pykaldi/tools/kaldi/src/util/libkaldi-util.so \ + /pykaldi/tools/kaldi/src_/lib \ + && rm -rf /pykaldi/tools/kaldi/src && mv /pykaldi/tools/kaldi/src_ /pykaldi/tools/kaldi/src \ + && cd /pykaldi/tools/kaldi/tools && mkdir openfsttmp && mv openfst-*/lib openfst-*/include openfst-*/bin openfsttmp && rm openfsttmp/lib/*.a openfsttmp/lib/*.la && \ + rm -r openfst-*/* && mv openfsttmp/* openfst-*/ && rm -r openfsttmp -## Create symbolik links -RUN cd /opt/kaldi/src/bin && \ - ln -s online2-wav-nnet2-latgen-faster kaldi-nnet2-latgen-faster && \ - ln -s online2-wav-nnet3-latgen-faster kaldi-nnet3-latgen-faster && \ - ln -s lattice-1best kaldi-lattice-1best && \ - ln -s lattice-align-words kaldi-lattice-align-words && \ - ln -s nbest-to-ctm kaldi-nbest-to-ctm +# Install main service packages +RUN pip3 install flask flask-cors flask-swagger-ui configparser pyyaml # Set environment variables -ENV PATH /opt/kaldi/src/bin:/opt/kaldi/egs/wsj/s5/utils/:$PATH +ENV PATH /pykaldi/tools/kaldi/egs/wsj/s5/utils/:$PATH WORKDIR /usr/src/speech-to-text +COPY tools.py . COPY run.py . EXPOSE 80 -CMD python3 ./run.py +CMD python3 ./run.py \ No newline at end of file diff --git a/Jenkinsfile b/Jenkinsfile index b4bdffc..530e391 100644 --- a/Jenkinsfile +++ b/Jenkinsfile @@ -47,5 +47,24 @@ pipeline { } } } + + stage('Docker build for pykaldi (unstable) branch'){ + when{ + branch 'pykaldi' + } + steps { + echo 'Publishing new Feature branch' + script { + image = docker.build(env.DOCKER_HUB_REPO) + VERSION = sh( + returnStdout: true, + script: "awk -v RS='' '/#/ {print; exit}' RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" + ).trim() + docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { + image.push('pykaldi') + } + } + } + } }// end stages } diff --git a/Makefile b/Makefile deleted file mode 100755 index 6b72774..0000000 --- a/Makefile +++ /dev/null @@ -1,14 +0,0 @@ -echo "Compile nnet2_decoder" -g++ -std=c++11 -L/usr/lib/atlas-base/atlas -L/opt/kaldi/tools/openfst/lib -L/opt/kaldi/src/lib -lblas -lkaldi-decoder -lkaldi-lat -lkaldi-fstext -lkaldi-hmm -lkaldi-transform -lkaldi-gmm -lkaldi-tree -lkaldi-util -lkaldi-matrix -lkaldi-base -lkaldi-nnet3 -lkaldi-online2 -lfst -lkaldi-cudamatrix -lkaldi-ivector -I /opt/kaldi/src -I /opt/kaldi/tools/openfst/include /opt/kaldi/src/online2bin/online2-wav-nnet2-latgen-faster.cc -o kaldi-nnet2-latgen-faster /opt/kaldi/src/lib/libkaldi-feat.so /opt/kaldi/src/lib/libkaldi-nnet2.so -lrt -lm -lpthread - -echo "Compile nnet3_decoder" -g++ -std=c++11 -L/usr/lib/atlas-base/atlas -L/opt/kaldi/tools/openfst/lib -L/opt/kaldi/src/lib -lblas -lkaldi-decoder -lkaldi-lat -lkaldi-fstext -lkaldi-hmm -lkaldi-transform -lkaldi-gmm -lkaldi-tree -lkaldi-util -lkaldi-matrix -lkaldi-base -lkaldi-nnet3 -lkaldi-online2 -lfst -lkaldi-cudamatrix -lkaldi-ivector -I /opt/kaldi/src -I /opt/kaldi/tools/openfst/include /opt/kaldi/src/online2bin/online2-wav-nnet3-latgen-faster.cc -o kaldi-nnet3-latgen-faster /opt/kaldi/src/lib/libkaldi-feat.so -lrt -lm -lpthread - -echo "Compile lattice-to-1best" -g++ -std=c++11 -L/usr/lib/atlas-base/atlas -L/opt/kaldi/tools/openfst/lib -L/opt/kaldi/src/lib -lblas -lkaldi-decoder -lkaldi-lat -lkaldi-fstext -lkaldi-hmm -lkaldi-transform -lkaldi-gmm -lkaldi-tree -lkaldi-util -lkaldi-matrix -lkaldi-base -lkaldi-nnet3 -lkaldi-online2 -lfst -lkaldi-cudamatrix -lkaldi-ivector -I /opt/kaldi/src -I /opt/kaldi/tools/openfst/include /opt/kaldi/src/latbin/lattice-1best.cc -o kaldi-lattice-1best /opt/kaldi/src/lib/libkaldi-feat.so -lrt -lm -lpthread - -echo "Compile lattice-align-words" -g++ -std=c++11 -L/usr/lib/atlas-base/atlas -L/opt/kaldi/tools/openfst/lib -L/opt/kaldi/src/lib -lblas -lkaldi-decoder -lkaldi-lat -lkaldi-fstext -lkaldi-hmm -lkaldi-transform -lkaldi-gmm -lkaldi-tree -lkaldi-util -lkaldi-matrix -lkaldi-base -lkaldi-nnet3 -lkaldi-online2 -lfst -lkaldi-cudamatrix -lkaldi-ivector -I /opt/kaldi/src -I /opt/kaldi/tools/openfst/include /opt/kaldi/src/latbin/lattice-align-words.cc -o kaldi-lattice-align-words /opt/kaldi/src/lib/libkaldi-feat.so -lrt -lm -lpthread - -echo "Compile nbest-to-ctm" -g++ -std=c++11 -L/usr/lib/atlas-base/atlas -L/opt/kaldi/tools/openfst/lib -L/opt/kaldi/src/lib -lblas -lkaldi-decoder -lkaldi-lat -lkaldi-fstext -lkaldi-hmm -lkaldi-transform -lkaldi-gmm -lkaldi-tree -lkaldi-util -lkaldi-matrix -lkaldi-base -lkaldi-nnet3 -lkaldi-online2 -lfst -lkaldi-cudamatrix -lkaldi-ivector -I /opt/kaldi/src -I /opt/kaldi/tools/openfst/include /opt/kaldi/src/latbin/nbest-to-ctm.cc -o kaldi-nbest-to-ctm /opt/kaldi/src/lib/libkaldi-feat.so -lrt -lm -lpthread diff --git a/RELEASE.md b/RELEASE.md index a2826a4..1d02d63 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,6 +1,2 @@ -# 1.1.2 -- New features: - - Word timestamp computing - - Response type: plain/text: simple text output and application/json: the transcription and the words timestamp. - - Swagger: integrate swagger in the service using a python package - - Fix minor bugs \ No newline at end of file +# 2.0.0 +- New ASR engine based on pykaldi package \ No newline at end of file diff --git a/run.py b/run.py index d678e94..88b6649 100755 --- a/run.py +++ b/run.py @@ -4,7 +4,7 @@ from flask import Flask, request, abort, Response, json from flask_swagger_ui import get_swaggerui_blueprint from flask_cors import CORS -import uuid, os, configparser, subprocess, shlex, re, yaml +import yaml, os app = Flask(__name__) @@ -16,139 +16,36 @@ SERVICE_PORT=80 SWAGGER_URL='/api-doc' + if not os.path.isdir(TEMP_FILE_PATH): os.mkdir(TEMP_FILE_PATH) if not os.path.isdir(CONFIG_FILES_PATH): os.mkdir(CONFIG_FILES_PATH) +# Environment parameters +if 'SERVICE_PORT' in os.environ: + SERVICE_PORT = os.environ['SERVICE_PORT'] +if 'SWAGGER_PATH' not in os.environ: + exit("You have to provide a 'SWAGGER_PATH'") +SWAGGER_PATH = os.environ['SWAGGER_PATH'] -def run_shell_command(command_line): - try: - command_line_args = shlex.split(command_line) - process = subprocess.Popen(command_line_args, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) - output, error = process.communicate() - return False, output - except OSError as err: - app.logger.info("OS error: {0}".format(err)) - return True, '' - except ValueError: - app.logger.info("data error.") - return True, '' - except: - app.logger.info("Unexpected error:", sys.exc_info()[0]) - return True, '' - -def decode(audio_file,wav_name,do_word_tStamp): - # Normalize audio file and convert it to wave format - error, output = run_shell_command("sox "+audio_file+" -t wav -b 16 -r 16000 -c 1 "+audio_file+".wav") - if not os.path.exists(audio_file+".wav"): - app.logger.info(output) - return False, 'Error during audio file conversion!!! Supported formats are wav, mp3, aiff, flac, and ogg.' - - - decode_file = audio_file+".wav" - decode_conf = CONFIG_FILES_PATH+"/online.conf" - decode_mdl = AM_PATH+"/"+AM_FILE_PATH+"/final.mdl" - decode_graph = LM_PATH+"/HCLG.fst" - decode_words = LM_PATH+"/words.txt" - decode_words_boundary = LM_PATH+"/word_boundary.int" - - - # Decode the audio file - decode_opt =" --min-active="+DECODER_MINACT - decode_opt+=" --max-active="+DECODER_MAXACT - decode_opt+=" --beam="+DECODER_BEAM - decode_opt+=" --lattice-beam="+DECODER_LATBEAM - decode_opt+=" --acoustic-scale="+DECODER_ACWT - - - if DECODER_SYS == 'dnn3': - error, output = run_shell_command("kaldi-nnet3-latgen-faster --do-endpointing=false --frames-per-chunk=20 --online=false --frame-subsampling-factor="+DECODER_FSF+" --config="+decode_conf+" --minimize=false "+decode_opt+" --word-symbol-table="+decode_words+" "+decode_mdl+" "+decode_graph+" \"ark:echo "+wav_name+" "+wav_name+"|\" \"scp:echo "+wav_name+" "+decode_file+"|\" ark:"+TEMP_FILE_PATH+"/"+wav_name+".lat") - elif DECODER_SYS == 'dnn2' or DECODER_SYS == 'dnn': - error, output = run_shell_command("kaldi-nnet2-latgen-faster --do-endpointing=false --online=false --config="+decode_conf+" "+decode_opt+" --word-symbol-table="+decode_words+" "+decode_mdl+" "+decode_graph+" \"ark:echo "+wav_name+" "+wav_name+"|\" \"scp:echo "+wav_name+" "+decode_file+"|\" ark:"+TEMP_FILE_PATH+"/"+wav_name+".lat") - else: - return False, 'The "decoder" parameter of the acoustic model is not supported!!!' - - if not os.path.exists(TEMP_FILE_PATH+"/"+wav_name+".lat"): - app.logger.info(output) - return False, 'One or multiple parameters of the acoustic model are not correct!!!' - - - # Normalize the obtained transcription - hypothesis = re.findall('\n'+wav_name+'.*',output.decode('utf-8')) - trans=re.sub(wav_name,'',hypothesis[0]).strip() - trans=re.sub(r"#nonterm:[^ ]* ", "", trans) - trans=re.sub(r" ", " ", " "+trans+" ") - - - # Get the begin and end time stamp from the decoder output - if do_word_tStamp: - error, output = run_shell_command("kaldi-lattice-1best --acoustic-scale="+DECODER_ACWT+" ark:"+TEMP_FILE_PATH+"/"+wav_name+".lat ark:"+TEMP_FILE_PATH+"/"+wav_name+".1best") - error, output = run_shell_command("kaldi-lattice-align-words "+decode_words_boundary+" "+decode_mdl+" ark:"+TEMP_FILE_PATH+"/"+wav_name+".1best ark:"+TEMP_FILE_PATH+"/"+wav_name+".words") - error, output = run_shell_command("kaldi-nbest-to-ctm ark:"+TEMP_FILE_PATH+"/"+wav_name+".words "+TEMP_FILE_PATH+"/"+wav_name+".ctm") - error, output = run_shell_command("int2sym.pl -f 5 "+decode_words+" "+TEMP_FILE_PATH+"/"+wav_name+".ctm") - if not error and output != "": - words = output.decode('utf-8').split("\n") - trans = "" - data = {} - data["words"] = [] - for word in words: - _word = word.strip().split(' ') - if len(_word) == 5: - meta = {} - word = re.sub("","",_word[4]) - word = re.sub("","",_word[4]) - if word != "": - trans = trans+" "+word - meta["word"] = word - meta["stime"] = float(_word[2]) - meta["etime"] = (float(_word[2]) + float(_word[3])) - meta["score"] = float(_word[1]) - data["words"].append(meta) - data["transcription"] = trans.strip() - return True, data - else: - app.logger.info("error during word time stamp generation") - - return True, trans.strip() +def swaggerUI(): + ### swagger specific ### + swagger_yml = yaml.load(open(SWAGGER_PATH, 'r'), Loader=yaml.Loader) + swaggerui = get_swaggerui_blueprint( + SWAGGER_URL, # Swagger UI static files will be mapped to '{SWAGGER_URL}/dist/' + SWAGGER_PATH, + config={ # Swagger UI config overrides + 'app_name': "STT API Documentation", + 'spec': swagger_yml + } + ) + app.register_blueprint(swaggerui, url_prefix=SWAGGER_URL) + ### end swagger specific ### @app.route('/transcribe', methods=['POST']) def transcribe(): - global busy - busy=1 - fileid = str(uuid.uuid4()) - if request.headers.get('accept').lower() == 'application/json': - metadata = True - elif request.headers.get('accept').lower() == 'text/plain': - metadata = False - else: - return 'Not accepted header', 400 - - - if 'file' in request.files.keys(): - file = request.files['file'] - file_ext = file.filename.rsplit('.', 1)[-1].lower() - file_type = file.content_type.rsplit('/', 1)[0] - if file_type == "audio": - filename = TEMP_FILE_PATH+'/'+fileid+'.'+file_ext - file.save(filename) - b, out = decode(filename,fileid,metadata) - if not b: - busy=0 - return 'Error while file transcription: '+out, 400 - else: - busy=0 - return 'Error while file transcription: The uploaded file format is not supported!!! Supported formats are wav, mp3, aiff, flac, and ogg.', 400 - else: - busy=0 - return 'No audio file was uploaded', 400 - - # Delete temporary files - for file in os.listdir(TEMP_FILE_PATH): - os.remove(TEMP_FILE_PATH+"/"+file) - busy=0 - json_string = json.dumps(out, ensure_ascii=False) - return Response(json_string,content_type="application/json; charset=utf-8" ), 200 + return 'Test', 200 @app.route('/healthcheck', methods=['GET']) def check(): @@ -164,60 +61,8 @@ def page_not_found(error): return 'The requested URL was not found', 404 if __name__ == '__main__': - if 'SERVICE_PORT' in os.environ: - SERVICE_PORT = os.environ['SERVICE_PORT'] - if 'SWAGGER_PATH' not in os.environ: - exit("You have to provide a 'SWAGGER_PATH'") - - SWAGGER_PATH = os.environ['SWAGGER_PATH'] + #start SwaggerUI + swaggerUI() - #Decoder parameters applied for both GMM and DNN based ASR systems - decoder_settings = configparser.ConfigParser() - decoder_settings.read(AM_PATH+'/decode.cfg') - DECODER_SYS = decoder_settings.get('decoder_params', 'decoder') - AM_FILE_PATH = decoder_settings.get('decoder_params', 'ampath') - DECODER_MINACT = decoder_settings.get('decoder_params', 'min_active') - DECODER_MAXACT = decoder_settings.get('decoder_params', 'max_active') - DECODER_BEAM = decoder_settings.get('decoder_params', 'beam') - DECODER_LATBEAM = decoder_settings.get('decoder_params', 'lattice_beam') - DECODER_ACWT = decoder_settings.get('decoder_params', 'acwt') - DECODER_FSF = decoder_settings.get('decoder_params', 'frame_subsampling_factor') - - #Prepare config files - AM_FINAL_PATH=AM_PATH+"/"+AM_FILE_PATH - with open(AM_FINAL_PATH+"/conf/online.conf") as f: - values = f.readlines() - with open(CONFIG_FILES_PATH+"/online.conf", 'w') as f: - for i in values: - f.write(i) - f.write("--ivector-extraction-config="+CONFIG_FILES_PATH+"/ivector_extractor.conf\n") - f.write("--mfcc-config="+AM_FINAL_PATH+"/conf/mfcc.conf") - - with open(AM_FINAL_PATH+"/conf/ivector_extractor.conf") as f: - values = f.readlines() - with open(CONFIG_FILES_PATH+"/ivector_extractor.conf", 'w') as f: - for i in values: - f.write(i) - f.write("--splice-config="+AM_FINAL_PATH+"/conf/splice.conf\n") - f.write("--cmvn-config="+AM_FINAL_PATH+"/conf/online_cmvn.conf\n") - f.write("--lda-matrix="+AM_FINAL_PATH+"/ivector_extractor/final.mat\n") - f.write("--global-cmvn-stats="+AM_FINAL_PATH+"/ivector_extractor/global_cmvn.stats\n") - f.write("--diag-ubm="+AM_FINAL_PATH+"/ivector_extractor/final.dubm\n") - f.write("--ivector-extractor="+AM_FINAL_PATH+"/ivector_extractor/final.ie") - - ### swagger specific ### - swagger_yml = yaml.load(open(SWAGGER_PATH, 'r'), Loader=yaml.Loader) - swaggerui = get_swaggerui_blueprint( - SWAGGER_URL, # Swagger UI static files will be mapped to '{SWAGGER_URL}/dist/' - SWAGGER_PATH, - config={ # Swagger UI config overrides - 'app_name': "STT API Documentation", - 'spec': swagger_yml - } - ) - app.register_blueprint(swaggerui, url_prefix=SWAGGER_URL) - ### end swagger specific ### - #Run server - app.run(host='0.0.0.0', port=SERVICE_PORT, debug=True, threaded=False, processes=1) - + app.run(host='0.0.0.0', port=SERVICE_PORT, debug=True, threaded=False, processes=1) \ No newline at end of file From c8a50267e91e2f9629b52e8cb5c05b3dad96ac38 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 2 Jun 2020 20:24:42 +0200 Subject: [PATCH 005/172] add ASR tools and init ASR engine --- run.py | 5 +++ tools.py | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 119 insertions(+) create mode 100644 tools.py diff --git a/run.py b/run.py index 88b6649..98cb255 100755 --- a/run.py +++ b/run.py @@ -4,6 +4,7 @@ from flask import Flask, request, abort, Response, json from flask_swagger_ui import get_swaggerui_blueprint from flask_cors import CORS +from tools import ASR import yaml, os app = Flask(__name__) @@ -15,6 +16,7 @@ CONFIG_FILES_PATH = '/opt/config' SERVICE_PORT=80 SWAGGER_URL='/api-doc' +asr = ASR(AM_PATH,LM_PATH, CONFIG_FILES_PATH) if not os.path.isdir(TEMP_FILE_PATH): @@ -64,5 +66,8 @@ def page_not_found(error): #start SwaggerUI swaggerUI() + #Run ASR engine + asr.run() + #Run server app.run(host='0.0.0.0', port=SERVICE_PORT, debug=True, threaded=False, processes=1) \ No newline at end of file diff --git a/tools.py b/tools.py new file mode 100644 index 0000000..7a93dd4 --- /dev/null +++ b/tools.py @@ -0,0 +1,114 @@ +## Kaldi ASR decoder +from kaldi.asr import NnetLatticeFasterOnlineRecognizer +from kaldi.decoder import LatticeFasterDecoderOptions +from kaldi.nnet3 import NnetSimpleLoopedComputationOptions +from kaldi.online2 import (OnlineEndpointConfig, + OnlineIvectorExtractorAdaptationState, + OnlineNnetFeaturePipelineConfig, + OnlineNnetFeaturePipelineInfo, + OnlineNnetFeaturePipeline, + OnlineSilenceWeighting) +from kaldi.util.options import ParseOptions +from kaldi.util.table import SequentialWaveReader +from kaldi.matrix import Matrix, Vector +############## + +## word to CTM +from kaldi.lat.align import (WordBoundaryInfoNewOpts, + WordBoundaryInfo, + word_align_lattice) +from kaldi.lat.functions import compact_lattice_to_word_alignment +from kaldi.asr import NnetRecognizer +import kaldi.fstext as _fst +############## + +## other packages +import configparser, sys +############## + + + +class ASR: + def __init__(self, AM_PATH, LM_PATH, CONFIG_FILES_PATH): + self.AM_PATH = AM_PATH + self.LM_PATH = LM_PATH + self.CONFIG_FILES_PATH = CONFIG_FILES_PATH + + def run(self): + def loadConfig(self): + #get decoder parameters from "decode.cfg" + decoder_settings = configparser.ConfigParser() + decoder_settings.read(self.AM_PATH+'/decode.cfg') + self.DECODER_SYS = decoder_settings.get('decoder_params', 'decoder') + self.AM_FILE_PATH = decoder_settings.get('decoder_params', 'ampath') + self.DECODER_MINACT = decoder_settings.get('decoder_params', 'min_active') + self.DECODER_MAXACT = decoder_settings.get('decoder_params', 'max_active') + self.DECODER_BEAM = decoder_settings.get('decoder_params', 'beam') + self.DECODER_LATBEAM = decoder_settings.get('decoder_params', 'lattice_beam') + self.DECODER_ACWT = decoder_settings.get('decoder_params', 'acwt') + self.DECODER_FSF = decoder_settings.get('decoder_params', 'frame_subsampling_factor') + + #Prepare "online.conf" + self.AM_PATH=self.AM_PATH+"/"+self.AM_FILE_PATH + with open(self.AM_PATH+"/conf/online.conf") as f: + values = f.readlines() + with open(self.CONFIG_FILES_PATH+"/online.conf", 'w') as f: + for i in values: + f.write(i) + f.write("--ivector-extraction-config="+self.CONFIG_FILES_PATH+"/ivector_extractor.conf\n") + f.write("--mfcc-config="+self.AM_PATH+"/conf/mfcc.conf") + + #Prepare "ivector_extractor.conf" + with open(self.AM_PATH+"/conf/ivector_extractor.conf") as f: + values = f.readlines() + with open(self.CONFIG_FILES_PATH+"/ivector_extractor.conf", 'w') as f: + for i in values: + f.write(i) + f.write("--splice-config="+self.AM_PATH+"/conf/splice.conf\n") + f.write("--cmvn-config="+self.AM_PATH+"/conf/online_cmvn.conf\n") + f.write("--lda-matrix="+self.AM_PATH+"/ivector_extractor/final.mat\n") + f.write("--global-cmvn-stats="+self.AM_PATH+"/ivector_extractor/global_cmvn.stats\n") + f.write("--diag-ubm="+self.AM_PATH+"/ivector_extractor/final.dubm\n") + f.write("--ivector-extractor="+self.AM_PATH+"/ivector_extractor/final.ie") + + # Define online feature pipeline + print("Load decoder config") + loadConfig(self) + feat_opts = OnlineNnetFeaturePipelineConfig() + endpoint_opts = OnlineEndpointConfig() + po = ParseOptions("") + feat_opts.register(po) + endpoint_opts.register(po) + po.read_config_file(self.AM_PATH+"/conf/online.conf") + feat_info = OnlineNnetFeaturePipelineInfo.from_config(feat_opts) + + # Construct recognizer + print("Load Decoder model") + decoder_opts = LatticeFasterDecoderOptions() + decoder_opts.beam = float(self.DECODER_BEAM) + decoder_opts.max_active = int(self.DECODER_MAXACT) + decoder_opts.min_active = int(self.DECODER_MINACT) + decoder_opts.lattice_beam = float(self.DECODER_LATBEAM) + decodable_opts = NnetSimpleLoopedComputationOptions() + decodable_opts.acoustic_scale = float(self.DECODER_ACWT) + decodable_opts.frame_subsampling_factor = int(self.DECODER_FSF) + decodable_opts.frames_per_chunk = 150 + asr = NnetLatticeFasterOnlineRecognizer.from_files( + self.AM_PATH+"/final.mdl", self.LM_PATH+"/HCLG.fst", self.LM_PATH+"/words.txt", + decoder_opts=decoder_opts, + decodable_opts=decodable_opts, + endpoint_opts=endpoint_opts) + + + +class Audio: + def __init__(self): + print("start Audio") + + def readAudio(stream,type): + print(type) + + + def transformAudio(): + print("###") + \ No newline at end of file From cde6c9273b3c5da1c0a14256cba3958ea8fa92e6 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Wed, 3 Jun 2020 14:43:25 +0200 Subject: [PATCH 006/172] add audio management and add exception --- Dockerfile | 1 + docker-compose.yml | 2 +- run.py | 53 +++++++++++++++++++++++++++++++++++++++------- tools.py | 51 ++++++++++++++++++++++++++++++++------------ 4 files changed, 85 insertions(+), 22 deletions(-) diff --git a/Dockerfile b/Dockerfile index dc10fa9..a7d1a8e 100644 --- a/Dockerfile +++ b/Dockerfile @@ -113,6 +113,7 @@ RUN git clone --depth 1 https://github.com/pykaldi/pykaldi.git /pykaldi \ # Install main service packages RUN pip3 install flask flask-cors flask-swagger-ui configparser pyyaml +RUN apt-get install -y libsox-fmt-all && pip3 install git+https://github.com/rabitt/pysox.git # Set environment variables ENV PATH /pykaldi/tools/kaldi/egs/wsj/s5/utils/:$PATH diff --git a/docker-compose.yml b/docker-compose.yml index 8c8e9aa..cdacdeb 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -5,7 +5,7 @@ services: stt-worker: container_name: stt-standalone-worker build: . - image: lintoai/linto-platform-stt-standalone-worker + image: lintoai/linto-platform-stt-standalone-worker:pykaldi volumes: - ${AM_PATH}:/opt/models/AM - ${LM_PATH}:/opt/models/LM diff --git a/run.py b/run.py index 98cb255..3ae4e97 100755 --- a/run.py +++ b/run.py @@ -4,8 +4,8 @@ from flask import Flask, request, abort, Response, json from flask_swagger_ui import get_swaggerui_blueprint from flask_cors import CORS -from tools import ASR -import yaml, os +from tools import ASR, Audio, Logger +import yaml, os, sox app = Flask(__name__) @@ -14,10 +14,13 @@ LM_PATH = '/opt/models/LM' TEMP_FILE_PATH = '/opt/tmp' CONFIG_FILES_PATH = '/opt/config' -SERVICE_PORT=80 -SWAGGER_URL='/api-doc' +SAVE_AUDIO = False +SERVICE_PORT = 80 +SWAGGER_URL = '/api-doc' asr = ASR(AM_PATH,LM_PATH, CONFIG_FILES_PATH) - +audio = Audio() +logASR = Logger(app,"ASR") +logAUDIO = Logger(app,"AUDIO") if not os.path.isdir(TEMP_FILE_PATH): os.mkdir(TEMP_FILE_PATH) @@ -47,27 +50,61 @@ def swaggerUI(): @app.route('/transcribe', methods=['POST']) def transcribe(): - return 'Test', 200 + try: + #get response content type + if request.headers.get('accept').lower() == 'application/json': + metadata = True + elif request.headers.get('accept').lower() == 'text/plain': + metadata = False + else: + raise ValueError('Not accepted header') + + #get input file + if 'file' in request.files.keys(): + file = request.files['file'] + file_path = TEMP_FILE_PATH+file.filename.lower() + file_type = file.content_type.rsplit('/', 1)[0] + file.save(file_path) + audio.transform(file_path) + else: + raise ValueError('No audio file was uploaded') + + return 'Test', 200 + except ValueError as error: + return str(error), 400 + except Exception as e: + app.logger.error(e) + return 'Server Error', 500 @app.route('/healthcheck', methods=['GET']) def check(): - return '1', 200 + return '', 200 # Rejected request handlers @app.errorhandler(405) -def page_not_found(error): +def method_not_allowed(error): return 'The method is not allowed for the requested URL', 405 @app.errorhandler(404) def page_not_found(error): return 'The requested URL was not found', 404 +@app.errorhandler(500) +def server_error(error): + app.logger.error(error) + return 'Server Error', 500 + if __name__ == '__main__': #start SwaggerUI swaggerUI() #Run ASR engine asr.run() + #Set Audio Sample Rate + audio.set_sample_rate(asr.get_sample_rate()) + #Set log messages + asr.set_logger(logASR) + audio.set_logger(logAUDIO) #Run server app.run(host='0.0.0.0', port=SERVICE_PORT, debug=True, threaded=False, processes=1) \ No newline at end of file diff --git a/tools.py b/tools.py index 7a93dd4..a14475f 100644 --- a/tools.py +++ b/tools.py @@ -23,10 +23,20 @@ ############## ## other packages -import configparser, sys +import configparser, sys, sox ############## +class Logger: + def __init__(self,app,module): + self.app = app + self.module = module + + def error(self,msg): + self.app.logger.error("["+self.module+"] "+str(msg)) + + def info(self,msg): + self.app.logger.info("["+self.module+"] "+str(msg)) class ASR: def __init__(self, AM_PATH, LM_PATH, CONFIG_FILES_PATH): @@ -70,7 +80,7 @@ def loadConfig(self): f.write("--global-cmvn-stats="+self.AM_PATH+"/ivector_extractor/global_cmvn.stats\n") f.write("--diag-ubm="+self.AM_PATH+"/ivector_extractor/final.dubm\n") f.write("--ivector-extractor="+self.AM_PATH+"/ivector_extractor/final.ie") - + # Define online feature pipeline print("Load decoder config") loadConfig(self) @@ -80,8 +90,8 @@ def loadConfig(self): feat_opts.register(po) endpoint_opts.register(po) po.read_config_file(self.AM_PATH+"/conf/online.conf") - feat_info = OnlineNnetFeaturePipelineInfo.from_config(feat_opts) - + self.feat_info = OnlineNnetFeaturePipelineInfo.from_config(feat_opts) + # Construct recognizer print("Load Decoder model") decoder_opts = LatticeFasterDecoderOptions() @@ -93,22 +103,37 @@ def loadConfig(self): decodable_opts.acoustic_scale = float(self.DECODER_ACWT) decodable_opts.frame_subsampling_factor = int(self.DECODER_FSF) decodable_opts.frames_per_chunk = 150 - asr = NnetLatticeFasterOnlineRecognizer.from_files( + self.asr = NnetLatticeFasterOnlineRecognizer.from_files( self.AM_PATH+"/final.mdl", self.LM_PATH+"/HCLG.fst", self.LM_PATH+"/words.txt", decoder_opts=decoder_opts, decodable_opts=decodable_opts, endpoint_opts=endpoint_opts) + def get_sample_rate(self): + return self.feat_info.mfcc_opts.frame_opts.samp_freq - + def set_logger(self,log): + self.log = log + class Audio: def __init__(self): - print("start Audio") - - def readAudio(stream,type): - print(type) + self.bit = 16 + self.channels = 1 + self.sr = -1 + + def set_sample_rate(self,sr): + self.sr = sr + def set_logger(self,log): + self.log = log - def transformAudio(): - print("###") - \ No newline at end of file + def transform(self,file_name): + try: + tfm = sox.Transformer() + tfm.set_output_format(rate=self.sr, + bits=self.bit, + channels=self.channels) + self.data = tfm.build_array(input_filepath=file_name) + except Exception as e: + self.log.error(e) + raise ValueError("The uploaded file format is not supported!!!") \ No newline at end of file From e09d76ad9d60529aff20987aa3b681fa61d16028 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Thu, 4 Jun 2020 13:01:46 +0200 Subject: [PATCH 007/172] add decode funtion to perform speech-to-text --- Dockerfile | 2 +- RELEASE.md | 4 ++-- run.py | 27 +++++++++++++++------------ tools.py | 30 ++++++++++++++++++++++++------ 4 files changed, 42 insertions(+), 21 deletions(-) diff --git a/Dockerfile b/Dockerfile index a7d1a8e..d881220 100644 --- a/Dockerfile +++ b/Dockerfile @@ -112,7 +112,7 @@ RUN git clone --depth 1 https://github.com/pykaldi/pykaldi.git /pykaldi \ rm -r openfst-*/* && mv openfsttmp/* openfst-*/ && rm -r openfsttmp # Install main service packages -RUN pip3 install flask flask-cors flask-swagger-ui configparser pyyaml +RUN pip3 install flask flask-cors flask-swagger-ui configparser pyyaml logger RUN apt-get install -y libsox-fmt-all && pip3 install git+https://github.com/rabitt/pysox.git # Set environment variables diff --git a/RELEASE.md b/RELEASE.md index 1d02d63..30d145e 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,2 +1,2 @@ -# 2.0.0 -- New ASR engine based on pykaldi package \ No newline at end of file +# 2.1.0 +- A fonctional offline ASR engine \ No newline at end of file diff --git a/run.py b/run.py index 3ae4e97..090fc22 100755 --- a/run.py +++ b/run.py @@ -5,9 +5,11 @@ from flask_swagger_ui import get_swaggerui_blueprint from flask_cors import CORS from tools import ASR, Audio, Logger -import yaml, os, sox +import yaml, os, sox, logging app = Flask(__name__) +app.logger.setLevel(logging.DEBUG) + # Main parameters AM_PATH = '/opt/models/AM' @@ -19,8 +21,8 @@ SWAGGER_URL = '/api-doc' asr = ASR(AM_PATH,LM_PATH, CONFIG_FILES_PATH) audio = Audio() -logASR = Logger(app,"ASR") -logAUDIO = Logger(app,"AUDIO") +asr.set_logger(Logger(app,"ASR")) +audio.set_logger(Logger(app,"AUDIO")) if not os.path.isdir(TEMP_FILE_PATH): os.mkdir(TEMP_FILE_PATH) @@ -48,6 +50,13 @@ def swaggerUI(): app.register_blueprint(swaggerui, url_prefix=SWAGGER_URL) ### end swagger specific ### +def getAudio(file): + file_path = TEMP_FILE_PATH+file.filename.lower() + file.save(file_path) + audio.transform(file_path) + if not SAVE_AUDIO: + os.remove(file_path) + @app.route('/transcribe', methods=['POST']) def transcribe(): try: @@ -62,14 +71,12 @@ def transcribe(): #get input file if 'file' in request.files.keys(): file = request.files['file'] - file_path = TEMP_FILE_PATH+file.filename.lower() - file_type = file.content_type.rsplit('/', 1)[0] - file.save(file_path) - audio.transform(file_path) + getAudio(file) + text = asr.decoder(audio) else: raise ValueError('No audio file was uploaded') - return 'Test', 200 + return text, 200 except ValueError as error: return str(error), 400 except Exception as e: @@ -97,14 +104,10 @@ def server_error(error): if __name__ == '__main__': #start SwaggerUI swaggerUI() - #Run ASR engine asr.run() #Set Audio Sample Rate audio.set_sample_rate(asr.get_sample_rate()) - #Set log messages - asr.set_logger(logASR) - audio.set_logger(logAUDIO) #Run server app.run(host='0.0.0.0', port=SERVICE_PORT, debug=True, threaded=False, processes=1) \ No newline at end of file diff --git a/tools.py b/tools.py index a14475f..d2b04d8 100644 --- a/tools.py +++ b/tools.py @@ -23,12 +23,12 @@ ############## ## other packages -import configparser, sys, sox +import configparser, sys, sox, time ############## class Logger: - def __init__(self,app,module): + def __init__(self,app,module=""): self.app = app self.module = module @@ -82,18 +82,18 @@ def loadConfig(self): f.write("--ivector-extractor="+self.AM_PATH+"/ivector_extractor/final.ie") # Define online feature pipeline - print("Load decoder config") + self.log.info("Load decoder config") loadConfig(self) feat_opts = OnlineNnetFeaturePipelineConfig() endpoint_opts = OnlineEndpointConfig() po = ParseOptions("") feat_opts.register(po) endpoint_opts.register(po) - po.read_config_file(self.AM_PATH+"/conf/online.conf") + po.read_config_file(self.CONFIG_FILES_PATH+"/online.conf") self.feat_info = OnlineNnetFeaturePipelineInfo.from_config(feat_opts) # Construct recognizer - print("Load Decoder model") + self.log.info("Load Decoder model") decoder_opts = LatticeFasterDecoderOptions() decoder_opts.beam = float(self.DECODER_BEAM) decoder_opts.max_active = int(self.DECODER_MAXACT) @@ -114,6 +114,21 @@ def get_sample_rate(self): def set_logger(self,log): self.log = log + + def decoder(self,audio): + try: + start_time = time.time() + feat_pipeline = OnlineNnetFeaturePipeline(self.feat_info) + self.asr.set_input_pipeline(feat_pipeline) + feat_pipeline.accept_waveform(audio.sr, audio.getDataKaldyVector()) + feat_pipeline.input_finished() + self.decode = self.asr.decode() + self.log.info("Decode time in seconds: %s" % (time.time() - start_time)) + except Exception as e: + self.log.error(e) + raise ValueError("Decoder failed to transcribe the input audio!!!") + else: + return self.decode["text"] class Audio: def __init__(self): @@ -136,4 +151,7 @@ def transform(self,file_name): self.data = tfm.build_array(input_filepath=file_name) except Exception as e: self.log.error(e) - raise ValueError("The uploaded file format is not supported!!!") \ No newline at end of file + raise ValueError("The uploaded file format is not supported!!!") + + def getDataKaldyVector(self): + return Vector(self.data) \ No newline at end of file From 8ec54b65289466ead42042310a0620631690353b Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Thu, 4 Jun 2020 13:15:57 +0200 Subject: [PATCH 008/172] replace class Logger by the logger package and configure it to show the lowest loggin level --- run.py | 12 ++++++------ tools.py | 21 ++++----------------- 2 files changed, 10 insertions(+), 23 deletions(-) diff --git a/run.py b/run.py index 090fc22..fb9b476 100755 --- a/run.py +++ b/run.py @@ -4,12 +4,14 @@ from flask import Flask, request, abort, Response, json from flask_swagger_ui import get_swaggerui_blueprint from flask_cors import CORS -from tools import ASR, Audio, Logger +from tools import ASR, Audio import yaml, os, sox, logging -app = Flask(__name__) -app.logger.setLevel(logging.DEBUG) +app = Flask("__stt-standelone-worker__") +# Set logger config +logger = logging.getLogger(__name__) +logging.basicConfig(level=logging.DEBUG) # Main parameters AM_PATH = '/opt/models/AM' @@ -21,8 +23,6 @@ SWAGGER_URL = '/api-doc' asr = ASR(AM_PATH,LM_PATH, CONFIG_FILES_PATH) audio = Audio() -asr.set_logger(Logger(app,"ASR")) -audio.set_logger(Logger(app,"AUDIO")) if not os.path.isdir(TEMP_FILE_PATH): os.mkdir(TEMP_FILE_PATH) @@ -67,7 +67,7 @@ def transcribe(): metadata = False else: raise ValueError('Not accepted header') - + #get input file if 'file' in request.files.keys(): file = request.files['file'] diff --git a/tools.py b/tools.py index d2b04d8..27e0db4 100644 --- a/tools.py +++ b/tools.py @@ -23,23 +23,12 @@ ############## ## other packages -import configparser, sys, sox, time +import configparser, sys, sox, time, logging ############## - -class Logger: - def __init__(self,app,module=""): - self.app = app - self.module = module - - def error(self,msg): - self.app.logger.error("["+self.module+"] "+str(msg)) - - def info(self,msg): - self.app.logger.info("["+self.module+"] "+str(msg)) - class ASR: def __init__(self, AM_PATH, LM_PATH, CONFIG_FILES_PATH): + self.log = logging.getLogger('__stt-standelone-worker__.ASR') self.AM_PATH = AM_PATH self.LM_PATH = LM_PATH self.CONFIG_FILES_PATH = CONFIG_FILES_PATH @@ -112,9 +101,6 @@ def loadConfig(self): def get_sample_rate(self): return self.feat_info.mfcc_opts.frame_opts.samp_freq - def set_logger(self,log): - self.log = log - def decoder(self,audio): try: start_time = time.time() @@ -129,9 +115,10 @@ def decoder(self,audio): raise ValueError("Decoder failed to transcribe the input audio!!!") else: return self.decode["text"] - + class Audio: def __init__(self): + self.log = logging.getLogger('__stt-standelone-worker__.Audio') self.bit = 16 self.channels = 1 self.sr = -1 From 11354add40408cca9ff1d24980290ec865c61df7 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Sun, 7 Jun 2020 19:52:21 +0200 Subject: [PATCH 009/172] add word timestamp and SttStandelone class to manage the hyperparam --- run.py | 11 +++++--- tools.py | 82 +++++++++++++++++++++++++++++++++++++++++++++----------- 2 files changed, 75 insertions(+), 18 deletions(-) diff --git a/run.py b/run.py index fb9b476..5bacc83 100755 --- a/run.py +++ b/run.py @@ -4,7 +4,7 @@ from flask import Flask, request, abort, Response, json from flask_swagger_ui import get_swaggerui_blueprint from flask_cors import CORS -from tools import ASR, Audio +from tools import ASR, Audio, SttStandelone import yaml, os, sox, logging app = Flask("__stt-standelone-worker__") @@ -32,6 +32,8 @@ # Environment parameters if 'SERVICE_PORT' in os.environ: SERVICE_PORT = os.environ['SERVICE_PORT'] +if 'SAVE_AUDIO' in os.environ: + SAVE_AUDIO = os.environ['SAVE_AUDIO'] if 'SWAGGER_PATH' not in os.environ: exit("You have to provide a 'SWAGGER_PATH'") SWAGGER_PATH = os.environ['SWAGGER_PATH'] @@ -61,6 +63,7 @@ def getAudio(file): def transcribe(): try: #get response content type + metadata = False if request.headers.get('accept').lower() == 'application/json': metadata = True elif request.headers.get('accept').lower() == 'text/plain': @@ -68,15 +71,17 @@ def transcribe(): else: raise ValueError('Not accepted header') + stt = SttStandelone(asr,metadata) + #get input file if 'file' in request.files.keys(): file = request.files['file'] getAudio(file) - text = asr.decoder(audio) + output = stt.run(audio,asr) else: raise ValueError('No audio file was uploaded') - return text, 200 + return output, 200 except ValueError as error: return str(error), 400 except Exception as e: diff --git a/tools.py b/tools.py index 27e0db4..818a5f6 100644 --- a/tools.py +++ b/tools.py @@ -17,7 +17,8 @@ from kaldi.lat.align import (WordBoundaryInfoNewOpts, WordBoundaryInfo, word_align_lattice) -from kaldi.lat.functions import compact_lattice_to_word_alignment +from kaldi.lat.functions import (compact_lattice_to_word_alignment, + compact_lattice_shortest_path) from kaldi.asr import NnetRecognizer import kaldi.fstext as _fst ############## @@ -40,12 +41,12 @@ def loadConfig(self): decoder_settings.read(self.AM_PATH+'/decode.cfg') self.DECODER_SYS = decoder_settings.get('decoder_params', 'decoder') self.AM_FILE_PATH = decoder_settings.get('decoder_params', 'ampath') - self.DECODER_MINACT = decoder_settings.get('decoder_params', 'min_active') - self.DECODER_MAXACT = decoder_settings.get('decoder_params', 'max_active') - self.DECODER_BEAM = decoder_settings.get('decoder_params', 'beam') - self.DECODER_LATBEAM = decoder_settings.get('decoder_params', 'lattice_beam') - self.DECODER_ACWT = decoder_settings.get('decoder_params', 'acwt') - self.DECODER_FSF = decoder_settings.get('decoder_params', 'frame_subsampling_factor') + self.DECODER_MINACT = int(decoder_settings.get('decoder_params', 'min_active')) + self.DECODER_MAXACT = int(decoder_settings.get('decoder_params', 'max_active')) + self.DECODER_BEAM = float(decoder_settings.get('decoder_params', 'beam')) + self.DECODER_LATBEAM = float(decoder_settings.get('decoder_params', 'lattice_beam')) + self.DECODER_ACWT = float(decoder_settings.get('decoder_params', 'acwt')) + self.DECODER_FSF = int(decoder_settings.get('decoder_params', 'frame_subsampling_factor')) #Prepare "online.conf" self.AM_PATH=self.AM_PATH+"/"+self.AM_FILE_PATH @@ -81,16 +82,22 @@ def loadConfig(self): po.read_config_file(self.CONFIG_FILES_PATH+"/online.conf") self.feat_info = OnlineNnetFeaturePipelineInfo.from_config(feat_opts) + # Set metadata parameters + self.samp_freq = self.feat_info.mfcc_opts.frame_opts.samp_freq + self.frame_shift = self.feat_info.mfcc_opts.frame_opts.frame_shift_ms / 1000 + self.symbols = _fst.SymbolTable.read_text(self.LM_PATH+"/words.txt") + self.info = WordBoundaryInfo.from_file(WordBoundaryInfoNewOpts(),self.LM_PATH+"/word_boundary.int") + # Construct recognizer self.log.info("Load Decoder model") decoder_opts = LatticeFasterDecoderOptions() - decoder_opts.beam = float(self.DECODER_BEAM) - decoder_opts.max_active = int(self.DECODER_MAXACT) - decoder_opts.min_active = int(self.DECODER_MINACT) - decoder_opts.lattice_beam = float(self.DECODER_LATBEAM) + decoder_opts.beam = self.DECODER_BEAM + decoder_opts.max_active = self.DECODER_MAXACT + decoder_opts.min_active = self.DECODER_MINACT + decoder_opts.lattice_beam = self.DECODER_LATBEAM decodable_opts = NnetSimpleLoopedComputationOptions() - decodable_opts.acoustic_scale = float(self.DECODER_ACWT) - decodable_opts.frame_subsampling_factor = int(self.DECODER_FSF) + decodable_opts.acoustic_scale = self.DECODER_ACWT + decodable_opts.frame_subsampling_factor = self.DECODER_FSF decodable_opts.frames_per_chunk = 150 self.asr = NnetLatticeFasterOnlineRecognizer.from_files( self.AM_PATH+"/final.mdl", self.LM_PATH+"/HCLG.fst", self.LM_PATH+"/words.txt", @@ -109,13 +116,58 @@ def decoder(self,audio): feat_pipeline.accept_waveform(audio.sr, audio.getDataKaldyVector()) feat_pipeline.input_finished() self.decode = self.asr.decode() + self.text = self.decode['text'] self.log.info("Decode time in seconds: %s" % (time.time() - start_time)) except Exception as e: self.log.error(e) raise ValueError("Decoder failed to transcribe the input audio!!!") - else: - return self.decode["text"] + + def wordTimestamp(self): + try: + _fst.utils.scale_compact_lattice([[1.0, 0],[0, float(self.DECODER_ACWT)]], self.decode['lattice']) + bestPath = compact_lattice_shortest_path(self.decode['lattice']) + _fst.utils.scale_compact_lattice([[1.0, 0],[0, 1.0/float(self.DECODER_ACWT)]], bestPath) + bestLattice = word_align_lattice(bestPath, self.asr.transition_model, self.info, 0) + alignment = compact_lattice_to_word_alignment(bestLattice[1]) + words = _fst.indices_to_symbols(self.symbols, alignment[0]) + self.timestamps={ + "words":words, + "start":alignment[1], + "dur":alignment[2] + } + except Exception as e: + self.log.error(e) + raise ValueError("Decoder failed to create the word timestamps!!!") + +class SttStandelone: + def __init__(self,asr,metadata=False): + self.log = logging.getLogger('__stt-standelone-worker__.SttStandelone') + self.metadata = metadata + def run(self,audio,asr): + asr.decoder(audio) + if self.metadata: + asr.wordTimestamp() + self.formatOutput(asr.timestamps,asr.frame_shift, asr.DECODER_FSF) + return self.output + else: + return asr.text + + def formatOutput(self,timestamps,frame_shift, frame_subsampling): + self.output = {} + text = "" + self.output["words"] = [] + for i in range(len(timestamps["words"])): + if timestamps["words"][i] != "": + meta = {} + meta["word"] = timestamps["words"][i] + meta["begin"] = round(timestamps["start"][i] * frame_shift * frame_subsampling,2) + meta["end"] = round((timestamps["start"][i]+timestamps["dur"][i]) * frame_shift * frame_subsampling, 2) + self.output["words"].append(meta) + text += " "+meta["word"] + self.output["transcription"] = text + + class Audio: def __init__(self): self.log = logging.getLogger('__stt-standelone-worker__.Audio') From 4f11a64374376a3c223d88fff6c82f6c2d9da7c2 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Mon, 8 Jun 2020 16:29:19 +0200 Subject: [PATCH 010/172] adapt the asr engine with the multiprocess mode and set the number of processes as external parameter --- .envdefault | 3 +- run.py | 10 +++++- tools.py | 90 ++++++++++++++++++++++++++++++++--------------------- 3 files changed, 65 insertions(+), 38 deletions(-) diff --git a/.envdefault b/.envdefault index 2246e24..80acea5 100644 --- a/.envdefault +++ b/.envdefault @@ -1,3 +1,4 @@ AM_PATH=/path/to/acoustic/models/dir LM_PATH=/path/to/language/models/dir -SWAGGER_PATH=/path/to/swagger/file \ No newline at end of file +SWAGGER_PATH=/path/to/swagger/file +NBR_PROCESSES=1 \ No newline at end of file diff --git a/run.py b/run.py index 5bacc83..9810ff3 100755 --- a/run.py +++ b/run.py @@ -6,6 +6,7 @@ from flask_cors import CORS from tools import ASR, Audio, SttStandelone import yaml, os, sox, logging +from time import gmtime, strftime app = Flask("__stt-standelone-worker__") @@ -18,6 +19,7 @@ LM_PATH = '/opt/models/LM' TEMP_FILE_PATH = '/opt/tmp' CONFIG_FILES_PATH = '/opt/config' +NBR_PROCESSES = 1 SAVE_AUDIO = False SERVICE_PORT = 80 SWAGGER_URL = '/api-doc' @@ -34,6 +36,11 @@ SERVICE_PORT = os.environ['SERVICE_PORT'] if 'SAVE_AUDIO' in os.environ: SAVE_AUDIO = os.environ['SAVE_AUDIO'] +if 'NBR_PROCESSES' in os.environ: + if int(os.environ['NBR_PROCESSES']) > 0: + NBR_PROCESSES = int(os.environ['NBR_PROCESSES']) + else: + exit("You must to provide a positif number of processes 'NBR_PROCESSES'") if 'SWAGGER_PATH' not in os.environ: exit("You have to provide a 'SWAGGER_PATH'") SWAGGER_PATH = os.environ['SWAGGER_PATH'] @@ -62,6 +69,7 @@ def getAudio(file): @app.route('/transcribe', methods=['POST']) def transcribe(): try: + app.logger.info('[%s] New user entry on /transcribe' % (strftime("%d/%b/%d %H:%M:%S", gmtime()))) #get response content type metadata = False if request.headers.get('accept').lower() == 'application/json': @@ -115,4 +123,4 @@ def server_error(error): audio.set_sample_rate(asr.get_sample_rate()) #Run server - app.run(host='0.0.0.0', port=SERVICE_PORT, debug=True, threaded=False, processes=1) \ No newline at end of file + app.run(host='0.0.0.0', port=SERVICE_PORT, debug=True, threaded=False, processes=NBR_PROCESSES) \ No newline at end of file diff --git a/tools.py b/tools.py index 818a5f6..48063f5 100644 --- a/tools.py +++ b/tools.py @@ -1,6 +1,7 @@ ## Kaldi ASR decoder from kaldi.asr import NnetLatticeFasterOnlineRecognizer -from kaldi.decoder import LatticeFasterDecoderOptions +from kaldi.decoder import (LatticeFasterDecoderOptions, + LatticeFasterOnlineDecoder) from kaldi.nnet3 import NnetSimpleLoopedComputationOptions from kaldi.online2 import (OnlineEndpointConfig, OnlineIvectorExtractorAdaptationState, @@ -75,18 +76,16 @@ def loadConfig(self): self.log.info("Load decoder config") loadConfig(self) feat_opts = OnlineNnetFeaturePipelineConfig() - endpoint_opts = OnlineEndpointConfig() + self.endpoint_opts = OnlineEndpointConfig() po = ParseOptions("") feat_opts.register(po) - endpoint_opts.register(po) + self.endpoint_opts.register(po) po.read_config_file(self.CONFIG_FILES_PATH+"/online.conf") self.feat_info = OnlineNnetFeaturePipelineInfo.from_config(feat_opts) # Set metadata parameters self.samp_freq = self.feat_info.mfcc_opts.frame_opts.samp_freq self.frame_shift = self.feat_info.mfcc_opts.frame_opts.frame_shift_ms / 1000 - self.symbols = _fst.SymbolTable.read_text(self.LM_PATH+"/words.txt") - self.info = WordBoundaryInfo.from_file(WordBoundaryInfoNewOpts(),self.LM_PATH+"/word_boundary.int") # Construct recognizer self.log.info("Load Decoder model") @@ -95,64 +94,83 @@ def loadConfig(self): decoder_opts.max_active = self.DECODER_MAXACT decoder_opts.min_active = self.DECODER_MINACT decoder_opts.lattice_beam = self.DECODER_LATBEAM - decodable_opts = NnetSimpleLoopedComputationOptions() - decodable_opts.acoustic_scale = self.DECODER_ACWT - decodable_opts.frame_subsampling_factor = self.DECODER_FSF - decodable_opts.frames_per_chunk = 150 - self.asr = NnetLatticeFasterOnlineRecognizer.from_files( - self.AM_PATH+"/final.mdl", self.LM_PATH+"/HCLG.fst", self.LM_PATH+"/words.txt", - decoder_opts=decoder_opts, - decodable_opts=decodable_opts, - endpoint_opts=endpoint_opts) + self.decodable_opts = NnetSimpleLoopedComputationOptions() + self.decodable_opts.acoustic_scale = self.DECODER_ACWT + self.decodable_opts.frame_subsampling_factor = self.DECODER_FSF + self.decodable_opts.frames_per_chunk = 150 + + # Load Acoustic and graph models and other files + self.transition_model, self.acoustic_model = NnetRecognizer.read_model(self.AM_PATH+"/final.mdl") + graph = _fst.read_fst_kaldi(self.LM_PATH+"/HCLG.fst") + self.decoder_graph = LatticeFasterOnlineDecoder(graph, decoder_opts) + self.symbols = _fst.SymbolTable.read_text(self.LM_PATH+"/words.txt") + self.info = WordBoundaryInfo.from_file(WordBoundaryInfoNewOpts(),self.LM_PATH+"/word_boundary.int") + del graph, decoder_opts def get_sample_rate(self): - return self.feat_info.mfcc_opts.frame_opts.samp_freq + return self.samp_freq - def decoder(self,audio): + def get_frames(self,feat_pipeline): + rows = feat_pipeline.num_frames_ready() + cols = feat_pipeline.dim() + frames = Matrix(rows,cols) + feat_pipeline.get_frames(range(rows),frames) + return frames[:,:self.feat_info.mfcc_opts.num_ceps], frames[:,self.feat_info.mfcc_opts.num_ceps:] + # return feats + ivectors + + def compute_feat(self,audio): + feat_pipeline = OnlineNnetFeaturePipeline(self.feat_info) + feat_pipeline.accept_waveform(audio.sr, audio.getDataKaldyVector()) + feat_pipeline.input_finished() + return feat_pipeline + + def decoder(self,feats): try: start_time = time.time() - feat_pipeline = OnlineNnetFeaturePipeline(self.feat_info) - self.asr.set_input_pipeline(feat_pipeline) - feat_pipeline.accept_waveform(audio.sr, audio.getDataKaldyVector()) - feat_pipeline.input_finished() - self.decode = self.asr.decode() - self.text = self.decode['text'] + asr = NnetLatticeFasterOnlineRecognizer(self.transition_model, self.acoustic_model, self.decoder_graph, + self.symbols, decodable_opts= self.decodable_opts, endpoint_opts=self.endpoint_opts) + asr.set_input_pipeline(feats) + decode = asr.decode() self.log.info("Decode time in seconds: %s" % (time.time() - start_time)) except Exception as e: self.log.error(e) raise ValueError("Decoder failed to transcribe the input audio!!!") + else: + return decode - def wordTimestamp(self): + def wordTimestamp(self,decode): try: - _fst.utils.scale_compact_lattice([[1.0, 0],[0, float(self.DECODER_ACWT)]], self.decode['lattice']) - bestPath = compact_lattice_shortest_path(self.decode['lattice']) + _fst.utils.scale_compact_lattice([[1.0, 0],[0, float(self.DECODER_ACWT)]], decode['lattice']) + bestPath = compact_lattice_shortest_path(decode['lattice']) _fst.utils.scale_compact_lattice([[1.0, 0],[0, 1.0/float(self.DECODER_ACWT)]], bestPath) - bestLattice = word_align_lattice(bestPath, self.asr.transition_model, self.info, 0) + bestLattice = word_align_lattice(bestPath, self.transition_model, self.info, 0) alignment = compact_lattice_to_word_alignment(bestLattice[1]) words = _fst.indices_to_symbols(self.symbols, alignment[0]) - self.timestamps={ + except Exception as e: + self.log.error(e) + raise ValueError("Decoder failed to create the word timestamps!!!") + else: + return { "words":words, "start":alignment[1], "dur":alignment[2] } - except Exception as e: - self.log.error(e) - raise ValueError("Decoder failed to create the word timestamps!!!") - + class SttStandelone: def __init__(self,asr,metadata=False): self.log = logging.getLogger('__stt-standelone-worker__.SttStandelone') self.metadata = metadata def run(self,audio,asr): - asr.decoder(audio) + feats = asr.compute_feat(audio) + decode = asr.decoder(feats) if self.metadata: - asr.wordTimestamp() - self.formatOutput(asr.timestamps,asr.frame_shift, asr.DECODER_FSF) + timestamps = asr.wordTimestamp(decode) + self.formatOutput(timestamps,asr.frame_shift, asr.decodable_opts.frame_subsampling_factor) return self.output else: - return asr.text - + return decode["text"] + def formatOutput(self,timestamps,frame_shift, frame_subsampling): self.output = {} text = "" From b4265164111cab48f156f24854cab3655951c5ef Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Wed, 10 Jun 2020 15:58:59 +0200 Subject: [PATCH 011/172] add Speaker diarization feature --- Dockerfile | 10 +- RELEASE.md | 6 +- document/swagger.yml | 13 ++ run.py | 35 +++- tools.py | 416 ++++++++++++++++++++++++++++++++++++++++--- 5 files changed, 442 insertions(+), 38 deletions(-) diff --git a/Dockerfile b/Dockerfile index d881220..6608943 100644 --- a/Dockerfile +++ b/Dockerfile @@ -111,14 +111,18 @@ RUN git clone --depth 1 https://github.com/pykaldi/pykaldi.git /pykaldi \ && cd /pykaldi/tools/kaldi/tools && mkdir openfsttmp && mv openfst-*/lib openfst-*/include openfst-*/bin openfsttmp && rm openfsttmp/lib/*.a openfsttmp/lib/*.la && \ rm -r openfst-*/* && mv openfsttmp/* openfst-*/ && rm -r openfsttmp +# Define the main folder +WORKDIR /usr/src/speech-to-text + # Install main service packages -RUN pip3 install flask flask-cors flask-swagger-ui configparser pyyaml logger -RUN apt-get install -y libsox-fmt-all && pip3 install git+https://github.com/rabitt/pysox.git +RUN pip3 install flask flask-cors flask-swagger-ui configparser pyyaml logger librosa webrtcvad scipy sklearn +RUN apt-get install -y libsox-fmt-all && pip3 install git+https://github.com/rabitt/pysox.git \ + && git clone https://github.com/irebai/pyBK.git /pykaldi/tools/pyBK \ + && cp /pykaldi/tools/pyBK/diarizationFunctions.py . # Set environment variables ENV PATH /pykaldi/tools/kaldi/egs/wsj/s5/utils/:$PATH -WORKDIR /usr/src/speech-to-text COPY tools.py . COPY run.py . diff --git a/RELEASE.md b/RELEASE.md index 30d145e..818a2d4 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,2 +1,4 @@ -# 2.1.0 -- A fonctional offline ASR engine \ No newline at end of file +# 2.2.0 +- Speaker diarization feature: pyBK package +- Mulithreading feature: Speech decoding and Speaker diarization processes +- Optional parameter: real number of speaker in the audio \ No newline at end of file diff --git a/document/swagger.yml b/document/swagger.yml index ebc5c08..57e818f 100644 --- a/document/swagger.yml +++ b/document/swagger.yml @@ -27,6 +27,19 @@ paths: description: "Audio File (wav, mp3, aiff, flac, ogg)" required: true type: "file" + - name: "nbrSpeaker" + in: "formData" + description: "Number of speakers in the audio" + required: false + type: "number" + default: 1 + - name: "speaker" + in: "formData" + description: "Do speaker diarization" + required: false + type: "string" + enum: [ "Yes", "No" ] + default: "No" responses: 200: description: Successfully transcribe the audio diff --git a/run.py b/run.py index 9810ff3..8ae7b76 100755 --- a/run.py +++ b/run.py @@ -4,7 +4,7 @@ from flask import Flask, request, abort, Response, json from flask_swagger_ui import get_swaggerui_blueprint from flask_cors import CORS -from tools import ASR, Audio, SttStandelone +from tools import ASR, Audio, SpeakerDiarization, SttStandelone import yaml, os, sox, logging from time import gmtime, strftime @@ -24,7 +24,6 @@ SERVICE_PORT = 80 SWAGGER_URL = '/api-doc' asr = ASR(AM_PATH,LM_PATH, CONFIG_FILES_PATH) -audio = Audio() if not os.path.isdir(TEMP_FILE_PATH): os.mkdir(TEMP_FILE_PATH) @@ -59,7 +58,7 @@ def swaggerUI(): app.register_blueprint(swaggerui, url_prefix=SWAGGER_URL) ### end swagger specific ### -def getAudio(file): +def getAudio(file,audio): file_path = TEMP_FILE_PATH+file.filename.lower() file.save(file_path) audio.transform(file_path) @@ -70,6 +69,10 @@ def getAudio(file): def transcribe(): try: app.logger.info('[%s] New user entry on /transcribe' % (strftime("%d/%b/%d %H:%M:%S", gmtime()))) + # create main objects + spk = SpeakerDiarization() + audio = Audio(asr.get_sample_rate()) + #get response content type metadata = False if request.headers.get('accept').lower() == 'application/json': @@ -79,13 +82,29 @@ def transcribe(): else: raise ValueError('Not accepted header') - stt = SttStandelone(asr,metadata) + #get speaker parameter + spkDiarization = False + if request.form.get('speaker') != None and (request.form.get('speaker').lower() == 'yes' or request.form.get('speaker').lower() == 'no'): + spkDiarization = True if request.form.get('speaker').lower() == 'yes' else False + #get number of speakers parameter + try: + if request.form.get('nbrSpeaker') != None and spkDiarization and int(request.form.get('nbrSpeaker')) > 0: + spk.set_maxNrSpeakers(int(request.form.get('nbrSpeaker'))) + elif request.form.get('nbrSpeaker') != None and spkDiarization: + raise ValueError('Not accepted "nbrSpeaker" field value (nbrSpeaker>0)') + except Exception as e: + app.logger.error(e) + raise ValueError('Not accepted "nbrSpeaker" field value (nbrSpeaker>0)') + else: + raise ValueError('Not accepted "speaker" field value (yes|no)') + stt = SttStandelone(metadata,spkDiarization) + #get input file if 'file' in request.files.keys(): file = request.files['file'] - getAudio(file) - output = stt.run(audio,asr) + getAudio(file,audio) + output = stt.run(audio,asr,spk) else: raise ValueError('No audio file was uploaded') @@ -119,8 +138,6 @@ def server_error(error): swaggerUI() #Run ASR engine asr.run() - #Set Audio Sample Rate - audio.set_sample_rate(asr.get_sample_rate()) #Run server - app.run(host='0.0.0.0', port=SERVICE_PORT, debug=True, threaded=False, processes=NBR_PROCESSES) \ No newline at end of file + app.run(host='0.0.0.0', port=SERVICE_PORT, debug=False, threaded=False, processes=NBR_PROCESSES) \ No newline at end of file diff --git a/tools.py b/tools.py index 48063f5..05c6c10 100644 --- a/tools.py +++ b/tools.py @@ -24,8 +24,19 @@ import kaldi.fstext as _fst ############## +## Speaker Diarization +from diarizationFunctions import * +import numpy as np +import librosa +from kaldi.ivector import (compute_vad_energy, + VadEnergyOptions) +from kaldi.feat.mfcc import Mfcc, MfccOptions +from kaldi.util.options import ParseOptions +############## + ## other packages import configparser, sys, sox, time, logging +from concurrent.futures import ThreadPoolExecutor ############## class ASR: @@ -127,6 +138,7 @@ def compute_feat(self,audio): def decoder(self,feats): try: start_time = time.time() + self.log.info("Start Decoding: %s" % (start_time)) asr = NnetLatticeFasterOnlineRecognizer(self.transition_model, self.acoustic_model, self.decoder_graph, self.symbols, decodable_opts= self.decodable_opts, endpoint_opts=self.endpoint_opts) asr.set_input_pipeline(feats) @@ -156,44 +168,399 @@ def wordTimestamp(self,decode): "dur":alignment[2] } +class SpeakerDiarization: + def __init__(self): + self.log = logging.getLogger('__stt-standelone-worker__.SPKDiarization') + + ### MFCC FEATURES PARAMETERS + self.frame_length_s=0.025 + self.frame_shift_s=0.01 + self.num_bins=40 + self.num_ceps=40 + self.low_freq=40 + self.high_freq=-200 + ##### + + ### VAD PARAMETERS + self.vad_ops = VadEnergyOptions() + self.vad_ops.vad_energy_mean_scale = 0.9 + self.vad_ops.vad_energy_threshold = 5 + #vad_ops.vad_frames_context = 2 + #vad_ops.vad_proportion_threshold = 0.12 + ##### + + ### Segment + self.seg_length = 100 # Window size in frames + self.seg_increment = 100 # Window increment after and before window in frames + self.seg_rate = 100 # Window shifting in frames + ##### + + ### KBM + self.minimumNumberOfInitialGaussians = 1024 # Minimum number of Gaussians in the initial pool + self.maximumKBMWindowRate = 50 # Maximum window rate for Gaussian computation + self.windowLength = 200 # Window length for computing Gaussians + self.kbmSize = 320 # Number of final Gaussian components in the KBM + self.useRelativeKBMsize = 1 # If set to 1, the KBM size is set as a proportion, given by "relKBMsize", of the pool size + self.relKBMsize = 0.3 # Relative KBM size if "useRelativeKBMsize = 1" (value between 0 and 1). + ###### + + ### BINARY_KEY + self.topGaussiansPerFrame = 5 # Number of top selected components per frame + self.bitsPerSegmentFactor = 0.2 # Percentage of bits set to 1 in the binary keys + ###### + + ### CLUSTERING + self.N_init = 16 # Number of initial clusters + self.linkage = 0 # Set to one to perform linkage clustering instead of clustering/reassignment + self.linkageCriterion = 'average' # Linkage criterion used if linkage==1 ('average', 'single', 'complete') + self.metric = 'cosine' # Similarity metric: 'cosine' for cumulative vectors, and 'jaccard' for binary keys + ###### + + ### CLUSTERING_SELECTION + self.metric_clusteringSelection = 'cosine' # Distance metric used in the selection of the output clustering solution ('jaccard','cosine') + self.bestClusteringCriterion = 'elbow' # Method employed for number of clusters selection. Can be either 'elbow' for an elbow criterion based on within-class sum of squares (WCSS) or 'spectral' for spectral clustering + self.sigma = 1 # Spectral clustering parameters, employed if bestClusteringCriterion == spectral + self.percentile = 40 + self.maxNrSpeakers = 16 # If known, max nr of speakers in a sesssion in the database. This is to limit the effect of changes in very small meaningless eigenvalues values generating huge eigengaps + ###### + + ### RESEGMENTATION + self.resegmentation = 1 # Set to 1 to perform re-segmentation + self.modelSize = 6 # Number of GMM components + self.nbIter = 10 # Number of expectation-maximization (EM) iterations + self.smoothWin = 100 # Size of the likelihood smoothing window in nb of frames + ###### + + def set_maxNrSpeakers(self,nbr): + self.maxNrSpeakers = nbr + + def compute_feat_Librosa(self,audio): + try: + self.log.info("Start feature extraction: %s" % (time.time())) + if audio.sr == 16000: + self.low_freq=20 + self.high_freq=7600 + data = audio.data/32768 + frame_length_inSample = self.frame_length_s * audio.sr + hop = int(self.frame_shift_s * audio.sr) + NFFT = int(2**np.ceil(np.log2(frame_length_inSample))) + mfccNumpy = librosa.feature.mfcc(y=data, + sr=audio.sr, + dct_type=2, + n_mfcc=self.num_ceps, + n_mels=self.num_bins, + n_fft=NFFT, + hop_length=hop, + fmin=self.low_freq, + fmax=self.high_freq).T + except Exception as e: + self.log.error(e) + raise ValueError("Speaker diarization failed when extracting features!!!") + else: + return mfccNumpy + + def compute_feat_KALDI(self,audio): + try: + self.log.info("Start feature extraction: %s" % (time.time())) + po = ParseOptions("") + mfcc_opts = MfccOptions() + mfcc_opts.use_energy = False + mfcc_opts.frame_opts.samp_freq = audio.sr + mfcc_opts.frame_opts.frame_length_ms = self.frame_length_s*1000 + mfcc_opts.frame_opts.frame_shift_ms = self.frame_shift_s*1000 + mfcc_opts.frame_opts.allow_downsample = False + mfcc_opts.mel_opts.num_bins = self.num_bins + mfcc_opts.mel_opts.low_freq = self.low_freq + mfcc_opts.mel_opts.high_freq = self.high_freq + mfcc_opts.num_ceps = self.num_ceps + mfcc_opts.register(po) + + # Create MFCC object and obtain sample frequency + mfccObj = Mfcc(mfcc_opts) + mfccKaldi = mfccObj.compute_features(audio.getDataKaldyVector(), audio.sr, 1.0) + except Exception as e: + self.log.error(e) + raise ValueError("Speaker diarization failed while extracting features!!!") + else: + return mfccKaldi + + def computeVAD_WEBRTC(self, audio): + try: + self.log.info("Start VAD: %s" % (time.time())) + data = audio.data/32768 + hop = 30 + va_framed = py_webrtcvad(data, fs=audio.sr, fs_vad=audio.sr, hoplength=hop, vad_mode=0) + segments = get_py_webrtcvad_segments(va_framed,audio.sr) + maskSAD = np.zeros([1,nFeatures]) + for seg in segments: + start=int(np.round(seg[0]/frame_shift_s)) + end=int(np.round(seg[1]/frame_shift_s)) + maskSAD[0][start:end]=1 + except Exception as e: + self.log.error(e) + raise ValueError("Speaker diarization failed while voice activity detection!!!") + else: + return maskSAD + + def computeVAD_KALDI(self, audio, feats=None): + try: + self.log.info("Start VAD: %s" % (time.time())) + vadStream = compute_vad_energy(self.vad_ops,feats) + vad = Vector(vadStream) + VAD = vad.numpy() + + ### segmentation + occurence=[] + value=[] + occurence.append(1) + value.append(VAD[0]) + + # compute the speech and non-speech frames + for i in range(1,len(VAD)): + if value[-1] == VAD[i]: + occurence[-1]+=1 + else: + occurence.append(1) + value.append(VAD[i]) + + # filter the speech and non-speech segments that are below 30 frames + i = 0 + while(i < len(occurence)): + if i != 0 and (occurence[i] < 30 or value[i-1] == value[i]): + occurence[i-1] += occurence[i] + del value[i] + del occurence[i] + else: + i+=1 + + # split if and only if the silence is above 50 frames + i = 0 + while(i < len(occurence)): + if i != 0 and ((occurence[i] < 30 and value[i] == 0.0) or value[i-1] == value[i]): + occurence[i-1] += occurence[i] + del value[i] + del occurence[i] + else: + i+=1 + + # compute VAD mask + maskSAD = np.zeros(len(VAD)) + start=0 + for i in range(len(occurence)): + if value[i] == 1.0: + end=start+occurence[i] + maskSAD[start:end] = 1 + start=end + else: + start += occurence[i] + + maskSAD = np.expand_dims(maskSAD, axis=0) + except ValueError as v: + self.log.error(v) + except Exception as e: + self.log.error(e) + raise ValueError("Speaker diarization failed while voice activity detection!!!") + else: + return maskSAD + + def run(self, audio, feats=None): + try: + def getSegments(frameshift, finalSegmentTable, finalClusteringTable, dur): + numberOfSpeechFeatures = finalSegmentTable[-1,2].astype(int)+1 + solutionVector = np.zeros([1,numberOfSpeechFeatures]) + for i in np.arange(np.size(finalSegmentTable,0)): + solutionVector[0,np.arange(finalSegmentTable[i,1],finalSegmentTable[i,2]+1).astype(int)]=finalClusteringTable[i] + seg = np.empty([0,3]) + solutionDiff = np.diff(solutionVector)[0] + first = 0 + for i in np.arange(0,np.size(solutionDiff,0)): + if solutionDiff[i]: + last = i+1 + seg1 = (first)*frameshift + seg2 = (last-first)*frameshift + seg3 = solutionVector[0,last-1] + if seg.shape[0] != 0 and seg3 == seg[-1][2]: + seg[-1][1] += seg2 + elif seg3 and seg2 > 0.3: # and seg2 > 0.1 + seg = np.vstack((seg,[seg1,seg2,seg3])) + first = i+1 + last = np.size(solutionVector,1) + seg1 = (first-1)*frameshift + seg2 = (last-first+1)*frameshift + seg3 = solutionVector[0,last-1] + if seg3 == seg[-1][2]: + seg[-1][1] += seg2 + elif seg3 and seg2 > 0.3: # and seg2 > 0.1 + seg = np.vstack((seg,[seg1,seg2,seg3])) + seg = np.vstack((seg,[dur,-1,-1])) + seg[0][0]=0.0 + return seg + + + start_time = time.time() + self.log.info("Start Speaker Diarization: %s" % (start_time)) + if self.maxNrSpeakers == 1: + self.log.info("Speaker Diarization time in seconds: %s" % (time.time() - start_time)) + return [[0, audio.dur, 1], + [audio.dur, -1, -1]] + if feats == None: + feats = self.compute_feat_KALDI(audio) + nFeatures = feats.shape[0] + maskSAD = self.computeVAD_KALDI(audio,feats) + maskUEM = np.ones([1,nFeatures]) + + mask = np.logical_and(maskUEM,maskSAD) + mask = mask[0][0:nFeatures] + nSpeechFeatures=np.sum(mask) + speechMapping = np.zeros(nFeatures) + #you need to start the mapping from 1 and end it in the actual number of features independently of the indexing style + #so that we don't lose features on the way + speechMapping[np.nonzero(mask)] = np.arange(1,nSpeechFeatures+1) + data=feats[np.where(mask==1)] + del feats + + segmentTable=getSegmentTable(mask,speechMapping,self.seg_length,self.seg_increment,self.seg_rate) + numberOfSegments=np.size(segmentTable,0) + #create the KBM + #set the window rate in order to obtain "minimumNumberOfInitialGaussians" gaussians + if np.floor((nSpeechFeatures-self.windowLength)/self.minimumNumberOfInitialGaussians) < self.maximumKBMWindowRate: + windowRate = int(np.floor((np.size(data,0)-self.windowLength)/self.minimumNumberOfInitialGaussians)) + else: + windowRate = int(self.maximumKBMWindowRate) + + if windowRate == 0: + raise ValueError('The audio is to short in order to perform the speaker diarization!!!') + + poolSize = np.floor((nSpeechFeatures-self.windowLength)/windowRate) + if self.useRelativeKBMsize: + kbmSize = int(np.floor(poolSize*self.relKBMsize)) + else: + kbmSize = int(self.kbmSize) + + #Training pool of',int(poolSize),'gaussians with a rate of',int(windowRate),'frames' + kbm, gmPool = trainKBM(data,self.windowLength,windowRate,kbmSize) + + #'Selected',kbmSize,'gaussians from the pool' + Vg = getVgMatrix(data,gmPool,kbm,self.topGaussiansPerFrame) + + #'Computing binary keys for all segments... ' + segmentBKTable, segmentCVTable = getSegmentBKs(segmentTable, kbmSize, Vg, self.bitsPerSegmentFactor, speechMapping) + + #'Performing initial clustering... ' + initialClustering = np.digitize(np.arange(numberOfSegments),np.arange(0,numberOfSegments,numberOfSegments/self.N_init)) + + + #'Performing agglomerative clustering... ' + if self.linkage: + finalClusteringTable, k = performClusteringLinkage(segmentBKTable, segmentCVTable, self.N_init, self.linkageCriterion, self.metric) + else: + finalClusteringTable, k = performClustering(speechMapping, segmentTable, segmentBKTable, segmentCVTable, Vg, self.bitsPerSegmentFactor, kbmSize, self.N_init, initialClustering, self.metric) + + #'Selecting best clustering...' + if self.bestClusteringCriterion == 'elbow': + bestClusteringID = getBestClustering(self.metric_clusteringSelection, segmentBKTable, segmentCVTable, finalClusteringTable, k, self.maxNrSpeakers) + elif self.bestClusteringCriterion == 'spectral': + bestClusteringID = getSpectralClustering(self.metric_clusteringSelection,finalClusteringTable,self.N_init,segmentBKTable,segmentCVTable,k,self.sigma,self.percentile,self.maxNrSpeakers)+1 + + if self.resegmentation and np.size(np.unique(finalClusteringTable[:,bestClusteringID.astype(int)-1]),0)>1: + finalClusteringTableResegmentation,finalSegmentTable = performResegmentation(data,speechMapping, mask,finalClusteringTable[:,bestClusteringID.astype(int)-1],segmentTable,self.modelSize,self.nbIter,self.smoothWin,nSpeechFeatures) + seg = getSegments(self.frame_shift_s,finalSegmentTable, np.squeeze(finalClusteringTableResegmentation), audio.dur) + else: + seg = getSegmentationFile(self.frame_shift_s,segmentTable, finalClusteringTable[:,bestClusteringID.astype(int)-1]) + self.log.info("Speaker Diarization time in seconds: %s" % (time.time() - start_time)) + except ValueError as v: + self.log.info(v) + return [[0, audio.dur, 1], + [audio.dur, -1, -1]] + except Exception as e: + self.log.error(e) + raise ValueError("Speaker Diarization failed!!!") + else: + return seg + class SttStandelone: - def __init__(self,asr,metadata=False): + def __init__(self,metadata=False,spkDiarization=False): self.log = logging.getLogger('__stt-standelone-worker__.SttStandelone') self.metadata = metadata - - def run(self,audio,asr): + self.spkDiarization = spkDiarization + self.timestamp = True if self.metadata or self.spkDiarization else False + + def run(self,audio,asr,spk): feats = asr.compute_feat(audio) - decode = asr.decoder(feats) - if self.metadata: + mfcc, ivector = asr.get_frames(feats) + if self.spkDiarization: + with ThreadPoolExecutor(max_workers=2) as executor: + thrd1 = executor.submit(asr.decoder, feats) + thrd2 = executor.submit(spk.run, audio, mfcc) + decode = thrd1.result() + spkSeg = thrd2.result() + else: + decode = asr.decoder(feats) + spkSeg = [] + + if self.timestamp: timestamps = asr.wordTimestamp(decode) - self.formatOutput(timestamps,asr.frame_shift, asr.decodable_opts.frame_subsampling_factor) - return self.output + output = self.getOutput(timestamps,asr.frame_shift, asr.decodable_opts.frame_subsampling_factor,spkSeg) + if self.metadata: + return output + else: + return {"text":output["text"]} else: return decode["text"] - def formatOutput(self,timestamps,frame_shift, frame_subsampling): - self.output = {} - text = "" - self.output["words"] = [] - for i in range(len(timestamps["words"])): - if timestamps["words"][i] != "": - meta = {} - meta["word"] = timestamps["words"][i] - meta["begin"] = round(timestamps["start"][i] * frame_shift * frame_subsampling,2) - meta["end"] = round((timestamps["start"][i]+timestamps["dur"][i]) * frame_shift * frame_subsampling, 2) - self.output["words"].append(meta) - text += " "+meta["word"] - self.output["transcription"] = text + def getOutput(self,timestamps,frame_shift, frame_subsampling, spkSeg = []): + output = {} + if len(spkSeg) == 0: + text = "" + output["words"] = [] + for i in range(len(timestamps["words"])): + if timestamps["words"][i] != "": + meta = {} + meta["word"] = timestamps["words"][i] + meta["btime"] = round(timestamps["start"][i] * frame_shift * frame_subsampling,2) + meta["etime"] = round((timestamps["start"][i]+timestamps["dur"][i]) * frame_shift * frame_subsampling, 2) + output["words"].append(meta) + text += " "+meta["word"] + output["text"] = text + else: + output["speakers"] = [] + output["text"] = [] + j = 0 + newSpk = 1 + for i in range(len(timestamps["words"])): + if timestamps["words"][i] != "": + if newSpk: + speaker = {} + speaker["speaker_id"] = "spk_"+str(int(spkSeg[j][2])) + speaker["words"] = [] + txtSpk = speaker["speaker_id"]+":" + newSpk = 0 + word = {} + word["word"] = timestamps["words"][i] + word["btime"] = round(timestamps["start"][i] * frame_shift * frame_subsampling,2) + word["etime"] = round((timestamps["start"][i]+timestamps["dur"][i]) * frame_shift * frame_subsampling, 2) + speaker["words"].append(word) + txtSpk += " "+word["word"] + if word["etime"] > spkSeg[j+1][0]: + speaker["btime"] = speaker["words"][0]["btime"] + speaker["etime"] = speaker["words"][-1]["etime"] + output["speakers"].append(speaker) + output["text"].append(txtSpk) + newSpk = 1 + j += 1 + #add the last speaker to the output speakers + speaker["btime"] = speaker["words"][0]["btime"] + speaker["etime"] = speaker["words"][-1]["etime"] + output["speakers"].append(speaker) + output["text"].append(txtSpk) + return output class Audio: - def __init__(self): + def __init__(self,sr): self.log = logging.getLogger('__stt-standelone-worker__.Audio') self.bit = 16 self.channels = 1 - self.sr = -1 - - def set_sample_rate(self,sr): self.sr = sr def set_logger(self,log): @@ -206,6 +573,7 @@ def transform(self,file_name): bits=self.bit, channels=self.channels) self.data = tfm.build_array(input_filepath=file_name) + self.dur = len(self.data) / self.sr except Exception as e: self.log.error(e) raise ValueError("The uploaded file format is not supported!!!") From 0eb3129c529e8934a4bacc4471a9199f89acb264 Mon Sep 17 00:00:00 2001 From: Houpert Date: Fri, 3 Jul 2020 10:01:50 +0200 Subject: [PATCH 012/172] Update README.md --- README.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 47bd234..8e53305 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,17 @@ -# Automatic Speech Recognition - LinSTT +# Linto-Platform-Stt-Standalone-Worker +This service is mandatory in a LinTO platform stack as the main worker for speech to text toolkit. -## LinSTT Generally, Automatic Speech Recognition (ASR) is the task of recognition and translation of spoken language into text. Our ASR system takes advantages from the recent advances in machine learning technologies and in particular deep learning ones (TDNN, LSTM, attentation-based architecture). The core of our system consists of two main components: an acoustic model and a decoding graph. A high-performance ASR system relies on an accurate acoustic model as well as a perfect decoding graph. +## Usage +See documentation : [doc.linto.ai](https://doc.linto.ai) + +# Deploy + +With our proposed stack [linto-platform-stack](https://github.com/linto-ai/linto-platform-stack) + +# Develop ## Installation From 761156d314c1699e73014e027ebe6196d6c8d6e2 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 7 Jul 2020 16:29:44 +0200 Subject: [PATCH 013/172] change Swagger service into optional --- run.py | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/run.py b/run.py index 8ae7b76..643a019 100755 --- a/run.py +++ b/run.py @@ -23,6 +23,7 @@ SAVE_AUDIO = False SERVICE_PORT = 80 SWAGGER_URL = '/api-doc' +SWAGGER_PATH = '' asr = ASR(AM_PATH,LM_PATH, CONFIG_FILES_PATH) if not os.path.isdir(TEMP_FILE_PATH): @@ -40,9 +41,8 @@ NBR_PROCESSES = int(os.environ['NBR_PROCESSES']) else: exit("You must to provide a positif number of processes 'NBR_PROCESSES'") -if 'SWAGGER_PATH' not in os.environ: - exit("You have to provide a 'SWAGGER_PATH'") -SWAGGER_PATH = os.environ['SWAGGER_PATH'] +if 'SWAGGER_PATH' in os.environ: + SWAGGER_PATH = os.environ['SWAGGER_PATH'] def swaggerUI(): ### swagger specific ### @@ -135,9 +135,11 @@ def server_error(error): if __name__ == '__main__': #start SwaggerUI - swaggerUI() + if SWAGGER_PATH != '': + swaggerUI() + #Run ASR engine asr.run() #Run server - app.run(host='0.0.0.0', port=SERVICE_PORT, debug=False, threaded=False, processes=NBR_PROCESSES) \ No newline at end of file + app.run(host='0.0.0.0', port=SERVICE_PORT, debug=False, threaded=False, processes=NBR_PROCESSES) From d28e02195a828690fd440b91f308ae536827abae Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Wed, 8 Jul 2020 16:44:25 +0200 Subject: [PATCH 014/172] fix some bugs related to ASR model loading. add word_boundary file generation --- Jenkinsfile | 19 -------- RELEASE.md | 20 +++++++- docker-compose.yml | 4 +- run.py | 18 +++++--- tools.py | 112 ++++++++++++++++++++++++++++++--------------- 5 files changed, 106 insertions(+), 67 deletions(-) diff --git a/Jenkinsfile b/Jenkinsfile index 530e391..b4bdffc 100644 --- a/Jenkinsfile +++ b/Jenkinsfile @@ -47,24 +47,5 @@ pipeline { } } } - - stage('Docker build for pykaldi (unstable) branch'){ - when{ - branch 'pykaldi' - } - steps { - echo 'Publishing new Feature branch' - script { - image = docker.build(env.DOCKER_HUB_REPO) - VERSION = sh( - returnStdout: true, - script: "awk -v RS='' '/#/ {print; exit}' RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" - ).trim() - docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { - image.push('pykaldi') - } - } - } - } }// end stages } diff --git a/RELEASE.md b/RELEASE.md index 818a2d4..c7827d5 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,4 +1,22 @@ +# 2.2.1 +- Fix minor bugs +- put SWAGGER_PATH parameter as optional +- Generate the word_boundary file if it not exists + # 2.2.0 - Speaker diarization feature: pyBK package - Mulithreading feature: Speech decoding and Speaker diarization processes -- Optional parameter: real number of speaker in the audio \ No newline at end of file +- Optional parameter: real number of speaker in the audio + +# 2.0.0 +- Reimplement LinTO-Platform-stt-standalone-worker using Pykaldi package + +# 1.1.2 +- New features: + - Word timestamp computing + - Response type: plain/text: simple text output and application/json: the transcription and the words timestamp. + - Swagger: integrate swagger in the service using a python package + - Fix minor bugs + +# 1.0.0 +- First build of LinTO-Platform-stt-standalone-worker \ No newline at end of file diff --git a/docker-compose.yml b/docker-compose.yml index cdacdeb..08c14d0 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -5,7 +5,7 @@ services: stt-worker: container_name: stt-standalone-worker build: . - image: lintoai/linto-platform-stt-standalone-worker:pykaldi + image: lintoai/linto-platform-stt-standalone-worker:latest volumes: - ${AM_PATH}:/opt/models/AM - ${LM_PATH}:/opt/models/LM @@ -15,4 +15,4 @@ services: published: 8888 env_file: .env environment: - SWAGGER_PATH: /opt/swagger.yml \ No newline at end of file + SWAGGER_PATH: /opt/swagger.yml diff --git a/run.py b/run.py index 643a019..8a0f52d 100755 --- a/run.py +++ b/run.py @@ -134,12 +134,16 @@ def server_error(error): return 'Server Error', 500 if __name__ == '__main__': - #start SwaggerUI - if SWAGGER_PATH != '': - swaggerUI() + try: + #start SwaggerUI + if SWAGGER_PATH != '': + swaggerUI() - #Run ASR engine - asr.run() + #Run ASR engine + asr.run() - #Run server - app.run(host='0.0.0.0', port=SERVICE_PORT, debug=False, threaded=False, processes=NBR_PROCESSES) + #Run server + app.run(host='0.0.0.0', port=SERVICE_PORT, debug=False, threaded=False, processes=NBR_PROCESSES) + except Exception as e: + app.logger.error(e) + exit(e) \ No newline at end of file diff --git a/tools.py b/tools.py index 05c6c10..f8298e6 100644 --- a/tools.py +++ b/tools.py @@ -35,7 +35,7 @@ ############## ## other packages -import configparser, sys, sox, time, logging +import configparser, sys, os, re, sox, time, logging from concurrent.futures import ThreadPoolExecutor ############## @@ -82,41 +82,72 @@ def loadConfig(self): f.write("--global-cmvn-stats="+self.AM_PATH+"/ivector_extractor/global_cmvn.stats\n") f.write("--diag-ubm="+self.AM_PATH+"/ivector_extractor/final.dubm\n") f.write("--ivector-extractor="+self.AM_PATH+"/ivector_extractor/final.ie") - - # Define online feature pipeline - self.log.info("Load decoder config") - loadConfig(self) - feat_opts = OnlineNnetFeaturePipelineConfig() - self.endpoint_opts = OnlineEndpointConfig() - po = ParseOptions("") - feat_opts.register(po) - self.endpoint_opts.register(po) - po.read_config_file(self.CONFIG_FILES_PATH+"/online.conf") - self.feat_info = OnlineNnetFeaturePipelineInfo.from_config(feat_opts) - - # Set metadata parameters - self.samp_freq = self.feat_info.mfcc_opts.frame_opts.samp_freq - self.frame_shift = self.feat_info.mfcc_opts.frame_opts.frame_shift_ms / 1000 + + #Prepare "word_boundary.int" if not exist + if not os.path.exists(self.LM_PATH+"/word_boundary.int"): + if os.path.exists(self.AM_PATH+"phones.txt"): + with open(self.AM_PATH+"phones.txt") as f: + phones = f.readlines() + + with open(self.LM_PATH+"/word_boundary.int", "w") as f: + for phone in phones: + phone = phone.strip() + phone = re.sub('^ .*','', phone) + phone = re.sub('^#\d+ .*','', phone) + if phone != '': + id = phone.split(' ')[1] + if '_I ' in phone: + f.write(id+" internal\n") + elif '_B ' in phone: + f.write(id+" begin\n") + elif '_E ' in phone: + f.write(id+" end\n") + elif '_S ' in phone: + f.write(id+" singleton\n") + else: + f.write(id+" nonword\n") - # Construct recognizer - self.log.info("Load Decoder model") - decoder_opts = LatticeFasterDecoderOptions() - decoder_opts.beam = self.DECODER_BEAM - decoder_opts.max_active = self.DECODER_MAXACT - decoder_opts.min_active = self.DECODER_MINACT - decoder_opts.lattice_beam = self.DECODER_LATBEAM - self.decodable_opts = NnetSimpleLoopedComputationOptions() - self.decodable_opts.acoustic_scale = self.DECODER_ACWT - self.decodable_opts.frame_subsampling_factor = self.DECODER_FSF - self.decodable_opts.frames_per_chunk = 150 + else: + raise ValueError('Neither word_boundary.int nor phones.txt exists!!!') - # Load Acoustic and graph models and other files - self.transition_model, self.acoustic_model = NnetRecognizer.read_model(self.AM_PATH+"/final.mdl") - graph = _fst.read_fst_kaldi(self.LM_PATH+"/HCLG.fst") - self.decoder_graph = LatticeFasterOnlineDecoder(graph, decoder_opts) - self.symbols = _fst.SymbolTable.read_text(self.LM_PATH+"/words.txt") - self.info = WordBoundaryInfo.from_file(WordBoundaryInfoNewOpts(),self.LM_PATH+"/word_boundary.int") - del graph, decoder_opts + try: + # Define online feature pipeline + self.log.info("Load decoder config") + loadConfig(self) + feat_opts = OnlineNnetFeaturePipelineConfig() + self.endpoint_opts = OnlineEndpointConfig() + po = ParseOptions("") + feat_opts.register(po) + self.endpoint_opts.register(po) + po.read_config_file(self.CONFIG_FILES_PATH+"/online.conf") + self.feat_info = OnlineNnetFeaturePipelineInfo.from_config(feat_opts) + + # Set metadata parameters + self.samp_freq = self.feat_info.mfcc_opts.frame_opts.samp_freq + self.frame_shift = self.feat_info.mfcc_opts.frame_opts.frame_shift_ms / 1000 + + # Construct recognizer + self.log.info("Load Decoder model") + decoder_opts = LatticeFasterDecoderOptions() + decoder_opts.beam = self.DECODER_BEAM + decoder_opts.max_active = self.DECODER_MAXACT + decoder_opts.min_active = self.DECODER_MINACT + decoder_opts.lattice_beam = self.DECODER_LATBEAM + self.decodable_opts = NnetSimpleLoopedComputationOptions() + self.decodable_opts.acoustic_scale = self.DECODER_ACWT + self.decodable_opts.frame_subsampling_factor = self.DECODER_FSF + self.decodable_opts.frames_per_chunk = 150 + + # Load Acoustic and graph models and other files + self.transition_model, self.acoustic_model = NnetRecognizer.read_model(self.AM_PATH+"/final.mdl") + graph = _fst.read_fst_kaldi(self.LM_PATH+"/HCLG.fst") + self.decoder_graph = LatticeFasterOnlineDecoder(graph, decoder_opts) + self.symbols = _fst.SymbolTable.read_text(self.LM_PATH+"/words.txt") + self.info = WordBoundaryInfo.from_file(WordBoundaryInfoNewOpts(),self.LM_PATH+"/word_boundary.int") + del graph, decoder_opts + except Exception as e: + self.log.error(e) + raise ValueError("AM and LM loading failed!!! (see logs for more details)") def get_sample_rate(self): return self.samp_freq @@ -130,10 +161,15 @@ def get_frames(self,feat_pipeline): # return feats + ivectors def compute_feat(self,audio): - feat_pipeline = OnlineNnetFeaturePipeline(self.feat_info) - feat_pipeline.accept_waveform(audio.sr, audio.getDataKaldyVector()) - feat_pipeline.input_finished() - return feat_pipeline + try: + feat_pipeline = OnlineNnetFeaturePipeline(self.feat_info) + feat_pipeline.accept_waveform(audio.sr, audio.getDataKaldyVector()) + feat_pipeline.input_finished() + except Exception as e: + self.log.error(e) + raise ValueError("Feature extraction failed!!!") + else: + return feat_pipeline def decoder(self,feats): try: From 2bc6a96815ce388f105a89a743d07f53d3b9d0c3 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Wed, 8 Jul 2020 16:45:39 +0200 Subject: [PATCH 015/172] fix RELEASE description --- RELEASE.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/RELEASE.md b/RELEASE.md index c7827d5..8712413 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,7 +1,7 @@ # 2.2.1 - Fix minor bugs - put SWAGGER_PATH parameter as optional -- Generate the word_boundary file if it not exists +- Generate the word_boundary file if it does not exist # 2.2.0 - Speaker diarization feature: pyBK package From ff80d207be010d011e0d76e69a424fa478276b3a Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Thu, 23 Jul 2020 12:24:03 +0200 Subject: [PATCH 016/172] add the generation of the offline image --- Jenkinsfile | 1 + 1 file changed, 1 insertion(+) diff --git a/Jenkinsfile b/Jenkinsfile index b4bdffc..d027c84 100644 --- a/Jenkinsfile +++ b/Jenkinsfile @@ -24,6 +24,7 @@ pipeline { docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { image.push("${VERSION}") image.push('latest') + image.push('offline') } } } From b09040714e8b02c5030681687b9ca6f6d756affc Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Fri, 14 Aug 2020 13:05:58 +0200 Subject: [PATCH 017/172] fix minor bugs: set speakerDiarization to False by default, and allow speaker diarization for audio longer than 3 seconds --- run.py | 3 ++- test/bonjour.wav | Bin 53810 -> 38496 bytes tools.py | 2 +- 3 files changed, 3 insertions(+), 2 deletions(-) diff --git a/run.py b/run.py index 8a0f52d..ecdbb18 100755 --- a/run.py +++ b/run.py @@ -96,7 +96,8 @@ def transcribe(): app.logger.error(e) raise ValueError('Not accepted "nbrSpeaker" field value (nbrSpeaker>0)') else: - raise ValueError('Not accepted "speaker" field value (yes|no)') + if request.form.get('speaker') != None: + raise ValueError('Not accepted "speaker" field value (yes|no)') stt = SttStandelone(metadata,spkDiarization) diff --git a/test/bonjour.wav b/test/bonjour.wav index d82dff97144aaea6e86a655a7246d0a65343479c..f03944e35c448f2226923356f7208d0234a6419a 100644 GIT binary patch literal 38496 zcmaI-1(@5q)&>eYaSVyWoJ=y4WU$QLWt=iIGq=0U%#6Ft%sgdgp0?9rW)3quzRu2> zzdhf7?u{NxvgloFNtU#hlz~0Eb{&2g0ETrO+GXnOMJg@;02I=#Yez;_0{|7^0+S~! zn$VOi6U-q9N-#qH2qF~zZ@(tDCj8%}f~leQf_|v%pZ@<_tr{AF`2_h3)%`Oy82|JA z&z6Gaf9HkpLis`NYB;Qk1xx-}=KsD0YyNI6)K16`Que=dYEpxe4*i0$8kAsKO?pjk zFs#WBrTu^VA>?3=;E!18px90tSw--vS@oOYj^Zq;hzr7Kn#=5F!cMAmLL_s@q)1+N{s{mf3Z+{4ML5aepB(ErvJqXA_WMX zbiuHO&);SLMEnzt3Iy@Me_ai)!CJw#LvjpJ8*D$6T7wfr3gU$jLwtr%g7t$`hv@&4 zuMkc!1^f@sp;Ezm!QXGp5I-R)hERgO|36QN#}H*9Nrtd$I17a#w4hJ>4~HS#5Pdb= z1UU`z@EfrP&rjk7U5LjZeIa=Uv4X9J`1mcqAWje|_(1=`tQkEa-a`EN|CL;g^nzUZ z2_K0Pd}{DQ6#eEfgcWK(giwQ36V^~f4H8QJg8YPd3-J@;BgD~fKZF>hB}8FJHZ^EL zE@&h#!7%iM;-R-7=OKyw#tZdP$oKt+Gat!W&;_{+`XPFP<0&Y~;2xl66o#;VqlVD_ z?7tekAZMYG7?NHHD~RAF5xoB*)u09C9dtn|L*tG5Z}0rc6AXl0kl$ec1!)R$7wm!1 zeDoVFB)#7p1!JMNU{2_J2=!bDEjU_2at!eoLJIO$Bh66X{cqkvJO%j*JvB%nj%!Bl z@3C3acQlft8jK(!^aqC_{%Y_5;J?RW4S&C-i2p0q(05Str3a-F`p!bU)o>LILy`?i zlJ*~&hsH=~RE9=Eu&07Nc?mb+AzY;IA{>O1d>mxVN!UqeBSR}0y2&t@?Euf;}1fvVt-T{414E-+M_4g0wpRr8&q&kosU~ArV5H1nD!8E{Jda z*VmJD8Auw9BsB)IT!>rGzcTTYxWQfp0Xl#HD7jB&0b)P`$N@DF0mK5)Kz*PQPzPuU zGz0zu+5;VdPC!qfD-aL#0Qv$0fg!*!U^Flmm;g)$rU5g_a0VI90A`c%slY^FJefBN z7(|xp2lN4Yk+pgN&4I>1y?4#|fEhy!W^tx1}@0li6DM*)+7dBA*N zDX<>c1snqo1E+v1zN5nHnR@fQ(8Ccir?jUm zq+F+DQRvjx)W506sBftbDj#ePP5}3SSHU~rS1=DO1@)i-G=jxo7Wfl<44wzqfP=ug zAP6Q?_fv;cRn!8?HOfqifRY281V)mw&m;B{tq6Z0Auv9G1Rmf6aF73*zpKB*x5$V2 z&Uo8;Q#_MBI`;;51NRTtau>(-!+FBF+&SGj-nobjUppDD#V*{n%&qYhdtQ4l`!4to z;Cli$iB~`>C58GDJWX2y4S{9IBcvtz0BugcM~`4kW}If^F_c&@Y&13nYl}%R3YN|| z!05rS(QnYl(^>Rq=m^w<96{P5rSLi!fsaA0pbFY)T1%Q0JOOqGIpACBHflSng>s*= zl+uKf4QwKNy?rwkzzDZSSnztjU&H zmUfmfi`1gDjItzHWY)*lfi{8NYiBx@t`6?MJ(s-|zDRs@U@@@>m`LeC6@ZCg585qS z9P|(5fV#s=;2W?J?uML3k`O(Dq4m%)=tgu0x*lDEE=RYL@tJ5Bl#iAn=aG4c68Q|T zf`#xqXaWR6muc}dob0j7;9hVOI0qaH&H|@^Gr+arDsU0F1-uBp0kgsUfBWz#nZpA2 zQ|*+_l-a-$;%4A7zR$nOx5K;L^N)L-YnF4qW1D?}ZKL(RMQ843wwNxMrkRGC{x&6; zBF$&be9H^V4(oc`8v9wtJ?A$U(=*Te)>q^=;D&&UumQ=GMN~g^9mt|xpsAsCP$e`1 z*1;!`_UL0&L%&XM%6Q07V3Vs-QNb3<9PTQv&_*T7H?2>lJ$dx6`S2 zRNM8o@7C9rJ?6fq2FBio+4`$GrS7Blp!Sv)*LK&X>DKG}8hRSzO)JeMmccfc{j_tK zTk0+M-NF|Wa>@a!fHn!*0IxwtpaS|CdUM7Na&+y%9M~=<##+rPWOZV1Wxr#a*a(Nm z@v*De-`S_wL)jwsN7h6Z%sR_#!A!*ZW0{OjjAQf)v^Kg1c?YBLVCXCjrY!(n)cNE{ zTSO@$=do9WnphAxi>LU7{`$T_-i4kG?#r&P&T5CxF16RPHL&)yOg1kx%`;9loX{I| z!*mYqM{TN>q8qI<=x*qj8U`9inpT*fS;B2c?Qvv}toC&Bh2y5cL7*+Q80-z*gg+wZ z(aGd@@|>Z;wqqGsUFI5Q6>|{lB1_Ne#=gQXWb-*qIITF1IMEy?=R127yD6K>e#n}} zLRkk%UY=vkuyYKY-k!b@eTxwAAowOkg{IJw!Cv4OYJKV*ii)xp$R!#Qvjf+0ufL6d zuP@!p^N#YIa_d|w*9hkt$1(c@+b3(bCC&WK^x9}L6zMIx_PW#BwpvWHN!oJ1X)vfe;eIM{0L|aNSbrkIZlnQ@EuAllm9UdYa6H{q<{+~(wR3OGMGPdSG;6FF*5CHpFSI2&hOVRdGGWR7J1#2R5c z8F}<3^u_37goX@+FGCPComLFSgO|yXwTAK$Pyl}u4+HeTSp1b=MSdYx?>O%ZkH|CB zecJWS>2QQO+S!NLR#-1tewhhVE7LmTH$$x9q`sd1hwg-Ko9>pbQa3?w)L$~pB7110 zd4gqw^`?#L=;zFK4fO!N6aI+-G4P7g7yL|%g9jk9(C&0L<1(W=_5rIyeihZsPOL+$ zTvi;r7kf4PJUg5HiT#v)jeVBAoc$L$)-JNfu}sViW;5m+tQJqG&tB z2DgF_Lnmo0k#y|ioL~(F&?uLtIu?iF_dX1XNPrI0wyKrbS9R_Sjf;a)-bZ@{plm=<>)}P zEBXYPiL8Ln!|4zK&4ygGAGA&+w@1kJE021PI)qwC5mHe~f8Yhc02IW+Ky+Xo-WTWM z_52CG;l2#-Vee~CktfwX-;KHL&f`wZne0$G&e-?Z@@xgx9P1CuOY>x*lUdy8wD>!ss_6LIx)_wtl@r}+(jTA(qp3aC$wqMiYF((04zQ+IeJx!UwZ z#VCWm33ZWkK!ARi-jVSEYmQCB)GQxw7JCl+9*fW3Mt)~+SQ7G!jA4Gp(ijb~=gck4 zILu5>qpRrq(RHX1os4#($J6)IchdW#5%lxOWrU6dAS$ecn?gIGtI!r&5xF)`rAAZL zlubYiz#@7F%J8#(y>FTCniuu@+){T-cbJ>$n(u1i0$fs8v166PW>2yztn&23G0 zjXdK};}SzB!*fG>1K)VWSjT+V+}t+XW^i zlg~YHGz|tXP|s4<0c(Jp#PvWWUKiJTcldD6aCe?(sOzrNXj@>7vo=2AiI$ zTV1`oD!F1v#q7#KmAfhnD;8IbuUJ&!u4q)%s`{qZt6yn|vAnSyw2K`pU5PG@=d)+5 zhvVDq?c}@c8|#1QWB3P=Yut9+OXvf6#ADLE4eSf_34noBfeLc0U&Jc|)PRWGYyAXz zQNB|l+9=viS~_$Qo)2>oDl!leqVM1w7)Dw^yWt5?1Ly=a7rH=>l_u~4NC0Jkb3rq; z4@FL;QqNK9f%hp_sGF%%sIMr;C@eq}hz+#xzrb7jui?i7EdsBJZTM1OrQ7W+wU4oB zthFp=^ElH1<9?$^-&g0Wo>qOK>S{&H@_8j4ii`5Qi*^)?|I=l;yLHp;d$ZKdZv5)o|fJf-b4ko5sDu zZNdZjP53eV?YtD;dLEm%pX1}KVsB+BSTyDsh9AvBv~UqQo=4Hnga1%P6e+R5f5iLL zxz#zyKHDm_>^03aJk#gvcIj4YbE{@my|1h+rN1Rra!tB|C~57Z7oAG+_{^GXcF6U%& zvN)~T?KpSYZdNU}iPf4_#w=wPuZ2csjSXGJ}V{9DsR6|IEA=(5_9A;Jy31a>D@8kWk~#CU~nBfn`a z!a^S+UE$_XG^nB4fI)#a{)Qg5lWyN{8ERr0+}e88vZ|I95#=*Wu@Y_Z>O#11K|U}4 zY%VL;l9iqDF1>l`i{#Ns>>rO3`QM&@>-k;rt1NwY_L@RtS%gkzd+A+B90%Qq8{5a8 z$`z0!sxEgR=QaBRvngvD(~0%NtPBpemf3*ZLylw(W6$RjJh32I&``8Otdu^MoRS12 zplr76h5TsPM@72`uj-?wZ`6&L`?0jRgt+B#`{QQBt&fpMf7Mi}zJ*tXwUVtD=L&Z4 zv$#t*#q2otH*ytT$kwu_ayF8y@l_U&HGtWNF$I}G@_Qdx5P0a@;W^;$=VUpi+KyQm z<}OC6evEc?^@d8K;!OFA(oH4xOU4#mD-;&Y&hzB7%MxaEOkbD!G4)V(Vsd7sMq}r@ru1-~rQb(w* z5p~0lhFzAn6t@(d(GKT~(Ix_5Q=%B1p5Wy?y(mvk#$QPi+tLw=LIn>mfM+GaFN ztxi6lr1+Vbxc8eb@#2q?FQuIFjJkvRh1w8Y zMs-m?QI=8UlovoD@C#T>siYjGGHLtab!Z7A9y`a{%L(UxzgsX) zv|GGY>X9?Utr4L5r}~vTMU$#tqgo!mC#;Taj^v%Nl4s(~WOc*h>D|zuNIaZMn+Lw2 zQo-q@E}KQ_L$4{DD6J`GU<2_GPw-vwh}}ibkM;!X8uK}$(6Ci^MXRgcRlTliVdcrn z2~~k|PubPddL=uGHy0ky&&x^9x|@Ne|DF0QgrybAgn)fxoQSp=V!`jZK z`j(OQx9)Q9HBwVb2DVUkkn?+AY80ulsX!ypkn$7YQ6^Hhks5j;XZ@h_zy_?43L{ELYhj;<@UKI$U#H(@|5Z9-sol+bh1y zmPpg)54HQmA}%^=rM~bjS9WW2fhk*Y5k`ze;Q*Rs&6_r>Rx6tF#rg!L;eLc2Ehe zGp&f$lZHVt@HC_?x|?3j&|<}`IUJC8hX?Sx3v>d$Xqeb5UMFoR8!4Y3)<+>yrYaPQ zSVd{rn6O3iuCipw2eDEt5OIZ<_zZplm&skkK{(6UFIWl8+N4e(V0OT6GXnHfbSp9f z=0V578Pqwz`GC&9!Z+4i;C|wwyCyqNJ0y0OHP2Gl+{ENE;QDpCKI9xst>jcBmAx+I zmUs&%7QD}Ik=G;lVD`PN`PqS7apCzQM~SoSSk)6D7#lSy6i;h(UOGXRz;nPpdwyjIFAK4UtIn0E(mk&}ED%jNXje3=6sAJ<2F! z9K|j(cd-tzYja<5FYvDLcL`PtU7{8epCq2_*|4x{VLKG>6h9OticG~fvVT{H&69VL zjg#IMQ$$wh_v~^pZ8yp<_3fneIZPOrq1D&I~XZ8CkT?MzIM_IG7Ib}yn<)uAIZt=p= zg)a(E7knrvD9kC&EQ_z)Tm4nL*znDC(OSo0bslp&JTv_jfm{NjL{fW#FKBlm8PX5w zjTWNa=zZwV=r+2JKA%w+dxV|G09GdJDEkU$0rxtW&TGZ5El3d579A0{kQ7K_WKYPo zcB^8RVolgB`9b+RxkYwJ+CZ{Yv{l$#*ig8XzluAK-HMgL6f%WaEBYg(IkFa6g;c|T z!B1%qN$XoFHHOlZXonvlx#W6pc;>kCU2NAl*GuOiXNF^g{i)S!?qOp)j`aQh}sMQOqsRsc5dv)eSOjvfQ<|ay@gu z^q%n<@RGn|Ac``cdXqYww7DHd>Y?}01kxH(%IJwzVM{P2)`YYR%q4ZzMa*BUhHRW| z;WXk&1RCLIQGLm3X{eo#4PazVKV`pe}J3D zW{`V{w~Sr%L~s4_q#cGXA^GSM#%Amc z^AQVYEny{LCPpeF9YdHrW+by6)5)64E#)J^N+BZNCYdFBF0Yas<*9OE*gs+4!#Q=Y!Qe!A|y)~G$HZLa&Nou+eXduzk0 z&sM&zSWWJWKa}1qsa0~WX)m_%%db@72@vKQ{jkD)D?zmz- z>Au#1tK_&oO(~%;z&iO8Ud_>RhI7u6yXI-!t-MQo zK;YpY5lj$qBqOA?q*@6ssgP#MKFgG{juM7=swh(QP&h#V^6PU~v%9m8v(_^2GX|3S ztBllUUm=&_B3eUQGZ3WGD0-qzppU=I_sv`6ZRv~go%7!D-t&(3nmsREjhuCC11-;t zRffL$j=BcALpruWXWVTrFt@g>HeWSI8#@>ZwF+%oRqv`DRVmfUT0nnWKS+O8KhDs~ z_(<1NZ_zf?{m`8?FpL9CCFUpAEPJXm&SUbFdf$4v{@(sA_=NzS++&QRvS|u<4l z)17o3xtlz|7|qC~bI5P|0AnQ9pBc@r$7#m(bDQ&fk^B23UNrw1?>n!YzlU$&oB7Ft z`oegjj$E}~@&Dq#=GEnXAy?jP(sI~}@sZSYXQA)W5>gZFj}$>&Xew|4wG;3@(1Dmi z&Rk8&qna7`j=*Sqm*3<)?iP3&y1v-8mTsnlrrPFi?)}FR}%Tvn;tIvGM6lR)b zTw&H5r;|EsjwQibY@KCZV3t|lno~{vOgv+_v4gQ%?=ch@(~L$V!_?g|)KrJaMu&?+b@@GRI9-3kc+Ik-A-E|5-)^-smy`$UA#KLVF{ zH#k>1takD=z*+1X>uh3g<&-+Oc9ngO<+wfBe#>>hIl^|Vd2?$kvz4mFojK)V3$1YO`IFq?J=ZiBu<+oHwvOl%$|!&0zf zW*%F?o=<9b)7Vb_IzC147oW*_Eg)nj# z&w=8pujtRAGO!;ghX&F=Qhy>hq2|=9loP%LPYn13Z-eJk1_f68bNmbaX9911*Bra- zi<~Q6Oj})lcOS=-=wIR<;u`B;<>+p$bR-&oxav8ESk&&P_P)-Jc9tX6Wwn=(e6!8{ z?Fo9b&TlO+#@jAi9@zQzrb_yu!QRmq&~s zjDeAq29(v5-hdPMOnDjD20RAM&~vCBy$!3BHV?kVAmAp9Yb*}RK*Y>l+!h=Qdo6n{ zI+L@F*%^C+Jz_KYsq9|7?bun)YI3fd&cdOw+|lS%fP#tOF35atA9xh17I@&rGzNVI zdKOy55Cf-ZF8q zqiMsT#WW6c6Uw2_;>=^VWcpaEnd9hFnDtmspjz-GMkgkV&F8P=&Z58JF2Z2ec78YT z8KJQb@E+?*8yZNPdluOo+nWHcE*TNqy3SeVs45T`F_~u`|9|p{?ER)P7!&y`;J^W``J&}{&9@6MtIu#FxwU17{>|+ z-_y^*vO-p&A9UC9_3wjViz8lp+7eQ*bD%q=3GJY_Ng#px6@LblQf>fCh{os~=p!YG zae``w#WXWz0Dm#i7#+k?QPN0#wJ$o4v4_dy?!nH`C-J<2mgpYtF=!dfL?6kT%ZPy6 zaB{&_jBs+?IFU0%5@Q*BAF0FHN-Q99`JGTNH6J_cya3E)cf?)fnew*4UMfr*7qny3 zp1KYJt?{Xrd;TVP7xzq$!~Mm&JPG`765Is)Hq0^0)5T${kc@=x+Mcg}Mi!N0jr+t(3xSGDIeUgpvV zBJp<~4W%)D-Lr?fg&2%a1?N&a5IWi|uti`Zlnu5B?1N>I(LrUUe*SX0v4v6l82I}D2n z340aoIq?E9<9oo?@DA@Cs4==8??!V|9sVywgmt z=6i-3XSknRw%KOmVb-l4CO+K>8TL5Nn)hn?p1pR+z1Y#rb==j~I^HqI-rJsMN1U52 zS3SATW%hZVg|?=y$?m(ZE!GHUwST8)8D8!k;mP;A0=FnaVi=wX)kS^~IAc2$53m@$ z;C!Snr#oi}w2*1!bU+BEpM8YBk)215afG*%^O)HTzRo?voQo2IH0U$^HTxX;5U!8# zNIQOOMmT*lYyh>$bhLu-Qx_uRC;;3D;(J>J(ugshHOOA5)U^UX=eg(I=fCLN?Y!d< zP&N=_T#sF-*Xn5H;#p=JjyfAT`L^!v&Gu%N$Nqkvg%*`9-V|Zz!>%H=-_+iA4aSsp8GVEy5ydFC8HPZly^3mL1_uq zMc-0)QFB27Dx-U`MzqfmiZw*p%=fHQjKz$d?AeSLj3)f2=odsHoI@|BKjVbK-;qSY z7W5p9F*Sq`sRI>}Qe&gVlvBhzbeOl6{~=ZFE+giEPpv%yC;Tg2E#WVooAzS>&r{pf z3RvKvXsccI?Hw)Qo&zqs>7yynw87cY+R)t8y3sJlP;SL+71fdUedG`@`vR`rb}=#4 zH`aBIIN+Y_jit5>WCS$O4Dzht8N-EafyaZ1@Ky3uLC@eob&x~oDC`953u6pp3_TYe zk40cV82vcmER6kzwUX15)0x$Y+mX?p^%rM0_LR$F=FyJu!r1p|e%2;f7TAn#N7lK+ zi4i~}e1D)}lNnMNCyRL`woRw=a| z%I{XaH1(?Nq}^taR_P35bO$QW8&Y&-M%eAO=Xx-59ETG$Vk&Tew3B=Uw!${lO;6%_ zIOVL?oXh-Xyb)Z07tb0=T9yYed$V?Mhj9ydgN0)RazQ4a5Zn_T;xy$uML2hy)G0`3 zCi18A=5zb8x05z#BYhNko;Cq$LRFF$m<7;u|31%PceLlSbCuQLOtjUqR9n02PE{|k zG}o%C`soanld1++g_nIS8Bs8{Ag`n%?@0FN!v00?%Q{xODr2hox`xIDCfGzJ?SOlo zU)^0C1F5l;B4Pv6h2CK{=O1F#=alkFcq-m@VP8R%NGXmNofC1y2H_5Vq2Pk(sjyP? zUb01ElXR8J<+Y_PBrRoc#1q7F@g>1xzEHS8@PYM`)rGX|GFVEw2W-t41b?SI1@^lK z`X0E`Z0)SWb&GW|meG~=>Y@7N(zR7jDw)NL3XW%A&q&UmpQ+2p$&u!N%|BeyyVPB# z(C*O@%&saJDI(OokVK=9hi@4E?O~7 z4QjztzESS6?zZ-Rh7H=;75ysaS3D~@ULr5ukWb4Qka;?J5@{tooSd4%$d1X1EqGA= zvb=$AqWQLQfEBS9IkLU)0|H8Y+7lgWTf09TPCX|A0Uet50QAqQYj%` zDe?(eLbce*&*LPpxg05XJwKNx;tB*O`63=mG*jGJG)B;qtK+=qo?_l*ZpCVmbFvv8 zPu^>}>MAu)w!~Fcm*kcWFG$Glk*UbIm^>}XpH%tH@Ll;cIWay-m0p?Q&fQm}>#7d#CeagYgtnh|PU3g1HGvyhDD~zEyA-^cim9&s#i1v$}!e^rP!a<@Yq`NJ- zD7`G(Cfz6PE5)P^(Mn;0sIj0}Fo#z~Ujuf4uKA6&r8a?%aBEk(9)tinXec;%V!@v2*q&Zy~{qnbIY{SgY)=I|&*ThhjOP-2$u z7M~Fw5;*zm1$RUpglgdz(FNfL;bE~#@>TdkGDRfhJ!L&({Gb^K*qdRcnCM!0>GPaE znN=x~zgBkrnC<4LR(T3S_>A!~eLYH?N7 zFI}P8?BII`Q}ndm>@8eE?2tw%8${66R1K<#RB$Bw#C6GYfgjjc@~l3C)JZx)_27}z z7SK9SOJhJgX`>;6)Nmd$M{@_T3OH)kWKJP60I3Hw!-u(}%^IzwL{fSwdvN}U^y`@i zGur1I&okuDD7#QHM*FL}pZS2Ti`#==rerc+vSx{ANKdDN+6TC1;!0{i^dyHa@+vy0 z3uB4e(gvB0;u>#i{G~x;{dKiF$HZ%?^~~7-D$O=V(w_}hVKh|D`_sOD;MS~uydv=?7@2hkVd>kV#{L3&=Ak4?e};&;7}9#k$$_$k0Q(Rm-ek6z|TD z%if;JPTi5R=U2znkLih7g@p%7nrQ19epm<87KUw;ua)|xYSCTccJ4#&Bj!5hYVy8d zDS4T@F)bIUL($?R@sYkbugra$+~W_ngqw`oBNZ=-n-|D)h>Vr#hO|ZLW3oFG* zCQLCdufGn+L|QT*vc0@^g0$iYiL+q{0W;+p=DzuTCpUPfAbB zq31KoA63pZiySf>qedZ4(l*mYuts=X8jzj|yQdr+QJ`L_{u*gl7icW%Y7Ivd7U2$W z8kR46A#BJ8nU_!sO$nUzZ*UKDe6Y2*ela->D*eFfBNcZ`n-}PEE~oEF?VRLGy7beU zygKcD_WUBCvX`ljLyu=t=QG~1`Utj&TFB6_tIFHqT~vKFSmdXuFVW^$ckH-Y+v29j z>0{bOw^PfMLP=MCI>v=J5_7zO^S0%v*{J`b>s!^1JZXJd^0DARcGJubDV0eP-&-b* z|6cZ!mG&j8qR?E?-?YfpFz^7{!*Fm`2y(^OWLmjSQLM;`7!~nGE!Qxk-bHa@RB`QN z7_kc?>qnL;&xN%SzvE_O3*iW`D<0|ns;(x<1~O*;G| z;b-2Djj21+CHciA7q!{egB}k>itJ?3`PIVr(nQ$>MWW(v_)gVn6)p0srccb~m`QQp zOSyge_mg2toBeIZm>aV&#E`MVr zPEBO~*q4%>em#F{X*>NO>toMK%1n45Ya#En*eBhlI1_$f^;CU0a$D47tOEm>#`4mh3s;q0Gy5Y^}M%Fv&=R2)TyfBioZ)f7r1hMWnNFyCV%;r z__NM0dGf)uoUFe2gUhR`#WuEENBj-NG0t!s^FK(ulB}?fim%E=5vw91HMKPZ$=e_+ zVr((jV!Fn}N9|VkRnC>=2wt+&&`w}6zLLD-andr$SXo_M(YNGm;h5Y*S#jy&lx4rh z{B$MVPgPCrsvSWgMqP5ly^X9PP|3!yfW(3|j2W89-fx+>%p(9 zN#V(tQ#NG?bFvDCm;1C2Exk$otP`ydqZj8Txpyd*mnqgnbWuOi#7Aw9N{b#C&5mgv z-7RWQWQ_W5_};M9Qdl^UBg3+wbl|xEp<7^|V7BTORUIvTSIEkJpRp%Ro}BTkchbq^ zJ1LJc2Id|tDy;aRkGA*lTmt$*7qCQj51~zbOV(D=O}QchiO{J}t5VfF)T=dn)Xmg( z6(eG~vQJorEK9Ukz~)%7qi`M27I@|zLCTG8)KqUSw-%kso0MfrFHa3ik)+&CO-*Y| z-W%v#+_?Ojj%I!19EE37-Dr2_Nd9i|8d)Rdrid<@p_;5Hb+j;=5?vXki2^iT)Z&Q0 z6s=^lBnJgQIFZcbXkTzFQS4K@>RHbkmR5HyTUq4D?USWVYmxded3AEf6js`_%*VMA zMQ_U|>l#^1&NlcnYBMB?^^Ch&_)bz5Hd(nZ;)^;Bub_0?fkCxMLj?EVJ~3tFD%QD=f+0m}yEIm9jT^WQr}7mA)nCdcon+ zuxgt5xMPnukJ1sY#7=NfQ7cKRJSu!i#5Q%FS`;-)vmo-R#;3ljnjOAJxnA~EGDeus z;W0e(mte)PA^QKnAIwOWO1E}5&FMud)zMr zzi5paNvzxaw&EjFo_vF1xpJU#V|bs49pTL*-bWmb7$4p+JXZNbo+Qf=>xH#>1K2`r zFfs%@MeOm9cfYrvw0<=X(LvSw%FV@H3NGe#%wCxplX)nkZ{~@tEqUt-UzeV&jx{CO zpLvHN4BFmGWk?a*F3W_-_mqFv zt+32+?!xn_U6BH22X1>|d+}A-8~I9wEc{XUF4byP2lW^AQFT4_2^ALcN-9m6G>>^t$!c!peE2?+f?mZ_8Pf1!WG+ZnFh;y~78Ccaat>9e1LjuDF5pv22rkuHviWj506W9o{6u7XC22Q}}Y_y|BMz zRLMO-1os?ENWTj0rfPtH{Fl5vJU^Vxo%ieqZG$W`O$GWLq%?aKuPRuOlahgDG)Wts z+Be;l)h}0A#HwQHciP5zr%{GNhp}SrSN;ykZ7Cl1E^IK_Uq>S5tIn#rX=bVRtK1O- zl+Lg!S%LU3{#H&?=5XWzbcymVu+m@0ch^(T9Yb0Q*EsVXb?wb;=S}qt^6GPCSkcNn zeb%S6eQCnf%oHetlG8B%QRy+Q(9CwW4cw(ZL8i0TbH552h`Y#6%Ni+46+n2yh-nd* zRS#8HRSzO`5pR`0!yd@kVh&%&u7$DTrr>L$8P4-*yr8GE8+WIZH;X^GE$%fg0eOIWpI$$^umGzJ=zrL^_*;>fW0}o)seGg8j(D<^F1sqH zDfTJKNPT~aYO<=YnySLX%M^FW-BJPn4*NDe5`IZ32(0wwd#RrB9-eoDx0%0!{BtFT zSVWu(sPLcUUgNQ~j^0u}v$!oso-_+VJBlothsR9rZ=?P%~CNBYb%nRr*aZh&_kV6Uqj9;B;@Kdx4Yh zYVVrksr3H9(|}pvQ|Kbn2+c!Q!4ldLV!XGTou|K5(Y-J_r%l>FDIb!IN&Qnb>9?|r z3hGxltNU8+J70U35Nm03kYf6C);Z21{w>jY={5Ot<>!d*n&*+?$oHD}>apP-`76m< zULEW>`id$d?bR%Ilw-1^ja%Y6?JW$PptOdQ8LQYrZlo|)R4FVFp5bM%X^d60NqDlO zz@RUGSjf#9n&SPrKXK5vo{0(HpZpZ1Zpqq~S6!m2x@~A|yXg5AcmhtNBW$x^uylpe zuX06Aj@8z>T-#RbR$Pmi*_xfoX0ko}9;}YYG_ZGIzHg`Jj%$_sj9cKJg5M%-$ph&x zSvlNjkwg4cc1m6?kCjU#e+#d3Tyz9H>^*6nQF*fHe8%mc^F9xGH}ZAI=PzFEfBE|D zv`zxHgFr@qm;vLGmDt7dg*tNBm*J)fwRS&51rq+nqt5M@strQhxuQXx?5E_^e5co%X zn|qFU@A+pEFnAT&%KXUPD|=tb-^8eC@Y5C1ydj zt@v%uo|MUnXFk?@ed($FvF1_xho_$$ec^m_?sJ=z*zBPtdkk|O$MK&~9rg&`T!}z& zOgUc_sW};`)l@`fM=gztirNy%)8G+1mDA+~;t#x7b^^MOR!p43A9>#Cm~lq?HiEy?5-rOuVWtMfq8A zzaD)V{I>23^ijY2>ifU$2_N5hHvZkwFa1+L=k+b`Y3}W6PAP|9v%~p3@ljb@`50w_ zf=aH8ZNj%He<|063zaJ63dKKRgfv15314zGYz#|=9pDb?2BH~|2|S>fz*Xov=1i_d zG+dsfWNONz%AzmF^oeO0t=804^(QsXoxWHG@yx(y;1&1` zeu`ytE($u!wuZfn_@!LzNw5|>>O)#bNhrJ%c^7sP5`R~Oy-Ro}eo zMaiu^dRC)U{QKiC^WPJ%em?8-6nJcY6#FFZx$yOl4;hIUQby-aENyNWYj2Ck)3z`e z+(x2EnMvLte0PLXHD0qXvVhd8c15p`?i1~fnjhIfb4|5Q*&wX5#3Ja=ea>>wPrwHt zH&stPKvjYp!KTm)qzp~Q#;{s)=)4j9DZ-^fm+*!#MYvZG&2Pus$L`F0#CU=9qvcYZ zfqlL^ZmuiL<})8L?$oDOcdNw8`6XF}H}W^-p3i!dZcA;GT=4Usp96mM{ju}s(3E*; z4YG9w>N1bEmvy7Z67WKUu)5r%!sgNo@_CA};kP5+lbUoF^(Zw@eN6?bhK4UwNW*fa z_e8ORqnsja1$qt=Qhx=``5SqckpH(@<{stF@wE2Y@$SG#>PE1{G6k~)OAut!N*dLSw#?=Jo&5b$4cuWxtrUqIg2 zxkTE3b_BNId-3yryYGRwr>E4#a%{DUE%QuI_3gFyt18Oxl^rN)Su8I+m*1tJGEZ1= zuHZvay;5bxCY{l6+;+jmBG<{kpx%rGau#uM2ME3iRtRnjXA4>j5ApBvr93NV3TH5R zugHwfM&fB+ii%k3|Kdq(x$h8`YTd!6yu=hvH_jtz<(9h=njTzp=#mvyVt-v4&+j-0;Hhw?_` z&u`YD;Dv%m3yvz-+N>~tcmCUX^YY%zEyG)sOYo#>Rd`*vW$5PMO@Z$INxq@(NA7j* zW%wWKIy{Y47rY_dEoWEGj+|DJhjYHcTWPxk=lc7&)9ndfX>wY8ORUd<*#7?ecI~zH zOsKx8GFdjV|Xfuw$}?E+`e%~uxL>6jM85!;x(P>yEm3L z_4Yn<-tk`%dO4>tvOd~1FE8(eyv?~Aav#gBi8e+1MW2d%l2eEG8CT+889BadoMqN4 z-UF##$@lO+TkF(2sVA&u_NVSPe<(CJ=ZNT(ynp071rHZg6}((9sabzKCwzbI1CfgG zoX}pp?K#WW#`hSWSsRAuhTic$$2($|Cj-fb#0!a@$!4j^UN5_~JKBFu@Y(RL$g12Q z^PbOd*{m%8`25a!FGQcqsSZ6Cc*OUMz1Ulvd^Wx=HoI~A{;Bnk?yaevQN6!%NBQ`& zM@mmCxw+)p-2+Re?Os~)e#uWIx0G%w{jmJ9iuqMNYp&fp6KNp!$G%BS^e%N?^i}vD z3|o{PW$9?1a~ttcX7m`?2xy{fp{1?47jdmfGGmYpa)6RaKo@ zHM#1O%1M>S;oX2)%HtZo&Q&V=dxUcnf}^{i5@$ zbBcS1bFMoN?|F}~gZ6`VPaDq!*@)3OVbulRc)X7~G_fpR82>chJ6Vvb@ZPtl zyY>FP!NPFA$V+(t_m|xHxv%C1bI*7d8J=i<4X2+Pr)7 zE1HeXk4HyGehl3hnCqMELP#1jpaAgOsadQv0MB* z?^fq*|1*Kwz;nSVfeC?|{hNJ_$R^R;cb9LY??k-u*u!6lyagxXDU({SG4)t-dty{# zY+_Bat@oMT=(Y@2g=a-8_2A0`npO&6qwzw=kbTT54{&28-6o1JNSuzt~<|ZVRyFHd2Owit*@;*d#8P(B90x!Q3*#?TRL`Nf^ z=bV*e=L`wk;Vz+n1V;o``ez`@eGD81I)j}aZl14`zc_GZXlJ-4r!w+kba!-c?z-r7 zJi(Y3{T6Ro=SE|A13MXM7LDV{*4J{L##6uJ@h;ezcq?_JI~MxpEavlt~5BCpM2L}dk4ElqMg8lK%#$Dm+a4P4ok&?*n z$kxadyk-7m&ie4Zp>Klo16Sg0+#h^>;ZLXf7P@ur6n6!(pq%R)<+}#{^{Vd!Uy<({ zl>3o|;x^w3caS^8skD!`2UuOa`sBjIJMmHYOT&o9_x3m6H?6M!-X*ouYL-_|tv;tyjattc12D9n)>Sds>fC>t^BQGK*jy#*OgBy|FZnn3cIRv&1rkCsrzQ% z@Wy_zTjF*ymHbO;S87jcXKHZjwA9Gd8>!Y_)Jj-O?0mO_uY><#|4M(8e@LK9U_s#A z;QPTRLLGC4M_$9azbS8B{$0(A@t)k2W^3}tc>FYK$qJKz&g-{d^#TL10wy>QGr|HlA^RI6N+Ve|T?bMQCQI zHgrFp{~w-n9dZy{8+|MKWppv#^q3iGi8qg)57&iGL;j2NgNp(W;2o7Z-`mLMak@JP zZ$}p7c5*SJv~M?OmU0l{z)qJux%hEB5O>aL~UYaa!hKr*Ul=k9lDt-d?)Oz0;kBb^2_#(7D{{;xuy>*!}Ec>s@P= z)e@_si?#7e`v!Y7_WkX?cD;3yeKMXy|BL;#U2Ok3ZDp@J^PCyZP^Szib+tDjqf$R> z3-Tpx^;UcHy%j*xMfRu;RwwHyWF*Q*j+I-im^IvPvR}rkJRDh#+W5M_Hk;vp#MSO@ z_W=BRm)p@7^5y#apyns{MM%`v-H6$`!?)jqjo$)p+S#^UV=Y3ar^k`OX@J$;s`h@v z`$B`gZl0HV8P8}8Nu83aPkxjfnygB^ndq0;93L6ajZbJgv1xVe-dKlNFqQzCZDLLg z$%tZoVtr#L$BvB+i=7ZVGj>VroLHCG*|Ae%&0@6&-aTN&j*Fdy)~>Ohv5B#Hv9+-Z zWJF5F`ZP^!`o5`4d_nwAiMqs7$@5YhQ}=jf-kH|D$dNVI3c)9DuxH`P{l&;7aK3X1 zY-bz1V5uAS-HO;X)W6B!3BG=P;Dx|vfsX^zu{KW#ycAd)_$`nK>Y)dj&1L ziS-!XHU7qbzrPJ~4B7rXyrHxi@$Yp+h~p9Yu6C=O=bh^upEJuI4y~Po#BpC+4_L=q z-+O=cmZZ9*Rwo}wj!1S$4oE(p+?H&WIv%#^CBI0vO3p~s#>d2ujK`a{G@TnC6<-oR zBQY!SN@5(cEVzjs@yFsWuJ{)<&2C!XRE)QRhQ_~$m*DE~?fCuizr`<%caOKin|}X@ zw@ZvkBoe)oQ<4jk8*!)F1-T$TNzP9$OKwfp!?UKPI(b(kpH_*-JY;9VD%v13SwHJ@ zYmb!&oxT9?YT-)LLA6XrIH^e&rDV%K22Pd*dCu4pB$eMA0EFu{z!aI z{I>Xg@dx4~;yr-MfOy|{|M;NzjQFqd0g1_prHP8fmc&6|mQTkO`(4bEH09=6%}36?g{ zsYTzf?Mkc({gH>|bLSgpgYyHBxx*Re3`gFh8=d={51qx%I%k*j8@7v_N1dCIL+C4b z^G)_ptPPjrE$k1F^O}7ifWOYS(+yy`|n}Q9Ip195H7h|fhNrzL$a;4^^1anr zPumCV&Q4$FdgoMUIdZ9Wu@6`)tz}l!9%9$nu5*#Iz?tJbfcLAnAbQyL=hjT?VPvlx zX)UmJ;xC%Jtpmuyb^y8JMkAkHCu@`U8m@3Yz;`S3(S&bQQqU; z1aL3|ypgfuu@$y20L~ZK*Wel^WXG_od~1Dby^5@LE0IZNr{%YstP*Rr^$zUi z0c5?q4KfVDHS08FNSljf*h8($k#(<~m51%=)(u$W##&>nLC{%&RRsQ?_U^ztZg+TN z!SPg_ukbzw-%mo8vEI#i4)r!~8s5766;Zp<3n6!3$SQ^&S3s^+-V&6rkx^wG_HQ8{ z-o4(9-i6*--s#A0_msB)x$>GJgW$zhU#m5;=l$SK21*0bcA+-}bF4ycLUMHqxbu+N zZ;?0Cdk}NpeK<0n4!+yUwb$%CH`;v6iB|3zY5#CjG#X3tOxO^$N_<2)ikR4Q<3+ z%fR<~@Kfg zyOq$;7WhXYw73)3I@RDV52*D;KF2$($E=s&ssDlvt+%#WyR7Oob}7UfKdjLQ-$+>X zz;GuJSO)KS1F>nM^&tG>Z@~OQ_+MAV?E>VJv>-e4UG7PLtATDc3iD+$bK(wguoXER zH=%4umvulJj}Cj^z?T<8zE8Y)(C(Y?oY`2nXFTG&%u5c5P*^>^H9{U3LLy+DaA1G~jy9h9~5MZYSLP`~kT#&$iBkRWV;?f7s%M z$bNY>JaH6!d>rBtvro=NdBvJxJq^#i4;dzJLHl2!fh*FvGB3i|GlAsM$YtqRw8!0; z_cvTWmH@wM%v*t@-@xnl!0RjMegV*W7uufdy##BTn)ca8yoZ3zqwwCTz-BhE`3OD# z4Vmbrn_$)3A=8fZwh(zWw*s@DAnT84T?Y^S4jNgF-v7p^g|NAgf$hSyM{ET*wTNQ@ zNZk_N)6wc|ortWSXTj8RZfGk1fXo(0LzPG@)R4lNxIj_4olf$}jpI~lo7dqRWf zL6?^u#BGFiJ8boz*8d<rIh*@{Qs@B7j)&MEu zwI*$O>w(HPXl*w*s|1hAY1Kg)i_-|=kx$s@I*egNSOnZZ2Kw_*=EK@W3#&nU0}$Q; zS{0x}-7{-$3X(G0HF8{{micdY!OH2`-y*_%4Gk_#N6bYS&ISocMc_6p$f7nG~8o?Z$Ix)3&h0k#+8>_(s^-DNw3Y#5t&lym97xpNRC&4$FEVAV!sUjxDuFr zkCvZ+ViDx7gVZ*7&Vg^`VjBedO{lE|#>|+@EVW<2ci+X8$Skb2yfT`I@&uyiL~jz# zoTQnNBY<>k^z8sTC&B9e05tj_lJo=GECbSU z^pC)a@stsDAn+WBqd%foZ}jc~A36b?Fgl12rNA>+LF%2Bv;lZAqODDr?||R;=_toY z_aiXcfbWm#BOY%$=(pQ}FC#tkIICCHBiAxBNi&Btvo~vI?G$Pb0QD;LFNQ@jJ2B(w zX6zXQwt*8y@KSI^AFTtj%r>1$(+DCejt42s#_ZHAG1xXE zrw^JAqfdSs)dGw;G95YF1OH=Sx6})@!EDYFeT@c)Jm;cjh&-+QWJHA?K))ACPh3Tv z3RJsbd^>2T1-K(G%*oB1+RTU@!%;FFYhA=n{FxiJ#}dSu|6m+rZ)V@#hxo%-RFd}X zUxCR6Aioy2whH#W0ye!0Weo~)@<9(9fg3UV4Oo_795o;@P1Zif*k6Gw*DIsDjX2+u zzCPUr>!)A%fzJ^@fa@}|K{JvKM)qTt!D;`!7?yP2K@aQ-`(qsB3e@tT#4aS}dcv=h z#ykKC=?B5IlthG@XLfGJ8o9zL0oNtawYf^*s^kMY1!>Re2wyn~aiu$O?g`#{0Q(c6 zt)nq>0g$8zOaAL3*bVX81W9?dv@u=enu#@BVSY{{MVxAZbHjfugtp0F3Sh=`{E;Bd zHD20938gg^LOp*cfz~B5FrPRrjJ8d0&jmLvU{_p)+rSQvM4^4slNoDh5pwOsD9g22 zuaT&?QlKHWzV9GUvMTeMp*^%h`#dharHvkq8Cs<+EdpM-3NnT>|2lbMzIMsW&g|vn zkMW3h$GArykkSl#i=vL!D6;ZekJd?z`Opx02T=;%D?AhvF%cIbVMPL3EXaULsWs zAkqqWaV_cy+w7F~>|?PfX6?~(1mb8E)Oim>Pm>joa`2i_u8t^2DeQw+NQ__POx8(8 z5?+hRFG?_g6b6pIqfA0e_Ex5luhlcP;P`A}Sf}{otE=RIaJZT=1@iYa82W zZyE>bW5ijJ7OaT?EmUlR(aH=2vDX`m{K%B}Be83fG7u(>|Y{=7=#vd^0oZDdxPN;GHFX zN3K=rIlKp^#W`t<;F_gtKQ-DS9Z&LsBz=}1!1Y6l{L+KOFRm_vyTmYRp4bR7{A~w% zG`*QCjd(JDKY;g5Tz9#Oh^EC(MIRC$IRodV^;2840eT3r6`W}Sj5)LbVwRU71JUB@ zD>h2&lqeKROG28YY6{Q!`x&~Da6OR-B^syiQRg}{=N5i+FIp#m`a37Qlchv*M@tzb zu4ntN+!gRzn}v4`^a8H1yr*Earbq8jN3SMOmvxU3Ozf4{f8xo)FLl6s7M)q|Tg2X_ zmeD|Er#_9}(|5%3>C56dv;o0Gv`l17(WuTIQp8^5w?IU8@@zCH7;!6wo+3WNs}kNn z@tT6sfGYt1KEZWc?waJvZx8HBFi}KAcl2EPD{Yw(i*_v6xb$O224bZg)6;n@R;+*Jxq1=} zX;fCMNSpIY^d;Kzdn~+H;57rUs`M&>YcJ&%**S_RO6(>s{Ia)>65R<_;*I2+NHc;c zcVa&h9c4w5SZY>U_7oq{Xrwk|Vv%^6T%YJ_M6}KI3TZIn(kgjv#+b{fD}JOAg@1`) zG~qQnJ%=lfTsKixSz#GT$-mgYu712bWT8*7PCQ7~1V(4^3+haMQ;2sL65j+Tx$ZKn zqk6q)h;y^(FEz<^5BZ?ZjU^BfMb`Kv`$$YxO^d$O);KG@hJ|y8g^7PsXVjV?N9=?K zDbNyhHD^1=@kQyD}rJrYpQG+9fS+vS64-i`C`Ljztq2Q zrd*2*Y1j^KD3gaP4T(m)XA_CU&W)ZqFEt_e7!vE~XQB^dE20a|LyGdNSK@D6r^qEa z6zi0TB*?Nq?M1KM)W_utO=i-$#BS6&$r*VQWMthJI?}_mV1o1~GagXeL`y7K&ctp+ zcCm9~IV5lpI_!j=Y$*qIB3joYK|)5E-lQP>3h%KfK8udQWk6^DFW1Y+>ydN5$c&}kRTY9Qi6)BB@8hgmE z#4Pf~UWyUtmZC_}e#ELWl5r%Hi_NBn{bjTiPVS$97o(#XNn}xsYTI_ zp&`iYRyDzM+Lr8zi1g4a8A>i+O04li`ebq=vnvlGjUX!AiU*Rq#ul*#gQ4m|DVfM2 zaaB;I{}5d*f~cO!o;?*8K|}Zvu0_(!W5p;Nb*&Xd$fs~F_DwVd5wUIYe?dxdk`-E! z*6R_~j{25Zh4Ct}2=QGWQ+m<0$|BKHY6TnV!x}*^n|I}0bSnG`|JfW;ej;T!%%rY7 z$ew-VL%kWRQLV~8(@vxx$B1=`Y@#d1DN|%twPD$-JxjD9U6qE&b8VH#DAvkj@l};m zwLoM9BSDsPWzNiFiI#Fii^^^nj1LN*v7+e=VkO)XT`AeeY~x31WbPFu(Sy+k`>1Y2 zL!u?3BVVHrp<%W%R;AY7nKQ5l`!g08&k&mz>k-7$Y^U*~K173Rp)x;vNMD119&xYK z*iR)AixN$!K13@#Gxny|t@5ZQ1PRviL(2S^h);fmE3?(3bXRO-q(N5ElG#O8J!YGk zDHAVBC`D?GO$a8!hs-I6rwOGusy~@oWTJ$MfWg7^kR!P#CK0Y4*$AlYQg3FX7Z_=c z4UwYZOnFkO(jsV?UP4v-P&!%?YgJOww2V`l(rUECb3skU%UGetG1>mAwkMhq9E1 zqv9H0Iv@#JA zx}>aM;Usg*`S>yXnBKybv3wq}RX#Fz=HCj16QQTAJQE(xR%KG%Wa7jAh8H>Fd}`5$ z%SZbBk}BsA>a=S!BdH$x;i!Y6q}LHfREerwWhDCOUk2F14{2~~D04QAuIA3w)R7C} zFk32_UFl`+dCZT@$9^&&_d-Kv=ec~+tR3V|$H`t>4&Mtc)(dyKm3oy=XW+SVA;)at zSosQlq9>Y?b8gwgc!1RC97bc=)Qx-`nT-I?NR=NMCCA+AY@9{=$evrFBm5W|nWIe) zJ<4p&q;9NG=GEDRbKROb4?W7HuhNR7zZVUEPZ~o}#|y8@iR=yXMiMz92Rb8b3`cU5 zh7cR|yThdyBy}F0SEwsR^-7*o+cJ;L%VV(s8G42{(7E0JNxU*ilA`F{_LB%m)=JH zOzwDWEKFzP2sxI%T5t9;j{VHX_$KSjEF!g>bDNEV_EkJgyUvz5gYqT&LwoRCI5V@G z(Hz4bijTIKzCurQBehbabY+D07iz528rkcaj5QqVd|Iz9%8%9`t`DUsGx1Dk5-TwE zI*0ZUn%wG<;aZMNn~qT$W>%>Y3Yo_;Rw*9ZpY)8bz2g=u>G<0NUeRf^84l~iyG1>jK zR?v|%8I@V1y$;_CRzg#@Ixc%2_R|{O6A9rrv&Qt3*@T{2j-md0{R|zB6-~;K(31Ad zZ8p-RZf3|l%cP>TNaxT#(q^deoc&}@ZO!CB>a<<@%2s=uz4p>E!mIEg$DB!!F@3eg z&=G#JW!F|k`cQneJ##N}WYf<+lisW|{A5!2Z}mcjW3uTfmF(JVie?OZ@vEZ_?WgD2 z)J$(Z(srRm+_Xk0NI&hZ^J(wQzM03`pL_Xgjgde(%-)-RtS3!l>(WzcW{;CI!$US5 zIpdZ+C}VcJp=_ucTA8hf)@uK3+%lHy`s}B)4fjG(+LZdCoSOL$KN5;sFC%q4_i`-swU?RwaDLcd*Ae#19CK)I zZIiuhb#9J5+=?=LvIjp(g~w*Jjun0~TiA~u#aho~uUkVYlfJh8UY)jaZ$?Y)|JOfT z3hAx<>6y~cq?L)ZwCl`fpV>dV56`rfv+1$$FK3G5;Uig-IfniD)tNJo%yBlyI;+ej zTT(GJWZvJOowde(6|c+@X5Q>0E)R;kbCOIvj>G{}XalG&Ddtn+a%h_Ka+ lV?EEL_P>o}FKx@5)zIOQoSC{qH9#tvAEBspGh58@{{o1}kAnaJ literal 53810 zcmXV21$-3A)9s#F*CfPnceewAySwY*9PaLNIKkcB4tIC=!=2-hBLYcQ#=F0_`~8{U zPIh;Cdb+EtUcIX7rB%av^@{vWNc)=Y>h>KxDpv#{gyNW14R0nBB9I8uyT_;=7PPSv zH!%?p$w8t?9C4wQn^=gG6vI)9lpuvkR$PfE3Rk(zt+@9W|IWYhUp&bDoN)^=cz}Px zH$LtnnMpyk%|WuEPdj?aP72_goFq4iMZcLzSyGf_!F3Z!#vL#Dzx)l?c|1wyDf|O} z#h>sOIKRdFWd4;u<#+jCI1cke{5-#azdQL^ocHm|{5W6CXY+-80$;_q;?|brYyf^R4+u(Ww9Ig0pJ_h$p z!kwXSJMpo6KAtd%_u<3v>dIT-&UU;l&UtwOo}Fjq#dvAF%JAIy}5C4r9;xYIx zC*D=$6?k>r(F=F?z|o3#$1_{un~vze5{~@1BR?;}Yhyf(aHTiKJstDgf%!dzES~d^ zki>V$-9@}4hD4GKd`f}r(s(+q{R7G0N6Qa*|Ba{O+Q0l6|G~fT_xv;3B%_BXklhh} z6q4J|uS4FS(dH*si}P5lQz23WM}Cq(Dv>IrHyK2_kj|tN8AJM!p=2tVK@!O#G7D-m zh72ad$Vf7eOhCIvq&mq%@?w2!;!{`Bopd98aA!+=8bPLn?-`0dx}vXQWByk0@(0L5nHzEy5T~ZtO6~a+~)W_$9aF3Zt6tpoL zdWj_|klB5H5U*|3~H^5V>VLS=E3PxOtXXAwbVE5TW_LSXWf3v&n0^5z_6no6R zvL84COk#w)xDA#Yi^GM3F@xE;g&Ry^7S34&_wo$1`@()PKTBsRIDcW!*+-UyPkv_R zI?G_0VDmM3OW5^5V8<*z2eO`y^Af(9Z^vq$XQ&m^>tR$rX&_G5H%$e@9YbNp>1Zvrq^1P$!L{(KHh^Q<-pF`;UCb-_X@Kngd5x z8iDsEXhm9<=Ehwn>ZK~NQ6Jg`iG@bdJTw;HWkOGceu9PWA=}9kSmZ?5Rc+WqeAzuXVcMb^j z5E^t8T6Gei&*S}N{tkNc4w&^XeDwc&=L~q}kYA_p&XlCWr@jPoU4{qP3Z2}Bdmo{% z`@porVSjoEsI~`Iwhhla22I_Dr_F*loeQh$2doHL?l9={YB;3#_d0nQoBSJ|=^4m#1LQXyGFyceJcC(&hLz@k z532wl*aSA30E@|oarhvKCs@xtd<9>MJ7!|7m*DOVn8y~}c^7N@1ta~5IfrJThLJsh zkw1YqFoT~Jfdy8D9u0<#-+<*j#c`FKgRO3dZOz9>M*s~P!&_9wxeffqP>gsQGJr%M99Y6YNUIW#0%&W- zxB~D}KKLjB$doON2@Rm(A>Th6Pnr$;nT7in;ED4fvn3eW0?2hG`dkA<*h}{PZ}iW} zE8xHptiV3B8i}ztfZTGz4|4EL0+|&AAE=4Bv;>prjH^w-7m5KD1HjQYkjs5&*fmJ* z3B2WTSjjy|>kDx84P^Zlc$x`FRRKP<6c|h-UfFSu1s5hH3DSN6ZTuEyQfFWb$Dng( zp}8qogCyL42fYM<$e|wUfYY>xWClR?12LAq7;A5wdqbABA)V@IQyRGFg*@z#QgN`K zYWS`Qq}C3te#bo_q*w`=EXG=G0$QBHy6wU@(;%;@klGf=XbG+^g8mHxYRmu`h-19cW+r6%F%NrqtMe-Gnz8AvcQ=9~gaor9g6 zg6(Vvn^_FI9Rtf>1^YP%A9MqL>O5rm47lQeZj^z_1q~bv*eT zTCd^DU=NEhuHCSM%^1}myefD`0?xI-JsQCG*TcIu@Vg^{R3Tsc7uv1|YAy?-+zR}i z0|qh$%wiE=3!B`Dag7I}^#-Tu4zAJ=T)qbQdl`7}I35WWP=uERD-wB-1(^s2V8*+2 zmc~B7_kUpjvQ#F64}51onE>8X660wB&)*714c-K7sW*;6U;$(B%+cUgeZZWC@-E>~ z)&)aq1=i9VJf;io>Vj+2fx1h>qrQfj-vdg#gMF)ECRhM+DF}(zhZPONT2F-qj)B*i z0Q;GRb6Z>~3(E*`_g2v6HtBR<`s)vkmFf`1%<{x5}=Zvq-^ z!?6(0>4_)Q1u7MR9CE;d6d?IqXwX;0Hdgq{>_Dfo=z9XRaTl2OA=t|}+&3O0?1y`s zzzR!aJlPR}Mx&Q}7@s-Jn*6}dkI=V|@H_uN>+ZtFPvHCvmi_=1u^MYQ2%fAR{8=xs z;bHKBzk|=N2{dHHSPee zZLk9Hq}yN#E1@6L@oEk>oeg@?37QdN+jYZCb||!eGx{6^U1o0As$>7M=R+00JN%v_LVXA@$lYDG1eongZ*KTL?Mk> zNKu354I<*Xidn5hyAao&h4~zToxXug9zr%7aA!zUSEI#J*wtL1Maa`G$LA&ZW)ULr zy}*jwkY)fr+zQ{E2Oh2zv>;^LKY%sfzyc~mW10b%LlJO2^e_s#-8dXA zW=G#XZg4wBR}3R72pz8jly8rgU4g1SfcB*@uJ6FK$M8J|!m?Tg=_K;WK!AbJlbKk7 zxsc;N%=s$x?mKk(0VMk;R;L*+ik!l4$Rp%K?jVw9!5WtXrc~s`@i_|L$UvcF_KaO& z2iRt|maSwP**2W#uvzRlJBp|z1+hyOUJ!Rpgj`o(ea3-Hgt8eefK4@_Wo4j8A+&1% zO>2qI!@_Gi2VV%EvknxA~T*H0Ue7Y z64-)`{J_}^{ffgK^`ISnG0v5c#A%FqW|)Cw!8Hf&$cu3m!22vf)8@d~Ezq_n;9|GP zd1&LoaE!AeJepnLXt&4*l0mZ4;0+&OP#&RI{dOZIeM)4H9 z;1xK+4y{z`QYz$mK8odWM^h>M^?a$48j{c`wFaW3GSQDrmFTXa2^P;7s#5#vjU~>!#hQScb11edEi3_V|A~R&%^|+D26z< z2+d1H`WR95O4w?DAW{WbKp|L7SFHG?a7=Ro@xftoft&~*e*+l_by1C^0Ve~9piPLF zZHTG!B6c?8T#dG;Rj8Z(B0msaKSFz%L?LcABaYTcURs|1Mmy2IbP%0KhtY9#0Ubb_ z(o!@RqU~hz0)BiPENcj?up4l75IDp-u=-o%3;73eeh7gF0WIQ48s_)~X!r=$dkEg? zDm=+k*mZVbY~3(d2(it37$ZX-;3Bx^Q1n|CHd_yU4}o&3ZMl3@pFusV%ohvdp5cklwU|A5{%LQaig&&^;RMSx6MfEDH7 zO?tqV*5MgbaIGm=S7|VclJFM^cvl@GXa-x)6}}_HdqNo42=i?p_7dG7|BB$oR`L}- z_7u2n$V*;^rOkoQo(Y^zgimY^&tDw;(+v%@aSeH>pW!@HGW72hI}7G%M=mN4UMAR8 z1~OCG;1yKp<}S95jbxo!N0z|qv0AJOE5$0ZR;)Rz#2n1SqF6l3#3=seU@&(h5=pMI ztQqSL&v6NCHamDlV?4hX?7j<-;11%X|KRg4!Uybx?u9t&I%we*;KXV8EFHN0COm_K zU@3#)J3_2_HT=d_#O7CVcQUe=WpRYGuO9qHPspklBzPI|^gt@7|MNBo6vk9PM%f?Gk7c+L2f-PHlUSgVVWP7Q;BxK zF$9|4iZ-L|=xiYBA9O8{buz6O!xIfNg|weQj~~eqjC`jOYsPJckH!I5e*> zVt}&nq9w2jp_s2B&RZeLYq0&-_8#s7s&=&`3fYy#U8LN@F4A2C+K}Uc!9QT60Aps)!bwY;5FK?$}AhJ z0E_6uda^%op2k+A&1AIh0n2H{Iyg zXZyh3q1QEN-G}vuKib8%!y}wPM)?_gjaL$LA(||X*-U^}5g` zWiEJ!RJXHcLV761v$GK(6XlCXzD*mH5AJx!WNUTDgiJdor~lq--KRwU{%(_ z;&y`btObu=1Mhnu_<0X`q+`I(zhJK$Al;jIT|n!lIOc_0CSpYfA^z=%XZJ+>FbHEC zixKXEl~2d2tiq}dgx^---9ED0@XF^IW7)7Sg)xW0V6$_8S>wQDCx)~3L(s<#u!xsn zwC51Rp2E0~hU-_K!)SgBW1`^ye`8!5Fw#EgDHqncIDB6L@Y~K{#;gB_G5c`ee$3a$ zivm+SLT6_ndodiD?19L&g>qKo;k_=xPuYNjzX$wS1TMA<`hOko55WJeg+;Bz zy(178mqkPn4Q5sp{INcITts$*51s^fdkoyajd!aN0Zc^O#d!Y|Ic5|5U18c1*s_Qw z(oJ*~eA__SY-w5$42#eY@GTc$)i4p74S3WVn9&*;<{Gpk?pOvNxQrghaeyA9FX?ys zFa1IPrtj!sdXjFYE9g495}tB8`fN>m<2(fHZ49tx2kzL5;{@6!gWVm)`!jS4&NF~U zD{yS03(;;W9ZM(D)pQrVL3h(@^fC=liJFAWLZnazXSW~<0csX90q>lGN|WdZ`h?yC zSKN#FY@z4T|7(ox4!wwRZKA{JXj&h0Xob9;CxZbm0iG19-1Pynwm`O_4s5A3r~Dt_-g_|P z$9UZZJ2rzaq<~9&0^|OK2vKJO_(KppQv=WZ0ffAQIlaX@ky~Mv^^mz5i8&8IMxin6 z-OB$34qriEr@=!0#yrzd87qugP6e88#|U=|Z)BCgWO1 zWPoZz`sJbHjgSopRZ=@a@0LJTgCMPzhz5Fx(Ro8yvqES+8>qbzcmIx2ww6nUc;x@HXhFHgG+ZhaL?)p9z%i4;1Kw7-0-p&J=tz3%Tw^!2GFr#ya45D3;lV z+T=lG_d-$hBXFaK!0k8STq$_}65qWH$F?8v`4bq>doaqI@B};1QzH7AgVze6)Ir1@ zp?G)^yvR@>Pha?#PGI(J;1_CyvjHJK-w?d25SUeQU_>b#Wib9aV6n}^aY0jX*KY7B z-7w=W_`V`qlmr*cfp#U3g|cFIOftAr3~~WxWGjQv-gM}%4}QBi`c1Wvc+Al_|+Xf+1>Mg}%y24l;Q z9DOtBT4!iqM=+cL(B=MUKLD-g!v-I~)}F(nzrunkWM;y=L`W|JoHLZA4effc;&35P zW5!;FP<2~DPREXKVxeCur2GQ*{2Y=<#kjtPBkHS=$x)2$1m<%FsCg1veF2DZBP^?p zV7Z&Zcp5_D?U?;S*hq-WTtJllH%9aXQhkE4J;My2VRWH=DFl8cCu|^;=^hmJY%9PX z7D2aGhcng7VGS#xRcqk|*I^wKq1T;pv;s5l3oYq|_3n!GY=Lj8!&8-oCN+aTH35?k z#l*wm2PeUQE<+#Nkf}KU4sr(Pt$4yj^wkahgfd>O(7r6RtuFL%AT)S3p0pgTLRqC7 z(Epo=B<{j8u7%^?6p{&^w+fCrv^(OB&WO#s!@IWzTPY7;Tm)<-0c<84e7S-sRYsq5bpG^DaE|EOg{Ac!JQUYgqO7uz_?Op`BH_(R% z_~sX?Rnh;mNi%dYKkTCf&JOHndj{`v34N}HRg8w-H-+{$0dlv1uP%mBHAGZU7)ad@ zV_5?|3|02G;>jVl9Mar-VQ%mOn)n?xppd^i4_prQmIzy&hfz<1^@OStb1|1C81W(8 z^#Hp20m$u!mBs;!3WqZit)L^bu!ie_pWDH;}} z_+EfMRD;f>5j$dQI}Pnsx{7Ry1fhj{J-#HLr^DK-M(#sm37HW8|PHizakhF!D<-gQIFQv;tu^{e{$ zTM?0ND7TU;?1h}jcco*Fp=!-J$oLp6`6$j;aK3>x3+iub#A0GOv5h!MoF%RoPlz|fTjE3U8qU|m1NgjOTquqZn~2rK;$pOzLu9z`t#Dr0 zDl8QK5UL0z1TXRq$B}ht2f1aXU$7!Wkzsxg464aLBmY^AWf-@N1x61ezwwX0Lm#C# z(sk{TwnZDL<F-Mk`eVa(C5X(p|qI$L=xmyrLH z?nq~(InosAq(tQ&@+SF?{2h)ihtfn@sq9gbm8@!cb(slaz(Z5@okCUg@cfR0b#+N+z|cI#1oA-c%#C zDq1(~q?W2tJx;H!57YPSFLc2uhp{~|B3N}~49)fZZ7$IKApF2waI28bK0=QEA!Z~I3QLT|ksDdrC}d*`AOfp_Z0lmW z9kJhg`im05BxFIv=Me0|Pvp7+i0xkB+9^bP=MX)ff@a@Do!~O+1#cjq)pRDEg6MNR zGG0{>k4eb4-GVQh0n2O*c9t7hWJ_qc$yiir>06l1ia|dC#FhdwA%n|ko=Y%JMU$BVz#kyh{ zv5**pBa>(oQ-vSGYvH-@NVtK&Z-rD?@eARma8=kQOb|K)7aYPvJiQTQb{Kx!L+--2 z6y-0GUn<4k7&DCQ#xZEbJ#CVfS39J3QvZW4l~mH?^Kzm*82bH(+)VBw_mum|W90Sn zW%;x0RoW>Vm6u8$^-uMdnoXOhrD)^ymwJM6-6#k7+QFEfAYPpb-yV-R{TO{nBf|K! zUAQbH39^u1tS$Bv$H9Vki6_NZVv6WAMVV@ux|oKWR+vtk?wG!stmZ1_!RAEsIrDQf zu@tp*wXCFRAZy@ zz_77$Seq5l$xGPtSrk#&JzfxAh{6WEK6f04{UaOHG3ob82ecJaCf!LF*14TuC!vBE6L=xUsSl)7lA)O4k`vP4dnI!aRTaBxnr zWw29lYfuUHlrBi+5BlHe0;yi}j85p7nrrm9>+#wl$A6tMw;z=&EJ0rKP31#cugw zer7&k-fNyrxjWe4=|ppq_Ml60x7 zv{u%XSL%EClE?6zPt+ICx7*4!d5@e&DXkP#6#1dtS1F-p(lRu=VFsdI2ckWMWOkss z*q*I18X1vBEu%0-RnQo1{AaX6p5+igxM)_g}iMi=s`hE207^`s+#E&lVqe!gM8gFdH!g+E*1P#{-u zLGXOAYOq1DX|OMJX-sfW&^Qf}G-5A)m;_BIeGPCDuK*4fair-OjbHG46(*pl7=`CZd1Dpol6FFT9() zZN0_3@m{|t(wpod-YcHPp6;HKo}3<&hkDZ74xDp%8hKWDzIiHm$9NBWUwM|oJg?%=1uia@hKx__&`rN0QaG6sCleW|{} z{$c*t{`!HZfxf{QX`fV2_9&Z_GU_$8xprRDz_}9iPT*br^@_Sh|Dv4+6I-l})0%4q zHHY?(dKIkUp86kTIz!71q`G7rWNY{;a*YOsqNX0^8J3>bJhlV2lJ*VuGLFNJ($4M9 zAI@y9+OB!7ZLW(hpNqI(xOTdlx(+)BIU73jImISV^8J0(Z5!{lt@-08gT##SAC=^Y8`F1b_1R*M$f6|$Klm~ zYS*-_+Mn74tr6JND#$UHo~Tpfsxh49!w%994dP(EQf&z`6j^Ic#^pBKkPy zIF>k8Ip#SQJ3cvLoxeK|J3l##yMA|#aQ)_rbLDZlT<@GSon@R!j!TZ2j*gBIj{Od& zbBOc2GsT(LRnb+z_0`$M`Q1^}G0A?|cFX$K^2+?ulmX_Rib&`N-^V5!WAvfg0Clc% zMJ_Kdk&NK9V5MMmP!2qY7CaAJ3w#Xt0{(y=NDkZx><#n?6oNdL`>XnY_zw66`fB(J z_=@3Z;5+0iJG@K5qSQ0EkU2EpND>w zh5zbc3^m3X`;7a>dn4WO859-VDyTP&!w!ioh^V8GYbZl~^q{Z+xGb3tnMcEOowdf< zD%$Gf)!f$2HU&EM!Isb7)4tfg+5XY)ag=s+a!hiJarAZ6b=VxQ?33*^?Lm0CtF{xi z7dDsucl%a*n!N_DR&+FVd)eT z>pSdQ?fcU=)tBgd;8T1J{D=Hef#m^vaB@(PhDwj5lJa!Fx_>#FeH|<{FkH z%YJJq+dkV{TdK`uPqsa@t+kD}HM0%0O|qS`<+AUy$2h7%K8cQXj+KtF4zJ^iJ<(p? zF4-R2j@n+@y!Q6?d5~p3M}Nl^$79D`$3@2?M?uGI`!QPp%&Cy2m3e?^qj(Qou_hfy zo+0v2WM_u3jg7iJOELa1Qi3_$1jtFiF-o;Up-1-DCsa!@EVy|AOZV=X+=7e`fRH|J>QNarT!a_4^MHs^fjBIi1I zxh$~0a;^rh&aOQ!#bvnsuE(xBt{bjZuEs9GHQ8CxdDfBOm~7u|`vpeOz*5^>&s0RT z3vUSW9PFKu431Yso25ppN0d(B%! z3ycaB4a5h&`S<(h`KS74!OmX#;{)9T>jF0exr0-K7lNt5BG9BuQWklf{8erMEo!Fz zt+v&!Aa<*%kJb-^Y5${t)nDl!arWyGMk`~X@!ZITeH4vQGoFvgCL1a-$B?}phKl1$ z`cZJ2qRrIu-LlGRh1P6=jlBjNB=*+!Gxk`=IEUsK=6vQH;7W3pbKiAGdOV)X?gj4b z?vt*CuI50ePOh`AH?F+y9qugfbQ3*2J+(X*&og&b_fXeKXR0I3{?fL_TGaBxv|l_c zG@~uaAl{d)GM?*p{Wop9npr)lG*(o3s$5Ps$rq)mQZ-2q?hFnJ#s?1s+60;hbpJER zq8SkTx$j@!W?wU39$#T!7vDbLS>I(}im!!#vj2+zD^O`j;C>(nnACs4-jWrlEy-n- zL5O`%E2osNV4HbVk4n@6>R|N&P|B(K)%w~7&8#;BR%hs|jl78TB-R0WnA)hq{*FrL z3gFRD({l4s%T;Tltqqdb(;cafY|awS{LXsL>CPL@pUwo=W>t8)s5#Ybkj@YJxx5DXg$L3@>6z47x)y8}6oM}U?*gwez@^P0Uf#slrI@<6&xR29()t5F1?h- z$}i=dN*mbODOJE1do#S}y;HoMyz{&Rz5Tt0XSnC1`=EQhE5k9s-q{vm zt!Ta?Ruy#YTdd5pu-wLEy_z1-`f0mWL4A!GB!mC0MNHa39t+gY3rw=egQU$tSMV}) z=x=a?%g~(5zHh!{907Q-5&p$EF8fOaW(H0Lo&_w>x8A`skc>n6UHU0iMpV07{v-$G ztl*OE!H|9_QCN?%$XdNtOKH`#%Gw-o%6wo;L-jfOWxa**)+hiCDhU)iMH1-#=cVFmC19u*1 z$Vm4|cV*8JPbO~#Z$+=@-R0@!+3mjS3Id^uIQuy!+K1a_Sl3uqn5&xXVls9Sr6PJI z>?`=&P3EGa=!D>Mo+!5#($QM}V z{|Wv*75=`3ud%PSuf1=J?fh(fEjVhd?{J_O>Qe60XJ%> zEKqhSmlQ#*rcQ(IELEqg!_}$kDpgWjfje%{A`s7RF?wMa*jVI__K>=i3aXIBlw!JV zUSsKDwb_Q+hS_%77TbF`zB!T{ilc+Gk87fP8*HhqyP~_6yS7_&Uw2XWFi5DR+vjTO ze&kN@)b+SMf~UFXhUc>9l4q)Cn5VpFjJtqqsw3Uj$(m*!YpN_Z64ue1WGatmMU4;o zQvIzqT&=BK$H;d9lb=Wxq`Xp9sUFT>gExRYeS)=dG(`lPJ@_H83%Fc55cJRVXTY-` z@G)N{e@CFvT>mfsK-kooz>>hWz|TM*9v5pC=?o*6fcVa5{U4ckoC3sp@f%SlU1A6lKZ z;kMqkBKFpf+s?PZ;WF-jU9Viv;f+qZdbo8KM*FDH=mb!o!%zi z>)wEOx%ZmqtNXj_xO2Z_jlH(bWKA+hn=6>gh}nd?L@}=F-SiWB5o3xTrBU?^@{}i( zgYxs>)WFn0dSH6+68ug#>Aloho-enT4RD!Wl0*6xY%Qfp)1X-ecPXFKN-XA4(n_XRg`x5hVxovBWjYl745 z{LOL1KG1&LR>@Y+`q5n3^pCI}Ik%u78@#`C=_(Rg8K z3{^LFnI}{HvV=9&GZ_`hX47_YHgcb}NdaM_`Lk_=%gj=ryCaFp-thm$1SZZz1gAX~=gQr;MJwvnZR+)9gGmD^A7;3(RB0%U~tl_%`8+ zSjqG|cAu{mP7_(!WtuIdlGDamQbfFHT7m*l=*L>^kYh%NREh zbsyGi810C_7Qxe2*BcoX;BQSvhBjVLHOk=L(ME#4O#fZ~%lM02)ShdPwbt4!Hr|Nk ztMwe%9eYxpYa~*_*RecC8|;)hpigJH$avb&SjUF57JLY653gyWPXz4p)6B-dygQpf z&tl(1eXMArRJlGobKG|<$Fc2}~|RGQmn zwnlU05w-W^qN%l+nojGZ_z+YRBCy-~rpXXeSQheuRi*Pu20d!B(=SGXT9JPuSA;@h zC(?>N)-tm{*;OHvIR|^FuhHYQ1a?Z8E5-@y*%Eahto0_(C;TNC)S+CKAL~}}j)-b6 zcd8fF2!5HZ6^06#*$4fpw$sQVb~k0h6E+*Ob=_!*n%)Z2Kpw5v(5o3;gnJf2EUvlL znOX%>(-LW8T0Ny4D{i<TTuz^csI@juC2+&b+#e-)NX-o64EC8YlJn zER~lsJ>YfBc6vgs3!W&LC*zk!KWL2JSl7ksredTWDgh0RyQHO1(^5jHrVe7qd0F#M zaXzVldg2Z}2OViDB9;O*2-qIPF^?EE9P4Jwi{R{)yH>D#+9a}cD zDtcK_rLn3X<-lI7WMR2&o={x9C#gyyVY};=b-mFo zutB;@7no04I&&A>sdZK#QIDmX)k=JFNi|9>EM{?L6TR9*Wr_Z;xrnttEM>EFn~pM1 zG@rL+)2-@pEgHIWz*56;P@g9qREltqW4HaOsh9dJy@|ZdQaN&^W0Jl%I8iCc&swV5 zme5zmBqOu-2k{|CmtZYz910#%PS9u8H?{@*RiSfS7f=3IMm;S0iJ+;ms&W`GZ;_)wKsR}5>H8e0)Hw8_*K(R$8NDtaC30B z^rvveZnphR>#2>TQ&NBoH#fJ=BZv5YZN5K0uVH^+9YRiE18}5K(Y(T(NBn8DB{j81 z`X97}>AdBvQozVZhVo;k&Eizk2>h<5zuHI7CycS=uzQUPic6h_Us%^RuQK1}CAE&q z6>$yu%9fL3hKp3BN7x2BPHRg3GsV(KVVlxS&q|)t58?@$L1*zF%6+{SxrQB*N%RM= zsOQuEHMUyPMC{U4=1U3e6WL;EEtECZXpNPTY_!-_SVoVsXQ+R-(|@3b-gC?XdcA{K08ewP3P@R$t;U51;zwzDJmyq44M-QVS%gJ|m)lBTP&PXSrk621* zYNQy&u{Y$ZX&hDbj{Fh)Ty|PQ>@T)QmEEPUF;>yp)3A=Ur(O3BG&;`B#zb$&o>ywFm1gVVOU8Bf%wu84PS5X;i zj;hsCGE=B06ae2mWbEbnQ5kP2R7Qn44pHP+;R342pHSz!%3F{c@Dy&^lRAVVWTP>i zm0_oNB|`Z}<2-aElQG6WB6)(r~qux;rHN>90 zwy_esWM8mO^aUBqW7$C?2j9U~k!9FP@`k=+Em&y3Yi97eP$fMd;j}t@+dJNeCLw~m zN!kfLOhKWB7$w#sZFwuwoYoUR(0QbpOHj)mjT%^W8Xzy|8ZoP7qIop^hF=q? zWtmD)QG1oBlvs4X%`(JvQ9>BW&n?ZYw~?pKZ>?z=X1-{)S-m!st)ulk^0y&K^)%A!#KFjx{$Y4#s&DT4lSH_Wy#u+=(qta>mu=L+E za-)vX%~#KN)Azm_M4cB6|EvgG1rk#31HPqv33-yxH44Kx8 zs9X0}hTy*fa3h{SqU2K()E9_!?kY$C>ZkPykoPS7#>3zjXm>Fv<~8Loy)i{&X7Q*i zPe;x8gjKZv<{0fbgNk2o$9(59*GKmj&pmg0MDN4hqGzN>M^u{Vbw!Mdm>F3q)3|6` zbkV4l-UQbhYmDg#c457R-lg*5#_#Gbd83pbxSY}R*N>ECDQfDfUnkSDr#sXAX@8{a z=^HY#`s(>J2Omg7l;T=YKgU|4c2J3)!XA?~^qg=~{1QShF+pr0y2UoaIdX-)Hrg6V zh;p}}epN=Tt{jo8$p@rbQblQ*v;%Q!W<+cUP)BSVyzHx)@hGEMV5gKS?~_Z*1LV2N zXyho@Ni=vS5H00|L@sHEw6gjvtSSv=Gzvde$mhV$sV-t zMXjQm4Y8YTH1cYbY(;EwHimsl>DK?OBdtFybu6>ZZ%hSEv&HX1E$Cz__MJV$j>6s0 z-f!4r&{g}4INPi)QHCM%_RBV8S(nQ>knz}$s^17{x^xVY{3|I&?uEMPOF2z0pfpfc zAx0mKda_>?u~RES8>$`C-fDNX>zV|A`CV^gtTi?mbB%RIHvE<*3wAZ;z<$vr)FqN= zJK=y(LPYAq9BaveeCc=lE5|bDeAgED5zj1dwuo~Py(1e${g&xmCSRsknQmqJo~cgs z>FDT~W-*ImJh4w>pT=#9FOvCg{L`4{Ubp?8*p6Ps4(#7p7VU4Taqx98PO2A}o_->A zaB`*OMJXG9j!c`GzBD7h|5)HyaDdcC&aXUHYO1ZZO~!umT1+tyu_T)JnO>PDqCTH$ zS}jfx<_Xgf*(QqbX?Nb*P_;SgIAx67MJgHG;Cr67_t&jo*V6K)&r2)ttIp4Lze;B; z_xB8p_gC@{4m6TVC|0$pdL1#mp_=ub`c3_@Udxz_9d6WUfxQ{|Sh}&sc&MlAw~U?m zz3@ENk`)BXuc9S|`a&I4Z914@EuF3Jtxat8?L8f%oV$?WD(HFSY3E&ptmiaux_4E? zlZdhrJ5UoY8qq#tc0`YewceASrpT#NPd>K`71LqP%npa`ta+{I5+0M;{3|lWepF=r z>KXO9qGMOecWHqXBOM9$K%Td?lpvp$TPo$$mcYC5*ljXWpNV|I2P2AIHmt}mWHFZM zjkK!j8feKmWt$=>Po&(~!x1O>g5LsW)JvlS1p>tbtpalb{{(gi3rgLkJyKrTj2$m$ zmH(7>>hIb-eHtoyHSn9VlptoesSh;S@LE5nPpIBQ(UcRZ)Ci?rERCU zfvZGimXu)9 z*re#>k}0!N#CJ?n0Qyxn@> zV>_=S;%tOBG9q$IM7;>r`^GC{hd^a-M~}~K@kD#tqq;KRJT(2X+RkqzV;;a^fJ(P<1EVgCbqPXnwSL3^6ew4XcmiAfBW!B>!$NbFnI#TtN zb2hVxw3^}2>MN1bU;eQfv(jgzuTHO?PSQI3+?)D7_0OM`f3-`ilu^+?Gk8U=tDQ8i z^Gw)PcnCWR_Xu;u^I})zv|6FI{2h333rzS6dBb$t-H)i=TjBIwTRBfb_VSV194uZM$Sw9AjOn?t0!v5sxBz zM9Pu#qAo<;kD3-$JgQ<;{ip^}MWbFuzKDDnxh=9;M-zCTr7tR!GxUoCy4k%wJItz#E<;^rjlP?yP%U$hp(Z(GgSDV~gsZ4c~%9*ljU zSCO?HDrPc0GkL&+mf7MRU!5)8w>&4kpCZ~uxuZA4G>e@YyC$|)T)FsD@onSn@rUD5 z;yT2Si?1CwDSAa@eNO}DS?gSJ65poPl5N2`{@1=M83`Hrfdb3Y^Q4bXpPKGT|C}}? z{e8v@|FPf;`INd=Cp-t@&+XXP{2cq6XJE&(ARZGs3bDdb`~o@#TzCwF)oOR-0zt2T zb;h6R&C>!uAN)M>D>*G8BQ@hcpC@1nJ_dq*lyb@2hLID$8L5hWuBWkU z692agGQ`J?zD9fS#pk+&D&io<*#_+1?uTDhl%-jP%woK0syW%x-4?LlaOUtl@P3V0 z6}dF3Z>FD_6!^jcnZ{)DL~nx~JdNHJeLp%|Oy-z}(fOlWWZDwtipm~Qz!hOTU}`S( z>AYX}--De`^xyK&MulclKn`w@8>>&X0!BTi@S}8{Fi1Fn z-z)CI|HPM>uVrzpF|ss+^(NXrrHVXP`VlM{=V{+dJ;GJz z80oQ8LcT74Rs>jjL%opk2)U&bd^Y}f0tI{iL%Y{=@`=d8-ZTz@OIO7IM>lTI*Ex3sPrUb|w^!uzs0W#fM6+mb?8n&BaWmtx#8-%S$Nw4E zInEuI6x%6!p?A4swG^$O`e-N{1-{zl3pTXq;ElBp;T6O%D=GcGvy7+7qu+1&J&FKY#jeZj$*I* z01|~bDm!~+*o@~|Nv*5aR$pnXVs){z^&xQyt;O1=rsh7V-qg4KXaC|<+#|g`A`VA- zGmVaJ5fg~9#lDLXV_U?|h^-N8iS@J_jokCTXA`AZhznZ1; zl-sG9e>O-}Qy2d-r5{Xhm{Hc}^0my^m%)8;!3)wDrK~nt|H^E_Pt!SbCG!r`3oy;r zrZ>V5WS>(|{m*T5*I%hQm7l@o{;s}2#*_4^zp|t{Q%F^>4l&$J&WF7_>slf8RSm|H*@0AA%KmM%#CjYMU~(ZQFKxYumQn-r8FXU^GJ>s#1suSGV3vh4C1MKtvacYl|!>!xcj5vWVq zWAfNE#nQ=k-u}vQ#cQ$m3ZLJ;+Y+@(a?*cu(u>J5rFfsRTR_{?Yf=ZMK9YKN>dmPu zq;49}Gu76Vzf%lO@grGqlKH-m9gnQVP2LIjV%|m^`ja$tRLHK7y&(}Hnf}ZTdmmmp z@>bNn==((ft&53|9qbH_&xoa)%CyZ=0llI%(V8#4j(a8b`ebitBO=ZRkbMB5mli=lE zgE!4&J8N&?7*6c+P_K?&;~bUkq1L6A!sY{>itgZqR`Kba&136FUykS;UM1{$=-uCC zf35m8<@fB6(V@%#3<;ke`6Bvz?CrSc&ZqIApwxC(4)nDq>{vSMnPV!Crzwj4+1ZH4 zFJum54_juB+dUR+uYm`5hxxAMgS9j3*!R31c&GNQ>zlIKd9 zKGkvPID0_NfTjWE0~Vy(kg{Kjk;&61E9BoOQ97UF_6?Suo*@ZSXdFb!`D>}PE&tm!P- zOoQCHTm|s>JK`dv8%0%(I2HEqpKYPjLjL=mI%IIj+0Zs&m%>X#_KtcKy*Tzy+-qk; zte0j)SPgZhB?hJ~IT&tu`r+wynrgAn=O!pU3BS`;_DfG8E5sXKj9H$&?TeKOQYADX&y@*as| zywcjfxCg{`3(p=B`m@{5RX@A_EcP?+ug)R8{{)96il`F#A5j{8qngI_ic5~=>UK{w zcd$3}@nQY8P-6cipIFacpJ;+_I`66W&enV8G^TFuGeo2fa&|?tpC3CoW@c0@5sPu* zZ^M@VDH2*bWOB&V(D=~U(7AuIhv$xb88s*7d+avnA+xbID$BtE^o9}y)!!(n%vvNyKxMYkGj$!RWV zO6NJ?N|CV0IUaw-tk@*6r(@Oxho`8Y@jD*=uUi`C*n)wC8B(WMYoNq5W6$>0P#BWi0a+#TtuL&SSQYfoEK`yu;tdw+-1F$Z5%XYX>%Z@BMHzluZ=6iqyw zh^2LjZu({N8{xax=Y;nuud-fe92JNpoonxG?`qF&kFW*Ivll$dojpe(aT~s}WbiJA`Kme-PF# z+!?VcYF6~Vm|}6U_`;fD>*ZlR@K8cCydL8{8_Z3tv8<-M>~+2Rvd^T5qk#R1b+9GM zRKc^GwZ|Yvqb$)>&*L)3t%@xkTRJ9V^qk1d5pBXl!!Cp^2(uCUI6HE2bc@*VxS8>p zT;tre(9D05Luoe4fgRmx$!ZiqkK>7$31BC4UhudPyH+q2lnpo=&b^#W(rm@zx`dGu+7u3u0oJ?1ZO#Ysv?jo+12{q&A6GOBoZZh#3<>Io$ z<&EO@aP|B*M=QSQ%9QP69bS1Ip# z-d;Yvd{Ps;ddWA5UlYF-#795#TR`0Sd*4&OhkQ5t*7vRJJB&!+mcIRbFZ!mV_m7Dm z-{)J~=a<(qN1%R2CT3Q2_Nb+iGl>Cu99c8!W>nMY zFVQoJ{BgwAiXBH>(Yx5xai`;YI6pc!#cxa)<0?vyiSg{p_9Ev}HFk6UU{00KHyG{7 zi=FX9*$q{ROeF#A;fcl@zMGwkb8PXp5%y?%bH@V5dq*K8{v!6E9AP)lDzD34!SL{L zuf|>v9D^Jg97pYG?Q?DAY#D7H?42Lh_ttCHS=L0(mFeTwbLV4B|kGMkcw|J4AD=pdSau1NKR*%MG_@v~RaHuywU{x3#ksvE}D^ zsqL0+zpbaOw{4|u8?Oy*Ep07`C75NKVjEzqV9RCuYHepti*5JY{FZ(6jXXiD5w*j& zvmI<~n$VSt!JB=~C}$3&&w=>l>^5qZFeG6adcuBo zYCR{O?{I>ztAMM#s|UNBy12T#_Oi?9qw9q$il~j1c*D=S6R~Go^#zuaA48c|etAk_ z^`3=4qfFAIN~2FjnX;N)WXsA$o~-gjM>HU7Odcc0PG@%NZe_hJsU!AAX3|%uG}AyFJGvKbbwMH;H&SPxjL$WMfdy&W-q|-ox9)xXVPwcQ@?{Vh=)J zasa5hMoaj#H#rgpv%Bs#Jz7J*X5!DN&Q9M36HKClfmid zqCSMVf_0VM$2-b9rFSMr4tr1A0n1&_GFN_AG1rrX*9je64ct+zpH5}}`c`vW%X`)i zAG14m9eeBBuotz3r#W<2&HTYsj1^dx+nIq}jF*1W3U|!mjIc6sNwdWJ_Ihjm$KRB<2`5bGfS&6|Cm# zMlOtum{u#OJ1%5P+1jdZX5Hvli0BP@M{P9q?{n`|0BN@+Nw_9Ti{38!&w*M2xa#Tg=(&m zLZm3T5yLF4BpNB5=`9lSF0|htx?Vt@d__6zQ0t=Qx_!Sfz6ri77)VkXBi&MQa+PHFD24}KfUm?>MLgQz&w zjpV>mguLlOn9 z3^J_)gVInJ<1_oxOOYdR9jLj7oO!3Xc53oijHB(!KCzRrSOnf2~zP@cv5-qNfA?gE!p4z1^atp7iM%`u7}aJ_ zXCEWa%2h@x6j+`?TB zf+cZK*+4QDtN}%?A}co&nXA_I@3CZ>Vxp;-0JXIL| zZD15$c3 zzt(3wij&=;9_={@wiO~1ZhtZ?j6|AbGCw1OPBF5FeSiX=u+JemGO!2!lO4?WeB??6 zB%$)iC{uL}uuJts0=Z5g|Leqk(?ca*oKXh4>_tybf|_gvVC_1NOe^of!QHfBEXO!T zq$`AFq!LTnh&L#PgL1w0;RaZbjqh zO+L61{5lzoyTh-caOex>Zv`4|1M>Em`G36IAMQ=dwF;6u%HjD+9?#|ESf0e1*dpG0 zh`rJO4}oLK7&nAGQ#r{a;sbqubEiOaDa=uhyL1EB%793Xj7(HL!Pe!V*jKEiXy*GZ z+R%3NpDM`t{ItCR$h`?laKcsDk>qhulrmRW0tu2L(R09ggBhW{NP;)eS{!+2o`J;s z!I6Dj_X+(;1Rtt~hBBJpGe%`Pc{19QWos7g-N*S(M#M~IrvuQf^uPwRG%NW-i|}p$ zd_R_69VN=`GFHt|IBP9!+`(~_$9Ap}%${mx(p2uigJcm=R^ALyS!HysHsqJ;0%8}1 znrx<{oHd&fZwC&~1@E3PSN`BjF8D|_N^;YVKlJ1Rp3Bq5PVJ}UW>q!4UG(WOe4#oB z$)Mqy(9wKo;1c$gD6}pAPfcDfRrvEohE{;~&#kE{TG>U-;-(Q z2rE2?pwXLPdJtR|2uh5Hj)%aXjk3DN2hEI z0%m1iZZQ&*LAuV2*Z}0~3^cZ5wEYU-pP;8pdEA9Ae|btUDl_4p@o+$A+L?&_@*nb%Iu)c zv*<;8FfTpQ)SD6gM7(!#kfS~LA&zSRZe-+_DDeCnSqqh|EP!4XhMRI5(%p;S?lRg> z$Oj{y%Zya*%q$M%KE3F3WA3digAb6UH@HfSrwUgd#u%z@)P5w;DQd&~hh*A^++NIF zHD|{1Az77qKY;lzM2o6`06l0`bF#?|LT>d!ewCyJex{qy(ndxu2uirhm?T5@Qr7y6 zMrOA^$ZqkmsxO9tba7y{+4LMeK{@R{@l7Z+ZD*X6-?|3(D$h6<=DI$_>TO1DcSJuZ zi#)9j7mk9{4|A=b^z{w$`~q~hnc3J+-(E02%B&X8tXmkdU(nwR=wLtepxTB-7}Xw( z{75K4+3*TN3rW!h{ouH|*c9W@e6OHkJOrQ5GBYEY`#NCkAMW*;QB$q7Oi13AoLdV@ zQk8&=jCD5Tzee*kwCX|+jYJDQj+W4c>n5iie#}TlC`VZjQ^8k1;Em&GR6Vg5vvXv{ zPO5=+-X9(D7^rrayM1MpQbRo@psTJ>$SUm49n9ArW@sXV&Z zk(IUr<8~j6Q3X8J?r6(%Z){Oji_*BU2F_^vFsuIjw*x#s2p*bDRJ1wP4|Ftk$VKCW3nV`1UC@A4|L4XtJ4@%PO=#Ke9q~ z+{!Z6eq(5&)uUdOJuMhxU! z@}JECQHOE=Wn5`5b2|nrv>`_+;u3RVLuX_}%OfLtb5;`=w*0Vhn6(=BgI9f%{$nS>GGfpd4+gQ;~{U zjxl&%?KujfKVS?p(C6yVNl$oY0CP5j%xS|7nffohzYw{39xA(m&2bP}yC0k5Ec$F> z@`e}2H&K$+pNwWVI>=ULza~_c2b@d}HKt@1vO-l6wBZLxcAQa|OAZayk2%d8yobxI z%&iX@23_>|4jR%KT78{gzBA^TXmtr@CXc~ye&_&h_~a>6_kkk@{KyMmSAquW8Ys~n zRNVsgZ3DmefLYt<$tY&5wlTZa83AS9R5eY>jt)qYqRezSym=n!I*MZ?QcpD)XBe|K z4@#WR3{2yj0nCE3M0Y^nR;|Vfcu^W5aZ8YSuMFDg2+qBXjK~PDR^*=Tjgd$SZro<1 zcOd~cqg}6~-;bc}htSI_difAkQ3W_*OE_2%N3S!2W_Bp+Ht4Yxv{$}Bwdb)s?o#UEsX~8%8aEf#Kf@7Jew_?Jcc3|DrHmG?M^miBfQ^hsq2>eMd^v}$c z5H6JezlIlH^ZE_m`NnT{xKcTZvolZGq0BePIc10K#L<|xcHurVko)@$j(iU-K4%2p z&<6Z$hLpCZy zc+l0ip$kZ|Hl;l+3=3%qNOuBRzm3N&xZ^cDdaQUea*|uJ1U@NU+mA;C7SkK7bY7Nq*psq!R zZn+3;ZwgvQL$XG;!AsU0O7CsxfWpr4yz7C7r~?*tarD&$w9glK06(C?TJUpc;~({y z(RT1eO|0vBd~Swb+ybq%HFI4XdddpgrRB^3yb4*N*QR{m2~^%qYtKS=2SKf4Q2H)I z4ocg53=OGHT~_Evl@zMbd)fFk!RbP^AtsAq)5vOa99@@ajEy z@CEquiJAHZ<;x!t3|B2D*XJ-ss5^IP#yB;lUmcxZ(kvEa}jS z(J4u8X;oS)O4ee{%51KsN^}R|)8)v)zKmrya(lkS$~|f1*bk!W(MGJ()8w{(;r3!I zGc&W|`}T}=1Mb<2yO+YI&&N@K>-J#wCUL)QaQP8NHwd5I9Ok(IHiUe%s^|8etaQpt zXU3=D$M-gDku=x=*|8UDK@nA$y;|Um?31VDgL;ZZ{eeiyL?A?RdQlJUpaSPr=ggwm zC)sIrGG0q^)Q1|nGb$Zvi8B3@bf*jpnbWPvwDi)@7)oCm+}N3-9D zcC(LMwac(1WpB1d-)IfJ^fPQw)p+cWrlOq6E#cQf@M%igAzeF^K3_n>oq{^|K@nF$ z(hzje)KHmfyOn^`TO0DM8i-ehPr|V&Trm*28;rJh6{OqGoo5?Z)D4-RjTVy}$*|tP zAW@z}ldsWI)6)AgXj_B$_aZdJAT%;%?Og#P&O%D=0K3kChj;LPeIs7uCpL|$!aPC4 z-ov4aPEFymwD8wQuB-ZR@8BpeH08v`wc9XCBhbNA({dC@JBR1-=yii>@e2F`C&1KD zJoFZFr21hYg@dvo-03`5--JJC3jgkfHmK^cb2)ztXn2)gel`&L6&})C#BzgBs*fH* z??f9m#>mDsQ@|BP=ucz%)`nvoS5_wF`MgtY#g)9%F_GCB&)g2ay$O=7 z7`&H*Q8Clfq_jN~S1ktOSLDj=Xv0A0YAt+s8hTYlquWr)8En8HsCuh$H(7!UIcF8|H^mbKd4dJtR z{4p}?FfEyiCMrMtP~){12-_L0ZU6|WdYh_)SPY3>1c{p-MEAj`?@M;UoJa`OTO7s} zcG0)*@O5tRq81d@0qLiTuFYwkaIzW?Rq0J@aDfFp_z5DsLn6MRCHo9La}J&f`TbQ> zuPpMX1oGk^MrIK0TZHsofu!Bb=qS7O3?yGg+EsxzWJKZw(4J(-o5aW$7gZ#_lc(~j z`xz20nByYH5pp#hYH$6+_W|lF&ku5HWUtzITfCb&Zgjosmxcu*_CsyWKG`E{ZFv4!}#s|2<>#fU&;LaT9)h z&bQZjearP@(HOK|krVDK0}83~cM{}HIR8FRZ&kZEknxeH;T3wCjqxiCzqDXPDw!w8;|Jaak z@;*zh{;lS@hx_lR9m@>)zrh%@i=2CnvtRHpWzavt)ztUPM*n1snCXptKcWDkPBRYG z0I$X;p?*W&;KNwW`R1YAgme)6G4a}$dUp0Kqi79`{C@pNZk`itOv-fXUM`Q_&=WF zZT^Z6+i6O~3a~1er{FaaYc46N&K7RSlE--Bo+9NgVWGa|`6P1XIuB|5AJEL-pz*&m zKHUH@l#6&Wh%*r6mG@{i=(U}DpXRRDK#~1e8#}So46@x_fy0E`T_@z?%b{ zvxXK2f-cKImKBW2Vy?2BevIOZeUKKNc-IeoViWy&k6ctewh-iR7>~!?Cm4iN_Twi! zKLO3ok~#c{fm4$iyOI1`m8yp`2R)dHPRvwS#ns55^TPtw6CDN=hGf@iN zLO#C$L+8n9*r>9H>l#{h4d$s7)T?^P={YA2bnJ(n`3Jgv2(3sb-DP;LR8v4Y`Vm7% z-wrp=0}IE)<^Q5l2cg^ShMsOiUGW@tc=0QA{S%IooqGccdjg#)KmKY1F`A=URzeFd z1{%qpE(e0tM2F0S2ALH*!Oi+%6uRkKYDxv8^RzvrUho{|! zl^BCe`H01HkvTjDdK?EuZZM;=IrbQjqs;Xtjv$a@H;BCvY+Mh@&&8)OiFsB91KAKW zX~lVP<0~3bE)cCYJl_;l7zW=?fZ~_niCK?zd;z|Did}sLUU|g(hrByw_;Am_(MJrP zQf-q!R*xp}@9}7SyWtJ_oEO4lLlBz~) zC7!FI^GUl`4bqfIVG^q{D)~6CJsdL_{ultywBi4%z&sXBWg7R}giaw}mwe1eL5f3Y zGkee~_QLf?>Bm)2-~#JSr{Mmb=n+CM)xulNS*Or59`m>kx9;Ow>-b)EoOc;LJY#4g zssONybJxSmYe0l`Xd?6J;}+xIYq{oO?k0^%H8f^}9e=Z`bO+U2F|!4baVhv;T0@u2&;K(s&Z-URK)XqcS1}V0X_m!7k&X2&7J!3x`4Zh;x2Kg%gobl?sb;gJIbi8XKYXK+d;;E z3)p>_89M~Buj0y+7$Mnls#nq+3Le5}j7Rd!LJ~~k%6&jw$pqErnZRA<(Ny^#??|7AP|9vvv;`Y#J|ocA z!1DIYN&$G+1^PaLOC_%_A+4XGi9AIod5k7<8qQQ@kTtv>0C6wy%}&GrF~QJ9n}Vke z!JvxxVT4nC(Q_JsPeQ1kJUW9&oef)h1{l8CK=f1S1CP;O@1Z|P!;z=sH5i@|K9YQp zm7EOyCI*>s0li`q_uR|fgE&s2<6P#uL*T^;aO5z0*>yZ5iubv|=UwQTbI=A=6Sg@T zTzL?*04SdeZLv7-GJ^S~@K~r8eRVK%5E|JO`nQ&2J4XoP(eX>nf=w;vL2EPe@9h@Jrgj>jLW`d8{r1^Q4E??_?8r=N@emu1we za^3vM3L$t79sx)XHBh3gjK ztW?M#kK1n4)Qbjd{d{#x&w_rGHjM%wCP`@$rfh$7_+H4y~^C) z=NnaST8)HJm37swnE?fD<}A@bFthsu4Luf%$N?thLQ{5t^{Jpwe*>dZ!H24HAB`u( zjV<;S?)im`jOH+5#jA$XeR`*wHZPEf-#ITOdTJUFG9O1-v?Z+-sNT;|5MclaGYZ^U z4Cc%O_hl!~f@7zGKeEpUf)cI4kPb+pT5xZ6+T=@n649P~$fHt*_FSKH`hYb(IlnWg zpqg4u(XHFV`MvloU9J~*7=}ce%atc_{i#TSt=xAjn06fW+X7-}C04Z{XVQv8$fix8 z*)e`ShIHA_V+Ar~8M5U*+VNW=UXq{-r^Az-gY}vU#D$cn(qSDQRjK39oSKgH$*ED1 z?=zd7hA#RP>&jF2^H|+hF_-TC)PooW^LaWaQR?;i~$51Ksu$y892>c#Z!&V-q3qg+4m~;-BmE{F#k9X4=e#?cR_Pk3{+F4v2&pC z6RvrkcRRV_0)vu=z*X(gT}wghZNV?q1TVp-jM%mTtfXWI1G4fdH_z$u_~yXk_4b5O z#U<7q#|ld}@TWEi&<6hNjgP(`|F6!G*Wl+|pn#$$+JNcp;M6wg$wk41++asqu9^)L zD+G#EL9Z+g78T=NWjuG*$K!Ifm- z)EmRkdxvZd_lSu4L)F9=yoP~ZAJJ8#z@Zp-YH+ijfsfMv*Dz8u;qC6Uurb#x0BRLO z_LK$FnsL8w^s5c6%ENf11*h_JR08Y9!R3-+M?mU}%*YwW_blI>0}YO#r{2MuxXTRe z2EFIem-dWqd;T{Tzrr>{3T@}#TH`v#T&iy3HUsOg!r||riU_pb-&o%XSl4Wk;7A4S zbcd#va7;y;mZl{uQZIC-c2MdBbbx`J)ea7=gRY~Res(A=8Vrdf4n*EU`Eyi< zErJ#Na1iDXI(B0CJsrN1yijIQ^r3tpiVtgmQOHpn7?lJ>@Wl(Hip0`dE1)an;aorT zbO&qB0sJQm+ErTq=Z7{Q3iAE{WrbTYV3d$ol{t4onZv-@=Egcg4JfT3Bc6$I&CZC| zWd>?8+9N>!w#-R8JX9MP!IN;pEATG`k~|AonTC7XXtC-X`?8j#`eje?Y&-=m-=b+e zLwakq&yO?H(dN2_y*U(wk|jG5&g=!A_2cts^rdNx~+g2@;WUwWGr^iE)I#K-cs-NNd81P#a{QS?(Vbf9XdR@&~xztV!R z1sGi~^pQBm3Y#4>IGuUm7sJ4l|7?o#eQu1`b+*1_(} z1I-M;Z!o}ogX-8_EC(#REf&j8(=H^S4;p@UeAerrgwg1i!$4`RY@5lJvKt!B%~j8X zYU%jw_S6K^E}~t|1pl?ZJQ8`)oTDCxR+6ftLsvjQOpczM5=>G(#4k|h2j=nvqb-Rl zKdov@D}o~uYEgwl$&Sv9Uw6Y|=>~;LHywjS>QAe7g0eqYsr!tam^^?V7N{6NE59R-ZX4(iZ*%4+pHx`s6T?3xW(9U9f*8Rpq z|AU#&)%1QXuTyA;{DM{ay(w2v1eU76cY;T{(60KtufcU{@!T1CHUwHxy}V{ndO_$% zaV0;Y=HKL`_d&aLL(g7(n;y!KW-Jto;RDU9=B8@vsm7nQae0BObH5JUrv+{5%P7b< z|F`m_>VPW-Ei=?v?EgNvVo2{&h78UEol6G$;j=VxA19ouRn`#b%*;C{t$9Y99-#|p zh2t}A_`~R1@xI35-4x2-g7(freX6r4B)W_~bOo$A0m|$KyY@qQn?M~!f2w}!Y0kNg zPVvL&sUpYR^w<|JQ&q=&(0K;vKOKnioY7Cfc6Kl>@1SBQ-ao|~vGI>Fa6k=YAp<#L z_}5MV-hhs%255ULMQ1C(Sd4qQc;qEg$ZqR0}0DYka@nBt6e2!%qv!U>*I8+*o zsz?unrbjYMit`u{*RzS7t?lX!f(evgZ@dmPHS%pLO zSu=uU@}2Yqvj(F31#-?##`=okh09)xoFXH<`sT0ew={iH|djnlG)sC`xD<|NNK#k49z=;&~MM@id-QPgZeB z^6fMNWfpL)$$USRtLo0Oe}b@1_R>4*HbK&YKrfrFI{kqU%|z>J z&aXjbXX(=h`nI0C9>6BJ&3GN*{&%3Wr_e^YaVJIR=w4YTeye11KHvA+3kz2SRSs>3eP6``iEHeGW5TrfzXPRYXl0X1kqD4 zyD$Kt8az3D7KF;ilr09VLA#YqK=BN*2GMwj0XcYbUwY~HtPhQsp*D2E%v)h!$ zbJaZ-j+)QxdOY&E*tnLkzc7>}IaH1J)wrH2!wzO1x?xGJXU-4fNjeABL>N&&X{c;p z-P{VhrxFo9>7js>SVxNWDT)o00=b!)s(_jJB?)on5o8$7j@{IPShwofNx94~v1ksN zGE-BwsCf_a_=G7FI}d(BUAv%-lf3SOE*BX7oH1BEN6-%z(1u5NejZ|3y{8>t@Bn>g z<^2v;(q&^+;VRy$6a4cSy2N3wq8NjfNMHHpCPL-Ck#&kU)w*E`Xr>UjYeP~cg4VPq z&6Zj?UtFI8E|kTqsIPV4&0;(lsvBPp)KbKc;$${r1swuYAH($taPd`m^|4_oC*^9H zkerG*O%I;ug})-1-9KRW-`x$0O?!y8DBbG=66dU86DcR{bcsPg zeXDBe*_f$va71C=$1rlb!XK{in_j%(+FDCF2rbWHCW84!E5;Fw&l50Ha%C%;?2Wk>N+Y^z6BpHlT|YS%P}uU8PMP|`?c!A+Y4AtIR|cpQq{x+Mtr4yM^fquJ#GLvOI1v_0W?(FVZ*K_iDC6Eo}tS7|$*fNyN`CcxYq|DRx<8gUY3bRTFzAs z^(1hL32yeMB{`tXQnXC_BeH|X`S{!r%~ZydfN(gq3O)Q1oWXxj{sQ8>0V8k1gNk$w=M15uZ2RZ%=t+8hk>79AckP~(@BA(F_>S4~0--F- zVH$8C()dm>1&a51#qU4B(63PKCp55++~XN`$2aImaR{GjgS75j=&47+?0IO6YZ*^P zncU&r@0_Vf96OjD57j8f?GqFdMXQrS36+`89vq6eS6t6xJcCnNi|EKK%d08B!hC&;o7NVnGED$7p7eC&~{z@EnzRP60c1?k4T*5mh@{92O#l;YC_^CI&ub_!oGzhwvF z5A!=R62GG+?lI~F2eIpLJA3Ch8^6CZzcGK|_(raH3spN3Sv>5icbcsh2igC^X>Ms~X;0nUqI{c^c{Mcf8oG)xm1i$&KYG}a9igerap**$rnHQJqNm+x>JQM` zj_^nRezaK0JI^H}WLLFj-D(T^%47Yf30S-`8LAf6Cg zF>v~n1-{74t8Ac8@T(-Ntd(Fea4ai9p^g&cF-i-}fP ziA3Ix&K=0}V$M}m!*X(AOhntBNM)$iWLi2&^wTMxkF&p?Y}-_zD_Btm30GgM8D0|t)bMF-eAdMeq^d_*=22Qy-r>1t%PoTfeY(c`kPaEF1zM? z)>&e$zbu1I%iP^OYs}@X%gK+E#*!Z{FJr!Bu40*vk1>WS3|{a#RS8`yCkG<(ODogiK~#|w^Y=3qGY3GZHlSHIxpKO7EAtbqj% z@Z@h5@f1{6cYqfr%U3Ys2aj7I#1&e*fjg*@`Z#j74+Z^(f`#qD@>;9}S>WFvU_X@s z@KfGEdTc`q4P?X>SKJ(3yB^-e*68DP(fq3!c19+oSRAu_59ufw9E$Y&g@pTrJbli$ zXYr++GdvMXk=e7MBt^B$_jQYN@8(Ja*(bY%j$;&w*O5vga)t-06Y7 zs9ng~4^#z>{tjG!kt->xa}yYF7+r5Qck{>MDr@LPSpJUU>wPkjKC^~fEjB+}9$OWoruj(n;2ff+`CGDU z{j^!_DeS53S?yWu1?{!$L+rEc>+BoIUH#GSO_rZbj;xM+j!KU5j@*tM)Gu%5Xy>Tq zsN<;UNaINBa5!uZN*UVU*kkt+;Xr^zerD z#2~1kCn(^=ZrO?+)`s=4#AKb>1^Vp=xa&JlN22t~2#Fy2h1>%NCa}t_fMqy2iDj%6TVVdvL~X>Q^Pi zrE=DA4s%YYRqM&qea5NE-KO}|97W@MlYuEP{(SrcDhfx%TM{zT^Lh!58HK^r^qkJf zTugYE@IB#Af}8r)ejFLd0@IXgaU)!F$i=>bTOgq-C4$j01_>S-HX zL9X58_&Y~sgGba-JIwD(sVh6!HOe*IHJNjFxsLJuGvv%mB+WDOavh->-A%F_WpWom zax_Cm^mli1cV;%*kUgj=>n-h&9aE8h$B2J05x0hR_t}6^{x{c_`v;@z4G43lQKr60d+L^ z*BB)AD5Up5Jgq^{TOgLQR(9ok9n5vRVpV8WvOZB14Op#|SF8kAPwoj}f6EUe4q1Dg zLdi4o+wH^JYD@G0WqKLOl}Cd>v!UTpJSOoxh+kT;j+&QV_;})|7aqzZiX)W~&8U^% z0H{3--`pvz>{aeDj8RwjFy!iRo?9b%TSEKgk>yQz-;kp*V>lZf-~{oHAz0+nw$qQjj?h(m&1npv^dz4|L|I3{gA(v-U!+Z|cT8qwcm=;{Y zrhm@apYa*!T8~-ny+~x_AvBN)%wK<6Bh8|rJGVQO%xll7oqLa&-U2`V$LAleNDh-b zvAZzs9f%(B5A(hhP3J6;nwO0^u`wfQn5#teJ2U-H!hFPg;=D5WngABh)oIeJkbRlCNzNAS&D-c6?;lkigv<6eUp!NI(CKx-L@PBb4q zYA;v2$?=GJQn`*8496{I;uU8+=ib5mM>S=RF*aBEbcq)I zBr-FO5x07%kjc1bglY@ngRKQ6YM*x=j^gwq1Nur1PZn0Rwa=*^^wt%Nb^!0&LJ{@R zY_t|$iX#vG&&-**_@*?z+&UcEG1iZ>M?sZK@H8An#%T}Bee|Dq=sU6?vtxg^#3mbq z^)(S2Y&+)FCvU5w(8?fpiedei<&g$is#QhVf?*um*M0|$>I^b&HJ0{b z>~+O)%5L9@U*!UiEl9Afe7gdUSqQ%b8J4E<4Jh8`1Df%BB=RrxFbjShFMM5E6aQ>{ zrzrL}hUKbk9!b!!Ke}_30bm-2feeL5Y3nt&-+zc|ORvoI?NnLfgV==Nlf+ z_b4M`!@Hm<7+9!s_YOn7wxCtj2<%yFY+RESnZwZ$ZRVHCKBD~6?fO}@bmol zHU2>PSrj9<$M7RAg{qce%gDo`cPkCw$x_Bdv4tP-u0`QFD2Y}!1}lCiy4n*w(AUtZ zk7N0t#y|atRgaH&td)gK8kE)Wl@}s@i2Sc)%wLDbxeT3i1$y;FbioeTt`*S&3lb?5 zgU9<{bSUk3%7%wZexz1>a|mB}JXy88(G#PI1zJsyv^ts?Dk+WiR@?A0G=NS*pfLG1 zs?d@_XrAi~8@iu)pZOyiZHU=q$wc1PhL+Km8I}o_ZWgDaoZyQlEuYD$c}PW2`5vL#-{WBgi&C5z8W_^*B9=!BbEO8+aIg1DmC$ zWiE9<9$0p8onXsZGHcr{r|<}DX1oXCFG$V^x>=jLkDmRI<0h+Q*NM2+4u@B)Slwqv z?-8xAmG!TsSjpPE(T8ZWCQw&RtYz))sLz~q07C}w>tJY3ac7gTr-yK^vMFdcVkz|H z!gz?1;fMSLy}g8b_MrVP#~vF>q)}I-V_VR!I_nHM4SX*{gi#s1s~wP zJ_HIjVcmNS`%rxR8@|>N4ZlXg)gQ4ZUh%EPxI$iNP&ujc6Vc1g;Ne?L_c_hgsv8H<;B0Lq|6KQ)y4y>3?_A$)$&{+KXYzI+NEbD)eBnHN`Ur{T3FBc1YK(fwW#Gg zIam}GrByU#Oic1+Cmia{v)5Gd8d8K%5nCX=a(Fc_>D3ULhF{F;m$X3*$rg$6?p45XK5Fo8E*6i z2MXX9EMUw=5yr0p`c@NWx-Z92WK&D#wJtt`^o)#l*j>Bk#-BFMT?SWGX>rcC)cp>Sy3J)K6w~}xAau;)&FAYbT%Tg6-lD~ zN)6CsE8~OvhiKJWyej%+GFrkKbe_GS^I_Vo460Wdv5WNarg7h|P>QkBMT7v6c%s_vl47h>2ViT|C0 z29k?;$OXPD22gjr#K;}z?(dNS4On#;4p0BxGpM~6+V9d7+SVFdHaMjWQmiH1-@@QP z?ZPO?x5`%KgLF{*W*GCWENzOgR7_qe!wc0LteT3X*J_4#?(K&AwP$26o@e3Q9paMz z1EuyDEAxlItkYn_F^;R8xeo+Vrm$r^b{lK|_mH@Ek-=BdSRUfTzJn+G9S9tO#*>Wv zAUW|&Rx~pDR$vXI71>nlv4T;EHI5>z1f(It+JsCG#e4RQYwQAngjt(F#1lqz^jeTm z`ACGNlR%cSjD=!YXMxcR!LMQTXDC?KpO_S757@>(m67Z-(cphraR|j*YGD;78F3I= zS82d%N&|CUBf_B$t1T5-Q%Q{nJ_9ij%5|EXn1>2@v1$=7QH_>nXPqK7k3?ipNoJ1c zdLc&q_H(q2U}p3rH;A<#6?Y8d*+cr172*c=1x_lIulu26w&gq`N>B6MAGZTbun0Dwa-CE}-zkN@Qw%#W3tCS)Xg(J{fGlu8W&9S+;JTjZ zM#`Vkib$Ti+@T;=pkj4Ro?lqs%BgW3uCTJh(4So*8Ms1vu2}(JL=~=5mg^{Yi#WBr zu`)f#*fp^ne%lH{ZlYgn(B{|BFYQg2cS6xaZTN>Q)z;`=1Nh%~<5{2T7|}$9=tCjS zkk;0de~f^8r-0v+u>aPA8S6pe10eTS?y!TA+r%BVU_UGezh-mZ8YI9v&I;n}m3)>I z5dSL!iFUBG#P+QY)-)vcqyg_0x!#Fi{$T`)8{U?3w4*dSVtLMK%>Tw96*hABn-0MN8?&e|Hsktc1Dz(?p2+UD#qwmG-Asu@V+W{9D%;M z0F84YSKh@+@Bmsrm3yvdM7pz99mH?@xZ`9b%oJ9K7csiS_;r#YY36}&l40tlqA#vP zzh|H^WzPGEju}GREW~s?Kz2TcLKXe_24r_bvnLFzTN36gyEDIn*WT!?ihat2lrM`c zRwPe8P^c2taS0@LUF39Cw2KBvoTg|DqtG5Y@U8el-xT87&w_@0lYgD z`)fF7>Re^k)SklnpleMm07b)Vf4X7|3J||G1TLR|b|;j%25Kpz$Q!K4H`sVDu^jJX zIX(dkpQ77EvS#B;rq)!TOG%I=GuB!FSeOfIF&*!V7^|R3Sts_Gl!Hl;Sf#NyZRYP> z;|^%3oDZ9j7K*vk&Jg8V`J1U?F_<_V99+jQ<9V<6e(9tO7{@{Ms~^1@4oc1fKi7k# z|AF!+SP$3^f*;3fdcYlmLGJtD{aZeLVdXjrt;j`<6zmXaQ+SNXu%eDtgopYns7$yf;wpl7-GlodO#7}ssg)wQbK zk=5-QtfF*ccTHyyxg(#-ab9Y(7g32$eCfZr>w9qgm&uR#Gza#dM$5(hVvsOq^BY^lmSeVP8r=~`wko60 z1n#+z)-2#!(qpIdZUHi5C9PBh&?Q5s6Nf9($V2R%_Gv~UowVEJ61tzX+G*guGH+A? zf2x9YCDDZH@ToeH&DF4o3xHEqLCFqaS0~1zEzep_>dt8OK@yFn9WxCqno9o_qq3EG zIK)gSwpjU)6z#p8vC}@eod!w=89pmTNy~#LeR-2HJCbx7j}i1wqx>$4>Q<(J($Gb2$SbcirZpm0>@O*U_MFXXh99*`6JgQ{J zrVd6QDps6x=i)AR$C6GZKZ70T(>suDJ6z^g^h^Mg$I|LP%vc{{|Db;SXbnt*nfDoJffJB{e>{`W zS%W~$Pli6WfmY51ALsEr6lqe0d5S|fx&Sq2V8(Kyr6omTWaTJ~7QGO=cM{k>7Q|Er z4CT>yVW9X|^xAje`%gYU22q}HXpi?EEbSfO?@~}#I~d1-(BnbhFisVQIFSO8T z&Qc~1wdA3Zo8ugsy3knpf21=9an)^T=Npj-%55|mc`$^QP2$r8Z1xUFwu<=7v!hie zAhq8z!)MSIb{U@ViO@tpP@_Nmspt<;koFQDz%tTqi&wn<$GHpmRXd!uN~L^gJ)ykm zP}xziQXVMfz_<_ZDk4`=xV!m9z8z_&=eUM;RnG+#$3WTIDVG_IFphPdXx5@)-8Q1V zqtF+hvF3GySdhIu4zgbJ!);vtGHFl&&PBa^NHRRtngOE{R;}?#HXy?kc{0c2zI}&8QDsMn;jvV|eGgOxk1j))L zWy`98F4O@1rU}|hBf|o(54Kd{`(}LK8mZM8+24`5)IQzL=qOqPQIxsj%yV&N*~HSz zv^QK5OkO_O6xTS;gV*xhYfrs)bZP_+Gg2q$*><={c`p_*e{(_L`CyAyXlC+$5xlMV z1La-P+TUd4hjib68L?r=saE(vYk)wC+AqsPdpDc$sAb@P8RoSBxKS9*w-E1&8F*Ki zYn)@0UPBF=xW4vL&*vViLA|Nmd$=KG`x+}sjnV9t#ZmJziN_dZ-d`!9?C_@z1iy)O zew8a~J>u_r05un(LU|#^cp_L|3wD2FA3}Z47}jLxBFTbjbviWkrO;nPq}M2VJ{TF* z8f4YpLG}0+Qo(^G?aRm(WrTAuM_O-63(|Wd75?Vn&CLAeMG~ZDJ`18zC`0StoO*51 zI7g##C}W&r{FTK<$S{a+hx1t;I7JgGx1RDi9Kx#nj25Py#}Vi~pFx|~XlK$VAA%6t z9isPl@NeEkm)i)+El2ko4o%eJzRD_=1J8(dUnQZv+IvzT8W{x&%>kWe5nV6qSUS;N zbk(0kt3+WLQvlBV4bAmEh;oBFoW~=126TJGr_*59I*?-k6sKH1%Ks)WO;xU50V%0G z_^MH(JvQH=Eajro?woM!Rx_>8T39YTI;t|Fj9S_qk&SOs8>s6~Tb08tJ6uqPD=V{7 z5AM~U6~3X2(iFz4KSz6P<$Q3ADkxZ3D^r|Iq+yFFc3Au9GIHHayxN!-6ERWRh4UWi zqx`M%HH!nT!xbms;`LDaa*$vfuN$C#MU~3~f1mbiC)^jLyK>gv;kXPhDZ_)}Y&Y^y z4ISlmQ0~v+aKSR@SH3#sI6MYF9R@M>@%cagCEQuhLwOyx@hVLv5IP?V?~H@WCxNp) zIFvJ>2G*}KN%|8H;6*f;oqfq^;g^E&R8!9Agaj84^n@3bQD7o%U&*yr(>CFlVyV?n zW!oHprZtc=lzFl}M`vu%V%Vgkq3ym_QQP`nY|Ux z7{Cg62Jm_^ICO{}{>!TT5YIelFR7^@_ss)cjAAUJsSH&DI<^KzPrn4XdnDxW>(oyg5m0qXaL{w zOL(E(X;-NmX{#0YEZjpmbX#+$cIZ4^K`+JO)d9aM^Qd6x`Q`8!XveO9pOtcpt-^#G{Mn||8SP+oZc2gq-Z82|tP diff --git a/tools.py b/tools.py index f8298e6..168bb1a 100644 --- a/tools.py +++ b/tools.py @@ -435,7 +435,7 @@ def getSegments(frameshift, finalSegmentTable, finalClusteringTable, dur): start_time = time.time() self.log.info("Start Speaker Diarization: %s" % (start_time)) - if self.maxNrSpeakers == 1: + if self.maxNrSpeakers == 1 or audio.dur < 3: self.log.info("Speaker Diarization time in seconds: %s" % (time.time() - start_time)) return [[0, audio.dur, 1], [audio.dur, -1, -1]] From 51f562dc3cf27d8e588003058a3ceadbeca1d1d6 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Mon, 17 Aug 2020 14:44:43 +0200 Subject: [PATCH 018/172] update readme --- README.md | 46 ++++++++++++++++++++++++++++++++++++---------- 1 file changed, 36 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 8e53305..a2540a8 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ This service is mandatory in a LinTO platform stack as the main worker for speec Generally, Automatic Speech Recognition (ASR) is the task of recognition and translation of spoken language into text. Our ASR system takes advantages from the recent advances in machine learning technologies and in particular deep learning ones (TDNN, LSTM, attentation-based architecture). The core of our system consists of two main components: an acoustic model and a decoding graph. A high-performance ASR system relies on an accurate acoustic model as well as a perfect decoding graph. ## Usage -See documentation : [doc.linto.ai](https://doc.linto.ai) +See documentation : [doc.linto.ai](https://doc.linto.ai/#/services/linstt) # Deploy @@ -42,12 +42,12 @@ Or, download the pre-built image from docker-hub: docker pull lintoai/linto-platform-stt-standalone-worker:latest ``` -NB: You must install docker on your machine. +NOTE: You must install docker on your machine. ## Configuration -The LinSTT service that will be set-up here require KALDI models, the acoustic model and the decoding graph. Indeed, these models are not included in the repository; you must download them in order to run LinSTT. You can use our pre-trained models from here: [Downloads](https://doc.linto.ai/#/services/linstt_download). +The LinSTT service that will be set-up here require KALDI models, the acoustic model and the decoding graph. Indeed, these models are not included in the repository; you must download them in order to run LinSTT. You can use our pre-trained models from here: [linstt download](services/linstt_download). -### Outside LinTO-Platform-STT-Service-Manager +### Outside LinTO-Platform-STT-Service-Manager If you want to use our service alone without LinTO-Platform-STT-Service-Manager, you must `unzip` the files and put the extracted ones in the [shared storage](https://doc.linto.ai/#/infra?id=shared-storage). For example, @@ -72,13 +72,15 @@ mv AM_fr-FR ~/linto_shared/data mv DG_fr-FR_Small ~/linto_shared/data ``` -4- Configure the environment file `.env` included in this repository +4- Rename the default environment file `.envdefault` included in the repository `linto-platform-stt-standalone-worker` and configure it by providing the full path of the following parameters: AM_PATH=/full/path/to/linto_shared/data/AM_fr-FR LM_PATH=/full/path/to/linto_shared/data/DG_fr-FR_Small +5- If you want to use Swagger interface, you need to set the corresponding environment parameter: + SWAGGER_PATH=/full/path/to/swagger/file -NB: if you want to use the visual user interface of the service, you need also to configure the swagger file `document/swagger.yml` included in this repository. Specifically, in the section `host`, specify the adress of the machine in which the service is deployed. +NOTE: if you want to use the user interface of the service, you need also to configure the swagger file `document/swagger.yml` included in the repository `linto-platform-stt-standalone-worker`. Specifically, in the section `host`, specify the address of the machine in which the service is deployed. ### Using LinTO-Platform-STT-Service-Manager In case you want to use `LinTO-Platform-STT-Service-Manager`, you need to: @@ -87,9 +89,9 @@ In case you want to use `LinTO-Platform-STT-Service-Manager`, you need to: 2- Create a language model and upload the corresponding decoding graph -3- Configure the environmenet file of this service. +3- Configure the environment file of this service. -For more details, see configuration instruction in [LinTO - STT-Manager](https://doc.linto.ai/#/manager) +For more details, see instructions in [LinTO - STT-Manager](https://doc.linto.ai/#/services/stt_manager) ## Execute In order to run the service alone, you have only to execute: @@ -98,8 +100,9 @@ In order to run the service alone, you have only to execute: cd linto-platform-stt-standalone-worker docker-compose up ``` +Then you can acces it on [localhost:8888](localhost:8888) -To run and manager LinSTT under `LinTO-Platform-STT-Service-Manager` service, you need to create a service first and then to start it. See [LinTO - STT-Manager](services/manager?id=execute) +To run and manager LinSTT under `LinTO-Platform-STT-Service-Manager` service, you need to create a service first and then to start it. See [LinTO - STT-Manager](https://doc.linto.ai/#/services/stt_manager_how2use?id=how-to-use-it) Our service requires an audio file in `Waveform format`. It should has the following parameters: @@ -109,6 +112,8 @@ Our service requires an audio file in `Waveform format`. It should has the follo - microphone: any type - duration: <30 minutes +Other formats are also supported: mp3, aiff, flac, and ogg. + ### Run Example Applications To run an automated test go to the test folder @@ -122,5 +127,26 @@ And run the test script: ./test_deployment.sh ``` -Or use swagger interface to perform your personal test +Or use swagger interface to perform your personal test: localhost:8888/api-doc/ + + + + +#### ** /transcribe ** + +Convert a speech to text + +### Functionality +> `post`
+> Make a POST request +>> Arguments : +>> - **{File} file** : Audio file (file format: wav, mp3, aiff, flac, ogg) +>> - **{Integer} nbrSpeaker (optional)**: Number of speakers engaged in dialog +>> - **{String} speaker (optional)**: Do speaker diarization (yes|no) +> +>> Header : +>> - **{String} Accept**: response content type (text/plain|application/json) +> +> **{text|Json}** : Return the full transcription or a json object with metadata + From 04795c1cb6edac269642ff206a76ccf314e08d3a Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 1 Sep 2020 12:04:06 +0200 Subject: [PATCH 019/172] remove extrat words from transcription when using text/plain response --- Jenkinsfile | 1 + tools.py | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/Jenkinsfile b/Jenkinsfile index d027c84..5f464c5 100644 --- a/Jenkinsfile +++ b/Jenkinsfile @@ -44,6 +44,7 @@ pipeline { ).trim() docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { image.push('latest-unstable') + image.push('offline') } } } diff --git a/tools.py b/tools.py index 168bb1a..92b8dc1 100644 --- a/tools.py +++ b/tools.py @@ -542,7 +542,8 @@ def run(self,audio,asr,spk): else: return {"text":output["text"]} else: - return decode["text"] + text = re.sub(r"#nonterm:[^ ]* ", "", decode["text"]) + return text def getOutput(self,timestamps,frame_shift, frame_subsampling, spkSeg = []): output = {} From 460efb7d19f4f28c750baee619dfabe189fa5c99 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Fri, 25 Sep 2020 02:48:57 +0200 Subject: [PATCH 020/172] update stt worker offline --- .gitmodules | 6 + Dockerfile | 173 ++++------ pyBK | 1 + run.py | 148 ++++----- tools.py | 922 ++++++++++++++++++++++------------------------------ vosk-api | 1 + 6 files changed, 515 insertions(+), 736 deletions(-) create mode 100644 .gitmodules create mode 160000 pyBK mode change 100755 => 100644 run.py create mode 160000 vosk-api diff --git a/.gitmodules b/.gitmodules new file mode 100644 index 0000000..9cea8d6 --- /dev/null +++ b/.gitmodules @@ -0,0 +1,6 @@ +[submodule "vosk-api"] + path = vosk-api + url = git@github.com:irebai/vosk-api.git +[submodule "pyBK"] + path = pyBK + url = git@github.com:irebai/pyBK.git diff --git a/Dockerfile b/Dockerfile index 6608943..604bdb7 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,128 +1,81 @@ -# Dockerfile for building PyKaldi image from Ubuntu 16.04 image FROM ubuntu:18.04 LABEL maintainer="irebai@linagora.com" -# Install necessary system packages -RUN apt-get update \ - && apt-get install -y \ - python3 \ +RUN apt-get update &&\ + apt-get install -y \ + python2.7 \ + python3 \ python3-pip \ - python2.7 \ - autoconf \ - automake \ - cmake \ - make \ - curl \ - g++ \ - git \ - graphviz \ - libatlas3-base \ - libtool \ - pkg-config \ - sox \ - subversion \ - bzip2 \ - unzip \ - wget \ - zlib1g-dev \ - ca-certificates \ - gfortran \ - patch \ - ffmpeg \ - nano && \ - ln -s /usr/bin/python3 /usr/bin/python && \ - ln -s /usr/bin/pip3 /usr/bin/pip + git \ + swig \ + nano \ + sox \ + automake wget unzip build-essential libtool zlib1g-dev locales libatlas-base-dev ca-certificates gfortran subversion &&\ + apt-get clean -# Install necessary Python packages (pykaldi dependencies) -RUN pip install --upgrade pip \ - numpy \ - setuptools \ - pyparsing \ - ninja +## Build kaldi and Clean installation (intel, openfst, src/*) +RUN git clone --depth 1 https://github.com/kaldi-asr/kaldi.git /opt/kaldi && \ + cd /opt/kaldi/tools && \ + ./extras/install_mkl.sh && \ + make -j $(nproc) && \ + cd /opt/kaldi/src && \ + ./configure --shared && \ + make depend -j $(nproc) && \ + make -j $(nproc) && \ + mkdir -p /opt/kaldi/src_ && \ + mv /opt/kaldi/src/base \ + /opt/kaldi/src/chain \ + /opt/kaldi/src/cudamatrix \ + /opt/kaldi/src/decoder \ + /opt/kaldi/src/feat \ + /opt/kaldi/src/fstext \ + /opt/kaldi/src/gmm \ + /opt/kaldi/src/hmm \ + /opt/kaldi/src/ivector \ + /opt/kaldi/src/kws \ + /opt/kaldi/src/lat \ + /opt/kaldi/src/lm \ + /opt/kaldi/src/matrix \ + /opt/kaldi/src/nnet \ + /opt/kaldi/src/nnet2 \ + /opt/kaldi/src/nnet3 \ + /opt/kaldi/src/online2 \ + /opt/kaldi/src/rnnlm \ + /opt/kaldi/src/sgmm2 \ + /opt/kaldi/src/transform \ + /opt/kaldi/src/tree \ + /opt/kaldi/src/util \ + /opt/kaldi/src/itf \ + /opt/kaldi/src/lib /opt/kaldi/src_ && \ + cd /opt/kaldi && rm -r src && mv src_ src && rm src/*/*.cc && rm src/*/*.o && rm src/*/*.so && \ + cd /opt/intel/mkl/lib && rm -f intel64/*.a intel64_lin/*.a && \ + cd /opt/kaldi/tools && mkdir openfst_ && mv openfst-*/lib openfst-*/include openfst-*/bin openfst_ && rm openfst_/lib/*.so* openfst_/lib/*.la && \ + rm -r openfst-*/* && mv openfst_/* openfst-*/ && rm -r openfst_ -## Install Protobuf, CLIF, Kaldi and PyKaldi and Clean installation -RUN git clone --depth 1 https://github.com/pykaldi/pykaldi.git /pykaldi \ - && cd /pykaldi/tools \ - && sed -i "s/make \-j4/make -j $(nproc)/g" ./install_kaldi.sh \ - && sed -i "s/\-j 2/-j $(nproc)/g" ./install_clif.sh \ - && sed -i "s/make \-j4/make -j $(nproc)/g" ./install_protobuf.sh \ - && ./check_dependencies.sh \ - && ./install_protobuf.sh \ - && ./install_clif.sh \ - && ./install_kaldi.sh \ - && cd /pykaldi \ - && python setup.py install \ - && rm -rf /pykaldi/CMakeLists.txt \ - /pykaldi/LICENSE \ - /pykaldi/README.md \ - /pykaldi/setup.cfg \ - /pykaldi/setup.py \ - /pykaldi/docker \ - /pykaldi/docs \ - /pykaldi/extras \ - /pykaldi/pykaldi.egg-info \ - /pykaldi/tests \ - /pykaldi/build/CMakeCache.txt \ - /pykaldi/build/bdist.linux-x86_64 \ - /pykaldi/build/build.ninja \ - /pykaldi/build/cmake_install.cmake \ - /pykaldi/build/docs \ - /pykaldi/build/kaldi \ - /pykaldi/build/lib \ - /pykaldi/build/rules.ninja \ - /pykaldi/tools/check_dependencies.sh \ - /pykaldi/tools/clif* \ - /pykaldi/tools/find_python_library.py \ - /pykaldi/tools/install_* \ - /pykaldi/tools/protobuf \ - /pykaldi/tools/use_namespace.sh \ - /pykaldi/tools/kaldi/COPYING \ - /pykaldi/tools/kaldi/INSTALL \ - /pykaldi/tools/kaldi/README.md \ - /pykaldi/tools/kaldi/egs \ - /pykaldi/tools/kaldi/misc \ - /pykaldi/tools/kaldi/scripts \ - /pykaldi/tools/kaldi/windows \ - && mkdir -p /pykaldi/tools/kaldi/src_/lib \ - && mv /pykaldi/tools/kaldi/src/base/libkaldi-base.so \ - /pykaldi/tools/kaldi/src/chain/libkaldi-chain.so \ - /pykaldi/tools/kaldi/src/cudamatrix/libkaldi-cudamatrix.so \ - /pykaldi/tools/kaldi/src/decoder/libkaldi-decoder.so \ - /pykaldi/tools/kaldi/src/feat/libkaldi-feat.so \ - /pykaldi/tools/kaldi/src/fstext/libkaldi-fstext.so \ - /pykaldi/tools/kaldi/src/gmm/libkaldi-gmm.so \ - /pykaldi/tools/kaldi/src/hmm/libkaldi-hmm.so \ - /pykaldi/tools/kaldi/src/ivector/libkaldi-ivector.so \ - /pykaldi/tools/kaldi/src/kws/libkaldi-kws.so \ - /pykaldi/tools/kaldi/src/lat/libkaldi-lat.so \ - /pykaldi/tools/kaldi/src/lm/libkaldi-lm.so \ - /pykaldi/tools/kaldi/src/matrix/libkaldi-matrix.so \ - /pykaldi/tools/kaldi/src/nnet/libkaldi-nnet.so \ - /pykaldi/tools/kaldi/src/nnet2/libkaldi-nnet2.so \ - /pykaldi/tools/kaldi/src/nnet3/libkaldi-nnet3.so \ - /pykaldi/tools/kaldi/src/online2/libkaldi-online2.so \ - /pykaldi/tools/kaldi/src/rnnlm/libkaldi-rnnlm.so \ - /pykaldi/tools/kaldi/src/sgmm2/libkaldi-sgmm2.so \ - /pykaldi/tools/kaldi/src/transform/libkaldi-transform.so \ - /pykaldi/tools/kaldi/src/tree/libkaldi-tree.so \ - /pykaldi/tools/kaldi/src/util/libkaldi-util.so \ - /pykaldi/tools/kaldi/src_/lib \ - && rm -rf /pykaldi/tools/kaldi/src && mv /pykaldi/tools/kaldi/src_ /pykaldi/tools/kaldi/src \ - && cd /pykaldi/tools/kaldi/tools && mkdir openfsttmp && mv openfst-*/lib openfst-*/include openfst-*/bin openfsttmp && rm openfsttmp/lib/*.a openfsttmp/lib/*.la && \ - rm -r openfst-*/* && mv openfsttmp/* openfst-*/ && rm -r openfsttmp +# Install pyBK (speaker diarization toolkit) +RUN apt install -y software-properties-common && wget https://apt.llvm.org/llvm.sh && chmod +x llvm.sh && ./llvm.sh 10 && \ + export LLVM_CONFIG=/usr/bin/llvm-config-10 && \ + pip3 install numpy && \ + pip3 install websockets && \ + pip3 install librosa webrtcvad scipy sklearn + +# build VOSK KALDI +COPY vosk-api /opt/vosk-api +RUN cd /opt/vosk-api/python && \ + export KALDI_ROOT=/opt/kaldi && \ + export KALDI_MKL=1 && \ + python3 setup.py install --user --single-version-externally-managed --root=/ # Define the main folder WORKDIR /usr/src/speech-to-text # Install main service packages -RUN pip3 install flask flask-cors flask-swagger-ui configparser pyyaml logger librosa webrtcvad scipy sklearn -RUN apt-get install -y libsox-fmt-all && pip3 install git+https://github.com/rabitt/pysox.git \ - && git clone https://github.com/irebai/pyBK.git /pykaldi/tools/pyBK \ - && cp /pykaldi/tools/pyBK/diarizationFunctions.py . +RUN pip3 install flask flask-cors flask-swagger-ui gevent pyyaml # Set environment variables ENV PATH /pykaldi/tools/kaldi/egs/wsj/s5/utils/:$PATH +COPY pyBK/diarizationFunctions.py pyBK/diarizationFunctions.py COPY tools.py . COPY run.py . diff --git a/pyBK b/pyBK new file mode 160000 index 0000000..7738eb7 --- /dev/null +++ b/pyBK @@ -0,0 +1 @@ +Subproject commit 7738eb75dfc65438fbcd0eed9bb6a1f086b4bd6c diff --git a/run.py b/run.py old mode 100755 new mode 100644 index ecdbb18..a95cf47 --- a/run.py +++ b/run.py @@ -2,79 +2,36 @@ # -*- coding: utf-8 -*- from flask import Flask, request, abort, Response, json -from flask_swagger_ui import get_swaggerui_blueprint -from flask_cors import CORS -from tools import ASR, Audio, SpeakerDiarization, SttStandelone -import yaml, os, sox, logging +from vosk import Model, KaldiRecognizer +from tools import WorkerStreaming from time import gmtime, strftime +from gevent.pywsgi import WSGIServer + + + app = Flask("__stt-standelone-worker__") -# Set logger config -logger = logging.getLogger(__name__) -logging.basicConfig(level=logging.DEBUG) - -# Main parameters -AM_PATH = '/opt/models/AM' -LM_PATH = '/opt/models/LM' -TEMP_FILE_PATH = '/opt/tmp' -CONFIG_FILES_PATH = '/opt/config' -NBR_PROCESSES = 1 -SAVE_AUDIO = False -SERVICE_PORT = 80 -SWAGGER_URL = '/api-doc' -SWAGGER_PATH = '' -asr = ASR(AM_PATH,LM_PATH, CONFIG_FILES_PATH) - -if not os.path.isdir(TEMP_FILE_PATH): - os.mkdir(TEMP_FILE_PATH) -if not os.path.isdir(CONFIG_FILES_PATH): - os.mkdir(CONFIG_FILES_PATH) - -# Environment parameters -if 'SERVICE_PORT' in os.environ: - SERVICE_PORT = os.environ['SERVICE_PORT'] -if 'SAVE_AUDIO' in os.environ: - SAVE_AUDIO = os.environ['SAVE_AUDIO'] -if 'NBR_PROCESSES' in os.environ: - if int(os.environ['NBR_PROCESSES']) > 0: - NBR_PROCESSES = int(os.environ['NBR_PROCESSES']) - else: - exit("You must to provide a positif number of processes 'NBR_PROCESSES'") -if 'SWAGGER_PATH' in os.environ: - SWAGGER_PATH = os.environ['SWAGGER_PATH'] - -def swaggerUI(): - ### swagger specific ### - swagger_yml = yaml.load(open(SWAGGER_PATH, 'r'), Loader=yaml.Loader) - swaggerui = get_swaggerui_blueprint( - SWAGGER_URL, # Swagger UI static files will be mapped to '{SWAGGER_URL}/dist/' - SWAGGER_PATH, - config={ # Swagger UI config overrides - 'app_name': "STT API Documentation", - 'spec': swagger_yml - } - ) - app.register_blueprint(swaggerui, url_prefix=SWAGGER_URL) - ### end swagger specific ### - -def getAudio(file,audio): - file_path = TEMP_FILE_PATH+file.filename.lower() - file.save(file_path) - audio.transform(file_path) - if not SAVE_AUDIO: - os.remove(file_path) - +# create WorkerStreaming object +worker = WorkerStreaming() + +# Load ASR models (acoustic model and decoding graph) +worker.log.info('Load acoustic model and decoding graph') +model = Model(worker.AM_PATH, worker.LM_PATH, + worker.CONFIG_FILES_PATH+"/online.conf") + + +# API @app.route('/transcribe', methods=['POST']) def transcribe(): try: - app.logger.info('[%s] New user entry on /transcribe' % (strftime("%d/%b/%d %H:%M:%S", gmtime()))) - # create main objects - spk = SpeakerDiarization() - audio = Audio(asr.get_sample_rate()) - - #get response content type - metadata = False + worker.log.info('[%s] New user entry on /transcribe' % + (strftime("%d/%b/%d %H:%M:%S", gmtime()))) + + metadata = worker.METADATA + nbrSpk = 10 + + # get response content type if request.headers.get('accept').lower() == 'application/json': metadata = True elif request.headers.get('accept').lower() == 'text/plain': @@ -82,69 +39,80 @@ def transcribe(): else: raise ValueError('Not accepted header') - #get speaker parameter + # get speaker parameter spkDiarization = False if request.form.get('speaker') != None and (request.form.get('speaker').lower() == 'yes' or request.form.get('speaker').lower() == 'no'): - spkDiarization = True if request.form.get('speaker').lower() == 'yes' else False - #get number of speakers parameter + spkDiarization = True if request.form.get( + 'speaker').lower() == 'yes' else False + # get number of speakers parameter try: if request.form.get('nbrSpeaker') != None and spkDiarization and int(request.form.get('nbrSpeaker')) > 0: - spk.set_maxNrSpeakers(int(request.form.get('nbrSpeaker'))) + nbrSpk = int(request.form.get('nbrSpeaker')) elif request.form.get('nbrSpeaker') != None and spkDiarization: - raise ValueError('Not accepted "nbrSpeaker" field value (nbrSpeaker>0)') + raise ValueError( + 'Not accepted "nbrSpeaker" field value (nbrSpeaker>0)') except Exception as e: - app.logger.error(e) - raise ValueError('Not accepted "nbrSpeaker" field value (nbrSpeaker>0)') + worker.log.error(e) + raise ValueError( + 'Not accepted "nbrSpeaker" field value (nbrSpeaker>0)') else: if request.form.get('speaker') != None: raise ValueError('Not accepted "speaker" field value (yes|no)') - stt = SttStandelone(metadata,spkDiarization) - - #get input file + # get input file if 'file' in request.files.keys(): file = request.files['file'] - getAudio(file,audio) - output = stt.run(audio,asr,spk) + worker.getAudio(file) + rec = KaldiRecognizer(model, worker.rate, metadata) + response = rec.Decode(worker.data) + if metadata: + obj = rec.GetMetadata() + data = json.loads(obj) + response = worker.process_metadata(data, spkDiarization, nbrSpk) else: raise ValueError('No audio file was uploaded') - return output, 200 + return response, 200 except ValueError as error: return str(error), 400 except Exception as e: - app.logger.error(e) + worker.log.error(e) return 'Server Error', 500 + @app.route('/healthcheck', methods=['GET']) def check(): return '', 200 # Rejected request handlers + + @app.errorhandler(405) def method_not_allowed(error): return 'The method is not allowed for the requested URL', 405 + @app.errorhandler(404) def page_not_found(error): return 'The requested URL was not found', 404 + @app.errorhandler(500) def server_error(error): - app.logger.error(error) + worker.log.error(error) return 'Server Error', 500 + if __name__ == '__main__': try: - #start SwaggerUI - if SWAGGER_PATH != '': - swaggerUI() + # start SwaggerUI + if worker.SWAGGER_PATH != '': + worker.swaggerUI(app) + # Run server - #Run ASR engine - asr.run() + http_server = WSGIServer(('', worker.SERVICE_PORT), app) + http_server.serve_forever() - #Run server - app.run(host='0.0.0.0', port=SERVICE_PORT, debug=False, threaded=False, processes=NBR_PROCESSES) except Exception as e: - app.logger.error(e) - exit(e) \ No newline at end of file + worker.log.error(e) + exit(e) diff --git a/tools.py b/tools.py index 92b8dc1..8cc3715 100644 --- a/tools.py +++ b/tools.py @@ -1,619 +1,469 @@ -## Kaldi ASR decoder -from kaldi.asr import NnetLatticeFasterOnlineRecognizer -from kaldi.decoder import (LatticeFasterDecoderOptions, - LatticeFasterOnlineDecoder) -from kaldi.nnet3 import NnetSimpleLoopedComputationOptions -from kaldi.online2 import (OnlineEndpointConfig, - OnlineIvectorExtractorAdaptationState, - OnlineNnetFeaturePipelineConfig, - OnlineNnetFeaturePipelineInfo, - OnlineNnetFeaturePipeline, - OnlineSilenceWeighting) -from kaldi.util.options import ParseOptions -from kaldi.util.table import SequentialWaveReader -from kaldi.matrix import Matrix, Vector -############## +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- -## word to CTM -from kaldi.lat.align import (WordBoundaryInfoNewOpts, - WordBoundaryInfo, - word_align_lattice) -from kaldi.lat.functions import (compact_lattice_to_word_alignment, - compact_lattice_shortest_path) -from kaldi.asr import NnetRecognizer -import kaldi.fstext as _fst +#  ASR +from vosk import Model, KaldiRecognizer ############## -## Speaker Diarization -from diarizationFunctions import * -import numpy as np +# Speaker Diarization +from pyBK.diarizationFunctions import * import librosa -from kaldi.ivector import (compute_vad_energy, - VadEnergyOptions) -from kaldi.feat.mfcc import Mfcc, MfccOptions -from kaldi.util.options import ParseOptions +import time +import webrtcvad ############## -## other packages -import configparser, sys, os, re, sox, time, logging -from concurrent.futures import ThreadPoolExecutor +# other packages +import configparser +import logging +import os +import re +import json +import yaml +import scipy.io.wavfile +import numpy as np +from flask_swagger_ui import get_swaggerui_blueprint ############## -class ASR: - def __init__(self, AM_PATH, LM_PATH, CONFIG_FILES_PATH): - self.log = logging.getLogger('__stt-standelone-worker__.ASR') - self.AM_PATH = AM_PATH - self.LM_PATH = LM_PATH - self.CONFIG_FILES_PATH = CONFIG_FILES_PATH + +class WorkerStreaming: + def __init__(self): + # Set logger config + self.log = logging.getLogger("__stt-standelone-worker-streaming__") + logging.basicConfig(level=logging.INFO) + + # Main parameters + self.AM_PATH = '/opt/models/AM' + self.LM_PATH = '/opt/models/LM' + self.TEMP_FILE_PATH = '/opt/tmp' + self.CONFIG_FILES_PATH = '/opt/config' + self.SAVE_AUDIO=False + self.SERVICE_PORT = 80 + self.NBR_THREADS = 100 + self.METADATA = True + self.SWAGGER_URL = '/api-doc' + self.SWAGGER_PATH = '' + + if not os.path.isdir(self.CONFIG_FILES_PATH): + os.mkdir(self.CONFIG_FILES_PATH) + + if not os.path.isdir(self.TEMP_FILE_PATH): + os.mkdir(self.TEMP_FILE_PATH) + + # Environment parameters + if 'NBR_THREADS' in os.environ: + if int(os.environ['NBR_THREADS']) > 0: + self.NBR_THREADS = int(os.environ['NBR_THREADS']) + else: + self.log.warning( + "You must to provide a positif number of threads 'NBR_THREADS'") + if 'SWAGGER_PATH' in os.environ: + self.SWAGGER_PATH = os.environ['SWAGGER_PATH'] + + + # start loading ASR configuration + self.log.info("Create the new config files") + self.loadConfig() + + + def swaggerUI(self, app): + ### swagger specific ### + swagger_yml = yaml.load(open(self.SWAGGER_PATH, 'r'), Loader=yaml.Loader) + swaggerui = get_swaggerui_blueprint( + self.SWAGGER_URL, # Swagger UI static files will be mapped to '{SWAGGER_URL}/dist/' + self.SWAGGER_PATH, + config={ # Swagger UI config overrides + 'app_name': "STT API Documentation", + 'spec': swagger_yml + } + ) + app.register_blueprint(swaggerui, url_prefix=self.SWAGGER_URL) + ### end swagger specific ### + + + def getAudio(self,file): + file_path = self.TEMP_FILE_PATH+"/"+file.filename.lower() + file.save(file_path) + self.rate, self.data = scipy.io.wavfile.read(file_path) + + if not self.SAVE_AUDIO: + os.remove(file_path) - def run(self): - def loadConfig(self): - #get decoder parameters from "decode.cfg" - decoder_settings = configparser.ConfigParser() - decoder_settings.read(self.AM_PATH+'/decode.cfg') - self.DECODER_SYS = decoder_settings.get('decoder_params', 'decoder') - self.AM_FILE_PATH = decoder_settings.get('decoder_params', 'ampath') - self.DECODER_MINACT = int(decoder_settings.get('decoder_params', 'min_active')) - self.DECODER_MAXACT = int(decoder_settings.get('decoder_params', 'max_active')) - self.DECODER_BEAM = float(decoder_settings.get('decoder_params', 'beam')) - self.DECODER_LATBEAM = float(decoder_settings.get('decoder_params', 'lattice_beam')) - self.DECODER_ACWT = float(decoder_settings.get('decoder_params', 'acwt')) - self.DECODER_FSF = int(decoder_settings.get('decoder_params', 'frame_subsampling_factor')) - - #Prepare "online.conf" - self.AM_PATH=self.AM_PATH+"/"+self.AM_FILE_PATH - with open(self.AM_PATH+"/conf/online.conf") as f: - values = f.readlines() - with open(self.CONFIG_FILES_PATH+"/online.conf", 'w') as f: - for i in values: - f.write(i) - f.write("--ivector-extraction-config="+self.CONFIG_FILES_PATH+"/ivector_extractor.conf\n") - f.write("--mfcc-config="+self.AM_PATH+"/conf/mfcc.conf") - - #Prepare "ivector_extractor.conf" - with open(self.AM_PATH+"/conf/ivector_extractor.conf") as f: - values = f.readlines() - with open(self.CONFIG_FILES_PATH+"/ivector_extractor.conf", 'w') as f: - for i in values: - f.write(i) - f.write("--splice-config="+self.AM_PATH+"/conf/splice.conf\n") - f.write("--cmvn-config="+self.AM_PATH+"/conf/online_cmvn.conf\n") - f.write("--lda-matrix="+self.AM_PATH+"/ivector_extractor/final.mat\n") - f.write("--global-cmvn-stats="+self.AM_PATH+"/ivector_extractor/global_cmvn.stats\n") - f.write("--diag-ubm="+self.AM_PATH+"/ivector_extractor/final.dubm\n") - f.write("--ivector-extractor="+self.AM_PATH+"/ivector_extractor/final.ie") - - #Prepare "word_boundary.int" if not exist - if not os.path.exists(self.LM_PATH+"/word_boundary.int"): - if os.path.exists(self.AM_PATH+"phones.txt"): - with open(self.AM_PATH+"phones.txt") as f: - phones = f.readlines() - - with open(self.LM_PATH+"/word_boundary.int", "w") as f: - for phone in phones: - phone = phone.strip() - phone = re.sub('^ .*','', phone) - phone = re.sub('^#\d+ .*','', phone) - if phone != '': - id = phone.split(' ')[1] - if '_I ' in phone: - f.write(id+" internal\n") - elif '_B ' in phone: - f.write(id+" begin\n") - elif '_E ' in phone: - f.write(id+" end\n") - elif '_S ' in phone: - f.write(id+" singleton\n") - else: - f.write(id+" nonword\n") + # re-create config files + def loadConfig(self): + # load decoder parameters from "decode.cfg" + decoder_settings = configparser.ConfigParser() + if os.path.exists(self.AM_PATH+'/decode.cfg') == False: + return False + decoder_settings.read(self.AM_PATH+'/decode.cfg') + + # Prepare "online.conf" + self.AM_PATH = self.AM_PATH+"/" + \ + decoder_settings.get('decoder_params', 'ampath') + with open(self.AM_PATH+"/conf/online.conf") as f: + values = f.readlines() + with open(self.CONFIG_FILES_PATH+"/online.conf", 'w') as f: + for i in values: + f.write(i) + f.write("--ivector-extraction-config=" + + self.CONFIG_FILES_PATH+"/ivector_extractor.conf\n") + f.write("--mfcc-config="+self.AM_PATH+"/conf/mfcc.conf\n") + f.write( + "--beam="+decoder_settings.get('decoder_params', 'beam')+"\n") + f.write( + "--lattice-beam="+decoder_settings.get('decoder_params', 'lattice_beam')+"\n") + f.write("--acoustic-scale=" + + decoder_settings.get('decoder_params', 'acwt')+"\n") + f.write( + "--min-active="+decoder_settings.get('decoder_params', 'min_active')+"\n") + f.write( + "--max-active="+decoder_settings.get('decoder_params', 'max_active')+"\n") + f.write("--frame-subsampling-factor="+decoder_settings.get( + 'decoder_params', 'frame_subsampling_factor')+"\n") + f.write("--endpoint.rule2.min-trailing-silence=0.5\n") + f.write("--endpoint.rule3.min-trailing-silence=1.0\n") + f.write("--endpoint.rule4.min-trailing-silence=2.0\n") + + # Prepare "ivector_extractor.conf" + with open(self.AM_PATH+"/conf/ivector_extractor.conf") as f: + values = f.readlines() + with open(self.CONFIG_FILES_PATH+"/ivector_extractor.conf", 'w') as f: + for i in values: + f.write(i) + f.write("--splice-config="+self.AM_PATH+"/conf/splice.conf\n") + f.write("--cmvn-config="+self.AM_PATH + + "/conf/online_cmvn.conf\n") + f.write("--lda-matrix="+self.AM_PATH + + "/ivector_extractor/final.mat\n") + f.write("--global-cmvn-stats="+self.AM_PATH + + "/ivector_extractor/global_cmvn.stats\n") + f.write("--diag-ubm="+self.AM_PATH + + "/ivector_extractor/final.dubm\n") + f.write("--ivector-extractor="+self.AM_PATH + + "/ivector_extractor/final.ie") + + # Prepare "word_boundary.int" if not exist + if not os.path.exists(self.LM_PATH+"/word_boundary.int") and os.path.exists(self.AM_PATH+"/phones.txt"): + self.log.info("Create word_boundary.int based on phones.txt") + with open(self.AM_PATH+"/phones.txt") as f: + phones = f.readlines() + + with open(self.LM_PATH+"/word_boundary.int", "w") as f: + for phone in phones: + phone = phone.strip() + phone = re.sub('^ .*', '', phone) + phone = re.sub('^#\d+ .*', '', phone) + if phone != '': + id = phone.split(' ')[1] + if '_I ' in phone: + f.write(id+" internal\n") + elif '_B ' in phone: + f.write(id+" begin\n") + elif '_E ' in phone: + f.write(id+" end\n") + elif '_S ' in phone: + f.write(id+" singleton\n") + else: + f.write(id+" nonword\n") + + # TODO: metadata (timestamps, speakers, save audio) + # return at the end of streaming a json object including word-data, speaker-data + # (get frames after the end of decoding) + def process_metadata(self, metadata, spkDiarization, nbrSpk=10): + if metadata is not None and 'words' in metadata and 'features' in metadata: + if not spkDiarization: + del metadata['features'] + del metadata['segments'] + return metadata + + features = metadata['features'] + seg = metadata['segments'] if metadata['segments'] is not None else [] + feats = np.array(features) + feats = np.squeeze(feats) + mask = np.ones(shape=(feats.shape[0],)) + + for pos in seg: + mask[pos-30:pos]=0 + + spk = SpeakerDiarization() + spk.set_maxNrSpeakers(nbrSpk) + spkrs = spk.run(feats,mask) + + speaker = [] + i = 0 + text = "" + for word in metadata['words']: + if i+1 < len(spkrs) and word["end"] < spkrs[i+1][0]: + text += word["word"] + " " else: - raise ValueError('Neither word_boundary.int nor phones.txt exists!!!') - - try: - # Define online feature pipeline - self.log.info("Load decoder config") - loadConfig(self) - feat_opts = OnlineNnetFeaturePipelineConfig() - self.endpoint_opts = OnlineEndpointConfig() - po = ParseOptions("") - feat_opts.register(po) - self.endpoint_opts.register(po) - po.read_config_file(self.CONFIG_FILES_PATH+"/online.conf") - self.feat_info = OnlineNnetFeaturePipelineInfo.from_config(feat_opts) - - # Set metadata parameters - self.samp_freq = self.feat_info.mfcc_opts.frame_opts.samp_freq - self.frame_shift = self.feat_info.mfcc_opts.frame_opts.frame_shift_ms / 1000 - - # Construct recognizer - self.log.info("Load Decoder model") - decoder_opts = LatticeFasterDecoderOptions() - decoder_opts.beam = self.DECODER_BEAM - decoder_opts.max_active = self.DECODER_MAXACT - decoder_opts.min_active = self.DECODER_MINACT - decoder_opts.lattice_beam = self.DECODER_LATBEAM - self.decodable_opts = NnetSimpleLoopedComputationOptions() - self.decodable_opts.acoustic_scale = self.DECODER_ACWT - self.decodable_opts.frame_subsampling_factor = self.DECODER_FSF - self.decodable_opts.frames_per_chunk = 150 + speaker.append({'spk'+str(int(spkrs[i][2])) : text}) + i+=1 + text="" + speaker.append({'spk'+str(int(spkrs[i][2])) : text}) - # Load Acoustic and graph models and other files - self.transition_model, self.acoustic_model = NnetRecognizer.read_model(self.AM_PATH+"/final.mdl") - graph = _fst.read_fst_kaldi(self.LM_PATH+"/HCLG.fst") - self.decoder_graph = LatticeFasterOnlineDecoder(graph, decoder_opts) - self.symbols = _fst.SymbolTable.read_text(self.LM_PATH+"/words.txt") - self.info = WordBoundaryInfo.from_file(WordBoundaryInfoNewOpts(),self.LM_PATH+"/word_boundary.int") - del graph, decoder_opts - except Exception as e: - self.log.error(e) - raise ValueError("AM and LM loading failed!!! (see logs for more details)") - - def get_sample_rate(self): - return self.samp_freq - - def get_frames(self,feat_pipeline): - rows = feat_pipeline.num_frames_ready() - cols = feat_pipeline.dim() - frames = Matrix(rows,cols) - feat_pipeline.get_frames(range(rows),frames) - return frames[:,:self.feat_info.mfcc_opts.num_ceps], frames[:,self.feat_info.mfcc_opts.num_ceps:] - # return feats + ivectors - - def compute_feat(self,audio): - try: - feat_pipeline = OnlineNnetFeaturePipeline(self.feat_info) - feat_pipeline.accept_waveform(audio.sr, audio.getDataKaldyVector()) - feat_pipeline.input_finished() - except Exception as e: - self.log.error(e) - raise ValueError("Feature extraction failed!!!") - else: - return feat_pipeline - - def decoder(self,feats): - try: - start_time = time.time() - self.log.info("Start Decoding: %s" % (start_time)) - asr = NnetLatticeFasterOnlineRecognizer(self.transition_model, self.acoustic_model, self.decoder_graph, - self.symbols, decodable_opts= self.decodable_opts, endpoint_opts=self.endpoint_opts) - asr.set_input_pipeline(feats) - decode = asr.decode() - self.log.info("Decode time in seconds: %s" % (time.time() - start_time)) - except Exception as e: - self.log.error(e) - raise ValueError("Decoder failed to transcribe the input audio!!!") - else: - return decode - - def wordTimestamp(self,decode): - try: - _fst.utils.scale_compact_lattice([[1.0, 0],[0, float(self.DECODER_ACWT)]], decode['lattice']) - bestPath = compact_lattice_shortest_path(decode['lattice']) - _fst.utils.scale_compact_lattice([[1.0, 0],[0, 1.0/float(self.DECODER_ACWT)]], bestPath) - bestLattice = word_align_lattice(bestPath, self.transition_model, self.info, 0) - alignment = compact_lattice_to_word_alignment(bestLattice[1]) - words = _fst.indices_to_symbols(self.symbols, alignment[0]) - except Exception as e: - self.log.error(e) - raise ValueError("Decoder failed to create the word timestamps!!!") + metadata["speakers"]=speaker + + # vad = metadata['silweights'] + # weights = np.zeros(shape=(vad[len(vad)-2]+1,)) + # id = [] + # w = [] + # for i in range(0, len(vad), 2): + # id.append(vad[i]) + # w.append(vad[i+1]) + # weights[vad[i]] = vad[i+1] + # self.log.info(id) + # self.log.info(w) + # self.log.info(weights) + + del metadata['features'] + del metadata['segments'] + + return metadata else: - return { - "words":words, - "start":alignment[1], - "dur":alignment[2] - } + return {'speakers': [], 'text': '', 'words': []} + +# def process_metadata_conversation_manager(self, metadata): +# features = metadata['features'] +# seg = metadata['segments'] if metadata['segments'] is not None else [] +# feats = np.array(features) +# feats = np.squeeze(feats) +# mask = np.ones(shape=(feats.shape[0],)) +# +# for pos in seg: +# mask[pos-30:pos]=0 +# +# spk = SpeakerDiarization() +# spk.set_maxNrSpeakers(10) +# spkrs = spk.run(feats,mask) +# +# speakers = [] +# text = [] +# i = 0 +# text_ = "" +# words=[] +# if 'words' in metadata: +# for word in metadata['words']: +# if i+1 < len(spkrs) and word["end"] < spkrs[i+1][0]: +# text_ += word["word"] + " " +# words.append(word) +# else: +# speaker = {} +# speaker["btime"]=words[0]["start"] +# speaker["etime"]=words[len(words)-1]["end"] +# speaker["speaker_id"]='spk'+str(int(spkrs[i][2])) +# speaker["words"]=words +# +# text.append('spk'+str(int(spkrs[i][2]))+' : '+text_) +# speakers.append(speaker) +# +# words=[] +# text_="" +# i+=1 +# +# speaker = {} +# speaker["btime"]=words[0]["start"] +# speaker["etime"]=words[len(words)-1]["end"] +# speaker["speaker_id"]='spk'+str(int(spkrs[i][2])) +# speaker["words"]=words +# +# text.append('spk'+str(int(spkrs[i][2]))+' : '+text_) +# speakers.append(speaker) +# return json.dumps({'speakers': speakers, 'text': text}) +# else: +# return json.dumps({'speakers': [], 'text': '', 'words': []}) + class SpeakerDiarization: def __init__(self): - self.log = logging.getLogger('__stt-standelone-worker__.SPKDiarization') - - ### MFCC FEATURES PARAMETERS - self.frame_length_s=0.025 - self.frame_shift_s=0.01 - self.num_bins=40 - self.num_ceps=40 - self.low_freq=40 - self.high_freq=-200 - ##### + self.log = logging.getLogger( + '__stt-standelone-worker__.SPKDiarization') - ### VAD PARAMETERS - self.vad_ops = VadEnergyOptions() - self.vad_ops.vad_energy_mean_scale = 0.9 - self.vad_ops.vad_energy_threshold = 5 - #vad_ops.vad_frames_context = 2 - #vad_ops.vad_proportion_threshold = 0.12 + # MFCC FEATURES PARAMETERS + self.frame_length_s = 0.025 + self.frame_shift_s = 0.01 + self.num_bins = 40 + self.num_ceps = 40 + self.low_freq = 40 + self.high_freq = -200 ##### - ### Segment - self.seg_length = 100 # Window size in frames - self.seg_increment = 100 # Window increment after and before window in frames - self.seg_rate = 100 # Window shifting in frames + # Segment + self.seg_length = 100 # Window size in frames + self.seg_increment = 100 # Window increment after and before window in frames + self.seg_rate = 100 # Window shifting in frames ##### - ### KBM - self.minimumNumberOfInitialGaussians = 1024 # Minimum number of Gaussians in the initial pool - self.maximumKBMWindowRate = 50 # Maximum window rate for Gaussian computation - self.windowLength = 200 # Window length for computing Gaussians - self.kbmSize = 320 # Number of final Gaussian components in the KBM - self.useRelativeKBMsize = 1 # If set to 1, the KBM size is set as a proportion, given by "relKBMsize", of the pool size - self.relKBMsize = 0.3 # Relative KBM size if "useRelativeKBMsize = 1" (value between 0 and 1). + # KBM + # Minimum number of Gaussians in the initial pool + self.minimumNumberOfInitialGaussians = 1024 + self.maximumKBMWindowRate = 50 # Maximum window rate for Gaussian computation + self.windowLength = 200 # Window length for computing Gaussians + self.kbmSize = 320 # Number of final Gaussian components in the KBM + # If set to 1, the KBM size is set as a proportion, given by "relKBMsize", of the pool size + self.useRelativeKBMsize = 1 + # Relative KBM size if "useRelativeKBMsize = 1" (value between 0 and 1). + self.relKBMsize = 0.3 ###### - ### BINARY_KEY - self.topGaussiansPerFrame = 5 # Number of top selected components per frame - self.bitsPerSegmentFactor = 0.2 # Percentage of bits set to 1 in the binary keys + # BINARY_KEY + self.topGaussiansPerFrame = 5 # Number of top selected components per frame + self.bitsPerSegmentFactor = 0.2 # Percentage of bits set to 1 in the binary keys ###### - ### CLUSTERING - self.N_init = 16 # Number of initial clusters - self.linkage = 0 # Set to one to perform linkage clustering instead of clustering/reassignment - self.linkageCriterion = 'average' # Linkage criterion used if linkage==1 ('average', 'single', 'complete') - self.metric = 'cosine' # Similarity metric: 'cosine' for cumulative vectors, and 'jaccard' for binary keys + # CLUSTERING + self.N_init = 16 # Number of initial clusters + # Set to one to perform linkage clustering instead of clustering/reassignment + self.linkage = 0 + # Linkage criterion used if linkage==1 ('average', 'single', 'complete') + self.linkageCriterion = 'average' + # Similarity metric: 'cosine' for cumulative vectors, and 'jaccard' for binary keys + self.metric = 'cosine' ###### - ### CLUSTERING_SELECTION - self.metric_clusteringSelection = 'cosine' # Distance metric used in the selection of the output clustering solution ('jaccard','cosine') - self.bestClusteringCriterion = 'elbow' # Method employed for number of clusters selection. Can be either 'elbow' for an elbow criterion based on within-class sum of squares (WCSS) or 'spectral' for spectral clustering - self.sigma = 1 # Spectral clustering parameters, employed if bestClusteringCriterion == spectral + # CLUSTERING_SELECTION + # Distance metric used in the selection of the output clustering solution ('jaccard','cosine') + self.metric_clusteringSelection = 'cosine' + # Method employed for number of clusters selection. Can be either 'elbow' for an elbow criterion based on within-class sum of squares (WCSS) or 'spectral' for spectral clustering + self.bestClusteringCriterion = 'elbow' + self.sigma = 1 # Spectral clustering parameters, employed if bestClusteringCriterion == spectral self.percentile = 40 - self.maxNrSpeakers = 16 # If known, max nr of speakers in a sesssion in the database. This is to limit the effect of changes in very small meaningless eigenvalues values generating huge eigengaps + self.maxNrSpeakers = 16 # If known, max nr of speakers in a sesssion in the database. This is to limit the effect of changes in very small meaningless eigenvalues values generating huge eigengaps ###### - ### RESEGMENTATION - self.resegmentation = 1 # Set to 1 to perform re-segmentation - self.modelSize = 6 # Number of GMM components - self.nbIter = 10 # Number of expectation-maximization (EM) iterations - self.smoothWin = 100 # Size of the likelihood smoothing window in nb of frames + # RESEGMENTATION + self.resegmentation = 1 # Set to 1 to perform re-segmentation + self.modelSize = 6 # Number of GMM components + self.nbIter = 10 # Number of expectation-maximization (EM) iterations + self.smoothWin = 100 # Size of the likelihood smoothing window in nb of frames ###### - - def set_maxNrSpeakers(self,nbr): - self.maxNrSpeakers = nbr - - def compute_feat_Librosa(self,audio): - try: - self.log.info("Start feature extraction: %s" % (time.time())) - if audio.sr == 16000: - self.low_freq=20 - self.high_freq=7600 - data = audio.data/32768 - frame_length_inSample = self.frame_length_s * audio.sr - hop = int(self.frame_shift_s * audio.sr) - NFFT = int(2**np.ceil(np.log2(frame_length_inSample))) - mfccNumpy = librosa.feature.mfcc(y=data, - sr=audio.sr, - dct_type=2, - n_mfcc=self.num_ceps, - n_mels=self.num_bins, - n_fft=NFFT, - hop_length=hop, - fmin=self.low_freq, - fmax=self.high_freq).T - except Exception as e: - self.log.error(e) - raise ValueError("Speaker diarization failed when extracting features!!!") - else: - return mfccNumpy - - def compute_feat_KALDI(self,audio): - try: - self.log.info("Start feature extraction: %s" % (time.time())) - po = ParseOptions("") - mfcc_opts = MfccOptions() - mfcc_opts.use_energy = False - mfcc_opts.frame_opts.samp_freq = audio.sr - mfcc_opts.frame_opts.frame_length_ms = self.frame_length_s*1000 - mfcc_opts.frame_opts.frame_shift_ms = self.frame_shift_s*1000 - mfcc_opts.frame_opts.allow_downsample = False - mfcc_opts.mel_opts.num_bins = self.num_bins - mfcc_opts.mel_opts.low_freq = self.low_freq - mfcc_opts.mel_opts.high_freq = self.high_freq - mfcc_opts.num_ceps = self.num_ceps - mfcc_opts.register(po) - - # Create MFCC object and obtain sample frequency - mfccObj = Mfcc(mfcc_opts) - mfccKaldi = mfccObj.compute_features(audio.getDataKaldyVector(), audio.sr, 1.0) - except Exception as e: - self.log.error(e) - raise ValueError("Speaker diarization failed while extracting features!!!") - else: - return mfccKaldi - - def computeVAD_WEBRTC(self, audio): - try: - self.log.info("Start VAD: %s" % (time.time())) - data = audio.data/32768 - hop = 30 - va_framed = py_webrtcvad(data, fs=audio.sr, fs_vad=audio.sr, hoplength=hop, vad_mode=0) - segments = get_py_webrtcvad_segments(va_framed,audio.sr) - maskSAD = np.zeros([1,nFeatures]) - for seg in segments: - start=int(np.round(seg[0]/frame_shift_s)) - end=int(np.round(seg[1]/frame_shift_s)) - maskSAD[0][start:end]=1 - except Exception as e: - self.log.error(e) - raise ValueError("Speaker diarization failed while voice activity detection!!!") - else: - return maskSAD - - def computeVAD_KALDI(self, audio, feats=None): - try: - self.log.info("Start VAD: %s" % (time.time())) - vadStream = compute_vad_energy(self.vad_ops,feats) - vad = Vector(vadStream) - VAD = vad.numpy() - - ### segmentation - occurence=[] - value=[] - occurence.append(1) - value.append(VAD[0]) - - # compute the speech and non-speech frames - for i in range(1,len(VAD)): - if value[-1] == VAD[i]: - occurence[-1]+=1 - else: - occurence.append(1) - value.append(VAD[i]) - - # filter the speech and non-speech segments that are below 30 frames - i = 0 - while(i < len(occurence)): - if i != 0 and (occurence[i] < 30 or value[i-1] == value[i]): - occurence[i-1] += occurence[i] - del value[i] - del occurence[i] - else: - i+=1 - # split if and only if the silence is above 50 frames - i = 0 - while(i < len(occurence)): - if i != 0 and ((occurence[i] < 30 and value[i] == 0.0) or value[i-1] == value[i]): - occurence[i-1] += occurence[i] - del value[i] - del occurence[i] - else: - i+=1 - - # compute VAD mask - maskSAD = np.zeros(len(VAD)) - start=0 - for i in range(len(occurence)): - if value[i] == 1.0: - end=start+occurence[i] - maskSAD[start:end] = 1 - start=end - else: - start += occurence[i] - - maskSAD = np.expand_dims(maskSAD, axis=0) - except ValueError as v: - self.log.error(v) - except Exception as e: - self.log.error(e) - raise ValueError("Speaker diarization failed while voice activity detection!!!") - else: - return maskSAD + def set_maxNrSpeakers(self, nbr): + self.maxNrSpeakers = nbr - def run(self, audio, feats=None): + def run(self, feats, mask): try: def getSegments(frameshift, finalSegmentTable, finalClusteringTable, dur): - numberOfSpeechFeatures = finalSegmentTable[-1,2].astype(int)+1 - solutionVector = np.zeros([1,numberOfSpeechFeatures]) - for i in np.arange(np.size(finalSegmentTable,0)): - solutionVector[0,np.arange(finalSegmentTable[i,1],finalSegmentTable[i,2]+1).astype(int)]=finalClusteringTable[i] - seg = np.empty([0,3]) + numberOfSpeechFeatures = finalSegmentTable[-1, 2].astype(int)+1 + solutionVector = np.zeros([1, numberOfSpeechFeatures]) + for i in np.arange(np.size(finalSegmentTable, 0)): + solutionVector[0, np.arange( + finalSegmentTable[i, 1], finalSegmentTable[i, 2]+1).astype(int)] = finalClusteringTable[i] + seg = np.empty([0, 3]) solutionDiff = np.diff(solutionVector)[0] first = 0 - for i in np.arange(0,np.size(solutionDiff,0)): + for i in np.arange(0, np.size(solutionDiff, 0)): if solutionDiff[i]: last = i+1 seg1 = (first)*frameshift seg2 = (last-first)*frameshift - seg3 = solutionVector[0,last-1] + seg3 = solutionVector[0, last-1] if seg.shape[0] != 0 and seg3 == seg[-1][2]: seg[-1][1] += seg2 - elif seg3 and seg2 > 0.3: # and seg2 > 0.1 - seg = np.vstack((seg,[seg1,seg2,seg3])) + elif seg3 and seg2 > 0.3: # and seg2 > 0.1 + seg = np.vstack((seg, [seg1, seg2, seg3])) first = i+1 - last = np.size(solutionVector,1) + last = np.size(solutionVector, 1) seg1 = (first-1)*frameshift seg2 = (last-first+1)*frameshift - seg3 = solutionVector[0,last-1] + seg3 = solutionVector[0, last-1] if seg3 == seg[-1][2]: seg[-1][1] += seg2 - elif seg3 and seg2 > 0.3: # and seg2 > 0.1 - seg = np.vstack((seg,[seg1,seg2,seg3])) - seg = np.vstack((seg,[dur,-1,-1])) - seg[0][0]=0.0 + elif seg3 and seg2 > 0.3: # and seg2 > 0.1 + seg = np.vstack((seg, [seg1, seg2, seg3])) + seg = np.vstack((seg, [dur, -1, -1])) + seg[0][0] = 0.0 return seg - start_time = time.time() - self.log.info("Start Speaker Diarization: %s" % (start_time)) - if self.maxNrSpeakers == 1 or audio.dur < 3: - self.log.info("Speaker Diarization time in seconds: %s" % (time.time() - start_time)) - return [[0, audio.dur, 1], - [audio.dur, -1, -1]] - if feats == None: - feats = self.compute_feat_KALDI(audio) + nFeatures = feats.shape[0] - maskSAD = self.computeVAD_KALDI(audio,feats) - maskUEM = np.ones([1,nFeatures]) + duration = nFeatures * self.frame_shift_s + + if duration < 5: + return [[0, duration, 1], + [duration, -1, -1]] - mask = np.logical_and(maskUEM,maskSAD) + maskSAD = mask + maskUEM = np.ones([1, nFeatures]) + + mask = np.logical_and(maskUEM, maskSAD) mask = mask[0][0:nFeatures] - nSpeechFeatures=np.sum(mask) + nSpeechFeatures = np.sum(mask) speechMapping = np.zeros(nFeatures) - #you need to start the mapping from 1 and end it in the actual number of features independently of the indexing style - #so that we don't lose features on the way - speechMapping[np.nonzero(mask)] = np.arange(1,nSpeechFeatures+1) - data=feats[np.where(mask==1)] + # you need to start the mapping from 1 and end it in the actual number of features independently of the indexing style + # so that we don't lose features on the way + speechMapping[np.nonzero(mask)] = np.arange(1, nSpeechFeatures+1) + data = feats[np.where(mask == 1)] del feats - segmentTable=getSegmentTable(mask,speechMapping,self.seg_length,self.seg_increment,self.seg_rate) - numberOfSegments=np.size(segmentTable,0) - #create the KBM - #set the window rate in order to obtain "minimumNumberOfInitialGaussians" gaussians + segmentTable = getSegmentTable( + mask, speechMapping, self.seg_length, self.seg_increment, self.seg_rate) + numberOfSegments = np.size(segmentTable, 0) + # create the KBM + # set the window rate in order to obtain "minimumNumberOfInitialGaussians" gaussians if np.floor((nSpeechFeatures-self.windowLength)/self.minimumNumberOfInitialGaussians) < self.maximumKBMWindowRate: - windowRate = int(np.floor((np.size(data,0)-self.windowLength)/self.minimumNumberOfInitialGaussians)) + windowRate = int(np.floor( + (np.size(data, 0)-self.windowLength)/self.minimumNumberOfInitialGaussians)) else: windowRate = int(self.maximumKBMWindowRate) - + if windowRate == 0: - raise ValueError('The audio is to short in order to perform the speaker diarization!!!') - + #self.log.info('The audio is to short in order to perform the speaker diarization!!!') + return [[0, duration, 1], + [duration, -1, -1]] + poolSize = np.floor((nSpeechFeatures-self.windowLength)/windowRate) - if self.useRelativeKBMsize: + if self.useRelativeKBMsize: kbmSize = int(np.floor(poolSize*self.relKBMsize)) else: kbmSize = int(self.kbmSize) - - #Training pool of',int(poolSize),'gaussians with a rate of',int(windowRate),'frames' - kbm, gmPool = trainKBM(data,self.windowLength,windowRate,kbmSize) - + + # Training pool of',int(poolSize),'gaussians with a rate of',int(windowRate),'frames' + kbm, gmPool = trainKBM( + data, self.windowLength, windowRate, kbmSize) + #'Selected',kbmSize,'gaussians from the pool' - Vg = getVgMatrix(data,gmPool,kbm,self.topGaussiansPerFrame) - + Vg = getVgMatrix(data, gmPool, kbm, self.topGaussiansPerFrame) + #'Computing binary keys for all segments... ' - segmentBKTable, segmentCVTable = getSegmentBKs(segmentTable, kbmSize, Vg, self.bitsPerSegmentFactor, speechMapping) - + segmentBKTable, segmentCVTable = getSegmentBKs( + segmentTable, kbmSize, Vg, self.bitsPerSegmentFactor, speechMapping) + #'Performing initial clustering... ' - initialClustering = np.digitize(np.arange(numberOfSegments),np.arange(0,numberOfSegments,numberOfSegments/self.N_init)) - - + initialClustering = np.digitize(np.arange(numberOfSegments), np.arange( + 0, numberOfSegments, numberOfSegments/self.N_init)) + #'Performing agglomerative clustering... ' if self.linkage: - finalClusteringTable, k = performClusteringLinkage(segmentBKTable, segmentCVTable, self.N_init, self.linkageCriterion, self.metric) + finalClusteringTable, k = performClusteringLinkage( + segmentBKTable, segmentCVTable, self.N_init, self.linkageCriterion, self.metric) else: - finalClusteringTable, k = performClustering(speechMapping, segmentTable, segmentBKTable, segmentCVTable, Vg, self.bitsPerSegmentFactor, kbmSize, self.N_init, initialClustering, self.metric) + finalClusteringTable, k = performClustering( + speechMapping, segmentTable, segmentBKTable, segmentCVTable, Vg, self.bitsPerSegmentFactor, kbmSize, self.N_init, initialClustering, self.metric) #'Selecting best clustering...' if self.bestClusteringCriterion == 'elbow': - bestClusteringID = getBestClustering(self.metric_clusteringSelection, segmentBKTable, segmentCVTable, finalClusteringTable, k, self.maxNrSpeakers) + bestClusteringID = getBestClustering( + self.metric_clusteringSelection, segmentBKTable, segmentCVTable, finalClusteringTable, k, self.maxNrSpeakers) elif self.bestClusteringCriterion == 'spectral': - bestClusteringID = getSpectralClustering(self.metric_clusteringSelection,finalClusteringTable,self.N_init,segmentBKTable,segmentCVTable,k,self.sigma,self.percentile,self.maxNrSpeakers)+1 - - if self.resegmentation and np.size(np.unique(finalClusteringTable[:,bestClusteringID.astype(int)-1]),0)>1: - finalClusteringTableResegmentation,finalSegmentTable = performResegmentation(data,speechMapping, mask,finalClusteringTable[:,bestClusteringID.astype(int)-1],segmentTable,self.modelSize,self.nbIter,self.smoothWin,nSpeechFeatures) - seg = getSegments(self.frame_shift_s,finalSegmentTable, np.squeeze(finalClusteringTableResegmentation), audio.dur) + bestClusteringID = getSpectralClustering(self.metric_clusteringSelection, finalClusteringTable, + self.N_init, segmentBKTable, segmentCVTable, k, self.sigma, self.percentile, self.maxNrSpeakers)+1 + + if self.resegmentation and np.size(np.unique(finalClusteringTable[:, bestClusteringID.astype(int)-1]), 0) > 1: + finalClusteringTableResegmentation, finalSegmentTable = performResegmentation(data, speechMapping, mask, finalClusteringTable[:, bestClusteringID.astype( + int)-1], segmentTable, self.modelSize, self.nbIter, self.smoothWin, nSpeechFeatures) + seg = getSegments(self.frame_shift_s, finalSegmentTable, np.squeeze( + finalClusteringTableResegmentation), duration) else: - seg = getSegmentationFile(self.frame_shift_s,segmentTable, finalClusteringTable[:,bestClusteringID.astype(int)-1]) + return None + self.log.info("Speaker Diarization time in seconds: %s" % (time.time() - start_time)) except ValueError as v: self.log.info(v) - return [[0, audio.dur, 1], - [audio.dur, -1, -1]] + return [[0, duration, 1], + [duration, -1, -1]] except Exception as e: self.log.error(e) - raise ValueError("Speaker Diarization failed!!!") + return None else: return seg - -class SttStandelone: - def __init__(self,metadata=False,spkDiarization=False): - self.log = logging.getLogger('__stt-standelone-worker__.SttStandelone') - self.metadata = metadata - self.spkDiarization = spkDiarization - self.timestamp = True if self.metadata or self.spkDiarization else False - - def run(self,audio,asr,spk): - feats = asr.compute_feat(audio) - mfcc, ivector = asr.get_frames(feats) - if self.spkDiarization: - with ThreadPoolExecutor(max_workers=2) as executor: - thrd1 = executor.submit(asr.decoder, feats) - thrd2 = executor.submit(spk.run, audio, mfcc) - decode = thrd1.result() - spkSeg = thrd2.result() - else: - decode = asr.decoder(feats) - spkSeg = [] - - if self.timestamp: - timestamps = asr.wordTimestamp(decode) - output = self.getOutput(timestamps,asr.frame_shift, asr.decodable_opts.frame_subsampling_factor,spkSeg) - if self.metadata: - return output - else: - return {"text":output["text"]} - else: - text = re.sub(r"#nonterm:[^ ]* ", "", decode["text"]) - return text - - def getOutput(self,timestamps,frame_shift, frame_subsampling, spkSeg = []): - output = {} - if len(spkSeg) == 0: - text = "" - output["words"] = [] - for i in range(len(timestamps["words"])): - if timestamps["words"][i] != "": - meta = {} - meta["word"] = timestamps["words"][i] - meta["btime"] = round(timestamps["start"][i] * frame_shift * frame_subsampling,2) - meta["etime"] = round((timestamps["start"][i]+timestamps["dur"][i]) * frame_shift * frame_subsampling, 2) - output["words"].append(meta) - text += " "+meta["word"] - output["text"] = text - else: - output["speakers"] = [] - output["text"] = [] - j = 0 - newSpk = 1 - for i in range(len(timestamps["words"])): - if timestamps["words"][i] != "": - if newSpk: - speaker = {} - speaker["speaker_id"] = "spk_"+str(int(spkSeg[j][2])) - speaker["words"] = [] - txtSpk = speaker["speaker_id"]+":" - newSpk = 0 - word = {} - word["word"] = timestamps["words"][i] - word["btime"] = round(timestamps["start"][i] * frame_shift * frame_subsampling,2) - word["etime"] = round((timestamps["start"][i]+timestamps["dur"][i]) * frame_shift * frame_subsampling, 2) - speaker["words"].append(word) - txtSpk += " "+word["word"] - if word["etime"] > spkSeg[j+1][0]: - speaker["btime"] = speaker["words"][0]["btime"] - speaker["etime"] = speaker["words"][-1]["etime"] - output["speakers"].append(speaker) - output["text"].append(txtSpk) - newSpk = 1 - j += 1 - #add the last speaker to the output speakers - speaker["btime"] = speaker["words"][0]["btime"] - speaker["etime"] = speaker["words"][-1]["etime"] - output["speakers"].append(speaker) - output["text"].append(txtSpk) - return output - - -class Audio: - def __init__(self,sr): - self.log = logging.getLogger('__stt-standelone-worker__.Audio') - self.bit = 16 - self.channels = 1 - self.sr = sr - - def set_logger(self,log): - self.log = log - - def transform(self,file_name): - try: - tfm = sox.Transformer() - tfm.set_output_format(rate=self.sr, - bits=self.bit, - channels=self.channels) - self.data = tfm.build_array(input_filepath=file_name) - self.dur = len(self.data) / self.sr - except Exception as e: - self.log.error(e) - raise ValueError("The uploaded file format is not supported!!!") - - def getDataKaldyVector(self): - return Vector(self.data) \ No newline at end of file diff --git a/vosk-api b/vosk-api new file mode 160000 index 0000000..fec4a1a --- /dev/null +++ b/vosk-api @@ -0,0 +1 @@ +Subproject commit fec4a1ad76a3c2e66bad84acd5cead2070b3d1b6 From 58fbcecea9f25487ab73dfe36411aa58599a6b45 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Fri, 25 Sep 2020 03:16:45 +0200 Subject: [PATCH 021/172] update submodules --- .gitmodules | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.gitmodules b/.gitmodules index 9cea8d6..b131dc4 100644 --- a/.gitmodules +++ b/.gitmodules @@ -1,6 +1,6 @@ [submodule "vosk-api"] path = vosk-api - url = git@github.com:irebai/vosk-api.git + url = https://github.com/irebai/vosk-api.git [submodule "pyBK"] path = pyBK - url = git@github.com:irebai/pyBK.git + url = https://github.com/irebai/pyBK.git From bcb5cc208602ebad9877226cab46cacb60b2d965 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Fri, 25 Sep 2020 14:05:33 +0200 Subject: [PATCH 022/172] add audio file exception --- tools.py | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/tools.py b/tools.py index 8cc3715..8f21607 100644 --- a/tools.py +++ b/tools.py @@ -81,13 +81,15 @@ def swaggerUI(self, app): def getAudio(self,file): - file_path = self.TEMP_FILE_PATH+"/"+file.filename.lower() - file.save(file_path) - self.rate, self.data = scipy.io.wavfile.read(file_path) - - if not self.SAVE_AUDIO: - os.remove(file_path) - + try: + file_path = self.TEMP_FILE_PATH+"/"+file.filename.lower() + file.save(file_path) + self.rate, self.data = scipy.io.wavfile.read(file_path) + if not self.SAVE_AUDIO: + os.remove(file_path) + except Exception as e: + raise ValueError('Unsupported audio file! Only WAVE format is supported.') + # re-create config files def loadConfig(self): # load decoder parameters from "decode.cfg" From a671b6cac69cde158631e4d6252ada5443b4adfc Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Wed, 30 Sep 2020 16:53:19 +0200 Subject: [PATCH 023/172] update the response format --- document/swagger.yml | 15 +--- run.py | 41 +++------- tools.py | 187 +++++++++++++++++++------------------------ 3 files changed, 92 insertions(+), 151 deletions(-) diff --git a/document/swagger.yml b/document/swagger.yml index 57e818f..8a93b7c 100644 --- a/document/swagger.yml +++ b/document/swagger.yml @@ -24,22 +24,9 @@ paths: parameters: - name: "file" in: "formData" - description: "Audio File (wav, mp3, aiff, flac, ogg)" + description: "Audio File - Waveform Format" required: true type: "file" - - name: "nbrSpeaker" - in: "formData" - description: "Number of speakers in the audio" - required: false - type: "number" - default: 1 - - name: "speaker" - in: "formData" - description: "Do speaker diarization" - required: false - type: "string" - enum: [ "Yes", "No" ] - default: "No" responses: 200: description: Successfully transcribe the audio diff --git a/run.py b/run.py index a95cf47..f4dc9c2 100644 --- a/run.py +++ b/run.py @@ -28,51 +28,30 @@ def transcribe(): worker.log.info('[%s] New user entry on /transcribe' % (strftime("%d/%b/%d %H:%M:%S", gmtime()))) - metadata = worker.METADATA - nbrSpk = 10 + is_metadata = False + nbrOfSpk = 10 # get response content type if request.headers.get('accept').lower() == 'application/json': - metadata = True + is_metadata = True elif request.headers.get('accept').lower() == 'text/plain': - metadata = False + is_metadata = False else: raise ValueError('Not accepted header') - # get speaker parameter - spkDiarization = False - if request.form.get('speaker') != None and (request.form.get('speaker').lower() == 'yes' or request.form.get('speaker').lower() == 'no'): - spkDiarization = True if request.form.get( - 'speaker').lower() == 'yes' else False - # get number of speakers parameter - try: - if request.form.get('nbrSpeaker') != None and spkDiarization and int(request.form.get('nbrSpeaker')) > 0: - nbrSpk = int(request.form.get('nbrSpeaker')) - elif request.form.get('nbrSpeaker') != None and spkDiarization: - raise ValueError( - 'Not accepted "nbrSpeaker" field value (nbrSpeaker>0)') - except Exception as e: - worker.log.error(e) - raise ValueError( - 'Not accepted "nbrSpeaker" field value (nbrSpeaker>0)') - else: - if request.form.get('speaker') != None: - raise ValueError('Not accepted "speaker" field value (yes|no)') - # get input file if 'file' in request.files.keys(): file = request.files['file'] worker.getAudio(file) - rec = KaldiRecognizer(model, worker.rate, metadata) - response = rec.Decode(worker.data) - if metadata: - obj = rec.GetMetadata() - data = json.loads(obj) - response = worker.process_metadata(data, spkDiarization, nbrSpk) + rec = KaldiRecognizer(model, worker.rate, is_metadata) + data_ = rec.Decode(worker.data) + if is_metadata: + data_ = rec.GetMetadata() + data = worker.get_response(data_, is_metadata, is_metadata, nbrOfSpk) else: raise ValueError('No audio file was uploaded') - return response, 200 + return data, 200 except ValueError as error: return str(error), 400 except Exception as e: diff --git a/tools.py b/tools.py index 8f21607..05fe391 100644 --- a/tools.py +++ b/tools.py @@ -39,7 +39,6 @@ def __init__(self): self.SAVE_AUDIO=False self.SERVICE_PORT = 80 self.NBR_THREADS = 100 - self.METADATA = True self.SWAGGER_URL = '/api-doc' self.SWAGGER_PATH = '' @@ -64,7 +63,6 @@ def __init__(self): self.log.info("Create the new config files") self.loadConfig() - def swaggerUI(self, app): ### swagger specific ### swagger_yml = yaml.load(open(self.SWAGGER_PATH, 'r'), Loader=yaml.Loader) @@ -79,7 +77,6 @@ def swaggerUI(self, app): app.register_blueprint(swaggerui, url_prefix=self.SWAGGER_URL) ### end swagger specific ### - def getAudio(self,file): try: file_path = self.TEMP_FILE_PATH+"/"+file.filename.lower() @@ -167,112 +164,90 @@ def loadConfig(self): else: f.write(id+" nonword\n") - # TODO: metadata (timestamps, speakers, save audio) - # return at the end of streaming a json object including word-data, speaker-data - # (get frames after the end of decoding) - def process_metadata(self, metadata, spkDiarization, nbrSpk=10): - if metadata is not None and 'words' in metadata and 'features' in metadata: - if not spkDiarization: - del metadata['features'] - del metadata['segments'] - return metadata - - - features = metadata['features'] - seg = metadata['segments'] if metadata['segments'] is not None else [] - feats = np.array(features) - feats = np.squeeze(feats) - mask = np.ones(shape=(feats.shape[0],)) - - for pos in seg: - mask[pos-30:pos]=0 - - spk = SpeakerDiarization() - spk.set_maxNrSpeakers(nbrSpk) - spkrs = spk.run(feats,mask) - - speaker = [] - i = 0 - text = "" - for word in metadata['words']: - if i+1 < len(spkrs) and word["end"] < spkrs[i+1][0]: - text += word["word"] + " " - else: - speaker.append({'spk'+str(int(spkrs[i][2])) : text}) - i+=1 - text="" - speaker.append({'spk'+str(int(spkrs[i][2])) : text}) - - metadata["speakers"]=speaker - - # vad = metadata['silweights'] - # weights = np.zeros(shape=(vad[len(vad)-2]+1,)) - # id = [] - # w = [] - # for i in range(0, len(vad), 2): - # id.append(vad[i]) - # w.append(vad[i+1]) - # weights[vad[i]] = vad[i+1] - # self.log.info(id) - # self.log.info(w) - # self.log.info(weights) - - del metadata['features'] - del metadata['segments'] - - return metadata + # remove extra symbols + def parse_text(self, text): + text = re.sub(r"", "", text) # remove symbol + text = re.sub(r"#nonterm:[^ ]* ", "", text) # remove entity's mark + text = re.sub(r"' ", "'", text) # remove space after quote ' + text = re.sub(r" +", " ", text) # remove multiple spaces + text = text.strip() + return text + + # Postprocess response + def get_response(self, dataJson, is_metadata, is_spkDiarization, nbrOfSpk): + if dataJson is not None: + data = json.loads(dataJson) + if not is_metadata: + text = data['text'] # get text from response + return self.parse_text(text) + + elif 'words' in data and 'features' in data: + if is_spkDiarization: + # Get Features and spoken segments and clean data + features = data['features'] + seg = data['segments'] if data['segments'] is not None else [] + del data['features'] + del data['segments'] + + # Prepare the parameters for SpeakerDiarization input + feats = np.array(features) + feats = np.squeeze(feats) + mask = np.ones(shape=(feats.shape[0],)) + for pos in seg: + mask[pos-30:pos]=0 + + # Do speaker diarization and get speaker segments + spk = SpeakerDiarization() + spk.set_maxNrSpeakers(nbrOfSpk) + spkrs = spk.run(feats,mask) + + # Generate final output data + return self.process_output(data, spkrs) + + del data['features'] + del data['segments'] + return data + else: + return {'speakers': [], 'text': '', 'words': []} else: return {'speakers': [], 'text': '', 'words': []} -# def process_metadata_conversation_manager(self, metadata): -# features = metadata['features'] -# seg = metadata['segments'] if metadata['segments'] is not None else [] -# feats = np.array(features) -# feats = np.squeeze(feats) -# mask = np.ones(shape=(feats.shape[0],)) -# -# for pos in seg: -# mask[pos-30:pos]=0 -# -# spk = SpeakerDiarization() -# spk.set_maxNrSpeakers(10) -# spkrs = spk.run(feats,mask) -# -# speakers = [] -# text = [] -# i = 0 -# text_ = "" -# words=[] -# if 'words' in metadata: -# for word in metadata['words']: -# if i+1 < len(spkrs) and word["end"] < spkrs[i+1][0]: -# text_ += word["word"] + " " -# words.append(word) -# else: -# speaker = {} -# speaker["btime"]=words[0]["start"] -# speaker["etime"]=words[len(words)-1]["end"] -# speaker["speaker_id"]='spk'+str(int(spkrs[i][2])) -# speaker["words"]=words -# -# text.append('spk'+str(int(spkrs[i][2]))+' : '+text_) -# speakers.append(speaker) -# -# words=[] -# text_="" -# i+=1 -# -# speaker = {} -# speaker["btime"]=words[0]["start"] -# speaker["etime"]=words[len(words)-1]["end"] -# speaker["speaker_id"]='spk'+str(int(spkrs[i][2])) -# speaker["words"]=words -# -# text.append('spk'+str(int(spkrs[i][2]))+' : '+text_) -# speakers.append(speaker) -# return json.dumps({'speakers': speakers, 'text': text}) -# else: -# return json.dumps({'speakers': [], 'text': '', 'words': []}) + + # return a json object including word-data, speaker-data + def process_output(self, data, spkrs): + speakers = [] + text = [] + i = 0 + text_ = "" + words=[] + for word in data['words']: + if i+1 < len(spkrs) and word["end"] < spkrs[i+1][0]: + text_ += word["word"] + " " + words.append(word) + else: + speaker = {} + speaker["start"]=words[0]["start"] + speaker["end"]=words[len(words)-1]["end"] + speaker["speaker_id"]='spk'+str(int(spkrs[i][2])) + speaker["words"]=words + + text.append('spk'+str(int(spkrs[i][2]))+' : '+ self.parse_text(text_)) + speakers.append(speaker) + + words=[word] + text_=word["word"] + " " + i+=1 + + speaker = {} + speaker["start"]=words[0]["start"] + speaker["end"]=words[len(words)-1]["end"] + speaker["speaker_id"]='spk'+str(int(spkrs[i][2])) + speaker["words"]=words + + text.append('spk'+str(int(spkrs[i][2]))+' : '+ self.parse_text(text_)) + speakers.append(speaker) + + return {'speakers': speakers, 'text': text} class SpeakerDiarization: From bc939cf3f119a12d04833d76c50b557ea7de1e6c Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Thu, 1 Oct 2020 14:01:43 +0200 Subject: [PATCH 024/172] update installation and remove healthcheck API --- Dockerfile | 3 --- Jenkinsfile | 2 -- run.py | 5 ----- 3 files changed, 10 deletions(-) diff --git a/Dockerfile b/Dockerfile index 604bdb7..1c6f518 100644 --- a/Dockerfile +++ b/Dockerfile @@ -72,9 +72,6 @@ WORKDIR /usr/src/speech-to-text # Install main service packages RUN pip3 install flask flask-cors flask-swagger-ui gevent pyyaml -# Set environment variables -ENV PATH /pykaldi/tools/kaldi/egs/wsj/s5/utils/:$PATH - COPY pyBK/diarizationFunctions.py pyBK/diarizationFunctions.py COPY tools.py . COPY run.py . diff --git a/Jenkinsfile b/Jenkinsfile index 5f464c5..b4bdffc 100644 --- a/Jenkinsfile +++ b/Jenkinsfile @@ -24,7 +24,6 @@ pipeline { docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { image.push("${VERSION}") image.push('latest') - image.push('offline') } } } @@ -44,7 +43,6 @@ pipeline { ).trim() docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { image.push('latest-unstable') - image.push('offline') } } } diff --git a/run.py b/run.py index f4dc9c2..b548209 100644 --- a/run.py +++ b/run.py @@ -59,13 +59,8 @@ def transcribe(): return 'Server Error', 500 -@app.route('/healthcheck', methods=['GET']) -def check(): - return '', 200 # Rejected request handlers - - @app.errorhandler(405) def method_not_allowed(error): return 'The method is not allowed for the requested URL', 405 From c07e43a4e6002d2269cafb793c350e60a5cf2f2d Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Fri, 2 Oct 2020 19:21:29 +0200 Subject: [PATCH 025/172] update audio file reader and change the production server --- Dockerfile | 2 +- run.py | 13 ++++++------- tools.py | 14 ++++++++------ 3 files changed, 15 insertions(+), 14 deletions(-) diff --git a/Dockerfile b/Dockerfile index 6608943..6fa563b 100644 --- a/Dockerfile +++ b/Dockerfile @@ -115,7 +115,7 @@ RUN git clone --depth 1 https://github.com/pykaldi/pykaldi.git /pykaldi \ WORKDIR /usr/src/speech-to-text # Install main service packages -RUN pip3 install flask flask-cors flask-swagger-ui configparser pyyaml logger librosa webrtcvad scipy sklearn +RUN pip3 install flask flask-cors flask-swagger-ui configparser pyyaml logger librosa webrtcvad scipy sklearn gevent RUN apt-get install -y libsox-fmt-all && pip3 install git+https://github.com/rabitt/pysox.git \ && git clone https://github.com/irebai/pyBK.git /pykaldi/tools/pyBK \ && cp /pykaldi/tools/pyBK/diarizationFunctions.py . diff --git a/run.py b/run.py index ecdbb18..1195d28 100755 --- a/run.py +++ b/run.py @@ -7,12 +7,13 @@ from tools import ASR, Audio, SpeakerDiarization, SttStandelone import yaml, os, sox, logging from time import gmtime, strftime +from gevent.pywsgi import WSGIServer app = Flask("__stt-standelone-worker__") # Set logger config logger = logging.getLogger(__name__) -logging.basicConfig(level=logging.DEBUG) +logging.basicConfig(level=logging.INFO) # Main parameters AM_PATH = '/opt/models/AM' @@ -61,7 +62,7 @@ def swaggerUI(): def getAudio(file,audio): file_path = TEMP_FILE_PATH+file.filename.lower() file.save(file_path) - audio.transform(file_path) + audio.read_audio(file_path) if not SAVE_AUDIO: os.remove(file_path) @@ -116,10 +117,6 @@ def transcribe(): app.logger.error(e) return 'Server Error', 500 -@app.route('/healthcheck', methods=['GET']) -def check(): - return '', 200 - # Rejected request handlers @app.errorhandler(405) def method_not_allowed(error): @@ -144,7 +141,9 @@ def server_error(error): asr.run() #Run server - app.run(host='0.0.0.0', port=SERVICE_PORT, debug=False, threaded=False, processes=NBR_PROCESSES) + app.logger.info('Server ready for transcription...') + http_server = WSGIServer(('', SERVICE_PORT), app) + http_server.serve_forever() except Exception as e: app.logger.error(e) exit(e) \ No newline at end of file diff --git a/tools.py b/tools.py index 168bb1a..d021a58 100644 --- a/tools.py +++ b/tools.py @@ -37,6 +37,7 @@ ## other packages import configparser, sys, os, re, sox, time, logging from concurrent.futures import ThreadPoolExecutor +import scipy.io.wavfile ############## class ASR: @@ -602,13 +603,14 @@ def __init__(self,sr): def set_logger(self,log): self.log = log - def transform(self,file_name): + def read_audio(self, audio): try: - tfm = sox.Transformer() - tfm.set_output_format(rate=self.sr, - bits=self.bit, - channels=self.channels) - self.data = tfm.build_array(input_filepath=file_name) + data, sr = librosa.load(audio,sr=None) + if sr != self.sr: + self.log.info('Resample audio file: '+str(sr)+'Hz -> '+str(self.sr)+'Hz') + data = librosa.resample(data, sr, self.sr) + data = (data * 32767).astype(np.int16) + self.data = data self.dur = len(self.data) / self.sr except Exception as e: self.log.error(e) From 25b0e0d236b85e74cd8c5c0f8c1f97f41fa71e95 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Sat, 3 Oct 2020 15:52:23 +0200 Subject: [PATCH 026/172] clean code --- .envdefault | 3 +- Dockerfile | 5 +- document/swagger.yml | 15 +- run.py | 101 +----- tools.py | 811 ++++++++++++++++++++++--------------------- 5 files changed, 438 insertions(+), 497 deletions(-) diff --git a/.envdefault b/.envdefault index 80acea5..2246e24 100644 --- a/.envdefault +++ b/.envdefault @@ -1,4 +1,3 @@ AM_PATH=/path/to/acoustic/models/dir LM_PATH=/path/to/language/models/dir -SWAGGER_PATH=/path/to/swagger/file -NBR_PROCESSES=1 \ No newline at end of file +SWAGGER_PATH=/path/to/swagger/file \ No newline at end of file diff --git a/Dockerfile b/Dockerfile index 6fa563b..5e9f2fe 100644 --- a/Dockerfile +++ b/Dockerfile @@ -115,9 +115,8 @@ RUN git clone --depth 1 https://github.com/pykaldi/pykaldi.git /pykaldi \ WORKDIR /usr/src/speech-to-text # Install main service packages -RUN pip3 install flask flask-cors flask-swagger-ui configparser pyyaml logger librosa webrtcvad scipy sklearn gevent -RUN apt-get install -y libsox-fmt-all && pip3 install git+https://github.com/rabitt/pysox.git \ - && git clone https://github.com/irebai/pyBK.git /pykaldi/tools/pyBK \ +RUN pip3 install flask flask-cors flask-swagger-ui pyyaml librosa gevent +RUN git clone https://github.com/irebai/pyBK.git /pykaldi/tools/pyBK \ && cp /pykaldi/tools/pyBK/diarizationFunctions.py . # Set environment variables diff --git a/document/swagger.yml b/document/swagger.yml index 57e818f..e763d3b 100644 --- a/document/swagger.yml +++ b/document/swagger.yml @@ -24,22 +24,9 @@ paths: parameters: - name: "file" in: "formData" - description: "Audio File (wav, mp3, aiff, flac, ogg)" + description: "Audio File (wav, mp3, flac, ogg)" required: true type: "file" - - name: "nbrSpeaker" - in: "formData" - description: "Number of speakers in the audio" - required: false - type: "number" - default: 1 - - name: "speaker" - in: "formData" - description: "Do speaker diarization" - required: false - type: "string" - enum: [ "Yes", "No" ] - default: "No" responses: 200: description: Successfully transcribe the audio diff --git a/run.py b/run.py index 1195d28..e485961 100755 --- a/run.py +++ b/run.py @@ -2,77 +2,24 @@ # -*- coding: utf-8 -*- from flask import Flask, request, abort, Response, json -from flask_swagger_ui import get_swaggerui_blueprint -from flask_cors import CORS -from tools import ASR, Audio, SpeakerDiarization, SttStandelone -import yaml, os, sox, logging +from tools import ASR, SttStandelone from time import gmtime, strftime from gevent.pywsgi import WSGIServer +import os app = Flask("__stt-standelone-worker__") -# Set logger config -logger = logging.getLogger(__name__) -logging.basicConfig(level=logging.INFO) +stt = SttStandelone() -# Main parameters -AM_PATH = '/opt/models/AM' -LM_PATH = '/opt/models/LM' -TEMP_FILE_PATH = '/opt/tmp' -CONFIG_FILES_PATH = '/opt/config' -NBR_PROCESSES = 1 -SAVE_AUDIO = False -SERVICE_PORT = 80 -SWAGGER_URL = '/api-doc' -SWAGGER_PATH = '' -asr = ASR(AM_PATH,LM_PATH, CONFIG_FILES_PATH) +# Load ASR models (acoustic model and decoding graph) +stt.log.info('Load acoustic model and decoding graph') +asr = ASR(stt.AM_PATH, stt.LM_PATH, stt.CONFIG_FILES_PATH) -if not os.path.isdir(TEMP_FILE_PATH): - os.mkdir(TEMP_FILE_PATH) -if not os.path.isdir(CONFIG_FILES_PATH): - os.mkdir(CONFIG_FILES_PATH) -# Environment parameters -if 'SERVICE_PORT' in os.environ: - SERVICE_PORT = os.environ['SERVICE_PORT'] -if 'SAVE_AUDIO' in os.environ: - SAVE_AUDIO = os.environ['SAVE_AUDIO'] -if 'NBR_PROCESSES' in os.environ: - if int(os.environ['NBR_PROCESSES']) > 0: - NBR_PROCESSES = int(os.environ['NBR_PROCESSES']) - else: - exit("You must to provide a positif number of processes 'NBR_PROCESSES'") -if 'SWAGGER_PATH' in os.environ: - SWAGGER_PATH = os.environ['SWAGGER_PATH'] - -def swaggerUI(): - ### swagger specific ### - swagger_yml = yaml.load(open(SWAGGER_PATH, 'r'), Loader=yaml.Loader) - swaggerui = get_swaggerui_blueprint( - SWAGGER_URL, # Swagger UI static files will be mapped to '{SWAGGER_URL}/dist/' - SWAGGER_PATH, - config={ # Swagger UI config overrides - 'app_name': "STT API Documentation", - 'spec': swagger_yml - } - ) - app.register_blueprint(swaggerui, url_prefix=SWAGGER_URL) - ### end swagger specific ### - -def getAudio(file,audio): - file_path = TEMP_FILE_PATH+file.filename.lower() - file.save(file_path) - audio.read_audio(file_path) - if not SAVE_AUDIO: - os.remove(file_path) - @app.route('/transcribe', methods=['POST']) def transcribe(): try: - app.logger.info('[%s] New user entry on /transcribe' % (strftime("%d/%b/%d %H:%M:%S", gmtime()))) - # create main objects - spk = SpeakerDiarization() - audio = Audio(asr.get_sample_rate()) + stt.log.info('[%s] New user entry on /transcribe' % (strftime("%d/%b/%d %H:%M:%S", gmtime()))) #get response content type metadata = False @@ -83,30 +30,11 @@ def transcribe(): else: raise ValueError('Not accepted header') - #get speaker parameter - spkDiarization = False - if request.form.get('speaker') != None and (request.form.get('speaker').lower() == 'yes' or request.form.get('speaker').lower() == 'no'): - spkDiarization = True if request.form.get('speaker').lower() == 'yes' else False - #get number of speakers parameter - try: - if request.form.get('nbrSpeaker') != None and spkDiarization and int(request.form.get('nbrSpeaker')) > 0: - spk.set_maxNrSpeakers(int(request.form.get('nbrSpeaker'))) - elif request.form.get('nbrSpeaker') != None and spkDiarization: - raise ValueError('Not accepted "nbrSpeaker" field value (nbrSpeaker>0)') - except Exception as e: - app.logger.error(e) - raise ValueError('Not accepted "nbrSpeaker" field value (nbrSpeaker>0)') - else: - if request.form.get('speaker') != None: - raise ValueError('Not accepted "speaker" field value (yes|no)') - - stt = SttStandelone(metadata,spkDiarization) - #get input file if 'file' in request.files.keys(): file = request.files['file'] - getAudio(file,audio) - output = stt.run(audio,asr,spk) + stt.read_audio(file,asr.get_sample_rate()) + output = stt.run(asr, metadata) else: raise ValueError('No audio file was uploaded') @@ -133,16 +61,13 @@ def server_error(error): if __name__ == '__main__': try: - #start SwaggerUI - if SWAGGER_PATH != '': - swaggerUI() - - #Run ASR engine - asr.run() + # start SwaggerUI + if os.path.exists(stt.SWAGGER_PATH): + stt.swaggerUI(app) #Run server app.logger.info('Server ready for transcription...') - http_server = WSGIServer(('', SERVICE_PORT), app) + http_server = WSGIServer(('', stt.SERVICE_PORT), app) http_server.serve_forever() except Exception as e: app.logger.error(e) diff --git a/tools.py b/tools.py index d021a58..cfb6117 100644 --- a/tools.py +++ b/tools.py @@ -1,4 +1,4 @@ -## Kaldi ASR decoder +# Kaldi ASR decoder from kaldi.asr import NnetLatticeFasterOnlineRecognizer from kaldi.decoder import (LatticeFasterDecoderOptions, LatticeFasterOnlineDecoder) @@ -14,17 +14,17 @@ from kaldi.matrix import Matrix, Vector ############## -## word to CTM +# word to CTM from kaldi.lat.align import (WordBoundaryInfoNewOpts, - WordBoundaryInfo, - word_align_lattice) + WordBoundaryInfo, + word_align_lattice) from kaldi.lat.functions import (compact_lattice_to_word_alignment, compact_lattice_shortest_path) from kaldi.asr import NnetRecognizer import kaldi.fstext as _fst ############## -## Speaker Diarization +# Speaker Diarization from diarizationFunctions import * import numpy as np import librosa @@ -34,191 +34,152 @@ from kaldi.util.options import ParseOptions ############## -## other packages -import configparser, sys, os, re, sox, time, logging -from concurrent.futures import ThreadPoolExecutor -import scipy.io.wavfile +# other packages +import configparser, sys, os, re, time, logging, yaml +from flask_swagger_ui import get_swaggerui_blueprint ############## + class ASR: def __init__(self, AM_PATH, LM_PATH, CONFIG_FILES_PATH): self.log = logging.getLogger('__stt-standelone-worker__.ASR') self.AM_PATH = AM_PATH self.LM_PATH = LM_PATH self.CONFIG_FILES_PATH = CONFIG_FILES_PATH - - def run(self): - def loadConfig(self): - #get decoder parameters from "decode.cfg" - decoder_settings = configparser.ConfigParser() - decoder_settings.read(self.AM_PATH+'/decode.cfg') - self.DECODER_SYS = decoder_settings.get('decoder_params', 'decoder') - self.AM_FILE_PATH = decoder_settings.get('decoder_params', 'ampath') - self.DECODER_MINACT = int(decoder_settings.get('decoder_params', 'min_active')) - self.DECODER_MAXACT = int(decoder_settings.get('decoder_params', 'max_active')) - self.DECODER_BEAM = float(decoder_settings.get('decoder_params', 'beam')) - self.DECODER_LATBEAM = float(decoder_settings.get('decoder_params', 'lattice_beam')) - self.DECODER_ACWT = float(decoder_settings.get('decoder_params', 'acwt')) - self.DECODER_FSF = int(decoder_settings.get('decoder_params', 'frame_subsampling_factor')) - - #Prepare "online.conf" - self.AM_PATH=self.AM_PATH+"/"+self.AM_FILE_PATH - with open(self.AM_PATH+"/conf/online.conf") as f: - values = f.readlines() - with open(self.CONFIG_FILES_PATH+"/online.conf", 'w') as f: - for i in values: - f.write(i) - f.write("--ivector-extraction-config="+self.CONFIG_FILES_PATH+"/ivector_extractor.conf\n") - f.write("--mfcc-config="+self.AM_PATH+"/conf/mfcc.conf") - - #Prepare "ivector_extractor.conf" - with open(self.AM_PATH+"/conf/ivector_extractor.conf") as f: - values = f.readlines() - with open(self.CONFIG_FILES_PATH+"/ivector_extractor.conf", 'w') as f: - for i in values: - f.write(i) - f.write("--splice-config="+self.AM_PATH+"/conf/splice.conf\n") - f.write("--cmvn-config="+self.AM_PATH+"/conf/online_cmvn.conf\n") - f.write("--lda-matrix="+self.AM_PATH+"/ivector_extractor/final.mat\n") - f.write("--global-cmvn-stats="+self.AM_PATH+"/ivector_extractor/global_cmvn.stats\n") - f.write("--diag-ubm="+self.AM_PATH+"/ivector_extractor/final.dubm\n") - f.write("--ivector-extractor="+self.AM_PATH+"/ivector_extractor/final.ie") - - #Prepare "word_boundary.int" if not exist - if not os.path.exists(self.LM_PATH+"/word_boundary.int"): - if os.path.exists(self.AM_PATH+"phones.txt"): - with open(self.AM_PATH+"phones.txt") as f: - phones = f.readlines() - - with open(self.LM_PATH+"/word_boundary.int", "w") as f: - for phone in phones: - phone = phone.strip() - phone = re.sub('^ .*','', phone) - phone = re.sub('^#\d+ .*','', phone) - if phone != '': - id = phone.split(' ')[1] - if '_I ' in phone: - f.write(id+" internal\n") - elif '_B ' in phone: - f.write(id+" begin\n") - elif '_E ' in phone: - f.write(id+" end\n") - elif '_S ' in phone: - f.write(id+" singleton\n") - else: - f.write(id+" nonword\n") - - else: - raise ValueError('Neither word_boundary.int nor phones.txt exists!!!') + self.LoadModels() + def LoadModels(self): try: # Define online feature pipeline - self.log.info("Load decoder config") - loadConfig(self) - feat_opts = OnlineNnetFeaturePipelineConfig() - self.endpoint_opts = OnlineEndpointConfig() po = ParseOptions("") - feat_opts.register(po) + + decoder_opts = LatticeFasterDecoderOptions() + self.endpoint_opts = OnlineEndpointConfig() + self.decodable_opts = NnetSimpleLoopedComputationOptions() + feat_opts = OnlineNnetFeaturePipelineConfig() + + + decoder_opts.register(po) self.endpoint_opts.register(po) + self.decodable_opts.register(po) + feat_opts.register(po) + po.read_config_file(self.CONFIG_FILES_PATH+"/online.conf") - self.feat_info = OnlineNnetFeaturePipelineInfo.from_config(feat_opts) - + self.feat_info = OnlineNnetFeaturePipelineInfo.from_config( + feat_opts) + # Set metadata parameters self.samp_freq = self.feat_info.mfcc_opts.frame_opts.samp_freq self.frame_shift = self.feat_info.mfcc_opts.frame_opts.frame_shift_ms / 1000 + self.acwt = self.decodable_opts.acoustic_scale - # Construct recognizer - self.log.info("Load Decoder model") - decoder_opts = LatticeFasterDecoderOptions() - decoder_opts.beam = self.DECODER_BEAM - decoder_opts.max_active = self.DECODER_MAXACT - decoder_opts.min_active = self.DECODER_MINACT - decoder_opts.lattice_beam = self.DECODER_LATBEAM - self.decodable_opts = NnetSimpleLoopedComputationOptions() - self.decodable_opts.acoustic_scale = self.DECODER_ACWT - self.decodable_opts.frame_subsampling_factor = self.DECODER_FSF - self.decodable_opts.frames_per_chunk = 150 - # Load Acoustic and graph models and other files - self.transition_model, self.acoustic_model = NnetRecognizer.read_model(self.AM_PATH+"/final.mdl") + self.transition_model, self.acoustic_model = NnetRecognizer.read_model( + self.AM_PATH+"/final.mdl") graph = _fst.read_fst_kaldi(self.LM_PATH+"/HCLG.fst") - self.decoder_graph = LatticeFasterOnlineDecoder(graph, decoder_opts) - self.symbols = _fst.SymbolTable.read_text(self.LM_PATH+"/words.txt") - self.info = WordBoundaryInfo.from_file(WordBoundaryInfoNewOpts(),self.LM_PATH+"/word_boundary.int") + self.decoder_graph = LatticeFasterOnlineDecoder( + graph, decoder_opts) + self.symbols = _fst.SymbolTable.read_text( + self.LM_PATH+"/words.txt") + self.info = WordBoundaryInfo.from_file( + WordBoundaryInfoNewOpts(), self.LM_PATH+"/word_boundary.int") + + + self.asr = NnetLatticeFasterOnlineRecognizer(self.transition_model, self.acoustic_model, self.decoder_graph, + self.symbols, decodable_opts=self.decodable_opts, endpoint_opts=self.endpoint_opts) del graph, decoder_opts except Exception as e: self.log.error(e) - raise ValueError("AM and LM loading failed!!! (see logs for more details)") + raise ValueError( + "AM and LM loading failed!!! (see logs for more details)") def get_sample_rate(self): return self.samp_freq - def get_frames(self,feat_pipeline): + def get_frames(self, feat_pipeline): rows = feat_pipeline.num_frames_ready() cols = feat_pipeline.dim() - frames = Matrix(rows,cols) - feat_pipeline.get_frames(range(rows),frames) - return frames[:,:self.feat_info.mfcc_opts.num_ceps], frames[:,self.feat_info.mfcc_opts.num_ceps:] + frames = Matrix(rows, cols) + feat_pipeline.get_frames(range(rows), frames) + return frames[:, :self.feat_info.mfcc_opts.num_ceps], frames[:, self.feat_info.mfcc_opts.num_ceps:] # return feats + ivectors - - def compute_feat(self,audio): + + def compute_feat(self, wav): try: feat_pipeline = OnlineNnetFeaturePipeline(self.feat_info) - feat_pipeline.accept_waveform(audio.sr, audio.getDataKaldyVector()) + feat_pipeline.accept_waveform(self.samp_freq, wav) feat_pipeline.input_finished() except Exception as e: self.log.error(e) raise ValueError("Feature extraction failed!!!") else: return feat_pipeline - - def decoder(self,feats): + + def decoder(self, feats): try: start_time = time.time() self.log.info("Start Decoding: %s" % (start_time)) - asr = NnetLatticeFasterOnlineRecognizer(self.transition_model, self.acoustic_model, self.decoder_graph, - self.symbols, decodable_opts= self.decodable_opts, endpoint_opts=self.endpoint_opts) - asr.set_input_pipeline(feats) - decode = asr.decode() - self.log.info("Decode time in seconds: %s" % (time.time() - start_time)) + self.asr.set_input_pipeline(feats) + decode = self.asr.decode() + self.log.info("Decode time in seconds: %s" % + (time.time() - start_time)) except Exception as e: self.log.error(e) raise ValueError("Decoder failed to transcribe the input audio!!!") else: return decode - - def wordTimestamp(self,decode): + + def wordTimestamp(self, text, lattice, frame_shift, frame_subsampling): try: - _fst.utils.scale_compact_lattice([[1.0, 0],[0, float(self.DECODER_ACWT)]], decode['lattice']) - bestPath = compact_lattice_shortest_path(decode['lattice']) - _fst.utils.scale_compact_lattice([[1.0, 0],[0, 1.0/float(self.DECODER_ACWT)]], bestPath) - bestLattice = word_align_lattice(bestPath, self.transition_model, self.info, 0) + _fst.utils.scale_compact_lattice( + [[1.0, 0], [0, float(self.acwt)]], lattice) + bestPath = compact_lattice_shortest_path(lattice) + _fst.utils.scale_compact_lattice( + [[1.0, 0], [0, 1.0/float(self.acwt)]], bestPath) + bestLattice = word_align_lattice( + bestPath, self.transition_model, self.info, 0) alignment = compact_lattice_to_word_alignment(bestLattice[1]) words = _fst.indices_to_symbols(self.symbols, alignment[0]) + start = alignment[1] + dur = alignment[2] + + output = {} + output["words"] = [] + for i in range(len(words)): + meta = {} + meta["word"] = words[i] + meta["start"] = round(start[i] * frame_shift * frame_subsampling, 2) + meta["end"] = round((start[i]+dur[i]) * frame_shift * frame_subsampling, 2) + output["words"].append(meta) + text += " "+meta["word"] + output["text"] = text + except Exception as e: self.log.error(e) raise ValueError("Decoder failed to create the word timestamps!!!") else: - return { - "words":words, - "start":alignment[1], - "dur":alignment[2] - } + return output + class SpeakerDiarization: - def __init__(self): - self.log = logging.getLogger('__stt-standelone-worker__.SPKDiarization') - - ### MFCC FEATURES PARAMETERS - self.frame_length_s=0.025 - self.frame_shift_s=0.01 - self.num_bins=40 - self.num_ceps=40 - self.low_freq=40 - self.high_freq=-200 + def __init__(self, sample_rate): + self.log = logging.getLogger( + '__stt-standelone-worker__.SPKDiarization') + + # MFCC FEATURES PARAMETERS + self.sr = sample_rate + self.frame_length_s = 0.025 + self.frame_shift_s = 0.01 + self.num_bins = 40 + self.num_ceps = 40 + self.low_freq = 40 + self.high_freq = -200 + if self.sr == 16000: + self.low_freq = 20 + self.high_freq = 7600 ##### - ### VAD PARAMETERS + # VAD PARAMETERS self.vad_ops = VadEnergyOptions() self.vad_ops.vad_energy_mean_scale = 0.9 self.vad_ops.vad_energy_threshold = 5 @@ -226,83 +187,62 @@ def __init__(self): #vad_ops.vad_proportion_threshold = 0.12 ##### - ### Segment - self.seg_length = 100 # Window size in frames - self.seg_increment = 100 # Window increment after and before window in frames - self.seg_rate = 100 # Window shifting in frames + # Segment + self.seg_length = 100 # Window size in frames + self.seg_increment = 100 # Window increment after and before window in frames + self.seg_rate = 100 # Window shifting in frames ##### - ### KBM - self.minimumNumberOfInitialGaussians = 1024 # Minimum number of Gaussians in the initial pool - self.maximumKBMWindowRate = 50 # Maximum window rate for Gaussian computation - self.windowLength = 200 # Window length for computing Gaussians - self.kbmSize = 320 # Number of final Gaussian components in the KBM - self.useRelativeKBMsize = 1 # If set to 1, the KBM size is set as a proportion, given by "relKBMsize", of the pool size - self.relKBMsize = 0.3 # Relative KBM size if "useRelativeKBMsize = 1" (value between 0 and 1). + # KBM + # Minimum number of Gaussians in the initial pool + self.minimumNumberOfInitialGaussians = 1024 + self.maximumKBMWindowRate = 50 # Maximum window rate for Gaussian computation + self.windowLength = 200 # Window length for computing Gaussians + self.kbmSize = 320 # Number of final Gaussian components in the KBM + # If set to 1, the KBM size is set as a proportion, given by "relKBMsize", of the pool size + self.useRelativeKBMsize = 1 + # Relative KBM size if "useRelativeKBMsize = 1" (value between 0 and 1). + self.relKBMsize = 0.3 ###### - ### BINARY_KEY - self.topGaussiansPerFrame = 5 # Number of top selected components per frame - self.bitsPerSegmentFactor = 0.2 # Percentage of bits set to 1 in the binary keys + # BINARY_KEY + self.topGaussiansPerFrame = 5 # Number of top selected components per frame + self.bitsPerSegmentFactor = 0.2 # Percentage of bits set to 1 in the binary keys ###### - ### CLUSTERING - self.N_init = 16 # Number of initial clusters - self.linkage = 0 # Set to one to perform linkage clustering instead of clustering/reassignment - self.linkageCriterion = 'average' # Linkage criterion used if linkage==1 ('average', 'single', 'complete') - self.metric = 'cosine' # Similarity metric: 'cosine' for cumulative vectors, and 'jaccard' for binary keys + # CLUSTERING + self.N_init = 16 # Number of initial clusters + # Set to one to perform linkage clustering instead of clustering/reassignment + self.linkage = 0 + # Linkage criterion used if linkage==1 ('average', 'single', 'complete') + self.linkageCriterion = 'average' + # Similarity metric: 'cosine' for cumulative vectors, and 'jaccard' for binary keys + self.metric = 'cosine' ###### - ### CLUSTERING_SELECTION - self.metric_clusteringSelection = 'cosine' # Distance metric used in the selection of the output clustering solution ('jaccard','cosine') - self.bestClusteringCriterion = 'elbow' # Method employed for number of clusters selection. Can be either 'elbow' for an elbow criterion based on within-class sum of squares (WCSS) or 'spectral' for spectral clustering - self.sigma = 1 # Spectral clustering parameters, employed if bestClusteringCriterion == spectral + # CLUSTERING_SELECTION + # Distance metric used in the selection of the output clustering solution ('jaccard','cosine') + self.metric_clusteringSelection = 'cosine' + # Method employed for number of clusters selection. Can be either 'elbow' for an elbow criterion based on within-class sum of squares (WCSS) or 'spectral' for spectral clustering + self.bestClusteringCriterion = 'elbow' + self.sigma = 1 # Spectral clustering parameters, employed if bestClusteringCriterion == spectral self.percentile = 40 - self.maxNrSpeakers = 16 # If known, max nr of speakers in a sesssion in the database. This is to limit the effect of changes in very small meaningless eigenvalues values generating huge eigengaps + self.maxNrSpeakers = 10 # If known, max nr of speakers in a sesssion in the database. This is to limit the effect of changes in very small meaningless eigenvalues values generating huge eigengaps ###### - ### RESEGMENTATION - self.resegmentation = 1 # Set to 1 to perform re-segmentation - self.modelSize = 6 # Number of GMM components - self.nbIter = 10 # Number of expectation-maximization (EM) iterations - self.smoothWin = 100 # Size of the likelihood smoothing window in nb of frames + # RESEGMENTATION + self.resegmentation = 1 # Set to 1 to perform re-segmentation + self.modelSize = 6 # Number of GMM components + self.nbIter = 10 # Number of expectation-maximization (EM) iterations + self.smoothWin = 100 # Size of the likelihood smoothing window in nb of frames ###### - - def set_maxNrSpeakers(self,nbr): - self.maxNrSpeakers = nbr - - def compute_feat_Librosa(self,audio): - try: - self.log.info("Start feature extraction: %s" % (time.time())) - if audio.sr == 16000: - self.low_freq=20 - self.high_freq=7600 - data = audio.data/32768 - frame_length_inSample = self.frame_length_s * audio.sr - hop = int(self.frame_shift_s * audio.sr) - NFFT = int(2**np.ceil(np.log2(frame_length_inSample))) - mfccNumpy = librosa.feature.mfcc(y=data, - sr=audio.sr, - dct_type=2, - n_mfcc=self.num_ceps, - n_mels=self.num_bins, - n_fft=NFFT, - hop_length=hop, - fmin=self.low_freq, - fmax=self.high_freq).T - except Exception as e: - self.log.error(e) - raise ValueError("Speaker diarization failed when extracting features!!!") - else: - return mfccNumpy - def compute_feat_KALDI(self,audio): + def compute_feat_KALDI(self, wav): try: - self.log.info("Start feature extraction: %s" % (time.time())) po = ParseOptions("") mfcc_opts = MfccOptions() mfcc_opts.use_energy = False - mfcc_opts.frame_opts.samp_freq = audio.sr + mfcc_opts.frame_opts.samp_freq = self.sr mfcc_opts.frame_opts.frame_length_ms = self.frame_length_s*1000 mfcc_opts.frame_opts.frame_shift_ms = self.frame_shift_s*1000 mfcc_opts.frame_opts.allow_downsample = False @@ -311,51 +251,33 @@ def compute_feat_KALDI(self,audio): mfcc_opts.mel_opts.high_freq = self.high_freq mfcc_opts.num_ceps = self.num_ceps mfcc_opts.register(po) - + # Create MFCC object and obtain sample frequency mfccObj = Mfcc(mfcc_opts) - mfccKaldi = mfccObj.compute_features(audio.getDataKaldyVector(), audio.sr, 1.0) + mfccKaldi = mfccObj.compute_features(wav, self.sr, 1.0) except Exception as e: self.log.error(e) - raise ValueError("Speaker diarization failed while extracting features!!!") + raise ValueError( + "Speaker diarization failed while extracting features!!!") else: return mfccKaldi - - def computeVAD_WEBRTC(self, audio): - try: - self.log.info("Start VAD: %s" % (time.time())) - data = audio.data/32768 - hop = 30 - va_framed = py_webrtcvad(data, fs=audio.sr, fs_vad=audio.sr, hoplength=hop, vad_mode=0) - segments = get_py_webrtcvad_segments(va_framed,audio.sr) - maskSAD = np.zeros([1,nFeatures]) - for seg in segments: - start=int(np.round(seg[0]/frame_shift_s)) - end=int(np.round(seg[1]/frame_shift_s)) - maskSAD[0][start:end]=1 - except Exception as e: - self.log.error(e) - raise ValueError("Speaker diarization failed while voice activity detection!!!") - else: - return maskSAD - - def computeVAD_KALDI(self, audio, feats=None): + + def computeVAD_KALDI(self, feats): try: - self.log.info("Start VAD: %s" % (time.time())) - vadStream = compute_vad_energy(self.vad_ops,feats) + vadStream = compute_vad_energy(self.vad_ops, feats) vad = Vector(vadStream) VAD = vad.numpy() - - ### segmentation - occurence=[] - value=[] + + #  segmentation + occurence = [] + value = [] occurence.append(1) value.append(VAD[0]) # compute the speech and non-speech frames - for i in range(1,len(VAD)): + for i in range(1, len(VAD)): if value[-1] == VAD[i]: - occurence[-1]+=1 + occurence[-1] += 1 else: occurence.append(1) value.append(VAD[i]) @@ -368,7 +290,7 @@ def computeVAD_KALDI(self, audio, feats=None): del value[i] del occurence[i] else: - i+=1 + i += 1 # split if and only if the silence is above 50 frames i = 0 @@ -378,16 +300,16 @@ def computeVAD_KALDI(self, audio, feats=None): del value[i] del occurence[i] else: - i+=1 - + i += 1 + # compute VAD mask maskSAD = np.zeros(len(VAD)) - start=0 + start = 0 for i in range(len(occurence)): if value[i] == 1.0: - end=start+occurence[i] + end = start+occurence[i] maskSAD[start:end] = 1 - start=end + start = end else: start += occurence[i] @@ -396,225 +318,334 @@ def computeVAD_KALDI(self, audio, feats=None): self.log.error(v) except Exception as e: self.log.error(e) - raise ValueError("Speaker diarization failed while voice activity detection!!!") + raise ValueError( + "Speaker diarization failed while voice activity detection!!!") else: return maskSAD - def run(self, audio, feats=None): + def run(self, wav, dur, feats=None): try: def getSegments(frameshift, finalSegmentTable, finalClusteringTable, dur): - numberOfSpeechFeatures = finalSegmentTable[-1,2].astype(int)+1 - solutionVector = np.zeros([1,numberOfSpeechFeatures]) - for i in np.arange(np.size(finalSegmentTable,0)): - solutionVector[0,np.arange(finalSegmentTable[i,1],finalSegmentTable[i,2]+1).astype(int)]=finalClusteringTable[i] - seg = np.empty([0,3]) + numberOfSpeechFeatures = finalSegmentTable[-1, 2].astype(int)+1 + solutionVector = np.zeros([1, numberOfSpeechFeatures]) + for i in np.arange(np.size(finalSegmentTable, 0)): + solutionVector[0, np.arange( + finalSegmentTable[i, 1], finalSegmentTable[i, 2]+1).astype(int)] = finalClusteringTable[i] + seg = np.empty([0, 3]) solutionDiff = np.diff(solutionVector)[0] first = 0 - for i in np.arange(0,np.size(solutionDiff,0)): + for i in np.arange(0, np.size(solutionDiff, 0)): if solutionDiff[i]: last = i+1 seg1 = (first)*frameshift seg2 = (last-first)*frameshift - seg3 = solutionVector[0,last-1] + seg3 = solutionVector[0, last-1] if seg.shape[0] != 0 and seg3 == seg[-1][2]: seg[-1][1] += seg2 - elif seg3 and seg2 > 0.3: # and seg2 > 0.1 - seg = np.vstack((seg,[seg1,seg2,seg3])) + elif seg3 and seg2 > 0.3: # and seg2 > 0.1 + seg = np.vstack((seg, [seg1, seg2, seg3])) first = i+1 - last = np.size(solutionVector,1) + last = np.size(solutionVector, 1) seg1 = (first-1)*frameshift seg2 = (last-first+1)*frameshift - seg3 = solutionVector[0,last-1] + seg3 = solutionVector[0, last-1] if seg3 == seg[-1][2]: seg[-1][1] += seg2 - elif seg3 and seg2 > 0.3: # and seg2 > 0.1 - seg = np.vstack((seg,[seg1,seg2,seg3])) - seg = np.vstack((seg,[dur,-1,-1])) - seg[0][0]=0.0 + elif seg3 and seg2 > 0.3: # and seg2 > 0.1 + seg = np.vstack((seg, [seg1, seg2, seg3])) + seg = np.vstack((seg, [dur, -1, -1])) + seg[0][0] = 0.0 return seg - + start_time = time.time() self.log.info("Start Speaker Diarization: %s" % (start_time)) - if self.maxNrSpeakers == 1 or audio.dur < 3: - self.log.info("Speaker Diarization time in seconds: %s" % (time.time() - start_time)) - return [[0, audio.dur, 1], - [audio.dur, -1, -1]] + if self.maxNrSpeakers == 1 or dur < 5: + self.log.info("Speaker Diarization time in seconds: %s" % + (time.time() - start_time)) + return [[0, dur, 1], + [dur, -1, -1]] if feats == None: - feats = self.compute_feat_KALDI(audio) + feats = self.compute_feat_KALDI(wav) nFeatures = feats.shape[0] - maskSAD = self.computeVAD_KALDI(audio,feats) - maskUEM = np.ones([1,nFeatures]) + maskSAD = self.computeVAD_KALDI(feats) + maskUEM = np.ones([1, nFeatures]) - mask = np.logical_and(maskUEM,maskSAD) + mask = np.logical_and(maskUEM, maskSAD) mask = mask[0][0:nFeatures] - nSpeechFeatures=np.sum(mask) + nSpeechFeatures = np.sum(mask) speechMapping = np.zeros(nFeatures) - #you need to start the mapping from 1 and end it in the actual number of features independently of the indexing style - #so that we don't lose features on the way - speechMapping[np.nonzero(mask)] = np.arange(1,nSpeechFeatures+1) - data=feats[np.where(mask==1)] + # you need to start the mapping from 1 and end it in the actual number of features independently of the indexing style + # so that we don't lose features on the way + speechMapping[np.nonzero(mask)] = np.arange(1, nSpeechFeatures+1) + data = feats[np.where(mask == 1)] del feats - segmentTable=getSegmentTable(mask,speechMapping,self.seg_length,self.seg_increment,self.seg_rate) - numberOfSegments=np.size(segmentTable,0) - #create the KBM - #set the window rate in order to obtain "minimumNumberOfInitialGaussians" gaussians + segmentTable = getSegmentTable( + mask, speechMapping, self.seg_length, self.seg_increment, self.seg_rate) + numberOfSegments = np.size(segmentTable, 0) + # create the KBM + # set the window rate in order to obtain "minimumNumberOfInitialGaussians" gaussians if np.floor((nSpeechFeatures-self.windowLength)/self.minimumNumberOfInitialGaussians) < self.maximumKBMWindowRate: - windowRate = int(np.floor((np.size(data,0)-self.windowLength)/self.minimumNumberOfInitialGaussians)) + windowRate = int(np.floor( + (np.size(data, 0)-self.windowLength)/self.minimumNumberOfInitialGaussians)) else: windowRate = int(self.maximumKBMWindowRate) - + if windowRate == 0: - raise ValueError('The audio is to short in order to perform the speaker diarization!!!') - + raise ValueError( + 'The audio is to short in order to perform the speaker diarization!!!') + poolSize = np.floor((nSpeechFeatures-self.windowLength)/windowRate) - if self.useRelativeKBMsize: + if self.useRelativeKBMsize: kbmSize = int(np.floor(poolSize*self.relKBMsize)) else: kbmSize = int(self.kbmSize) - - #Training pool of',int(poolSize),'gaussians with a rate of',int(windowRate),'frames' - kbm, gmPool = trainKBM(data,self.windowLength,windowRate,kbmSize) - + + # Training pool of',int(poolSize),'gaussians with a rate of',int(windowRate),'frames' + kbm, gmPool = trainKBM( + data, self.windowLength, windowRate, kbmSize) + #'Selected',kbmSize,'gaussians from the pool' - Vg = getVgMatrix(data,gmPool,kbm,self.topGaussiansPerFrame) - + Vg = getVgMatrix(data, gmPool, kbm, self.topGaussiansPerFrame) + #'Computing binary keys for all segments... ' - segmentBKTable, segmentCVTable = getSegmentBKs(segmentTable, kbmSize, Vg, self.bitsPerSegmentFactor, speechMapping) - + segmentBKTable, segmentCVTable = getSegmentBKs( + segmentTable, kbmSize, Vg, self.bitsPerSegmentFactor, speechMapping) + #'Performing initial clustering... ' - initialClustering = np.digitize(np.arange(numberOfSegments),np.arange(0,numberOfSegments,numberOfSegments/self.N_init)) - - + initialClustering = np.digitize(np.arange(numberOfSegments), np.arange( + 0, numberOfSegments, numberOfSegments/self.N_init)) + #'Performing agglomerative clustering... ' if self.linkage: - finalClusteringTable, k = performClusteringLinkage(segmentBKTable, segmentCVTable, self.N_init, self.linkageCriterion, self.metric) + finalClusteringTable, k = performClusteringLinkage( + segmentBKTable, segmentCVTable, self.N_init, self.linkageCriterion, self.metric) else: - finalClusteringTable, k = performClustering(speechMapping, segmentTable, segmentBKTable, segmentCVTable, Vg, self.bitsPerSegmentFactor, kbmSize, self.N_init, initialClustering, self.metric) + finalClusteringTable, k = performClustering( + speechMapping, segmentTable, segmentBKTable, segmentCVTable, Vg, self.bitsPerSegmentFactor, kbmSize, self.N_init, initialClustering, self.metric) #'Selecting best clustering...' if self.bestClusteringCriterion == 'elbow': - bestClusteringID = getBestClustering(self.metric_clusteringSelection, segmentBKTable, segmentCVTable, finalClusteringTable, k, self.maxNrSpeakers) + bestClusteringID = getBestClustering( + self.metric_clusteringSelection, segmentBKTable, segmentCVTable, finalClusteringTable, k, self.maxNrSpeakers) elif self.bestClusteringCriterion == 'spectral': - bestClusteringID = getSpectralClustering(self.metric_clusteringSelection,finalClusteringTable,self.N_init,segmentBKTable,segmentCVTable,k,self.sigma,self.percentile,self.maxNrSpeakers)+1 - - if self.resegmentation and np.size(np.unique(finalClusteringTable[:,bestClusteringID.astype(int)-1]),0)>1: - finalClusteringTableResegmentation,finalSegmentTable = performResegmentation(data,speechMapping, mask,finalClusteringTable[:,bestClusteringID.astype(int)-1],segmentTable,self.modelSize,self.nbIter,self.smoothWin,nSpeechFeatures) - seg = getSegments(self.frame_shift_s,finalSegmentTable, np.squeeze(finalClusteringTableResegmentation), audio.dur) + bestClusteringID = getSpectralClustering(self.metric_clusteringSelection, finalClusteringTable, + self.N_init, segmentBKTable, segmentCVTable, k, self.sigma, self.percentile, self.maxNrSpeakers)+1 + + if self.resegmentation and np.size(np.unique(finalClusteringTable[:, bestClusteringID.astype(int)-1]), 0) > 1: + finalClusteringTableResegmentation, finalSegmentTable = performResegmentation(data, speechMapping, mask, finalClusteringTable[:, bestClusteringID.astype( + int)-1], segmentTable, self.modelSize, self.nbIter, self.smoothWin, nSpeechFeatures) + seg = getSegments(self.frame_shift_s, finalSegmentTable, np.squeeze(finalClusteringTableResegmentation), dur) else: - seg = getSegmentationFile(self.frame_shift_s,segmentTable, finalClusteringTable[:,bestClusteringID.astype(int)-1]) - self.log.info("Speaker Diarization time in seconds: %s" % (time.time() - start_time)) + seg = getSegmentationFile( + self.frame_shift_s, segmentTable, finalClusteringTable[:, bestClusteringID.astype(int)-1]) + self.log.info("Speaker Diarization time in seconds: %s" % + (time.time() - start_time)) except ValueError as v: self.log.info(v) - return [[0, audio.dur, 1], - [audio.dur, -1, -1]] + return [[0, dur, 1], + [dur, -1, -1]] except Exception as e: self.log.error(e) raise ValueError("Speaker Diarization failed!!!") else: return seg - + + class SttStandelone: - def __init__(self,metadata=False,spkDiarization=False): - self.log = logging.getLogger('__stt-standelone-worker__.SttStandelone') - self.metadata = metadata - self.spkDiarization = spkDiarization - self.timestamp = True if self.metadata or self.spkDiarization else False - - def run(self,audio,asr,spk): - feats = asr.compute_feat(audio) - mfcc, ivector = asr.get_frames(feats) - if self.spkDiarization: - with ThreadPoolExecutor(max_workers=2) as executor: - thrd1 = executor.submit(asr.decoder, feats) - thrd2 = executor.submit(spk.run, audio, mfcc) - decode = thrd1.result() - spkSeg = thrd2.result() - else: - decode = asr.decoder(feats) - spkSeg = [] - - if self.timestamp: - timestamps = asr.wordTimestamp(decode) - output = self.getOutput(timestamps,asr.frame_shift, asr.decodable_opts.frame_subsampling_factor,spkSeg) - if self.metadata: - return output - else: - return {"text":output["text"]} - else: - return decode["text"] + def __init__(self): + self.log = logging.getLogger("__stt-standelone-worker-streaming__") + logging.basicConfig(level=logging.INFO) - def getOutput(self,timestamps,frame_shift, frame_subsampling, spkSeg = []): - output = {} - if len(spkSeg) == 0: - text = "" - output["words"] = [] - for i in range(len(timestamps["words"])): - if timestamps["words"][i] != "": - meta = {} - meta["word"] = timestamps["words"][i] - meta["btime"] = round(timestamps["start"][i] * frame_shift * frame_subsampling,2) - meta["etime"] = round((timestamps["start"][i]+timestamps["dur"][i]) * frame_shift * frame_subsampling, 2) - output["words"].append(meta) - text += " "+meta["word"] - output["text"] = text - else: - output["speakers"] = [] - output["text"] = [] - j = 0 - newSpk = 1 - for i in range(len(timestamps["words"])): - if timestamps["words"][i] != "": - if newSpk: - speaker = {} - speaker["speaker_id"] = "spk_"+str(int(spkSeg[j][2])) - speaker["words"] = [] - txtSpk = speaker["speaker_id"]+":" - newSpk = 0 - word = {} - word["word"] = timestamps["words"][i] - word["btime"] = round(timestamps["start"][i] * frame_shift * frame_subsampling,2) - word["etime"] = round((timestamps["start"][i]+timestamps["dur"][i]) * frame_shift * frame_subsampling, 2) - speaker["words"].append(word) - txtSpk += " "+word["word"] - if word["etime"] > spkSeg[j+1][0]: - speaker["btime"] = speaker["words"][0]["btime"] - speaker["etime"] = speaker["words"][-1]["etime"] - output["speakers"].append(speaker) - output["text"].append(txtSpk) - newSpk = 1 - j += 1 - #add the last speaker to the output speakers - speaker["btime"] = speaker["words"][0]["btime"] - speaker["etime"] = speaker["words"][-1]["etime"] - output["speakers"].append(speaker) - output["text"].append(txtSpk) - return output - - -class Audio: - def __init__(self,sr): - self.log = logging.getLogger('__stt-standelone-worker__.Audio') - self.bit = 16 - self.channels = 1 - self.sr = sr - - def set_logger(self,log): - self.log = log - - def read_audio(self, audio): + # Main parameters + self.AM_PATH = '/opt/models/AM' + self.LM_PATH = '/opt/models/LM' + self.TEMP_FILE_PATH = '/opt/tmp' + self.CONFIG_FILES_PATH = '/opt/config' + self.SAVE_AUDIO = False + self.SERVICE_PORT = 80 + self.SWAGGER_URL = '/api-doc' + self.SWAGGER_PATH = None + + if not os.path.isdir(self.TEMP_FILE_PATH): + os.mkdir(self.TEMP_FILE_PATH) + if not os.path.isdir(self.CONFIG_FILES_PATH): + os.mkdir(self.CONFIG_FILES_PATH) + + # Environment parameters + if 'SERVICE_PORT' in os.environ: + self.SERVICE_PORT = os.environ['SERVICE_PORT'] + if 'SAVE_AUDIO' in os.environ: + self.SAVE_AUDIO = os.environ['SAVE_AUDIO'] + if 'SWAGGER_PATH' in os.environ: + self.SWAGGER_PATH = os.environ['SWAGGER_PATH'] + + self.loadConfig() + + def loadConfig(self): + # get decoder parameters from "decode.cfg" + decoder_settings = configparser.ConfigParser() + if not os.path.exists(self.AM_PATH+'/decode.cfg'): + return False + decoder_settings.read(self.AM_PATH+'/decode.cfg') + + # Prepare "online.conf" + self.AM_PATH = self.AM_PATH+"/" + \ + decoder_settings.get('decoder_params', 'ampath') + with open(self.AM_PATH+"/conf/online.conf") as f: + values = f.readlines() + with open(self.CONFIG_FILES_PATH+"/online.conf", 'w') as f: + for i in values: + f.write(i) + f.write("--ivector-extraction-config=" + + self.CONFIG_FILES_PATH+"/ivector_extractor.conf\n") + f.write("--mfcc-config="+self.AM_PATH+"/conf/mfcc.conf\n") + f.write( + "--beam="+decoder_settings.get('decoder_params', 'beam')+"\n") + f.write( + "--lattice-beam="+decoder_settings.get('decoder_params', 'lattice_beam')+"\n") + f.write("--acoustic-scale=" + + decoder_settings.get('decoder_params', 'acwt')+"\n") + f.write( + "--min-active="+decoder_settings.get('decoder_params', 'min_active')+"\n") + f.write( + "--max-active="+decoder_settings.get('decoder_params', 'max_active')+"\n") + f.write("--frame-subsampling-factor="+decoder_settings.get( + 'decoder_params', 'frame_subsampling_factor')+"\n") + + # Prepare "ivector_extractor.conf" + with open(self.AM_PATH+"/conf/ivector_extractor.conf") as f: + values = f.readlines() + with open(self.CONFIG_FILES_PATH+"/ivector_extractor.conf", 'w') as f: + for i in values: + f.write(i) + f.write("--splice-config="+self.AM_PATH+"/conf/splice.conf\n") + f.write("--cmvn-config="+self.AM_PATH + + "/conf/online_cmvn.conf\n") + f.write("--lda-matrix="+self.AM_PATH + + "/ivector_extractor/final.mat\n") + f.write("--global-cmvn-stats="+self.AM_PATH + + "/ivector_extractor/global_cmvn.stats\n") + f.write("--diag-ubm="+self.AM_PATH + + "/ivector_extractor/final.dubm\n") + f.write("--ivector-extractor="+self.AM_PATH + + "/ivector_extractor/final.ie") + + # Prepare "word_boundary.int" if not exist + if not os.path.exists(self.LM_PATH+"/word_boundary.int") and os.path.exists(self.AM_PATH+"phones.txt"): + with open(self.AM_PATH+"phones.txt") as f: + phones = f.readlines() + + with open(self.LM_PATH+"/word_boundary.int", "w") as f: + for phone in phones: + phone = phone.strip() + phone = re.sub('^ .*', '', phone) + phone = re.sub('^#\d+ .*', '', phone) + if phone != '': + id = phone.split(' ')[1] + if '_I ' in phone: + f.write(id+" internal\n") + elif '_B ' in phone: + f.write(id+" begin\n") + elif '_E ' in phone: + f.write(id+" end\n") + elif '_S ' in phone: + f.write(id+" singleton\n") + else: + f.write(id+" nonword\n") + + def swaggerUI(self, app): + ### swagger specific ### + swagger_yml = yaml.load( + open(self.SWAGGER_PATH, 'r'), Loader=yaml.Loader) + swaggerui = get_swaggerui_blueprint( + # Swagger UI static files will be mapped to '{SWAGGER_URL}/dist/' + self.SWAGGER_URL, + self.SWAGGER_PATH, + config={ # Swagger UI config overrides + 'app_name': "STT API Documentation", + 'spec': swagger_yml + } + ) + app.register_blueprint(swaggerui, url_prefix=self.SWAGGER_URL) + ### end swagger specific ### + + def read_audio(self, file, sample_rate): + file_path = self.TEMP_FILE_PATH+file.filename.lower() + file.save(file_path) try: - data, sr = librosa.load(audio,sr=None) - if sr != self.sr: - self.log.info('Resample audio file: '+str(sr)+'Hz -> '+str(self.sr)+'Hz') - data = librosa.resample(data, sr, self.sr) + data, sr = librosa.load(file_path, sr=None) + if sr != sample_rate: + self.log.info('Resample audio file: '+str(sr) + + 'Hz -> '+str(sample_rate)+'Hz') + data = librosa.resample(data, sr, sample_rate) data = (data * 32767).astype(np.int16) - self.data = data - self.dur = len(self.data) / self.sr + self.dur = len(data) / sample_rate + self.data = Vector(data) + + if not self.SAVE_AUDIO: + os.remove(file_path) except Exception as e: self.log.error(e) raise ValueError("The uploaded file format is not supported!!!") - - def getDataKaldyVector(self): - return Vector(self.data) \ No newline at end of file + + def run(self, asr, metadata): + feats = asr.compute_feat(self.data) + mfcc, ivector = asr.get_frames(feats) + decode = asr.decoder(feats) + if metadata: + spk = SpeakerDiarization(asr.get_sample_rate()) + spkSeg = spk.run(self.data, self.dur, mfcc) + data = asr.wordTimestamp(decode["text"], decode['lattice'], asr.frame_shift, asr.decodable_opts.frame_subsampling_factor) + output = self.process_output(data, spkSeg) + return output + else: + return self.parse_text(decode["text"]) + + + # return a json object including word-data, speaker-data + def process_output(self, data, spkrs): + speakers = [] + text = [] + i = 0 + text_ = "" + words=[] + for word in data['words']: + if i+1 < len(spkrs) and word["end"] < spkrs[i+1][0]: + text_ += word["word"] + " " + words.append(word) + else: + speaker = {} + speaker["start"]=words[0]["start"] + speaker["end"]=words[len(words)-1]["end"] + speaker["speaker_id"]='spk'+str(int(spkrs[i][2])) + speaker["words"]=words + + text.append('spk'+str(int(spkrs[i][2]))+' : '+ self.parse_text(text_)) + speakers.append(speaker) + + words=[word] + text_=word["word"] + " " + i+=1 + + speaker = {} + speaker["start"]=words[0]["start"] + speaker["end"]=words[len(words)-1]["end"] + speaker["speaker_id"]='spk'+str(int(spkrs[i][2])) + speaker["words"]=words + + text.append('spk'+str(int(spkrs[i][2]))+' : '+ self.parse_text(text_)) + speakers.append(speaker) + + return {'speakers': speakers, 'text': text} + + # remove extra symbols + def parse_text(self, text): + text = re.sub(r"", "", text) # remove symbol + text = re.sub(r"#nonterm:[^ ]* ", "", text) # remove entity's mark + text = re.sub(r"", "", text) # remove + text = re.sub(r"' ", "'", text) # remove space after quote ' + text = re.sub(r" +", " ", text) # remove multiple spaces + text = text.strip() + return text From f6a3659028fe77a6874dceec685e17a5eaa1bb0d Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Sat, 3 Oct 2020 15:53:48 +0200 Subject: [PATCH 027/172] update readme --- README.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/README.md b/README.md index a2540a8..fb69cc2 100644 --- a/README.md +++ b/README.md @@ -140,9 +140,7 @@ Convert a speech to text > `post`
> Make a POST request >> Arguments : ->> - **{File} file** : Audio file (file format: wav, mp3, aiff, flac, ogg) ->> - **{Integer} nbrSpeaker (optional)**: Number of speakers engaged in dialog ->> - **{String} speaker (optional)**: Do speaker diarization (yes|no) +>> - **{File} file** : Audio file (file format: wav, mp3, flac, ogg) > >> Header : >> - **{String} Accept**: response content type (text/plain|application/json) From 37b48db04370f8485384ca4c5fb5ee0a4c4c44d1 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 13 Oct 2020 11:53:03 +0200 Subject: [PATCH 028/172] change audio file loader and saved filename --- tools.py | 81 +++++++++++++++++++++++++++++++------------------------- 1 file changed, 45 insertions(+), 36 deletions(-) diff --git a/tools.py b/tools.py index 05fe391..5d925bf 100644 --- a/tools.py +++ b/tools.py @@ -14,12 +14,13 @@ # other packages import configparser +import librosa import logging import os import re +import uuid import json import yaml -import scipy.io.wavfile import numpy as np from flask_swagger_ui import get_swaggerui_blueprint ############## @@ -36,7 +37,7 @@ def __init__(self): self.LM_PATH = '/opt/models/LM' self.TEMP_FILE_PATH = '/opt/tmp' self.CONFIG_FILES_PATH = '/opt/config' - self.SAVE_AUDIO=False + self.SAVE_AUDIO = False self.SERVICE_PORT = 80 self.NBR_THREADS = 100 self.SWAGGER_URL = '/api-doc' @@ -58,16 +59,17 @@ def __init__(self): if 'SWAGGER_PATH' in os.environ: self.SWAGGER_PATH = os.environ['SWAGGER_PATH'] - # start loading ASR configuration self.log.info("Create the new config files") self.loadConfig() def swaggerUI(self, app): ### swagger specific ### - swagger_yml = yaml.load(open(self.SWAGGER_PATH, 'r'), Loader=yaml.Loader) + swagger_yml = yaml.load( + open(self.SWAGGER_PATH, 'r'), Loader=yaml.Loader) swaggerui = get_swaggerui_blueprint( - self.SWAGGER_URL, # Swagger UI static files will be mapped to '{SWAGGER_URL}/dist/' + # Swagger UI static files will be mapped to '{SWAGGER_URL}/dist/' + self.SWAGGER_URL, self.SWAGGER_PATH, config={ # Swagger UI config overrides 'app_name': "STT API Documentation", @@ -77,15 +79,20 @@ def swaggerUI(self, app): app.register_blueprint(swaggerui, url_prefix=self.SWAGGER_URL) ### end swagger specific ### - def getAudio(self,file): + def getAudio(self, file): + filename = str(uuid.uuid4()) + file_path = self.TEMP_FILE_PATH+"/"+filename + file.save(file_path) try: - file_path = self.TEMP_FILE_PATH+"/"+file.filename.lower() - file.save(file_path) - self.rate, self.data = scipy.io.wavfile.read(file_path) + data, sr = librosa.load(file_path) + self.data = (data * 32767).astype(np.int16) + self.rate = sr + except Exception as e: + self.log.error(e) + raise ValueError("The uploaded file format is not supported!!!") + finally: if not self.SAVE_AUDIO: os.remove(file_path) - except Exception as e: - raise ValueError('Unsupported audio file! Only WAVE format is supported.') # re-create config files def loadConfig(self): @@ -166,10 +173,10 @@ def loadConfig(self): # remove extra symbols def parse_text(self, text): - text = re.sub(r"", "", text) # remove symbol - text = re.sub(r"#nonterm:[^ ]* ", "", text) # remove entity's mark - text = re.sub(r"' ", "'", text) # remove space after quote ' - text = re.sub(r" +", " ", text) # remove multiple spaces + text = re.sub(r"", "", text) # remove symbol + text = re.sub(r"#nonterm:[^ ]* ", "", text) # remove entity's mark + text = re.sub(r"' ", "'", text) # remove space after quote ' + text = re.sub(r" +", " ", text) # remove multiple spaces text = text.strip() return text @@ -178,7 +185,7 @@ def get_response(self, dataJson, is_metadata, is_spkDiarization, nbrOfSpk): if dataJson is not None: data = json.loads(dataJson) if not is_metadata: - text = data['text'] # get text from response + text = data['text'] # get text from response return self.parse_text(text) elif 'words' in data and 'features' in data: @@ -194,12 +201,12 @@ def get_response(self, dataJson, is_metadata, is_spkDiarization, nbrOfSpk): feats = np.squeeze(feats) mask = np.ones(shape=(feats.shape[0],)) for pos in seg: - mask[pos-30:pos]=0 + mask[pos-30:pos] = 0 # Do speaker diarization and get speaker segments spk = SpeakerDiarization() spk.set_maxNrSpeakers(nbrOfSpk) - spkrs = spk.run(feats,mask) + spkrs = spk.run(feats, mask) # Generate final output data return self.process_output(data, spkrs) @@ -212,39 +219,40 @@ def get_response(self, dataJson, is_metadata, is_spkDiarization, nbrOfSpk): else: return {'speakers': [], 'text': '', 'words': []} - # return a json object including word-data, speaker-data + def process_output(self, data, spkrs): speakers = [] text = [] i = 0 text_ = "" - words=[] + words = [] for word in data['words']: if i+1 < len(spkrs) and word["end"] < spkrs[i+1][0]: - text_ += word["word"] + " " + text_ += word["word"] + " " words.append(word) else: speaker = {} - speaker["start"]=words[0]["start"] - speaker["end"]=words[len(words)-1]["end"] - speaker["speaker_id"]='spk'+str(int(spkrs[i][2])) - speaker["words"]=words + speaker["start"] = words[0]["start"] + speaker["end"] = words[len(words)-1]["end"] + speaker["speaker_id"] = 'spk'+str(int(spkrs[i][2])) + speaker["words"] = words - text.append('spk'+str(int(spkrs[i][2]))+' : '+ self.parse_text(text_)) + text.append( + 'spk'+str(int(spkrs[i][2]))+' : ' + self.parse_text(text_)) speakers.append(speaker) - words=[word] - text_=word["word"] + " " - i+=1 + words = [word] + text_ = word["word"] + " " + i += 1 speaker = {} - speaker["start"]=words[0]["start"] - speaker["end"]=words[len(words)-1]["end"] - speaker["speaker_id"]='spk'+str(int(spkrs[i][2])) - speaker["words"]=words + speaker["start"] = words[0]["start"] + speaker["end"] = words[len(words)-1]["end"] + speaker["speaker_id"] = 'spk'+str(int(spkrs[i][2])) + speaker["words"] = words - text.append('spk'+str(int(spkrs[i][2]))+' : '+ self.parse_text(text_)) + text.append('spk'+str(int(spkrs[i][2]))+' : ' + self.parse_text(text_)) speakers.append(speaker) return {'speakers': speakers, 'text': text} @@ -362,7 +370,7 @@ def getSegments(frameshift, finalSegmentTable, finalClusteringTable, dur): maskSAD = mask maskUEM = np.ones([1, nFeatures]) - + mask = np.logical_and(maskUEM, maskSAD) mask = mask[0][0:nFeatures] nSpeechFeatures = np.sum(mask) @@ -434,7 +442,8 @@ def getSegments(frameshift, finalSegmentTable, finalClusteringTable, dur): else: return None - self.log.info("Speaker Diarization time in seconds: %s" % (time.time() - start_time)) + self.log.info("Speaker Diarization time in seconds: %s" % + (time.time() - start_time)) except ValueError as v: self.log.info(v) return [[0, duration, 1], From f6f9da2fc5a43aa5e911273916a85ff132f63a05 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 13 Oct 2020 11:56:24 +0200 Subject: [PATCH 029/172] update swagger --- document/swagger.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/document/swagger.yml b/document/swagger.yml index 8a93b7c..e763d3b 100644 --- a/document/swagger.yml +++ b/document/swagger.yml @@ -24,7 +24,7 @@ paths: parameters: - name: "file" in: "formData" - description: "Audio File - Waveform Format" + description: "Audio File (wav, mp3, flac, ogg)" required: true type: "file" responses: From 3782a1210daf02034cfef1dabca2c984f2c23bf3 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 13 Oct 2020 12:15:04 +0200 Subject: [PATCH 030/172] change audio file loader and saved filename --- document/swagger.yml | 2 +- tools.py | 11 ++++++----- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/document/swagger.yml b/document/swagger.yml index e763d3b..3db05a0 100644 --- a/document/swagger.yml +++ b/document/swagger.yml @@ -24,7 +24,7 @@ paths: parameters: - name: "file" in: "formData" - description: "Audio File (wav, mp3, flac, ogg)" + description: "Audio File (wav, mp3, flac, ogg, wma, m4a)" required: true type: "file" responses: diff --git a/tools.py b/tools.py index cfb6117..3129d99 100644 --- a/tools.py +++ b/tools.py @@ -35,7 +35,7 @@ ############## # other packages -import configparser, sys, os, re, time, logging, yaml +import configparser, sys, os, re, time, logging, yaml, uuid from flask_swagger_ui import get_swaggerui_blueprint ############## @@ -572,7 +572,8 @@ def swaggerUI(self, app): ### end swagger specific ### def read_audio(self, file, sample_rate): - file_path = self.TEMP_FILE_PATH+file.filename.lower() + filename = str(uuid.uuid4()) + file_path = self.TEMP_FILE_PATH+"/"+filename file.save(file_path) try: data, sr = librosa.load(file_path, sr=None) @@ -583,12 +584,12 @@ def read_audio(self, file, sample_rate): data = (data * 32767).astype(np.int16) self.dur = len(data) / sample_rate self.data = Vector(data) - - if not self.SAVE_AUDIO: - os.remove(file_path) except Exception as e: self.log.error(e) raise ValueError("The uploaded file format is not supported!!!") + finally: + if not self.SAVE_AUDIO: + os.remove(file_path) def run(self, asr, metadata): feats = asr.compute_feat(self.data) From 43958a74a45116c59f19a26a0b140f5eb8dca42a Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 13 Oct 2020 12:29:17 +0200 Subject: [PATCH 031/172] install ffmpeg to extend the supported audio format --- Dockerfile | 3 ++- document/swagger.yml | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/Dockerfile b/Dockerfile index 1c6f518..e3bd38a 100644 --- a/Dockerfile +++ b/Dockerfile @@ -70,7 +70,8 @@ RUN cd /opt/vosk-api/python && \ WORKDIR /usr/src/speech-to-text # Install main service packages -RUN pip3 install flask flask-cors flask-swagger-ui gevent pyyaml +RUN pip3 install flask flask-cors flask-swagger-ui gevent pyyaml && \ + apt-get install -y ffmpeg COPY pyBK/diarizationFunctions.py pyBK/diarizationFunctions.py COPY tools.py . diff --git a/document/swagger.yml b/document/swagger.yml index e763d3b..3db05a0 100644 --- a/document/swagger.yml +++ b/document/swagger.yml @@ -24,7 +24,7 @@ paths: parameters: - name: "file" in: "formData" - description: "Audio File (wav, mp3, flac, ogg)" + description: "Audio File (wav, mp3, flac, ogg, wma, m4a)" required: true type: "file" responses: From a23a19725b28287a44de69caaba06bbccda9d010 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 3 Nov 2020 17:24:20 +0100 Subject: [PATCH 032/172] update worker and fix some bugs --- Dockerfile | 8 +-- document/swagger.yml | 2 +- run.py | 10 ++-- tools.py | 133 +++++++++++++++++++++++++++---------------- vosk-api | 2 +- 5 files changed, 96 insertions(+), 59 deletions(-) diff --git a/Dockerfile b/Dockerfile index e3bd38a..c8e95cd 100644 --- a/Dockerfile +++ b/Dockerfile @@ -59,6 +59,10 @@ RUN apt install -y software-properties-common && wget https://apt.llvm.org/llvm. pip3 install websockets && \ pip3 install librosa webrtcvad scipy sklearn +# Install main service packages +RUN pip3 install flask flask-cors flask-swagger-ui gevent pyyaml && \ + apt-get install -y ffmpeg + # build VOSK KALDI COPY vosk-api /opt/vosk-api RUN cd /opt/vosk-api/python && \ @@ -69,10 +73,6 @@ RUN cd /opt/vosk-api/python && \ # Define the main folder WORKDIR /usr/src/speech-to-text -# Install main service packages -RUN pip3 install flask flask-cors flask-swagger-ui gevent pyyaml && \ - apt-get install -y ffmpeg - COPY pyBK/diarizationFunctions.py pyBK/diarizationFunctions.py COPY tools.py . COPY run.py . diff --git a/document/swagger.yml b/document/swagger.yml index 3db05a0..b52b52c 100644 --- a/document/swagger.yml +++ b/document/swagger.yml @@ -24,7 +24,7 @@ paths: parameters: - name: "file" in: "formData" - description: "Audio File (wav, mp3, flac, ogg, wma, m4a)" + description: "Audio File - Waveform Audio File Format is required. Best configuration (16KHz, 16b, mono)" required: true type: "file" responses: diff --git a/run.py b/run.py index b548209..59bc5b1 100644 --- a/run.py +++ b/run.py @@ -19,7 +19,7 @@ worker.log.info('Load acoustic model and decoding graph') model = Model(worker.AM_PATH, worker.LM_PATH, worker.CONFIG_FILES_PATH+"/online.conf") - +spkModel = None # API @app.route('/transcribe', methods=['POST']) @@ -43,11 +43,13 @@ def transcribe(): if 'file' in request.files.keys(): file = request.files['file'] worker.getAudio(file) - rec = KaldiRecognizer(model, worker.rate, is_metadata) - data_ = rec.Decode(worker.data) + rec = KaldiRecognizer(model, spkModel, worker.rate, False) + rec.AcceptWaveform(worker.data) + data_ = rec.FinalResult() if is_metadata: data_ = rec.GetMetadata() - data = worker.get_response(data_, is_metadata, is_metadata, nbrOfSpk) + data = worker.get_response(data_, is_metadata, nbrOfSpk) + worker.clean() else: raise ValueError('No audio file was uploaded') diff --git a/tools.py b/tools.py index 5d925bf..958502f 100644 --- a/tools.py +++ b/tools.py @@ -22,6 +22,7 @@ import json import yaml import numpy as np +from scipy.io import wavfile from flask_swagger_ui import get_swaggerui_blueprint ############## @@ -81,18 +82,20 @@ def swaggerUI(self, app): def getAudio(self, file): filename = str(uuid.uuid4()) - file_path = self.TEMP_FILE_PATH+"/"+filename - file.save(file_path) + self.file_path = self.TEMP_FILE_PATH+"/"+filename + file.save(self.file_path) try: - data, sr = librosa.load(file_path) - self.data = (data * 32767).astype(np.int16) - self.rate = sr + self.rate, self.data = wavfile.read(self.file_path) + # if stereo file, convert to mono by computing the mean of the channels + if len(self.data.shape) == 2 and self.data.shape[1] == 2: + self.data = np.mean(self.data, axis=1, dtype=np.int16) except Exception as e: self.log.error(e) raise ValueError("The uploaded file format is not supported!!!") - finally: - if not self.SAVE_AUDIO: - os.remove(file_path) + + def clean(self): + if not self.SAVE_AUDIO: + os.remove(self.file_path) # re-create config files def loadConfig(self): @@ -125,9 +128,6 @@ def loadConfig(self): "--max-active="+decoder_settings.get('decoder_params', 'max_active')+"\n") f.write("--frame-subsampling-factor="+decoder_settings.get( 'decoder_params', 'frame_subsampling_factor')+"\n") - f.write("--endpoint.rule2.min-trailing-silence=0.5\n") - f.write("--endpoint.rule3.min-trailing-silence=1.0\n") - f.write("--endpoint.rule4.min-trailing-silence=2.0\n") # Prepare "ivector_extractor.conf" with open(self.AM_PATH+"/conf/ivector_extractor.conf") as f: @@ -181,39 +181,23 @@ def parse_text(self, text): return text # Postprocess response - def get_response(self, dataJson, is_metadata, is_spkDiarization, nbrOfSpk): + def get_response(self, dataJson, is_metadata, nbrOfSpk): if dataJson is not None: data = json.loads(dataJson) if not is_metadata: text = data['text'] # get text from response return self.parse_text(text) - elif 'words' in data and 'features' in data: - if is_spkDiarization: - # Get Features and spoken segments and clean data - features = data['features'] - seg = data['segments'] if data['segments'] is not None else [] - del data['features'] - del data['segments'] - - # Prepare the parameters for SpeakerDiarization input - feats = np.array(features) - feats = np.squeeze(feats) - mask = np.ones(shape=(feats.shape[0],)) - for pos in seg: - mask[pos-30:pos] = 0 - - # Do speaker diarization and get speaker segments - spk = SpeakerDiarization() - spk.set_maxNrSpeakers(nbrOfSpk) - spkrs = spk.run(feats, mask) - - # Generate final output data - return self.process_output(data, spkrs) - - del data['features'] - del data['segments'] - return data + elif 'words' in data: + # Do speaker diarization and get speaker segments + spk = SpeakerDiarization() + spk.set_maxNrSpeakers(nbrOfSpk) + spkrs = spk.run(self.file_path) + + # Generate final output data + return self.process_output(data, spkrs) + elif 'text' in data: + return {'speakers': [], 'text': data['text'], 'words': []} else: return {'speakers': [], 'text': '', 'words': []} else: @@ -228,6 +212,8 @@ def process_output(self, data, spkrs): text_ = "" words = [] for word in data['words']: + if i+1 == len(spkrs): + continue if i+1 < len(spkrs) and word["end"] < spkrs[i+1][0]: text_ += word["word"] + " " words.append(word) @@ -266,10 +252,8 @@ def __init__(self): # MFCC FEATURES PARAMETERS self.frame_length_s = 0.025 self.frame_shift_s = 0.01 - self.num_bins = 40 - self.num_ceps = 40 - self.low_freq = 40 - self.high_freq = -200 + self.num_bins = 30 + self.num_ceps = 30 ##### # Segment @@ -321,11 +305,57 @@ def __init__(self): self.nbIter = 10 # Number of expectation-maximization (EM) iterations self.smoothWin = 100 # Size of the likelihood smoothing window in nb of frames ###### + + def compute_feat_Librosa(self,audioFile): + try: + self.data, self.sr = librosa.load(audioFile,sr=None) + frame_length_inSample = self.frame_length_s * self.sr + hop = int(self.frame_shift_s * self.sr) + NFFT = int(2**np.ceil(np.log2(frame_length_inSample))) + if self.sr >= 16000: + mfccNumpy = librosa.feature.mfcc(y=self.data, + sr=self.sr, + dct_type=2, + n_mfcc=self.num_ceps, + n_mels=self.num_bins, + n_fft=NFFT, + hop_length=hop, + fmin=20, + fmax=7600).T + else: + mfccNumpy = librosa.feature.mfcc(y=self.data, + sr=self.sr, + dct_type=2, + n_mfcc=self.num_ceps, + n_mels=self.num_bins, + n_fft=NFFT, + hop_length=hop).T + + except Exception as e: + self.log.error(e) + raise ValueError("Speaker diarization failed when extracting features!!!") + else: + return mfccNumpy + + def computeVAD_WEBRTC(self, data, sr, nFeatures): + try: + va_framed = py_webrtcvad(data, fs=sr, fs_vad=sr, hoplength=30, vad_mode=0) + segments = get_py_webrtcvad_segments(va_framed,sr) + maskSAD = np.zeros([1,nFeatures]) + for seg in segments: + start=int(np.round(seg[0]/self.frame_shift_s)) + end=int(np.round(seg[1]/self.frame_shift_s)) + maskSAD[0][start:end]=1 + except Exception as e: + self.log.error(e) + raise ValueError("Speaker diarization failed while voice activity detection!!!") + else: + return maskSAD def set_maxNrSpeakers(self, nbr): self.maxNrSpeakers = nbr - def run(self, feats, mask): + def run(self, audioFile): try: def getSegments(frameshift, finalSegmentTable, finalClusteringTable, dur): numberOfSpeechFeatures = finalSegmentTable[-1, 2].astype(int)+1 @@ -361,6 +391,9 @@ def getSegments(frameshift, finalSegmentTable, finalClusteringTable, dur): start_time = time.time() + self.log.info('Start Speaker diarization') + + feats = self.compute_feat_Librosa(audioFile) nFeatures = feats.shape[0] duration = nFeatures * self.frame_shift_s @@ -368,7 +401,7 @@ def getSegments(frameshift, finalSegmentTable, finalClusteringTable, dur): return [[0, duration, 1], [duration, -1, -1]] - maskSAD = mask + maskSAD = self.computeVAD_WEBRTC(self.data, self.sr, nFeatures) maskUEM = np.ones([1, nFeatures]) mask = np.logical_and(maskUEM, maskSAD) @@ -440,16 +473,18 @@ def getSegments(frameshift, finalSegmentTable, finalClusteringTable, dur): seg = getSegments(self.frame_shift_s, finalSegmentTable, np.squeeze( finalClusteringTableResegmentation), duration) else: - return None + return [[0, duration, 1], + [duration, -1, -1]] - self.log.info("Speaker Diarization time in seconds: %s" % - (time.time() - start_time)) + self.log.info("Speaker Diarization time in seconds: %d" % + int(time.time() - start_time)) except ValueError as v: - self.log.info(v) + self.log.error(v) return [[0, duration, 1], [duration, -1, -1]] except Exception as e: self.log.error(e) - return None + return [[0, duration, 1], + [duration, -1, -1]] else: return seg diff --git a/vosk-api b/vosk-api index fec4a1a..a38506d 160000 --- a/vosk-api +++ b/vosk-api @@ -1 +1 @@ -Subproject commit fec4a1ad76a3c2e66bad84acd5cead2070b3d1b6 +Subproject commit a38506d69460438d7f2b074470e72ef0ab973bf0 From ad9d5db2b795e42093e3233a39323e32156aef79 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 3 Nov 2020 17:32:13 +0100 Subject: [PATCH 033/172] remove extra parameters --- run.py | 5 ++--- tools.py | 9 +++------ 2 files changed, 5 insertions(+), 9 deletions(-) diff --git a/run.py b/run.py index 59bc5b1..6ecfa3f 100644 --- a/run.py +++ b/run.py @@ -29,7 +29,6 @@ def transcribe(): (strftime("%d/%b/%d %H:%M:%S", gmtime()))) is_metadata = False - nbrOfSpk = 10 # get response content type if request.headers.get('accept').lower() == 'application/json': @@ -43,12 +42,12 @@ def transcribe(): if 'file' in request.files.keys(): file = request.files['file'] worker.getAudio(file) - rec = KaldiRecognizer(model, spkModel, worker.rate, False) + rec = KaldiRecognizer(model, spkModel, worker.rate, worker.ONLINE) rec.AcceptWaveform(worker.data) data_ = rec.FinalResult() if is_metadata: data_ = rec.GetMetadata() - data = worker.get_response(data_, is_metadata, nbrOfSpk) + data = worker.get_response(data_, is_metadata) worker.clean() else: raise ValueError('No audio file was uploaded') diff --git a/tools.py b/tools.py index 958502f..0cfe1f8 100644 --- a/tools.py +++ b/tools.py @@ -43,6 +43,7 @@ def __init__(self): self.NBR_THREADS = 100 self.SWAGGER_URL = '/api-doc' self.SWAGGER_PATH = '' + self.ONLINE = False if not os.path.isdir(self.CONFIG_FILES_PATH): os.mkdir(self.CONFIG_FILES_PATH) @@ -181,7 +182,7 @@ def parse_text(self, text): return text # Postprocess response - def get_response(self, dataJson, is_metadata, nbrOfSpk): + def get_response(self, dataJson, is_metadata): if dataJson is not None: data = json.loads(dataJson) if not is_metadata: @@ -191,7 +192,6 @@ def get_response(self, dataJson, is_metadata, nbrOfSpk): elif 'words' in data: # Do speaker diarization and get speaker segments spk = SpeakerDiarization() - spk.set_maxNrSpeakers(nbrOfSpk) spkrs = spk.run(self.file_path) # Generate final output data @@ -296,7 +296,7 @@ def __init__(self): self.bestClusteringCriterion = 'elbow' self.sigma = 1 # Spectral clustering parameters, employed if bestClusteringCriterion == spectral self.percentile = 40 - self.maxNrSpeakers = 16 # If known, max nr of speakers in a sesssion in the database. This is to limit the effect of changes in very small meaningless eigenvalues values generating huge eigengaps + self.maxNrSpeakers = 10 # If known, max nr of speakers in a sesssion in the database. This is to limit the effect of changes in very small meaningless eigenvalues values generating huge eigengaps ###### # RESEGMENTATION @@ -352,9 +352,6 @@ def computeVAD_WEBRTC(self, data, sr, nFeatures): else: return maskSAD - def set_maxNrSpeakers(self, nbr): - self.maxNrSpeakers = nbr - def run(self, audioFile): try: def getSegments(frameshift, finalSegmentTable, finalClusteringTable, dur): From d13326ec61dd9947323cabc6d08c0f5ace2e1a96 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 3 Nov 2020 17:35:01 +0100 Subject: [PATCH 034/172] update README --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 8e53305..45c75f7 100644 --- a/README.md +++ b/README.md @@ -55,14 +55,14 @@ If you want to use our service alone without LinTO-Platform-STT-Service-Manager, ```bash wget https://dl.linto.ai/downloads/model-distribution/acoustic-models/fr-FR/linSTT_AM_fr-FR_v1.0.0.zip -wget https://dl.linto.ai/downloads/model-distribution/decoding-graphs/LVCSR/fr-FR/decoding_graph_fr-FR_Small_v1.0.0.zip +wget https://dl.linto.ai/downloads/model-distribution/decoding-graphs/LVCSR/fr-FR/decoding_graph_fr-FR_Small_v1.1.0.zip ``` 2- Uncompress both files ```bash unzip linSTT_AM_fr-FR_v1.0.0.zip -d AM_fr-FR -unzip decoding_graph_fr-FR_Small_v1.0.0.zip -d DG_fr-FR_Small +unzip decoding_graph_fr-FR_Small_v1.1.0.zip -d DG_fr-FR_Small ``` 3- Move the uncompressed files into the shared storage directory From e21264e36b1a7f00f8f16d7d1e9be45aea79b2ce Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 3 Nov 2020 17:45:06 +0100 Subject: [PATCH 035/172] update README --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index fb69cc2..43df31e 100644 --- a/README.md +++ b/README.md @@ -55,14 +55,14 @@ If you want to use our service alone without LinTO-Platform-STT-Service-Manager, ```bash wget https://dl.linto.ai/downloads/model-distribution/acoustic-models/fr-FR/linSTT_AM_fr-FR_v1.0.0.zip -wget https://dl.linto.ai/downloads/model-distribution/decoding-graphs/LVCSR/fr-FR/decoding_graph_fr-FR_Small_v1.0.0.zip +wget https://dl.linto.ai/downloads/model-distribution/decoding-graphs/LVCSR/fr-FR/decoding_graph_fr-FR_Small_v1.1.0.zip ``` 2- Uncompress both files ```bash unzip linSTT_AM_fr-FR_v1.0.0.zip -d AM_fr-FR -unzip decoding_graph_fr-FR_Small_v1.0.0.zip -d DG_fr-FR_Small +unzip decoding_graph_fr-FR_Small_v1.1.0.zip -d DG_fr-FR_Small ``` 3- Move the uncompressed files into the shared storage directory From d4b127f19f9cef1cbd11cb18f2168a796b0a939a Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Fri, 6 Nov 2020 15:47:34 +0100 Subject: [PATCH 036/172] add confidence score feature --- run.py | 5 +++-- tools.py | 4 ++-- vosk-api | 2 +- 3 files changed, 6 insertions(+), 5 deletions(-) diff --git a/run.py b/run.py index 6ecfa3f..822a650 100644 --- a/run.py +++ b/run.py @@ -3,7 +3,7 @@ from flask import Flask, request, abort, Response, json from vosk import Model, KaldiRecognizer -from tools import WorkerStreaming +from tools import Worker from time import gmtime, strftime from gevent.pywsgi import WSGIServer @@ -13,7 +13,7 @@ app = Flask("__stt-standelone-worker__") # create WorkerStreaming object -worker = WorkerStreaming() +worker = Worker() # Load ASR models (acoustic model and decoding graph) worker.log.info('Load acoustic model and decoding graph') @@ -45,6 +45,7 @@ def transcribe(): rec = KaldiRecognizer(model, spkModel, worker.rate, worker.ONLINE) rec.AcceptWaveform(worker.data) data_ = rec.FinalResult() + worker.log.info(rec.uttConfidence()) if is_metadata: data_ = rec.GetMetadata() data = worker.get_response(data_, is_metadata) diff --git a/tools.py b/tools.py index 0cfe1f8..39e2700 100644 --- a/tools.py +++ b/tools.py @@ -27,10 +27,10 @@ ############## -class WorkerStreaming: +class Worker: def __init__(self): # Set logger config - self.log = logging.getLogger("__stt-standelone-worker-streaming__") + self.log = logging.getLogger("__stt-standelone-worker__") logging.basicConfig(level=logging.INFO) # Main parameters diff --git a/vosk-api b/vosk-api index a38506d..7f555e4 160000 --- a/vosk-api +++ b/vosk-api @@ -1 +1 @@ -Subproject commit a38506d69460438d7f2b074470e72ef0ab973bf0 +Subproject commit 7f555e464c1d6b16233354491868f46d009c453c From 088cbb55e347042dec3604c4fab5db1bb84a5fa5 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 12 Jan 2021 13:28:12 +0100 Subject: [PATCH 037/172] fix some bugs: response error when generating the speaker information --- docker-compose.yml | 2 +- pyBK | 2 +- tools.py | 79 +++++++++++++++++++++++++--------------------- 3 files changed, 45 insertions(+), 38 deletions(-) diff --git a/docker-compose.yml b/docker-compose.yml index 08c14d0..f7da7db 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -5,7 +5,7 @@ services: stt-worker: container_name: stt-standalone-worker build: . - image: lintoai/linto-platform-stt-standalone-worker:latest + image: lintoai/linto-platform-stt-standalone-worker:latest-unstable volumes: - ${AM_PATH}:/opt/models/AM - ${LM_PATH}:/opt/models/LM diff --git a/pyBK b/pyBK index 7738eb7..1e5dc7d 160000 --- a/pyBK +++ b/pyBK @@ -1 +1 @@ -Subproject commit 7738eb75dfc65438fbcd0eed9bb6a1f086b4bd6c +Subproject commit 1e5dc7de4e0a7d43a44152a68beca0699c14fd4c diff --git a/tools.py b/tools.py index 39e2700..92c91ed 100644 --- a/tools.py +++ b/tools.py @@ -206,42 +206,49 @@ def get_response(self, dataJson, is_metadata): # return a json object including word-data, speaker-data def process_output(self, data, spkrs): - speakers = [] - text = [] - i = 0 - text_ = "" - words = [] - for word in data['words']: - if i+1 == len(spkrs): - continue - if i+1 < len(spkrs) and word["end"] < spkrs[i+1][0]: - text_ += word["word"] + " " - words.append(word) - else: - speaker = {} - speaker["start"] = words[0]["start"] - speaker["end"] = words[len(words)-1]["end"] - speaker["speaker_id"] = 'spk'+str(int(spkrs[i][2])) - speaker["words"] = words - - text.append( - 'spk'+str(int(spkrs[i][2]))+' : ' + self.parse_text(text_)) - speakers.append(speaker) - - words = [word] - text_ = word["word"] + " " - i += 1 - - speaker = {} - speaker["start"] = words[0]["start"] - speaker["end"] = words[len(words)-1]["end"] - speaker["speaker_id"] = 'spk'+str(int(spkrs[i][2])) - speaker["words"] = words - - text.append('spk'+str(int(spkrs[i][2]))+' : ' + self.parse_text(text_)) - speakers.append(speaker) - - return {'speakers': speakers, 'text': text} + try: + speakers = [] + text = [] + i = 0 + text_ = "" + words = [] + for word in data['words']: + if i+1 == len(spkrs): + continue + if i+1 < len(spkrs) and word["end"] < spkrs[i+1][0]: + text_ += word["word"] + " " + words.append(word) + elif len(words) != 0: + speaker = {} + speaker["start"] = words[0]["start"] + speaker["end"] = words[len(words)-1]["end"] + speaker["speaker_id"] = 'spk'+str(int(spkrs[i][2])) + speaker["words"] = words + + text.append( + 'spk'+str(int(spkrs[i][2]))+' : ' + self.parse_text(text_)) + speakers.append(speaker) + + words = [word] + text_ = word["word"] + " " + i += 1 + else: + words = [word] + text_ = word["word"] + " " + i += 1 + + speaker = {} + speaker["start"] = words[0]["start"] + speaker["end"] = words[len(words)-1]["end"] + speaker["speaker_id"] = 'spk'+str(int(spkrs[i][2])) + speaker["words"] = words + + text.append('spk'+str(int(spkrs[i][2]))+' : ' + self.parse_text(text_)) + speakers.append(speaker) + + return {'speakers': speakers, 'text': text} + except: + return { 'data': data, 'spks': spkrs } class SpeakerDiarization: From 05651b8adddf128fa702ce1f1cdda82be8cfed99 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 12 Jan 2021 15:14:23 +0100 Subject: [PATCH 038/172] add the confidence score to the response json content --- run.py | 4 +-- tools.py | 75 +++++++++++++++++++++++++++++++------------------------- 2 files changed, 44 insertions(+), 35 deletions(-) diff --git a/run.py b/run.py index 822a650..5a7fb25 100644 --- a/run.py +++ b/run.py @@ -45,10 +45,10 @@ def transcribe(): rec = KaldiRecognizer(model, spkModel, worker.rate, worker.ONLINE) rec.AcceptWaveform(worker.data) data_ = rec.FinalResult() - worker.log.info(rec.uttConfidence()) + confidence = rec.uttConfidence() if is_metadata: data_ = rec.GetMetadata() - data = worker.get_response(data_, is_metadata) + data = worker.get_response(data_, confidence, is_metadata) worker.clean() else: raise ValueError('No audio file was uploaded') diff --git a/tools.py b/tools.py index 92c91ed..8844e48 100644 --- a/tools.py +++ b/tools.py @@ -182,9 +182,10 @@ def parse_text(self, text): return text # Postprocess response - def get_response(self, dataJson, is_metadata): + def get_response(self, dataJson, confidence, is_metadata): if dataJson is not None: data = json.loads(dataJson) + data['conf'] = confidence if not is_metadata: text = data['text'] # get text from response return self.parse_text(text) @@ -197,11 +198,11 @@ def get_response(self, dataJson, is_metadata): # Generate final output data return self.process_output(data, spkrs) elif 'text' in data: - return {'speakers': [], 'text': data['text'], 'words': []} + return {'speakers': [], 'text': data['text'], 'confidence-score': data['conf'], 'words': []} else: - return {'speakers': [], 'text': '', 'words': []} + return {'speakers': [], 'text': '', 'confidence-score': 0, 'words': []} else: - return {'speakers': [], 'text': '', 'words': []} + return {'speakers': [], 'text': '', 'confidence-score': 0, 'words': []} # return a json object including word-data, speaker-data @@ -243,12 +244,13 @@ def process_output(self, data, spkrs): speaker["speaker_id"] = 'spk'+str(int(spkrs[i][2])) speaker["words"] = words - text.append('spk'+str(int(spkrs[i][2]))+' : ' + self.parse_text(text_)) + text.append('spk'+str(int(spkrs[i][2])) + + ' : ' + self.parse_text(text_)) speakers.append(speaker) - return {'speakers': speakers, 'text': text} + return {'speakers': speakers, 'text': text, 'confidence-score': data['conf']} except: - return { 'data': data, 'spks': spkrs } + return {'text': data['text'], 'words': data['words'], 'confidence-score': data['conf'], 'spks': []} class SpeakerDiarization: @@ -312,50 +314,57 @@ def __init__(self): self.nbIter = 10 # Number of expectation-maximization (EM) iterations self.smoothWin = 100 # Size of the likelihood smoothing window in nb of frames ###### - - def compute_feat_Librosa(self,audioFile): + + def compute_feat_Librosa(self, audioFile): try: - self.data, self.sr = librosa.load(audioFile,sr=None) + self.data, self.sr = librosa.load(audioFile, sr=None) frame_length_inSample = self.frame_length_s * self.sr hop = int(self.frame_shift_s * self.sr) NFFT = int(2**np.ceil(np.log2(frame_length_inSample))) if self.sr >= 16000: mfccNumpy = librosa.feature.mfcc(y=self.data, - sr=self.sr, - dct_type=2, - n_mfcc=self.num_ceps, - n_mels=self.num_bins, - n_fft=NFFT, - hop_length=hop, - fmin=20, - fmax=7600).T + sr=self.sr, + dct_type=2, + n_mfcc=self.num_ceps, + n_mels=self.num_bins, + n_fft=NFFT, + hop_length=hop, + fmin=20, + fmax=7600).T else: mfccNumpy = librosa.feature.mfcc(y=self.data, - sr=self.sr, - dct_type=2, - n_mfcc=self.num_ceps, - n_mels=self.num_bins, - n_fft=NFFT, - hop_length=hop).T + sr=self.sr, + dct_type=2, + n_mfcc=self.num_ceps, + n_mels=self.num_bins, + n_fft=NFFT, + hop_length=hop).T except Exception as e: self.log.error(e) - raise ValueError("Speaker diarization failed when extracting features!!!") + raise ValueError( + "Speaker diarization failed when extracting features!!!") else: return mfccNumpy def computeVAD_WEBRTC(self, data, sr, nFeatures): try: - va_framed = py_webrtcvad(data, fs=sr, fs_vad=sr, hoplength=30, vad_mode=0) - segments = get_py_webrtcvad_segments(va_framed,sr) - maskSAD = np.zeros([1,nFeatures]) + if sr not in [8000, 16000, 32000, 48000]: + data = librosa.resample(data, sr, 16000) + sr = 16000 + + va_framed = py_webrtcvad( + data, fs=sr, fs_vad=sr, hoplength=30, vad_mode=0) + segments = get_py_webrtcvad_segments(va_framed, sr) + maskSAD = np.zeros([1, nFeatures]) for seg in segments: - start=int(np.round(seg[0]/self.frame_shift_s)) - end=int(np.round(seg[1]/self.frame_shift_s)) - maskSAD[0][start:end]=1 + start = int(np.round(seg[0]/self.frame_shift_s)) + end = int(np.round(seg[1]/self.frame_shift_s)) + maskSAD[0][start:end] = 1 except Exception as e: self.log.error(e) - raise ValueError("Speaker diarization failed while voice activity detection!!!") + raise ValueError( + "Speaker diarization failed while voice activity detection!!!") else: return maskSAD @@ -478,7 +487,7 @@ def getSegments(frameshift, finalSegmentTable, finalClusteringTable, dur): finalClusteringTableResegmentation), duration) else: return [[0, duration, 1], - [duration, -1, -1]] + [duration, -1, -1]] self.log.info("Speaker Diarization time in seconds: %d" % int(time.time() - start_time)) From 19054c4a4dc3275c44064b859b7d7e7c0a3c611c Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 12 Jan 2021 15:51:39 +0100 Subject: [PATCH 039/172] update RELEASE --- RELEASE.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/RELEASE.md b/RELEASE.md index 8712413..e190830 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,3 +1,8 @@ +# 3.1.1 +- Change Pykaldi with vosk-API (no python wrapper for decoding function, no extrat packages during installation, c++ implementation based on kaldi functions) +- New feature: Compute a confidence score per transcription +- Fix minor bugs + # 2.2.1 - Fix minor bugs - put SWAGGER_PATH parameter as optional From 49e9528735359309122d8da220f094092e622a3a Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 12 Jan 2021 16:19:43 +0100 Subject: [PATCH 040/172] update worker --- .gitmodules | 6 + Dockerfile | 180 ++++------- Jenkinsfile | 1 - RELEASE.md | 5 + document/swagger.yml | 2 +- pyBK | 1 + run.py | 70 ++-- tools.py | 749 +++++++++++++++++-------------------------- vosk-api | 1 + 9 files changed, 426 insertions(+), 589 deletions(-) create mode 100644 .gitmodules create mode 160000 pyBK mode change 100755 => 100644 run.py create mode 160000 vosk-api diff --git a/.gitmodules b/.gitmodules new file mode 100644 index 0000000..b131dc4 --- /dev/null +++ b/.gitmodules @@ -0,0 +1,6 @@ +[submodule "vosk-api"] + path = vosk-api + url = https://github.com/irebai/vosk-api.git +[submodule "pyBK"] + path = pyBK + url = https://github.com/irebai/pyBK.git diff --git a/Dockerfile b/Dockerfile index 5e9f2fe..c8e95cd 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,127 +1,79 @@ -# Dockerfile for building PyKaldi image from Ubuntu 16.04 image FROM ubuntu:18.04 LABEL maintainer="irebai@linagora.com" -# Install necessary system packages -RUN apt-get update \ - && apt-get install -y \ - python3 \ +RUN apt-get update &&\ + apt-get install -y \ + python2.7 \ + python3 \ python3-pip \ - python2.7 \ - autoconf \ - automake \ - cmake \ - make \ - curl \ - g++ \ - git \ - graphviz \ - libatlas3-base \ - libtool \ - pkg-config \ - sox \ - subversion \ - bzip2 \ - unzip \ - wget \ - zlib1g-dev \ - ca-certificates \ - gfortran \ - patch \ - ffmpeg \ - nano && \ - ln -s /usr/bin/python3 /usr/bin/python && \ - ln -s /usr/bin/pip3 /usr/bin/pip + git \ + swig \ + nano \ + sox \ + automake wget unzip build-essential libtool zlib1g-dev locales libatlas-base-dev ca-certificates gfortran subversion &&\ + apt-get clean -# Install necessary Python packages (pykaldi dependencies) -RUN pip install --upgrade pip \ - numpy \ - setuptools \ - pyparsing \ - ninja +## Build kaldi and Clean installation (intel, openfst, src/*) +RUN git clone --depth 1 https://github.com/kaldi-asr/kaldi.git /opt/kaldi && \ + cd /opt/kaldi/tools && \ + ./extras/install_mkl.sh && \ + make -j $(nproc) && \ + cd /opt/kaldi/src && \ + ./configure --shared && \ + make depend -j $(nproc) && \ + make -j $(nproc) && \ + mkdir -p /opt/kaldi/src_ && \ + mv /opt/kaldi/src/base \ + /opt/kaldi/src/chain \ + /opt/kaldi/src/cudamatrix \ + /opt/kaldi/src/decoder \ + /opt/kaldi/src/feat \ + /opt/kaldi/src/fstext \ + /opt/kaldi/src/gmm \ + /opt/kaldi/src/hmm \ + /opt/kaldi/src/ivector \ + /opt/kaldi/src/kws \ + /opt/kaldi/src/lat \ + /opt/kaldi/src/lm \ + /opt/kaldi/src/matrix \ + /opt/kaldi/src/nnet \ + /opt/kaldi/src/nnet2 \ + /opt/kaldi/src/nnet3 \ + /opt/kaldi/src/online2 \ + /opt/kaldi/src/rnnlm \ + /opt/kaldi/src/sgmm2 \ + /opt/kaldi/src/transform \ + /opt/kaldi/src/tree \ + /opt/kaldi/src/util \ + /opt/kaldi/src/itf \ + /opt/kaldi/src/lib /opt/kaldi/src_ && \ + cd /opt/kaldi && rm -r src && mv src_ src && rm src/*/*.cc && rm src/*/*.o && rm src/*/*.so && \ + cd /opt/intel/mkl/lib && rm -f intel64/*.a intel64_lin/*.a && \ + cd /opt/kaldi/tools && mkdir openfst_ && mv openfst-*/lib openfst-*/include openfst-*/bin openfst_ && rm openfst_/lib/*.so* openfst_/lib/*.la && \ + rm -r openfst-*/* && mv openfst_/* openfst-*/ && rm -r openfst_ -## Install Protobuf, CLIF, Kaldi and PyKaldi and Clean installation -RUN git clone --depth 1 https://github.com/pykaldi/pykaldi.git /pykaldi \ - && cd /pykaldi/tools \ - && sed -i "s/make \-j4/make -j $(nproc)/g" ./install_kaldi.sh \ - && sed -i "s/\-j 2/-j $(nproc)/g" ./install_clif.sh \ - && sed -i "s/make \-j4/make -j $(nproc)/g" ./install_protobuf.sh \ - && ./check_dependencies.sh \ - && ./install_protobuf.sh \ - && ./install_clif.sh \ - && ./install_kaldi.sh \ - && cd /pykaldi \ - && python setup.py install \ - && rm -rf /pykaldi/CMakeLists.txt \ - /pykaldi/LICENSE \ - /pykaldi/README.md \ - /pykaldi/setup.cfg \ - /pykaldi/setup.py \ - /pykaldi/docker \ - /pykaldi/docs \ - /pykaldi/extras \ - /pykaldi/pykaldi.egg-info \ - /pykaldi/tests \ - /pykaldi/build/CMakeCache.txt \ - /pykaldi/build/bdist.linux-x86_64 \ - /pykaldi/build/build.ninja \ - /pykaldi/build/cmake_install.cmake \ - /pykaldi/build/docs \ - /pykaldi/build/kaldi \ - /pykaldi/build/lib \ - /pykaldi/build/rules.ninja \ - /pykaldi/tools/check_dependencies.sh \ - /pykaldi/tools/clif* \ - /pykaldi/tools/find_python_library.py \ - /pykaldi/tools/install_* \ - /pykaldi/tools/protobuf \ - /pykaldi/tools/use_namespace.sh \ - /pykaldi/tools/kaldi/COPYING \ - /pykaldi/tools/kaldi/INSTALL \ - /pykaldi/tools/kaldi/README.md \ - /pykaldi/tools/kaldi/egs \ - /pykaldi/tools/kaldi/misc \ - /pykaldi/tools/kaldi/scripts \ - /pykaldi/tools/kaldi/windows \ - && mkdir -p /pykaldi/tools/kaldi/src_/lib \ - && mv /pykaldi/tools/kaldi/src/base/libkaldi-base.so \ - /pykaldi/tools/kaldi/src/chain/libkaldi-chain.so \ - /pykaldi/tools/kaldi/src/cudamatrix/libkaldi-cudamatrix.so \ - /pykaldi/tools/kaldi/src/decoder/libkaldi-decoder.so \ - /pykaldi/tools/kaldi/src/feat/libkaldi-feat.so \ - /pykaldi/tools/kaldi/src/fstext/libkaldi-fstext.so \ - /pykaldi/tools/kaldi/src/gmm/libkaldi-gmm.so \ - /pykaldi/tools/kaldi/src/hmm/libkaldi-hmm.so \ - /pykaldi/tools/kaldi/src/ivector/libkaldi-ivector.so \ - /pykaldi/tools/kaldi/src/kws/libkaldi-kws.so \ - /pykaldi/tools/kaldi/src/lat/libkaldi-lat.so \ - /pykaldi/tools/kaldi/src/lm/libkaldi-lm.so \ - /pykaldi/tools/kaldi/src/matrix/libkaldi-matrix.so \ - /pykaldi/tools/kaldi/src/nnet/libkaldi-nnet.so \ - /pykaldi/tools/kaldi/src/nnet2/libkaldi-nnet2.so \ - /pykaldi/tools/kaldi/src/nnet3/libkaldi-nnet3.so \ - /pykaldi/tools/kaldi/src/online2/libkaldi-online2.so \ - /pykaldi/tools/kaldi/src/rnnlm/libkaldi-rnnlm.so \ - /pykaldi/tools/kaldi/src/sgmm2/libkaldi-sgmm2.so \ - /pykaldi/tools/kaldi/src/transform/libkaldi-transform.so \ - /pykaldi/tools/kaldi/src/tree/libkaldi-tree.so \ - /pykaldi/tools/kaldi/src/util/libkaldi-util.so \ - /pykaldi/tools/kaldi/src_/lib \ - && rm -rf /pykaldi/tools/kaldi/src && mv /pykaldi/tools/kaldi/src_ /pykaldi/tools/kaldi/src \ - && cd /pykaldi/tools/kaldi/tools && mkdir openfsttmp && mv openfst-*/lib openfst-*/include openfst-*/bin openfsttmp && rm openfsttmp/lib/*.a openfsttmp/lib/*.la && \ - rm -r openfst-*/* && mv openfsttmp/* openfst-*/ && rm -r openfsttmp - -# Define the main folder -WORKDIR /usr/src/speech-to-text +# Install pyBK (speaker diarization toolkit) +RUN apt install -y software-properties-common && wget https://apt.llvm.org/llvm.sh && chmod +x llvm.sh && ./llvm.sh 10 && \ + export LLVM_CONFIG=/usr/bin/llvm-config-10 && \ + pip3 install numpy && \ + pip3 install websockets && \ + pip3 install librosa webrtcvad scipy sklearn # Install main service packages -RUN pip3 install flask flask-cors flask-swagger-ui pyyaml librosa gevent -RUN git clone https://github.com/irebai/pyBK.git /pykaldi/tools/pyBK \ - && cp /pykaldi/tools/pyBK/diarizationFunctions.py . +RUN pip3 install flask flask-cors flask-swagger-ui gevent pyyaml && \ + apt-get install -y ffmpeg -# Set environment variables -ENV PATH /pykaldi/tools/kaldi/egs/wsj/s5/utils/:$PATH +# build VOSK KALDI +COPY vosk-api /opt/vosk-api +RUN cd /opt/vosk-api/python && \ + export KALDI_ROOT=/opt/kaldi && \ + export KALDI_MKL=1 && \ + python3 setup.py install --user --single-version-externally-managed --root=/ + +# Define the main folder +WORKDIR /usr/src/speech-to-text +COPY pyBK/diarizationFunctions.py pyBK/diarizationFunctions.py COPY tools.py . COPY run.py . diff --git a/Jenkinsfile b/Jenkinsfile index d027c84..b4bdffc 100644 --- a/Jenkinsfile +++ b/Jenkinsfile @@ -24,7 +24,6 @@ pipeline { docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { image.push("${VERSION}") image.push('latest') - image.push('offline') } } } diff --git a/RELEASE.md b/RELEASE.md index 8712413..e190830 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,3 +1,8 @@ +# 3.1.1 +- Change Pykaldi with vosk-API (no python wrapper for decoding function, no extrat packages during installation, c++ implementation based on kaldi functions) +- New feature: Compute a confidence score per transcription +- Fix minor bugs + # 2.2.1 - Fix minor bugs - put SWAGGER_PATH parameter as optional diff --git a/document/swagger.yml b/document/swagger.yml index 3db05a0..b52b52c 100644 --- a/document/swagger.yml +++ b/document/swagger.yml @@ -24,7 +24,7 @@ paths: parameters: - name: "file" in: "formData" - description: "Audio File (wav, mp3, flac, ogg, wma, m4a)" + description: "Audio File - Waveform Audio File Format is required. Best configuration (16KHz, 16b, mono)" required: true type: "file" responses: diff --git a/pyBK b/pyBK new file mode 160000 index 0000000..1e5dc7d --- /dev/null +++ b/pyBK @@ -0,0 +1 @@ +Subproject commit 1e5dc7de4e0a7d43a44152a68beca0699c14fd4c diff --git a/run.py b/run.py old mode 100755 new mode 100644 index e485961..48cf57d --- a/run.py +++ b/run.py @@ -2,73 +2,95 @@ # -*- coding: utf-8 -*- from flask import Flask, request, abort, Response, json -from tools import ASR, SttStandelone +from vosk import Model, KaldiRecognizer +from tools import Worker from time import gmtime, strftime from gevent.pywsgi import WSGIServer import os +from gevent.pywsgi import WSGIServer + + + app = Flask("__stt-standelone-worker__") -stt = SttStandelone() +# create WorkerStreaming object +worker = Worker() # Load ASR models (acoustic model and decoding graph) -stt.log.info('Load acoustic model and decoding graph') -asr = ASR(stt.AM_PATH, stt.LM_PATH, stt.CONFIG_FILES_PATH) - +worker.log.info('Load acoustic model and decoding graph') +model = Model(worker.AM_PATH, worker.LM_PATH, + worker.CONFIG_FILES_PATH+"/online.conf") +spkModel = None +# API @app.route('/transcribe', methods=['POST']) def transcribe(): try: - stt.log.info('[%s] New user entry on /transcribe' % (strftime("%d/%b/%d %H:%M:%S", gmtime()))) - - #get response content type - metadata = False + worker.log.info('[%s] New user entry on /transcribe' % + (strftime("%d/%b/%d %H:%M:%S", gmtime()))) + + is_metadata = False + + # get response content type if request.headers.get('accept').lower() == 'application/json': - metadata = True + is_metadata = True elif request.headers.get('accept').lower() == 'text/plain': - metadata = False + is_metadata = False else: raise ValueError('Not accepted header') - #get input file + # get input file if 'file' in request.files.keys(): file = request.files['file'] - stt.read_audio(file,asr.get_sample_rate()) - output = stt.run(asr, metadata) + worker.getAudio(file) + rec = KaldiRecognizer(model, spkModel, worker.rate, worker.ONLINE) + rec.AcceptWaveform(worker.data) + data_ = rec.FinalResult() + confidence = rec.uttConfidence() + if is_metadata: + data_ = rec.GetMetadata() + data = worker.get_response(data_, confidence, is_metadata) + worker.clean() else: raise ValueError('No audio file was uploaded') - return output, 200 + return data, 200 except ValueError as error: return str(error), 400 except Exception as e: - app.logger.error(e) + worker.log.error(e) return 'Server Error', 500 + + # Rejected request handlers @app.errorhandler(405) def method_not_allowed(error): return 'The method is not allowed for the requested URL', 405 + @app.errorhandler(404) def page_not_found(error): return 'The requested URL was not found', 404 + @app.errorhandler(500) def server_error(error): - app.logger.error(error) + worker.log.error(error) return 'Server Error', 500 + if __name__ == '__main__': try: # start SwaggerUI - if os.path.exists(stt.SWAGGER_PATH): - stt.swaggerUI(app) + if worker.SWAGGER_PATH != '': + worker.swaggerUI(app) + # Run server - #Run server - app.logger.info('Server ready for transcription...') - http_server = WSGIServer(('', stt.SERVICE_PORT), app) + http_server = WSGIServer(('', worker.SERVICE_PORT), app) http_server.serve_forever() + except Exception as e: - app.logger.error(e) - exit(e) \ No newline at end of file + worker.log.error(e) + exit(e) diff --git a/tools.py b/tools.py index 3129d99..8844e48 100644 --- a/tools.py +++ b/tools.py @@ -1,190 +1,268 @@ -# Kaldi ASR decoder -from kaldi.asr import NnetLatticeFasterOnlineRecognizer -from kaldi.decoder import (LatticeFasterDecoderOptions, - LatticeFasterOnlineDecoder) -from kaldi.nnet3 import NnetSimpleLoopedComputationOptions -from kaldi.online2 import (OnlineEndpointConfig, - OnlineIvectorExtractorAdaptationState, - OnlineNnetFeaturePipelineConfig, - OnlineNnetFeaturePipelineInfo, - OnlineNnetFeaturePipeline, - OnlineSilenceWeighting) -from kaldi.util.options import ParseOptions -from kaldi.util.table import SequentialWaveReader -from kaldi.matrix import Matrix, Vector -############## +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- -# word to CTM -from kaldi.lat.align import (WordBoundaryInfoNewOpts, - WordBoundaryInfo, - word_align_lattice) -from kaldi.lat.functions import (compact_lattice_to_word_alignment, - compact_lattice_shortest_path) -from kaldi.asr import NnetRecognizer -import kaldi.fstext as _fst +#  ASR +from vosk import Model, KaldiRecognizer ############## # Speaker Diarization -from diarizationFunctions import * -import numpy as np +from pyBK.diarizationFunctions import * import librosa -from kaldi.ivector import (compute_vad_energy, - VadEnergyOptions) -from kaldi.feat.mfcc import Mfcc, MfccOptions -from kaldi.util.options import ParseOptions +import time +import webrtcvad ############## # other packages -import configparser, sys, os, re, time, logging, yaml, uuid +import configparser +import librosa +import logging +import os +import re +import uuid +import json +import yaml +import numpy as np +from scipy.io import wavfile from flask_swagger_ui import get_swaggerui_blueprint ############## -class ASR: - def __init__(self, AM_PATH, LM_PATH, CONFIG_FILES_PATH): - self.log = logging.getLogger('__stt-standelone-worker__.ASR') - self.AM_PATH = AM_PATH - self.LM_PATH = LM_PATH - self.CONFIG_FILES_PATH = CONFIG_FILES_PATH - self.LoadModels() - - def LoadModels(self): - try: - # Define online feature pipeline - po = ParseOptions("") - - decoder_opts = LatticeFasterDecoderOptions() - self.endpoint_opts = OnlineEndpointConfig() - self.decodable_opts = NnetSimpleLoopedComputationOptions() - feat_opts = OnlineNnetFeaturePipelineConfig() - - - decoder_opts.register(po) - self.endpoint_opts.register(po) - self.decodable_opts.register(po) - feat_opts.register(po) - - po.read_config_file(self.CONFIG_FILES_PATH+"/online.conf") - self.feat_info = OnlineNnetFeaturePipelineInfo.from_config( - feat_opts) - - # Set metadata parameters - self.samp_freq = self.feat_info.mfcc_opts.frame_opts.samp_freq - self.frame_shift = self.feat_info.mfcc_opts.frame_opts.frame_shift_ms / 1000 - self.acwt = self.decodable_opts.acoustic_scale - - # Load Acoustic and graph models and other files - self.transition_model, self.acoustic_model = NnetRecognizer.read_model( - self.AM_PATH+"/final.mdl") - graph = _fst.read_fst_kaldi(self.LM_PATH+"/HCLG.fst") - self.decoder_graph = LatticeFasterOnlineDecoder( - graph, decoder_opts) - self.symbols = _fst.SymbolTable.read_text( - self.LM_PATH+"/words.txt") - self.info = WordBoundaryInfo.from_file( - WordBoundaryInfoNewOpts(), self.LM_PATH+"/word_boundary.int") - - - self.asr = NnetLatticeFasterOnlineRecognizer(self.transition_model, self.acoustic_model, self.decoder_graph, - self.symbols, decodable_opts=self.decodable_opts, endpoint_opts=self.endpoint_opts) - del graph, decoder_opts - except Exception as e: - self.log.error(e) - raise ValueError( - "AM and LM loading failed!!! (see logs for more details)") +class Worker: + def __init__(self): + # Set logger config + self.log = logging.getLogger("__stt-standelone-worker__") + logging.basicConfig(level=logging.INFO) - def get_sample_rate(self): - return self.samp_freq + # Main parameters + self.AM_PATH = '/opt/models/AM' + self.LM_PATH = '/opt/models/LM' + self.TEMP_FILE_PATH = '/opt/tmp' + self.CONFIG_FILES_PATH = '/opt/config' + self.SAVE_AUDIO = False + self.SERVICE_PORT = 80 + self.NBR_THREADS = 100 + self.SWAGGER_URL = '/api-doc' + self.SWAGGER_PATH = '' + self.ONLINE = False - def get_frames(self, feat_pipeline): - rows = feat_pipeline.num_frames_ready() - cols = feat_pipeline.dim() - frames = Matrix(rows, cols) - feat_pipeline.get_frames(range(rows), frames) - return frames[:, :self.feat_info.mfcc_opts.num_ceps], frames[:, self.feat_info.mfcc_opts.num_ceps:] - # return feats + ivectors + if not os.path.isdir(self.CONFIG_FILES_PATH): + os.mkdir(self.CONFIG_FILES_PATH) - def compute_feat(self, wav): - try: - feat_pipeline = OnlineNnetFeaturePipeline(self.feat_info) - feat_pipeline.accept_waveform(self.samp_freq, wav) - feat_pipeline.input_finished() - except Exception as e: - self.log.error(e) - raise ValueError("Feature extraction failed!!!") - else: - return feat_pipeline + if not os.path.isdir(self.TEMP_FILE_PATH): + os.mkdir(self.TEMP_FILE_PATH) + + # Environment parameters + if 'NBR_THREADS' in os.environ: + if int(os.environ['NBR_THREADS']) > 0: + self.NBR_THREADS = int(os.environ['NBR_THREADS']) + else: + self.log.warning( + "You must to provide a positif number of threads 'NBR_THREADS'") + if 'SWAGGER_PATH' in os.environ: + self.SWAGGER_PATH = os.environ['SWAGGER_PATH'] + + # start loading ASR configuration + self.log.info("Create the new config files") + self.loadConfig() + + def swaggerUI(self, app): + ### swagger specific ### + swagger_yml = yaml.load( + open(self.SWAGGER_PATH, 'r'), Loader=yaml.Loader) + swaggerui = get_swaggerui_blueprint( + # Swagger UI static files will be mapped to '{SWAGGER_URL}/dist/' + self.SWAGGER_URL, + self.SWAGGER_PATH, + config={ # Swagger UI config overrides + 'app_name': "STT API Documentation", + 'spec': swagger_yml + } + ) + app.register_blueprint(swaggerui, url_prefix=self.SWAGGER_URL) + ### end swagger specific ### - def decoder(self, feats): + def getAudio(self, file): + filename = str(uuid.uuid4()) + self.file_path = self.TEMP_FILE_PATH+"/"+filename + file.save(self.file_path) try: - start_time = time.time() - self.log.info("Start Decoding: %s" % (start_time)) - self.asr.set_input_pipeline(feats) - decode = self.asr.decode() - self.log.info("Decode time in seconds: %s" % - (time.time() - start_time)) + self.rate, self.data = wavfile.read(self.file_path) + # if stereo file, convert to mono by computing the mean of the channels + if len(self.data.shape) == 2 and self.data.shape[1] == 2: + self.data = np.mean(self.data, axis=1, dtype=np.int16) except Exception as e: self.log.error(e) - raise ValueError("Decoder failed to transcribe the input audio!!!") + raise ValueError("The uploaded file format is not supported!!!") + + def clean(self): + if not self.SAVE_AUDIO: + os.remove(self.file_path) + + # re-create config files + def loadConfig(self): + # load decoder parameters from "decode.cfg" + decoder_settings = configparser.ConfigParser() + if os.path.exists(self.AM_PATH+'/decode.cfg') == False: + return False + decoder_settings.read(self.AM_PATH+'/decode.cfg') + + # Prepare "online.conf" + self.AM_PATH = self.AM_PATH+"/" + \ + decoder_settings.get('decoder_params', 'ampath') + with open(self.AM_PATH+"/conf/online.conf") as f: + values = f.readlines() + with open(self.CONFIG_FILES_PATH+"/online.conf", 'w') as f: + for i in values: + f.write(i) + f.write("--ivector-extraction-config=" + + self.CONFIG_FILES_PATH+"/ivector_extractor.conf\n") + f.write("--mfcc-config="+self.AM_PATH+"/conf/mfcc.conf\n") + f.write( + "--beam="+decoder_settings.get('decoder_params', 'beam')+"\n") + f.write( + "--lattice-beam="+decoder_settings.get('decoder_params', 'lattice_beam')+"\n") + f.write("--acoustic-scale=" + + decoder_settings.get('decoder_params', 'acwt')+"\n") + f.write( + "--min-active="+decoder_settings.get('decoder_params', 'min_active')+"\n") + f.write( + "--max-active="+decoder_settings.get('decoder_params', 'max_active')+"\n") + f.write("--frame-subsampling-factor="+decoder_settings.get( + 'decoder_params', 'frame_subsampling_factor')+"\n") + + # Prepare "ivector_extractor.conf" + with open(self.AM_PATH+"/conf/ivector_extractor.conf") as f: + values = f.readlines() + with open(self.CONFIG_FILES_PATH+"/ivector_extractor.conf", 'w') as f: + for i in values: + f.write(i) + f.write("--splice-config="+self.AM_PATH+"/conf/splice.conf\n") + f.write("--cmvn-config="+self.AM_PATH + + "/conf/online_cmvn.conf\n") + f.write("--lda-matrix="+self.AM_PATH + + "/ivector_extractor/final.mat\n") + f.write("--global-cmvn-stats="+self.AM_PATH + + "/ivector_extractor/global_cmvn.stats\n") + f.write("--diag-ubm="+self.AM_PATH + + "/ivector_extractor/final.dubm\n") + f.write("--ivector-extractor="+self.AM_PATH + + "/ivector_extractor/final.ie") + + # Prepare "word_boundary.int" if not exist + if not os.path.exists(self.LM_PATH+"/word_boundary.int") and os.path.exists(self.AM_PATH+"/phones.txt"): + self.log.info("Create word_boundary.int based on phones.txt") + with open(self.AM_PATH+"/phones.txt") as f: + phones = f.readlines() + + with open(self.LM_PATH+"/word_boundary.int", "w") as f: + for phone in phones: + phone = phone.strip() + phone = re.sub('^ .*', '', phone) + phone = re.sub('^#\d+ .*', '', phone) + if phone != '': + id = phone.split(' ')[1] + if '_I ' in phone: + f.write(id+" internal\n") + elif '_B ' in phone: + f.write(id+" begin\n") + elif '_E ' in phone: + f.write(id+" end\n") + elif '_S ' in phone: + f.write(id+" singleton\n") + else: + f.write(id+" nonword\n") + + # remove extra symbols + def parse_text(self, text): + text = re.sub(r"", "", text) # remove symbol + text = re.sub(r"#nonterm:[^ ]* ", "", text) # remove entity's mark + text = re.sub(r"' ", "'", text) # remove space after quote ' + text = re.sub(r" +", " ", text) # remove multiple spaces + text = text.strip() + return text + + # Postprocess response + def get_response(self, dataJson, confidence, is_metadata): + if dataJson is not None: + data = json.loads(dataJson) + data['conf'] = confidence + if not is_metadata: + text = data['text'] # get text from response + return self.parse_text(text) + + elif 'words' in data: + # Do speaker diarization and get speaker segments + spk = SpeakerDiarization() + spkrs = spk.run(self.file_path) + + # Generate final output data + return self.process_output(data, spkrs) + elif 'text' in data: + return {'speakers': [], 'text': data['text'], 'confidence-score': data['conf'], 'words': []} + else: + return {'speakers': [], 'text': '', 'confidence-score': 0, 'words': []} else: - return decode + return {'speakers': [], 'text': '', 'confidence-score': 0, 'words': []} + + # return a json object including word-data, speaker-data - def wordTimestamp(self, text, lattice, frame_shift, frame_subsampling): + def process_output(self, data, spkrs): try: - _fst.utils.scale_compact_lattice( - [[1.0, 0], [0, float(self.acwt)]], lattice) - bestPath = compact_lattice_shortest_path(lattice) - _fst.utils.scale_compact_lattice( - [[1.0, 0], [0, 1.0/float(self.acwt)]], bestPath) - bestLattice = word_align_lattice( - bestPath, self.transition_model, self.info, 0) - alignment = compact_lattice_to_word_alignment(bestLattice[1]) - words = _fst.indices_to_symbols(self.symbols, alignment[0]) - start = alignment[1] - dur = alignment[2] - - output = {} - output["words"] = [] - for i in range(len(words)): - meta = {} - meta["word"] = words[i] - meta["start"] = round(start[i] * frame_shift * frame_subsampling, 2) - meta["end"] = round((start[i]+dur[i]) * frame_shift * frame_subsampling, 2) - output["words"].append(meta) - text += " "+meta["word"] - output["text"] = text + speakers = [] + text = [] + i = 0 + text_ = "" + words = [] + for word in data['words']: + if i+1 == len(spkrs): + continue + if i+1 < len(spkrs) and word["end"] < spkrs[i+1][0]: + text_ += word["word"] + " " + words.append(word) + elif len(words) != 0: + speaker = {} + speaker["start"] = words[0]["start"] + speaker["end"] = words[len(words)-1]["end"] + speaker["speaker_id"] = 'spk'+str(int(spkrs[i][2])) + speaker["words"] = words + + text.append( + 'spk'+str(int(spkrs[i][2]))+' : ' + self.parse_text(text_)) + speakers.append(speaker) + + words = [word] + text_ = word["word"] + " " + i += 1 + else: + words = [word] + text_ = word["word"] + " " + i += 1 - except Exception as e: - self.log.error(e) - raise ValueError("Decoder failed to create the word timestamps!!!") - else: - return output + speaker = {} + speaker["start"] = words[0]["start"] + speaker["end"] = words[len(words)-1]["end"] + speaker["speaker_id"] = 'spk'+str(int(spkrs[i][2])) + speaker["words"] = words + + text.append('spk'+str(int(spkrs[i][2])) + + ' : ' + self.parse_text(text_)) + speakers.append(speaker) + + return {'speakers': speakers, 'text': text, 'confidence-score': data['conf']} + except: + return {'text': data['text'], 'words': data['words'], 'confidence-score': data['conf'], 'spks': []} class SpeakerDiarization: - def __init__(self, sample_rate): + def __init__(self): self.log = logging.getLogger( '__stt-standelone-worker__.SPKDiarization') - # MFCC FEATURES PARAMETERS - self.sr = sample_rate + # MFCC FEATURES PARAMETERS self.frame_length_s = 0.025 self.frame_shift_s = 0.01 - self.num_bins = 40 - self.num_ceps = 40 - self.low_freq = 40 - self.high_freq = -200 - if self.sr == 16000: - self.low_freq = 20 - self.high_freq = 7600 - ##### - - # VAD PARAMETERS - self.vad_ops = VadEnergyOptions() - self.vad_ops.vad_energy_mean_scale = 0.9 - self.vad_ops.vad_energy_threshold = 5 - #vad_ops.vad_frames_context = 2 - #vad_ops.vad_proportion_threshold = 0.12 + self.num_bins = 30 + self.num_ceps = 30 ##### # Segment @@ -237,85 +315,52 @@ def __init__(self, sample_rate): self.smoothWin = 100 # Size of the likelihood smoothing window in nb of frames ###### - def compute_feat_KALDI(self, wav): + def compute_feat_Librosa(self, audioFile): try: - po = ParseOptions("") - mfcc_opts = MfccOptions() - mfcc_opts.use_energy = False - mfcc_opts.frame_opts.samp_freq = self.sr - mfcc_opts.frame_opts.frame_length_ms = self.frame_length_s*1000 - mfcc_opts.frame_opts.frame_shift_ms = self.frame_shift_s*1000 - mfcc_opts.frame_opts.allow_downsample = False - mfcc_opts.mel_opts.num_bins = self.num_bins - mfcc_opts.mel_opts.low_freq = self.low_freq - mfcc_opts.mel_opts.high_freq = self.high_freq - mfcc_opts.num_ceps = self.num_ceps - mfcc_opts.register(po) - - # Create MFCC object and obtain sample frequency - mfccObj = Mfcc(mfcc_opts) - mfccKaldi = mfccObj.compute_features(wav, self.sr, 1.0) + self.data, self.sr = librosa.load(audioFile, sr=None) + frame_length_inSample = self.frame_length_s * self.sr + hop = int(self.frame_shift_s * self.sr) + NFFT = int(2**np.ceil(np.log2(frame_length_inSample))) + if self.sr >= 16000: + mfccNumpy = librosa.feature.mfcc(y=self.data, + sr=self.sr, + dct_type=2, + n_mfcc=self.num_ceps, + n_mels=self.num_bins, + n_fft=NFFT, + hop_length=hop, + fmin=20, + fmax=7600).T + else: + mfccNumpy = librosa.feature.mfcc(y=self.data, + sr=self.sr, + dct_type=2, + n_mfcc=self.num_ceps, + n_mels=self.num_bins, + n_fft=NFFT, + hop_length=hop).T + except Exception as e: self.log.error(e) raise ValueError( - "Speaker diarization failed while extracting features!!!") + "Speaker diarization failed when extracting features!!!") else: - return mfccKaldi + return mfccNumpy - def computeVAD_KALDI(self, feats): + def computeVAD_WEBRTC(self, data, sr, nFeatures): try: - vadStream = compute_vad_energy(self.vad_ops, feats) - vad = Vector(vadStream) - VAD = vad.numpy() - - #  segmentation - occurence = [] - value = [] - occurence.append(1) - value.append(VAD[0]) - - # compute the speech and non-speech frames - for i in range(1, len(VAD)): - if value[-1] == VAD[i]: - occurence[-1] += 1 - else: - occurence.append(1) - value.append(VAD[i]) - - # filter the speech and non-speech segments that are below 30 frames - i = 0 - while(i < len(occurence)): - if i != 0 and (occurence[i] < 30 or value[i-1] == value[i]): - occurence[i-1] += occurence[i] - del value[i] - del occurence[i] - else: - i += 1 - - # split if and only if the silence is above 50 frames - i = 0 - while(i < len(occurence)): - if i != 0 and ((occurence[i] < 30 and value[i] == 0.0) or value[i-1] == value[i]): - occurence[i-1] += occurence[i] - del value[i] - del occurence[i] - else: - i += 1 - - # compute VAD mask - maskSAD = np.zeros(len(VAD)) - start = 0 - for i in range(len(occurence)): - if value[i] == 1.0: - end = start+occurence[i] - maskSAD[start:end] = 1 - start = end - else: - start += occurence[i] - - maskSAD = np.expand_dims(maskSAD, axis=0) - except ValueError as v: - self.log.error(v) + if sr not in [8000, 16000, 32000, 48000]: + data = librosa.resample(data, sr, 16000) + sr = 16000 + + va_framed = py_webrtcvad( + data, fs=sr, fs_vad=sr, hoplength=30, vad_mode=0) + segments = get_py_webrtcvad_segments(va_framed, sr) + maskSAD = np.zeros([1, nFeatures]) + for seg in segments: + start = int(np.round(seg[0]/self.frame_shift_s)) + end = int(np.round(seg[1]/self.frame_shift_s)) + maskSAD[0][start:end] = 1 except Exception as e: self.log.error(e) raise ValueError( @@ -323,7 +368,7 @@ def computeVAD_KALDI(self, feats): else: return maskSAD - def run(self, wav, dur, feats=None): + def run(self, audioFile): try: def getSegments(frameshift, finalSegmentTable, finalClusteringTable, dur): numberOfSpeechFeatures = finalSegmentTable[-1, 2].astype(int)+1 @@ -357,18 +402,19 @@ def getSegments(frameshift, finalSegmentTable, finalClusteringTable, dur): seg[0][0] = 0.0 return seg - start_time = time.time() - self.log.info("Start Speaker Diarization: %s" % (start_time)) - if self.maxNrSpeakers == 1 or dur < 5: - self.log.info("Speaker Diarization time in seconds: %s" % - (time.time() - start_time)) - return [[0, dur, 1], - [dur, -1, -1]] - if feats == None: - feats = self.compute_feat_KALDI(wav) + + self.log.info('Start Speaker diarization') + + feats = self.compute_feat_Librosa(audioFile) nFeatures = feats.shape[0] - maskSAD = self.computeVAD_KALDI(feats) + duration = nFeatures * self.frame_shift_s + + if duration < 5: + return [[0, duration, 1], + [duration, -1, -1]] + + maskSAD = self.computeVAD_WEBRTC(self.data, self.sr, nFeatures) maskUEM = np.ones([1, nFeatures]) mask = np.logical_and(maskUEM, maskSAD) @@ -393,8 +439,9 @@ def getSegments(frameshift, finalSegmentTable, finalClusteringTable, dur): windowRate = int(self.maximumKBMWindowRate) if windowRate == 0: - raise ValueError( - 'The audio is to short in order to perform the speaker diarization!!!') + #self.log.info('The audio is to short in order to perform the speaker diarization!!!') + return [[0, duration, 1], + [duration, -1, -1]] poolSize = np.floor((nSpeechFeatures-self.windowLength)/windowRate) if self.useRelativeKBMsize: @@ -436,217 +483,21 @@ def getSegments(frameshift, finalSegmentTable, finalClusteringTable, dur): if self.resegmentation and np.size(np.unique(finalClusteringTable[:, bestClusteringID.astype(int)-1]), 0) > 1: finalClusteringTableResegmentation, finalSegmentTable = performResegmentation(data, speechMapping, mask, finalClusteringTable[:, bestClusteringID.astype( int)-1], segmentTable, self.modelSize, self.nbIter, self.smoothWin, nSpeechFeatures) - seg = getSegments(self.frame_shift_s, finalSegmentTable, np.squeeze(finalClusteringTableResegmentation), dur) + seg = getSegments(self.frame_shift_s, finalSegmentTable, np.squeeze( + finalClusteringTableResegmentation), duration) else: - seg = getSegmentationFile( - self.frame_shift_s, segmentTable, finalClusteringTable[:, bestClusteringID.astype(int)-1]) - self.log.info("Speaker Diarization time in seconds: %s" % - (time.time() - start_time)) + return [[0, duration, 1], + [duration, -1, -1]] + + self.log.info("Speaker Diarization time in seconds: %d" % + int(time.time() - start_time)) except ValueError as v: - self.log.info(v) - return [[0, dur, 1], - [dur, -1, -1]] + self.log.error(v) + return [[0, duration, 1], + [duration, -1, -1]] except Exception as e: self.log.error(e) - raise ValueError("Speaker Diarization failed!!!") + return [[0, duration, 1], + [duration, -1, -1]] else: return seg - - -class SttStandelone: - def __init__(self): - self.log = logging.getLogger("__stt-standelone-worker-streaming__") - logging.basicConfig(level=logging.INFO) - - # Main parameters - self.AM_PATH = '/opt/models/AM' - self.LM_PATH = '/opt/models/LM' - self.TEMP_FILE_PATH = '/opt/tmp' - self.CONFIG_FILES_PATH = '/opt/config' - self.SAVE_AUDIO = False - self.SERVICE_PORT = 80 - self.SWAGGER_URL = '/api-doc' - self.SWAGGER_PATH = None - - if not os.path.isdir(self.TEMP_FILE_PATH): - os.mkdir(self.TEMP_FILE_PATH) - if not os.path.isdir(self.CONFIG_FILES_PATH): - os.mkdir(self.CONFIG_FILES_PATH) - - # Environment parameters - if 'SERVICE_PORT' in os.environ: - self.SERVICE_PORT = os.environ['SERVICE_PORT'] - if 'SAVE_AUDIO' in os.environ: - self.SAVE_AUDIO = os.environ['SAVE_AUDIO'] - if 'SWAGGER_PATH' in os.environ: - self.SWAGGER_PATH = os.environ['SWAGGER_PATH'] - - self.loadConfig() - - def loadConfig(self): - # get decoder parameters from "decode.cfg" - decoder_settings = configparser.ConfigParser() - if not os.path.exists(self.AM_PATH+'/decode.cfg'): - return False - decoder_settings.read(self.AM_PATH+'/decode.cfg') - - # Prepare "online.conf" - self.AM_PATH = self.AM_PATH+"/" + \ - decoder_settings.get('decoder_params', 'ampath') - with open(self.AM_PATH+"/conf/online.conf") as f: - values = f.readlines() - with open(self.CONFIG_FILES_PATH+"/online.conf", 'w') as f: - for i in values: - f.write(i) - f.write("--ivector-extraction-config=" + - self.CONFIG_FILES_PATH+"/ivector_extractor.conf\n") - f.write("--mfcc-config="+self.AM_PATH+"/conf/mfcc.conf\n") - f.write( - "--beam="+decoder_settings.get('decoder_params', 'beam')+"\n") - f.write( - "--lattice-beam="+decoder_settings.get('decoder_params', 'lattice_beam')+"\n") - f.write("--acoustic-scale=" + - decoder_settings.get('decoder_params', 'acwt')+"\n") - f.write( - "--min-active="+decoder_settings.get('decoder_params', 'min_active')+"\n") - f.write( - "--max-active="+decoder_settings.get('decoder_params', 'max_active')+"\n") - f.write("--frame-subsampling-factor="+decoder_settings.get( - 'decoder_params', 'frame_subsampling_factor')+"\n") - - # Prepare "ivector_extractor.conf" - with open(self.AM_PATH+"/conf/ivector_extractor.conf") as f: - values = f.readlines() - with open(self.CONFIG_FILES_PATH+"/ivector_extractor.conf", 'w') as f: - for i in values: - f.write(i) - f.write("--splice-config="+self.AM_PATH+"/conf/splice.conf\n") - f.write("--cmvn-config="+self.AM_PATH + - "/conf/online_cmvn.conf\n") - f.write("--lda-matrix="+self.AM_PATH + - "/ivector_extractor/final.mat\n") - f.write("--global-cmvn-stats="+self.AM_PATH + - "/ivector_extractor/global_cmvn.stats\n") - f.write("--diag-ubm="+self.AM_PATH + - "/ivector_extractor/final.dubm\n") - f.write("--ivector-extractor="+self.AM_PATH + - "/ivector_extractor/final.ie") - - # Prepare "word_boundary.int" if not exist - if not os.path.exists(self.LM_PATH+"/word_boundary.int") and os.path.exists(self.AM_PATH+"phones.txt"): - with open(self.AM_PATH+"phones.txt") as f: - phones = f.readlines() - - with open(self.LM_PATH+"/word_boundary.int", "w") as f: - for phone in phones: - phone = phone.strip() - phone = re.sub('^ .*', '', phone) - phone = re.sub('^#\d+ .*', '', phone) - if phone != '': - id = phone.split(' ')[1] - if '_I ' in phone: - f.write(id+" internal\n") - elif '_B ' in phone: - f.write(id+" begin\n") - elif '_E ' in phone: - f.write(id+" end\n") - elif '_S ' in phone: - f.write(id+" singleton\n") - else: - f.write(id+" nonword\n") - - def swaggerUI(self, app): - ### swagger specific ### - swagger_yml = yaml.load( - open(self.SWAGGER_PATH, 'r'), Loader=yaml.Loader) - swaggerui = get_swaggerui_blueprint( - # Swagger UI static files will be mapped to '{SWAGGER_URL}/dist/' - self.SWAGGER_URL, - self.SWAGGER_PATH, - config={ # Swagger UI config overrides - 'app_name': "STT API Documentation", - 'spec': swagger_yml - } - ) - app.register_blueprint(swaggerui, url_prefix=self.SWAGGER_URL) - ### end swagger specific ### - - def read_audio(self, file, sample_rate): - filename = str(uuid.uuid4()) - file_path = self.TEMP_FILE_PATH+"/"+filename - file.save(file_path) - try: - data, sr = librosa.load(file_path, sr=None) - if sr != sample_rate: - self.log.info('Resample audio file: '+str(sr) + - 'Hz -> '+str(sample_rate)+'Hz') - data = librosa.resample(data, sr, sample_rate) - data = (data * 32767).astype(np.int16) - self.dur = len(data) / sample_rate - self.data = Vector(data) - except Exception as e: - self.log.error(e) - raise ValueError("The uploaded file format is not supported!!!") - finally: - if not self.SAVE_AUDIO: - os.remove(file_path) - - def run(self, asr, metadata): - feats = asr.compute_feat(self.data) - mfcc, ivector = asr.get_frames(feats) - decode = asr.decoder(feats) - if metadata: - spk = SpeakerDiarization(asr.get_sample_rate()) - spkSeg = spk.run(self.data, self.dur, mfcc) - data = asr.wordTimestamp(decode["text"], decode['lattice'], asr.frame_shift, asr.decodable_opts.frame_subsampling_factor) - output = self.process_output(data, spkSeg) - return output - else: - return self.parse_text(decode["text"]) - - - # return a json object including word-data, speaker-data - def process_output(self, data, spkrs): - speakers = [] - text = [] - i = 0 - text_ = "" - words=[] - for word in data['words']: - if i+1 < len(spkrs) and word["end"] < spkrs[i+1][0]: - text_ += word["word"] + " " - words.append(word) - else: - speaker = {} - speaker["start"]=words[0]["start"] - speaker["end"]=words[len(words)-1]["end"] - speaker["speaker_id"]='spk'+str(int(spkrs[i][2])) - speaker["words"]=words - - text.append('spk'+str(int(spkrs[i][2]))+' : '+ self.parse_text(text_)) - speakers.append(speaker) - - words=[word] - text_=word["word"] + " " - i+=1 - - speaker = {} - speaker["start"]=words[0]["start"] - speaker["end"]=words[len(words)-1]["end"] - speaker["speaker_id"]='spk'+str(int(spkrs[i][2])) - speaker["words"]=words - - text.append('spk'+str(int(spkrs[i][2]))+' : '+ self.parse_text(text_)) - speakers.append(speaker) - - return {'speakers': speakers, 'text': text} - - # remove extra symbols - def parse_text(self, text): - text = re.sub(r"", "", text) # remove symbol - text = re.sub(r"#nonterm:[^ ]* ", "", text) # remove entity's mark - text = re.sub(r"", "", text) # remove - text = re.sub(r"' ", "'", text) # remove space after quote ' - text = re.sub(r" +", " ", text) # remove multiple spaces - text = text.strip() - return text diff --git a/vosk-api b/vosk-api new file mode 160000 index 0000000..7f555e4 --- /dev/null +++ b/vosk-api @@ -0,0 +1 @@ +Subproject commit 7f555e464c1d6b16233354491868f46d009c453c From a78f891fc1cb76e2a55d369af086b5853da90376 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 12 Jan 2021 16:21:23 +0100 Subject: [PATCH 041/172] update README --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 43df31e..bd11978 100644 --- a/README.md +++ b/README.md @@ -140,7 +140,8 @@ Convert a speech to text > `post`
> Make a POST request >> Arguments : ->> - **{File} file** : Audio file (file format: wav, mp3, flac, ogg) +>> - **{File} file** Audio File - Waveform Audio File Format is required + > >> Header : >> - **{String} Accept**: response content type (text/plain|application/json) From 5b9ebb6a41be19d628fbe1cef3c0c47bad9ca16c Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Mon, 22 Feb 2021 13:37:40 +0100 Subject: [PATCH 042/172] update --- docker-compose.yml | 2 +- run.py | 8 -------- 2 files changed, 1 insertion(+), 9 deletions(-) diff --git a/docker-compose.yml b/docker-compose.yml index f7da7db..08c14d0 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -5,7 +5,7 @@ services: stt-worker: container_name: stt-standalone-worker build: . - image: lintoai/linto-platform-stt-standalone-worker:latest-unstable + image: lintoai/linto-platform-stt-standalone-worker:latest volumes: - ${AM_PATH}:/opt/models/AM - ${LM_PATH}:/opt/models/LM diff --git a/run.py b/run.py index c40e4f0..8f594d3 100644 --- a/run.py +++ b/run.py @@ -6,14 +6,6 @@ from tools import Worker from time import gmtime, strftime from gevent.pywsgi import WSGIServer -import os - -from gevent.pywsgi import WSGIServer - - - -from gevent.pywsgi import WSGIServer - app = Flask("__stt-standelone-worker__") From ed50dde63ec61d2ea633cbc06e805274c6c92dcc Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Mon, 22 Feb 2021 13:38:33 +0100 Subject: [PATCH 043/172] update --- run.py | 1 - 1 file changed, 1 deletion(-) diff --git a/run.py b/run.py index 5a7fb25..ea77af2 100644 --- a/run.py +++ b/run.py @@ -5,7 +5,6 @@ from vosk import Model, KaldiRecognizer from tools import Worker from time import gmtime, strftime - from gevent.pywsgi import WSGIServer From cf8192af0b1059f0266a29c92615553acac8d362 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Mon, 22 Feb 2021 13:39:06 +0100 Subject: [PATCH 044/172] update --- run.py | 1 - 1 file changed, 1 deletion(-) diff --git a/run.py b/run.py index ea77af2..8f594d3 100644 --- a/run.py +++ b/run.py @@ -8,7 +8,6 @@ from gevent.pywsgi import WSGIServer - app = Flask("__stt-standelone-worker__") # create WorkerStreaming object From ae038127505d4172a147a673a913eb11a7e749dd Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Mon, 22 Feb 2021 14:50:48 +0100 Subject: [PATCH 045/172] update README --- .envdefault | 2 +- README.md | 99 +++++++++++++++++++++++++---------------------------- 2 files changed, 48 insertions(+), 53 deletions(-) diff --git a/.envdefault b/.envdefault index 2246e24..130f6ef 100644 --- a/.envdefault +++ b/.envdefault @@ -1,3 +1,3 @@ AM_PATH=/path/to/acoustic/models/dir LM_PATH=/path/to/language/models/dir -SWAGGER_PATH=/path/to/swagger/file \ No newline at end of file +SWAGGER_PATH=./document/swagger.yml \ No newline at end of file diff --git a/README.md b/README.md index bd11978..a966856 100644 --- a/README.md +++ b/README.md @@ -5,12 +5,21 @@ This service is mandatory in a LinTO platform stack as the main worker for speec Generally, Automatic Speech Recognition (ASR) is the task of recognition and translation of spoken language into text. Our ASR system takes advantages from the recent advances in machine learning technologies and in particular deep learning ones (TDNN, LSTM, attentation-based architecture). The core of our system consists of two main components: an acoustic model and a decoding graph. A high-performance ASR system relies on an accurate acoustic model as well as a perfect decoding graph. ## Usage -See documentation : [doc.linto.ai](https://doc.linto.ai/#/services/linstt) +See documentation : [doc.linto.ai](https://doc.linto.ai) # Deploy With our proposed stack [linto-platform-stack](https://github.com/linto-ai/linto-platform-stack) +# Hardware requirements +In order to install and run this service, you need to have at least: + * 5Go available on your hard drive for the installation, and + * 500Mo/3Go/7Go of RAM memory available for models loading and decoding. The size depends mainly on the choosed decoding model (small, medium or big). + +While there is no specific minimal requirement on the CPU, speech recognition is a computationally task. + +**`—The better your hardware performance, the lower your decoding time—`** + # Develop ## Installation @@ -20,6 +29,7 @@ To start the LinSTT service on your local machine or your cloud, you need first ```bash git clone https://github.com/linto-ai/linto-platform-stt-standalone-worker +git submodule update --init cd linto-platform-stt-standalone-worker mv .envdefault .env ``` @@ -27,7 +37,7 @@ mv .envdefault .env Then, to build the docker image, execute: ```bash -docker build -t lintoai/linto-platform-stt-standalone-worker . +docker build -t lintoai/linto-platform-stt-standalone-worker:latest . ``` Or by docker-compose, by using: @@ -42,16 +52,12 @@ Or, download the pre-built image from docker-hub: docker pull lintoai/linto-platform-stt-standalone-worker:latest ``` -NOTE: You must install docker on your machine. +NB: You must install docker and docker-compose on your machine. ## Configuration -The LinSTT service that will be set-up here require KALDI models, the acoustic model and the decoding graph. Indeed, these models are not included in the repository; you must download them in order to run LinSTT. You can use our pre-trained models from here: [linstt download](services/linstt_download). +The LinSTT service that will be set-up here require KALDI models, the acoustic model and the decoding graph. Indeed, these models are not included in the repository; you must download them in order to run LinSTT. You can use our pre-trained models from here: [Downloads](https://doc.linto.ai/#/services/linstt_download). -### Outside LinTO-Platform-STT-Service-Manager - -If you want to use our service alone without LinTO-Platform-STT-Service-Manager, you must `unzip` the files and put the extracted ones in the [shared storage](https://doc.linto.ai/#/infra?id=shared-storage). For example, - -1- Download the French acoustic model and the small decoding graph +1- Download the French acoustic model and the small decoding graph (linstt.v1). You can download the latest version for optimal performance and you should make sure that you have the hardware requirement in terms of RAM. ```bash wget https://dl.linto.ai/downloads/model-distribution/acoustic-models/fr-FR/linSTT_AM_fr-FR_v1.0.0.zip @@ -68,41 +74,31 @@ unzip decoding_graph_fr-FR_Small_v1.1.0.zip -d DG_fr-FR_Small 3- Move the uncompressed files into the shared storage directory ```bash -mv AM_fr-FR ~/linto_shared/data -mv DG_fr-FR_Small ~/linto_shared/data +mkdir ~/linstt_model_storage +mv AM_fr-FR ~/linstt_model_storage +mv DG_fr-FR ~/linstt_model_storage ``` -4- Rename the default environment file `.envdefault` included in the repository `linto-platform-stt-standalone-worker` and configure it by providing the full path of the following parameters: - - AM_PATH=/full/path/to/linto_shared/data/AM_fr-FR - LM_PATH=/full/path/to/linto_shared/data/DG_fr-FR_Small +4- Configure the environment file `.env` included in this repository -5- If you want to use Swagger interface, you need to set the corresponding environment parameter: - SWAGGER_PATH=/full/path/to/swagger/file + AM_PATH=~/linstt_model_storage/AM_fr-FR + LM_PATH=~/linstt_model_storage/DG_fr-FR -NOTE: if you want to use the user interface of the service, you need also to configure the swagger file `document/swagger.yml` included in the repository `linto-platform-stt-standalone-worker`. Specifically, in the section `host`, specify the address of the machine in which the service is deployed. - -### Using LinTO-Platform-STT-Service-Manager -In case you want to use `LinTO-Platform-STT-Service-Manager`, you need to: - -1- Create an acoustic model and upload the approriate file - -2- Create a language model and upload the corresponding decoding graph - -3- Configure the environment file of this service. - -For more details, see instructions in [LinTO - STT-Manager](https://doc.linto.ai/#/services/stt_manager) +NB: if you want to use the visual user interface of the service, you need also to configure the swagger file `document/swagger.yml` included in this repository. Specifically, in the section `host`, specify the adress of the machine in which the service is deployed. ## Execute -In order to run the service alone, you have only to execute: +In order to run the service, you have only to execute: ```bash cd linto-platform-stt-standalone-worker -docker-compose up +docker run -p 8888:80 -v /full/path/to/linstt_model_storage/AM_fr-FR:/opt/models/AM -v /full/path/to/linstt_model_storage/DG_fr-FR:/opt/models/LM -v /full/path/to/linto-platform-stt-standalone-worker/document/swagger.yml:/opt/swagger.yml -e SWAGGER_PATH="/opt/swagger.yml" lintoai/linto-platform-stt-standalone-worker:latest ``` -Then you can acces it on [localhost:8888](localhost:8888) -To run and manager LinSTT under `LinTO-Platform-STT-Service-Manager` service, you need to create a service first and then to start it. See [LinTO - STT-Manager](https://doc.linto.ai/#/services/stt_manager_how2use?id=how-to-use-it) +or simply by executing: +```bash +cd linto-platform-stt-standalone-worker +docker-compose up +``` Our service requires an audio file in `Waveform format`. It should has the following parameters: @@ -112,27 +108,10 @@ Our service requires an audio file in `Waveform format`. It should has the follo - microphone: any type - duration: <30 minutes -Other formats are also supported: mp3, aiff, flac, and ogg. - -### Run Example Applications -To run an automated test go to the test folder - -```bash -cd linto-platform-stt-standalone-worker/test -``` - -And run the test script: - -```bash -./test_deployment.sh -``` - -Or use swagger interface to perform your personal test: localhost:8888/api-doc/ - - +### API -#### ** /transcribe ** +#### /transcribe Convert a speech to text @@ -149,3 +128,19 @@ Convert a speech to text > **{text|Json}** : Return the full transcription or a json object with metadata + + +### Run Example Applications +To run an automated test, go to the test folder: + +```bash +cd linto-platform-stt-standalone-worker/test +``` + +And run the test script: + +```bash +./test_deployment.sh +``` + +To run personal test, you can use swagger interface: `localhost:8888/api-doc/` \ No newline at end of file From 8674379d28993e919ea5462d9481d6ced4d2c641 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Mon, 22 Feb 2021 14:54:13 +0100 Subject: [PATCH 046/172] update README --- .envdefault | 3 +- README.md | 80 ++++++++++++++++++++++++++++++++++------------------- 2 files changed, 52 insertions(+), 31 deletions(-) diff --git a/.envdefault b/.envdefault index 80acea5..e997778 100644 --- a/.envdefault +++ b/.envdefault @@ -1,4 +1,3 @@ AM_PATH=/path/to/acoustic/models/dir LM_PATH=/path/to/language/models/dir -SWAGGER_PATH=/path/to/swagger/file -NBR_PROCESSES=1 \ No newline at end of file +SWAGGER_PATH=./document/swagger.yml diff --git a/README.md b/README.md index 45c75f7..3270b8c 100644 --- a/README.md +++ b/README.md @@ -11,6 +11,17 @@ See documentation : [doc.linto.ai](https://doc.linto.ai) With our proposed stack [linto-platform-stack](https://github.com/linto-ai/linto-platform-stack) +# Hardware requirements +In order to install and run this service, you need to have at least: + +* 5Go available on your hard drive for the installation, and + +* 500Mo/3Go/7Go of RAM memory available for models loading and decoding. The size depends mainly on the choosed decoding model (small, medium or big). + +While there is no specific minimal requirement on the CPU, speech recognition is a computationally task. + +**`—The better your hardware performance, the lower your decoding time—`** + # Develop ## Installation @@ -20,6 +31,7 @@ To start the LinSTT service on your local machine or your cloud, you need first ```bash git clone https://github.com/linto-ai/linto-platform-stt-standalone-worker +git submodule update --init cd linto-platform-stt-standalone-worker mv .envdefault .env ``` @@ -27,7 +39,7 @@ mv .envdefault .env Then, to build the docker image, execute: ```bash -docker build -t lintoai/linto-platform-stt-standalone-worker . +docker build -t lintoai/linto-platform-stt-standalone-worker:latest . ``` Or by docker-compose, by using: @@ -42,16 +54,12 @@ Or, download the pre-built image from docker-hub: docker pull lintoai/linto-platform-stt-standalone-worker:latest ``` -NB: You must install docker on your machine. +NB: You must install docker and docker-compose on your machine. ## Configuration The LinSTT service that will be set-up here require KALDI models, the acoustic model and the decoding graph. Indeed, these models are not included in the repository; you must download them in order to run LinSTT. You can use our pre-trained models from here: [Downloads](https://doc.linto.ai/#/services/linstt_download). -### Outside LinTO-Platform-STT-Service-Manager - -If you want to use our service alone without LinTO-Platform-STT-Service-Manager, you must `unzip` the files and put the extracted ones in the [shared storage](https://doc.linto.ai/#/infra?id=shared-storage). For example, - -1- Download the French acoustic model and the small decoding graph +1- Download the French acoustic model and the small decoding graph (linstt.v1). You can download the latest version for optimal performance and you should make sure that you have the hardware requirement in terms of RAM. ```bash wget https://dl.linto.ai/downloads/model-distribution/acoustic-models/fr-FR/linSTT_AM_fr-FR_v1.0.0.zip @@ -68,38 +76,31 @@ unzip decoding_graph_fr-FR_Small_v1.1.0.zip -d DG_fr-FR_Small 3- Move the uncompressed files into the shared storage directory ```bash -mv AM_fr-FR ~/linto_shared/data -mv DG_fr-FR_Small ~/linto_shared/data +mkdir ~/linstt_model_storage +mv AM_fr-FR ~/linstt_model_storage +mv DG_fr-FR ~/linstt_model_storage ``` 4- Configure the environment file `.env` included in this repository - AM_PATH=/full/path/to/linto_shared/data/AM_fr-FR - LM_PATH=/full/path/to/linto_shared/data/DG_fr-FR_Small - + AM_PATH=~/linstt_model_storage/AM_fr-FR + LM_PATH=~/linstt_model_storage/DG_fr-FR NB: if you want to use the visual user interface of the service, you need also to configure the swagger file `document/swagger.yml` included in this repository. Specifically, in the section `host`, specify the adress of the machine in which the service is deployed. -### Using LinTO-Platform-STT-Service-Manager -In case you want to use `LinTO-Platform-STT-Service-Manager`, you need to: - -1- Create an acoustic model and upload the approriate file - -2- Create a language model and upload the corresponding decoding graph - -3- Configure the environmenet file of this service. - -For more details, see configuration instruction in [LinTO - STT-Manager](https://doc.linto.ai/#/manager) - ## Execute -In order to run the service alone, you have only to execute: +In order to run the service, you have only to execute: ```bash cd linto-platform-stt-standalone-worker -docker-compose up +docker run -p 8888:80 -v /full/path/to/linstt_model_storage/AM_fr-FR:/opt/models/AM -v /full/path/to/linstt_model_storage/DG_fr-FR:/opt/models/LM -v /full/path/to/linto-platform-stt-standalone-worker/document/swagger.yml:/opt/swagger.yml -e SWAGGER_PATH="/opt/swagger.yml" lintoai/linto-platform-stt-standalone-worker:latest ``` -To run and manager LinSTT under `LinTO-Platform-STT-Service-Manager` service, you need to create a service first and then to start it. See [LinTO - STT-Manager](services/manager?id=execute) +or simply by executing: +```bash +cd linto-platform-stt-standalone-worker +docker-compose up +``` Our service requires an audio file in `Waveform format`. It should has the following parameters: @@ -109,8 +110,30 @@ Our service requires an audio file in `Waveform format`. It should has the follo - microphone: any type - duration: <30 minutes +### API + + +#### /transcribe + +Convert a speech to text + +### Functionality +> `post`
+> Make a POST request +>> Arguments : +>> - **{File} file** Audio File - Waveform Audio File Format is required + +> +>> Header : +>> - **{String} Accept**: response content type (text/plain|application/json) +> +> **{text|Json}** : Return the full transcription or a json object with metadata + + + + ### Run Example Applications -To run an automated test go to the test folder +To run an automated test, go to the test folder: ```bash cd linto-platform-stt-standalone-worker/test @@ -122,5 +145,4 @@ And run the test script: ./test_deployment.sh ``` -Or use swagger interface to perform your personal test - +To run personal test, you can use swagger interface: `localhost:8888/api-doc/` \ No newline at end of file From 31db0d0bdcd2336da05bf334f0becd12a3543993 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Mon, 22 Feb 2021 14:54:50 +0100 Subject: [PATCH 047/172] update README --- README.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index a966856..3270b8c 100644 --- a/README.md +++ b/README.md @@ -13,8 +13,10 @@ With our proposed stack [linto-platform-stack](https://github.com/linto-ai/linto # Hardware requirements In order to install and run this service, you need to have at least: - * 5Go available on your hard drive for the installation, and - * 500Mo/3Go/7Go of RAM memory available for models loading and decoding. The size depends mainly on the choosed decoding model (small, medium or big). + +* 5Go available on your hard drive for the installation, and + +* 500Mo/3Go/7Go of RAM memory available for models loading and decoding. The size depends mainly on the choosed decoding model (small, medium or big). While there is no specific minimal requirement on the CPU, speech recognition is a computationally task. From a7c5cd5bd28519b4e38b32684a891e9f3e04e9a3 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Thu, 4 Mar 2021 11:36:21 +0100 Subject: [PATCH 048/172] update README --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 3270b8c..419dcd6 100644 --- a/README.md +++ b/README.md @@ -31,8 +31,8 @@ To start the LinSTT service on your local machine or your cloud, you need first ```bash git clone https://github.com/linto-ai/linto-platform-stt-standalone-worker -git submodule update --init cd linto-platform-stt-standalone-worker +git submodule update --init mv .envdefault .env ``` From a86fb9e23707db29fbe33f1e55867b2188247e4a Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Fri, 26 Mar 2021 13:57:02 +0100 Subject: [PATCH 049/172] update docker compose config --- docker-compose.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docker-compose.yml b/docker-compose.yml index f7da7db..08c14d0 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -5,7 +5,7 @@ services: stt-worker: container_name: stt-standalone-worker build: . - image: lintoai/linto-platform-stt-standalone-worker:latest-unstable + image: lintoai/linto-platform-stt-standalone-worker:latest volumes: - ${AM_PATH}:/opt/models/AM - ${LM_PATH}:/opt/models/LM From 57e110ac1dc3b222d23e8e772071eeeab8120950 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Fri, 26 Mar 2021 13:59:29 +0100 Subject: [PATCH 050/172] update env file --- .envdefault | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.envdefault b/.envdefault index e997778..130f6ef 100644 --- a/.envdefault +++ b/.envdefault @@ -1,3 +1,3 @@ AM_PATH=/path/to/acoustic/models/dir LM_PATH=/path/to/language/models/dir -SWAGGER_PATH=./document/swagger.yml +SWAGGER_PATH=./document/swagger.yml \ No newline at end of file From 000758b8807e2d23b7fff62be21b6c8fe5a5aedb Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Fri, 26 Mar 2021 14:00:58 +0100 Subject: [PATCH 051/172] update readme --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 3270b8c..419dcd6 100644 --- a/README.md +++ b/README.md @@ -31,8 +31,8 @@ To start the LinSTT service on your local machine or your cloud, you need first ```bash git clone https://github.com/linto-ai/linto-platform-stt-standalone-worker -git submodule update --init cd linto-platform-stt-standalone-worker +git submodule update --init mv .envdefault .env ``` From ff279c53b3cd011cd7537f6d286e2412319a860b Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Wed, 31 Mar 2021 11:14:26 +0200 Subject: [PATCH 052/172] remove speaker diarization from the linstt functions. add speaker diarization punctuation services dependencies for stt service --- .envdefault | 9 +- Dockerfile | 11 +- docker-compose.yml | 7 + docker-entrypoint.sh | 33 ++++ run.py | 51 ++++-- tools.py | 396 ++++++++++++++++--------------------------- wait-for-it.sh | 184 ++++++++++++++++++++ 7 files changed, 427 insertions(+), 264 deletions(-) create mode 100755 docker-entrypoint.sh create mode 100755 wait-for-it.sh diff --git a/.envdefault b/.envdefault index 130f6ef..8cc601e 100644 --- a/.envdefault +++ b/.envdefault @@ -1,3 +1,10 @@ AM_PATH=/path/to/acoustic/models/dir LM_PATH=/path/to/language/models/dir -SWAGGER_PATH=./document/swagger.yml \ No newline at end of file +SWAGGER_PATH=./document/swagger.yml + +# dependent services config +PUCTUATION_HOST=text-punctuation-worker-host-name +PUCTUATION_PORT=8080 +PUCTUATION_ROUTE="/api/route/path/" +SPEAKER_DIARIZATION_HOST=speaker-diarization-worker-host-name +SPEAKER_DIARIZATION_PORT=80 \ No newline at end of file diff --git a/Dockerfile b/Dockerfile index c8e95cd..ee79f37 100644 --- a/Dockerfile +++ b/Dockerfile @@ -70,13 +70,18 @@ RUN cd /opt/vosk-api/python && \ export KALDI_MKL=1 && \ python3 setup.py install --user --single-version-externally-managed --root=/ +# Install curl for healthcheck +RUN apt-get install -y curl + # Define the main folder WORKDIR /usr/src/speech-to-text COPY pyBK/diarizationFunctions.py pyBK/diarizationFunctions.py -COPY tools.py . -COPY run.py . +COPY tools.py run.py docker-entrypoint.sh wait-for-it.sh ./ EXPOSE 80 -CMD python3 ./run.py \ No newline at end of file +HEALTHCHECK CMD curl http://localhost/healthcheck || exit 1 + +# Entrypoint handles the passed arguments +ENTRYPOINT ["./docker-entrypoint.sh"] \ No newline at end of file diff --git a/docker-compose.yml b/docker-compose.yml index 08c14d0..d4baa12 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -16,3 +16,10 @@ services: env_file: .env environment: SWAGGER_PATH: /opt/swagger.yml + networks: + - linstt-net + +networks: + internal: + linstt-net: + external: true \ No newline at end of file diff --git a/docker-entrypoint.sh b/docker-entrypoint.sh new file mode 100755 index 0000000..3555b4b --- /dev/null +++ b/docker-entrypoint.sh @@ -0,0 +1,33 @@ +#!/bin/bash +set -e + +max_attempts=3 +delay=5 + +for retry in $(seq 1 $max_attempts); do + echo "Waiting punctuation service... [attempt=$retry]" + punctuation_state=1 + ./wait-for-it.sh $PUCTUATION_HOST:$PUCTUATION_PORT --timeout=$delay || punctuation_state=0 +done + +if [ $punctuation_state == 1 ]; then + echo "$PUCTUATION_HOST:$PUCTUATION_PORT is up" +else + echo "punctuation service is not runninig" +fi + +for retry in $(seq 1 $max_attempts); do + echo "Waiting speaker diarization service... [attempt=$retry]" + spkdiarization_state=1 + ./wait-for-it.sh $SPEAKER_DIARIZATION_HOST:$SPEAKER_DIARIZATION_PORT --timeout=$delay || spkdiarization_state=0 +done + +if [ $spkdiarization_state == 1 ]; then + echo "$SPEAKER_DIARIZATION_HOST:$SPEAKER_DIARIZATION_PORT is up" +else + echo "speaker diarization service is not runninig" +fi + +echo "RUNNING service" + +python3 ./run.py --puctuation $punctuation_state --speaker_diarization $spkdiarization_state \ No newline at end of file diff --git a/run.py b/run.py index 8f594d3..7cba003 100644 --- a/run.py +++ b/run.py @@ -3,15 +3,18 @@ from flask import Flask, request, abort, Response, json from vosk import Model, KaldiRecognizer -from tools import Worker +from tools import Worker, SpeakerDiarization, Punctuation from time import gmtime, strftime from gevent.pywsgi import WSGIServer - +import argparse +import os app = Flask("__stt-standelone-worker__") -# create WorkerStreaming object +# instantiate services worker = Worker() +punctuation = Punctuation() +speakerdiarization = SpeakerDiarization() # Load ASR models (acoustic model and decoding graph) worker.log.info('Load acoustic model and decoding graph') @@ -19,7 +22,20 @@ worker.CONFIG_FILES_PATH+"/online.conf") spkModel = None +def decode(is_metadata): + rec = KaldiRecognizer(model, spkModel, worker.rate, worker.ONLINE) + rec.AcceptWaveform(worker.data) + data = rec.FinalResult() + confidence = rec.uttConfidence() + if is_metadata: + data = rec.GetMetadata() + return data, confidence + # API +@app.route('/healthcheck', methods=['GET']) +def healthcheck(): + return "1", 200 + @app.route('/transcribe', methods=['POST']) def transcribe(): try: @@ -40,18 +56,15 @@ def transcribe(): if 'file' in request.files.keys(): file = request.files['file'] worker.getAudio(file) - rec = KaldiRecognizer(model, spkModel, worker.rate, worker.ONLINE) - rec.AcceptWaveform(worker.data) - data_ = rec.FinalResult() - confidence = rec.uttConfidence() - if is_metadata: - data_ = rec.GetMetadata() - data = worker.get_response(data_, confidence, is_metadata) + data, confidence = decode(is_metadata) + spk = speakerdiarization.get(worker.file_path) + trans = worker.get_response(data, spk, confidence, is_metadata) + response = punctuation.get(trans) worker.clean() else: raise ValueError('No audio file was uploaded') - return data, 200 + return response, 200 except ValueError as error: return str(error), 400 except Exception as e: @@ -79,6 +92,22 @@ def server_error(error): if __name__ == '__main__': try: + parser = argparse.ArgumentParser() + parser.add_argument( + '--puctuation', + type=int, + help='punctuation service status', + default=0) + parser.add_argument( + '--speaker_diarization', + type=int, + help='speaker diarization service status', + default=0) + args = parser.parse_args() + + punctuation.setParam(True if args.puctuation else False) + speakerdiarization.setParam(True if args.speaker_diarization else False) + # start SwaggerUI if worker.SWAGGER_PATH != '': worker.swaggerUI(app) diff --git a/tools.py b/tools.py index 8844e48..286490b 100644 --- a/tools.py +++ b/tools.py @@ -24,13 +24,14 @@ import numpy as np from scipy.io import wavfile from flask_swagger_ui import get_swaggerui_blueprint +import requests ############## class Worker: def __init__(self): # Set logger config - self.log = logging.getLogger("__stt-standelone-worker__") + self.log = logging.getLogger("__stt-standelone-worker__.Worker") logging.basicConfig(level=logging.INFO) # Main parameters @@ -40,7 +41,6 @@ def __init__(self): self.CONFIG_FILES_PATH = '/opt/config' self.SAVE_AUDIO = False self.SERVICE_PORT = 80 - self.NBR_THREADS = 100 self.SWAGGER_URL = '/api-doc' self.SWAGGER_PATH = '' self.ONLINE = False @@ -52,12 +52,9 @@ def __init__(self): os.mkdir(self.TEMP_FILE_PATH) # Environment parameters - if 'NBR_THREADS' in os.environ: - if int(os.environ['NBR_THREADS']) > 0: - self.NBR_THREADS = int(os.environ['NBR_THREADS']) - else: - self.log.warning( - "You must to provide a positif number of threads 'NBR_THREADS'") + if 'SAVE_AUDIO' in os.environ: + self.SAVE_AUDIO = True if os.environ['SAVE_AUDIO'].lower( + ) == "true" else False if 'SWAGGER_PATH' in os.environ: self.SWAGGER_PATH = os.environ['SWAGGER_PATH'] @@ -182,7 +179,7 @@ def parse_text(self, text): return text # Postprocess response - def get_response(self, dataJson, confidence, is_metadata): + def get_response(self, dataJson, speakers, confidence, is_metadata): if dataJson is not None: data = json.loads(dataJson) data['conf'] = confidence @@ -191,12 +188,12 @@ def get_response(self, dataJson, confidence, is_metadata): return self.parse_text(text) elif 'words' in data: - # Do speaker diarization and get speaker segments - spk = SpeakerDiarization() - spkrs = spk.run(self.file_path) + if speakers is not None: + # Generate final output data + return self.process_output_v2(data, speakers) + else: + return {'speakers': [], 'text': data['text'], 'confidence-score': data['conf'], 'words': data['words']} - # Generate final output data - return self.process_output(data, spkrs) elif 'text' in data: return {'speakers': [], 'text': data['text'], 'confidence-score': data['conf'], 'words': []} else: @@ -205,7 +202,6 @@ def get_response(self, dataJson, confidence, is_metadata): return {'speakers': [], 'text': '', 'confidence-score': 0, 'words': []} # return a json object including word-data, speaker-data - def process_output(self, data, spkrs): try: speakers = [] @@ -252,252 +248,154 @@ def process_output(self, data, spkrs): except: return {'text': data['text'], 'words': data['words'], 'confidence-score': data['conf'], 'spks': []} - -class SpeakerDiarization: - def __init__(self): - self.log = logging.getLogger( - '__stt-standelone-worker__.SPKDiarization') - - # MFCC FEATURES PARAMETERS - self.frame_length_s = 0.025 - self.frame_shift_s = 0.01 - self.num_bins = 30 - self.num_ceps = 30 - ##### - - # Segment - self.seg_length = 100 # Window size in frames - self.seg_increment = 100 # Window increment after and before window in frames - self.seg_rate = 100 # Window shifting in frames - ##### - - # KBM - # Minimum number of Gaussians in the initial pool - self.minimumNumberOfInitialGaussians = 1024 - self.maximumKBMWindowRate = 50 # Maximum window rate for Gaussian computation - self.windowLength = 200 # Window length for computing Gaussians - self.kbmSize = 320 # Number of final Gaussian components in the KBM - # If set to 1, the KBM size is set as a proportion, given by "relKBMsize", of the pool size - self.useRelativeKBMsize = 1 - # Relative KBM size if "useRelativeKBMsize = 1" (value between 0 and 1). - self.relKBMsize = 0.3 - ###### - - # BINARY_KEY - self.topGaussiansPerFrame = 5 # Number of top selected components per frame - self.bitsPerSegmentFactor = 0.2 # Percentage of bits set to 1 in the binary keys - ###### - - # CLUSTERING - self.N_init = 16 # Number of initial clusters - # Set to one to perform linkage clustering instead of clustering/reassignment - self.linkage = 0 - # Linkage criterion used if linkage==1 ('average', 'single', 'complete') - self.linkageCriterion = 'average' - # Similarity metric: 'cosine' for cumulative vectors, and 'jaccard' for binary keys - self.metric = 'cosine' - ###### - - # CLUSTERING_SELECTION - # Distance metric used in the selection of the output clustering solution ('jaccard','cosine') - self.metric_clusteringSelection = 'cosine' - # Method employed for number of clusters selection. Can be either 'elbow' for an elbow criterion based on within-class sum of squares (WCSS) or 'spectral' for spectral clustering - self.bestClusteringCriterion = 'elbow' - self.sigma = 1 # Spectral clustering parameters, employed if bestClusteringCriterion == spectral - self.percentile = 40 - self.maxNrSpeakers = 10 # If known, max nr of speakers in a sesssion in the database. This is to limit the effect of changes in very small meaningless eigenvalues values generating huge eigengaps - ###### - - # RESEGMENTATION - self.resegmentation = 1 # Set to 1 to perform re-segmentation - self.modelSize = 6 # Number of GMM components - self.nbIter = 10 # Number of expectation-maximization (EM) iterations - self.smoothWin = 100 # Size of the likelihood smoothing window in nb of frames - ###### - - def compute_feat_Librosa(self, audioFile): + # return a json object including word-data, speaker-data + def process_output_v2(self, data, spkrs): try: - self.data, self.sr = librosa.load(audioFile, sr=None) - frame_length_inSample = self.frame_length_s * self.sr - hop = int(self.frame_shift_s * self.sr) - NFFT = int(2**np.ceil(np.log2(frame_length_inSample))) - if self.sr >= 16000: - mfccNumpy = librosa.feature.mfcc(y=self.data, - sr=self.sr, - dct_type=2, - n_mfcc=self.num_ceps, - n_mels=self.num_bins, - n_fft=NFFT, - hop_length=hop, - fmin=20, - fmax=7600).T - else: - mfccNumpy = librosa.feature.mfcc(y=self.data, - sr=self.sr, - dct_type=2, - n_mfcc=self.num_ceps, - n_mels=self.num_bins, - n_fft=NFFT, - hop_length=hop).T + speakers = [] + text = [] + i = 0 + text_ = "" + words = [] - except Exception as e: - self.log.error(e) - raise ValueError( - "Speaker diarization failed when extracting features!!!") - else: - return mfccNumpy + for word in data['words']: + if i+1 == len(spkrs): + continue + if i+1 < len(spkrs) and word["end"] < spkrs[i+1]["seg_begin"]: + text_ += word["word"] + " " + words.append(word) + elif len(words) != 0: + speaker = {} + speaker["start"] = words[0]["start"] + speaker["end"] = words[len(words)-1]["end"] + speaker["speaker_id"] = str(spkrs[i]["spk_id"]) + speaker["words"] = words - def computeVAD_WEBRTC(self, data, sr, nFeatures): - try: - if sr not in [8000, 16000, 32000, 48000]: - data = librosa.resample(data, sr, 16000) - sr = 16000 - - va_framed = py_webrtcvad( - data, fs=sr, fs_vad=sr, hoplength=30, vad_mode=0) - segments = get_py_webrtcvad_segments(va_framed, sr) - maskSAD = np.zeros([1, nFeatures]) - for seg in segments: - start = int(np.round(seg[0]/self.frame_shift_s)) - end = int(np.round(seg[1]/self.frame_shift_s)) - maskSAD[0][start:end] = 1 + text.append( + str(spkrs[i]["spk_id"])+' : ' + self.parse_text(text_)) + speakers.append(speaker) + + words = [word] + text_ = word["word"] + " " + i += 1 + else: + words = [word] + text_ = word["word"] + " " + i += 1 + + speaker = {} + speaker["start"] = words[0]["start"] + speaker["end"] = words[len(words)-1]["end"] + speaker["speaker_id"] = str(spkrs[i]["spk_id"]) + speaker["words"] = words + + text.append(str(spkrs[i]["spk_id"]) + + ' : ' + self.parse_text(text_)) + speakers.append(speaker) + + return {'speakers': speakers, 'text': text, 'confidence-score': data['conf']} except Exception as e: self.log.error(e) - raise ValueError( - "Speaker diarization failed while voice activity detection!!!") - else: - return maskSAD + return {'text': data['text'], 'words': data['words'], 'confidence-score': data['conf'], 'spks': []} - def run(self, audioFile): - try: - def getSegments(frameshift, finalSegmentTable, finalClusteringTable, dur): - numberOfSpeechFeatures = finalSegmentTable[-1, 2].astype(int)+1 - solutionVector = np.zeros([1, numberOfSpeechFeatures]) - for i in np.arange(np.size(finalSegmentTable, 0)): - solutionVector[0, np.arange( - finalSegmentTable[i, 1], finalSegmentTable[i, 2]+1).astype(int)] = finalClusteringTable[i] - seg = np.empty([0, 3]) - solutionDiff = np.diff(solutionVector)[0] - first = 0 - for i in np.arange(0, np.size(solutionDiff, 0)): - if solutionDiff[i]: - last = i+1 - seg1 = (first)*frameshift - seg2 = (last-first)*frameshift - seg3 = solutionVector[0, last-1] - if seg.shape[0] != 0 and seg3 == seg[-1][2]: - seg[-1][1] += seg2 - elif seg3 and seg2 > 0.3: # and seg2 > 0.1 - seg = np.vstack((seg, [seg1, seg2, seg3])) - first = i+1 - last = np.size(solutionVector, 1) - seg1 = (first-1)*frameshift - seg2 = (last-first+1)*frameshift - seg3 = solutionVector[0, last-1] - if seg3 == seg[-1][2]: - seg[-1][1] += seg2 - elif seg3 and seg2 > 0.3: # and seg2 > 0.1 - seg = np.vstack((seg, [seg1, seg2, seg3])) - seg = np.vstack((seg, [dur, -1, -1])) - seg[0][0] = 0.0 - return seg - - start_time = time.time() - - self.log.info('Start Speaker diarization') - - feats = self.compute_feat_Librosa(audioFile) - nFeatures = feats.shape[0] - duration = nFeatures * self.frame_shift_s - - if duration < 5: - return [[0, duration, 1], - [duration, -1, -1]] - - maskSAD = self.computeVAD_WEBRTC(self.data, self.sr, nFeatures) - maskUEM = np.ones([1, nFeatures]) - - mask = np.logical_and(maskUEM, maskSAD) - mask = mask[0][0:nFeatures] - nSpeechFeatures = np.sum(mask) - speechMapping = np.zeros(nFeatures) - # you need to start the mapping from 1 and end it in the actual number of features independently of the indexing style - # so that we don't lose features on the way - speechMapping[np.nonzero(mask)] = np.arange(1, nSpeechFeatures+1) - data = feats[np.where(mask == 1)] - del feats - - segmentTable = getSegmentTable( - mask, speechMapping, self.seg_length, self.seg_increment, self.seg_rate) - numberOfSegments = np.size(segmentTable, 0) - # create the KBM - # set the window rate in order to obtain "minimumNumberOfInitialGaussians" gaussians - if np.floor((nSpeechFeatures-self.windowLength)/self.minimumNumberOfInitialGaussians) < self.maximumKBMWindowRate: - windowRate = int(np.floor( - (np.size(data, 0)-self.windowLength)/self.minimumNumberOfInitialGaussians)) - else: - windowRate = int(self.maximumKBMWindowRate) - if windowRate == 0: - #self.log.info('The audio is to short in order to perform the speaker diarization!!!') - return [[0, duration, 1], - [duration, -1, -1]] +class SpeakerDiarization: + def __init__(self): + self.SPEAKER_DIARIZATION_ISON = False + self.SPEAKER_DIARIZATION_HOST = None + self.SPEAKER_DIARIZATION_PORT = None + self.url = None + self.log = logging.getLogger( + "__stt-standelone-worker__.SpeakerDiarization") + logging.basicConfig(level=logging.INFO) - poolSize = np.floor((nSpeechFeatures-self.windowLength)/windowRate) - if self.useRelativeKBMsize: - kbmSize = int(np.floor(poolSize*self.relKBMsize)) + def setParam(self, SPEAKER_DIARIZATION_ISON): + self.SPEAKER_DIARIZATION_ISON = SPEAKER_DIARIZATION_ISON + if self.SPEAKER_DIARIZATION_ISON: + self.SPEAKER_DIARIZATION_HOST = os.environ['SPEAKER_DIARIZATION_HOST'] + self.SPEAKER_DIARIZATION_PORT = os.environ['SPEAKER_DIARIZATION_PORT'] + self.url = "http://"+self.SPEAKER_DIARIZATION_HOST + \ + ":"+self.SPEAKER_DIARIZATION_PORT+"/" + self.log.info(self.url) if self.url is not None else self.log.warn( + "The Speaker Diarization service is not running!") + + def get(self, audio_path): + try: + if self.SPEAKER_DIARIZATION_ISON: + file = open(audio_path, 'rb') + result = requests.post(self.url, files={'file': file}) + if result.status_code != 200: + raise ValueError(result.text) + + speakers = json.loads(result.text) + speakers = speakers["segments"] + + last_spk = { + 'seg_begin': speakers[len(speakers) - 1]["seg_end"] + 10, + 'seg_end': -1, + 'spk_id': -1, + 'seg_id': -1, + } + speakers.append(last_spk) + + return speakers else: - kbmSize = int(self.kbmSize) - - # Training pool of',int(poolSize),'gaussians with a rate of',int(windowRate),'frames' - kbm, gmPool = trainKBM( - data, self.windowLength, windowRate, kbmSize) + raise ValueError('Service is OFF') + except Exception as e: + self.log.error(str(e)) + return None + except ValueError as error: + self.log.error(str(error)) + return None - #'Selected',kbmSize,'gaussians from the pool' - Vg = getVgMatrix(data, gmPool, kbm, self.topGaussiansPerFrame) - #'Computing binary keys for all segments... ' - segmentBKTable, segmentCVTable = getSegmentBKs( - segmentTable, kbmSize, Vg, self.bitsPerSegmentFactor, speechMapping) +class Punctuation: + def __init__(self): + self.PUCTUATION_ISON = False + self.PUCTUATION_HOST = None + self.PUCTUATION_PORT = None + self.PUCTUATION_ROUTE = None + self.url = None + self.log = logging.getLogger("__stt-standelone-worker__.Punctuation") + logging.basicConfig(level=logging.INFO) - #'Performing initial clustering... ' - initialClustering = np.digitize(np.arange(numberOfSegments), np.arange( - 0, numberOfSegments, numberOfSegments/self.N_init)) + def setParam(self, PUCTUATION_ISON): + self.PUCTUATION_ISON = PUCTUATION_ISON + if self.PUCTUATION_ISON: + self.PUCTUATION_HOST = os.environ['PUCTUATION_HOST'] + self.PUCTUATION_PORT = os.environ['PUCTUATION_PORT'] + self.PUCTUATION_ROUTE = os.environ['PUCTUATION_ROUTE'] + self.PUCTUATION_ROUTE = re.sub('^/','',self.PUCTUATION_ROUTE) + self.PUCTUATION_ROUTE = re.sub('"|\'','',self.PUCTUATION_ROUTE) + self.url = "http://"+self.PUCTUATION_HOST+":"+self.PUCTUATION_PORT+"/"+self.PUCTUATION_ROUTE + self.log.info(self.url) if self.url is not None else self.log.warn( + "The Punctuation service is not running!") + + def get(self, text): + try: + if self.PUCTUATION_ISON: + if isinstance(text, dict): + text_punc = [] + for utterance in text['text']: + data = utterance.split(':') + result = requests.post(self.url, data=data[1].strip().encode('utf-8'), headers={'content-type': 'application/octet-stream'}) + if result.status_code != 200: + raise ValueError(result.text) + + text_punc.append(data[0]+": "+result.text.encode('latin-1').decode('utf-8')) + text['text'] = text_punc + return text + else: + result = requests.post(self.url, data=text.encode('utf-8'), headers={'content-type': 'application/octet-stream'}) + if result.status_code != 200: + raise ValueError(result.text.encode('latin-1').decode('utf-8')) - #'Performing agglomerative clustering... ' - if self.linkage: - finalClusteringTable, k = performClusteringLinkage( - segmentBKTable, segmentCVTable, self.N_init, self.linkageCriterion, self.metric) + return result.text else: - finalClusteringTable, k = performClustering( - speechMapping, segmentTable, segmentBKTable, segmentCVTable, Vg, self.bitsPerSegmentFactor, kbmSize, self.N_init, initialClustering, self.metric) - - #'Selecting best clustering...' - if self.bestClusteringCriterion == 'elbow': - bestClusteringID = getBestClustering( - self.metric_clusteringSelection, segmentBKTable, segmentCVTable, finalClusteringTable, k, self.maxNrSpeakers) - elif self.bestClusteringCriterion == 'spectral': - bestClusteringID = getSpectralClustering(self.metric_clusteringSelection, finalClusteringTable, - self.N_init, segmentBKTable, segmentCVTable, k, self.sigma, self.percentile, self.maxNrSpeakers)+1 - - if self.resegmentation and np.size(np.unique(finalClusteringTable[:, bestClusteringID.astype(int)-1]), 0) > 1: - finalClusteringTableResegmentation, finalSegmentTable = performResegmentation(data, speechMapping, mask, finalClusteringTable[:, bestClusteringID.astype( - int)-1], segmentTable, self.modelSize, self.nbIter, self.smoothWin, nSpeechFeatures) - seg = getSegments(self.frame_shift_s, finalSegmentTable, np.squeeze( - finalClusteringTableResegmentation), duration) - else: - return [[0, duration, 1], - [duration, -1, -1]] - - self.log.info("Speaker Diarization time in seconds: %d" % - int(time.time() - start_time)) - except ValueError as v: - self.log.error(v) - return [[0, duration, 1], - [duration, -1, -1]] + raise ValueError('Service is OFF') except Exception as e: - self.log.error(e) - return [[0, duration, 1], - [duration, -1, -1]] - else: - return seg + self.log.error(str(e)) + return text + except ValueError as error: + self.log.error(str(error)) + return text + diff --git a/wait-for-it.sh b/wait-for-it.sh new file mode 100755 index 0000000..ea66f79 --- /dev/null +++ b/wait-for-it.sh @@ -0,0 +1,184 @@ +#!/usr/bin/env bash +# Use this script to test if a given TCP host/port are available + +WAITFORIT_cmdname=${0##*/} + +echoerr() { if [[ $WAITFORIT_QUIET -ne 1 ]]; then echo "$@" 1>&2; fi } + +usage() +{ + cat << USAGE >&2 +Usage: + $WAITFORIT_cmdname host:port [-s] [-t timeout] [-- command args] + -h HOST | --host=HOST Host or IP under test + -p PORT | --port=PORT TCP port under test + Alternatively, you specify the host and port as host:port + -s | --strict Only execute subcommand if the test succeeds + -q | --quiet Don't output any status messages + -t TIMEOUT | --timeout=TIMEOUT + Timeout in seconds, zero for no timeout + -- COMMAND ARGS Execute command with args after the test finishes +USAGE + exit 1 +} + +wait_for() +{ + if [[ $WAITFORIT_TIMEOUT -gt 0 ]]; then + echoerr "$WAITFORIT_cmdname: waiting $WAITFORIT_TIMEOUT seconds for $WAITFORIT_HOST:$WAITFORIT_PORT" + else + echoerr "$WAITFORIT_cmdname: waiting for $WAITFORIT_HOST:$WAITFORIT_PORT without a timeout" + fi + WAITFORIT_start_ts=$(date +%s) + while : + do + if [[ $WAITFORIT_ISBUSY -eq 1 ]]; then + nc -z $WAITFORIT_HOST $WAITFORIT_PORT + WAITFORIT_result=$? + else + (echo > /dev/tcp/$WAITFORIT_HOST/$WAITFORIT_PORT) >/dev/null 2>&1 + WAITFORIT_result=$? + fi + if [[ $WAITFORIT_result -eq 0 ]]; then + WAITFORIT_end_ts=$(date +%s) + echoerr "$WAITFORIT_cmdname: $WAITFORIT_HOST:$WAITFORIT_PORT is available after $((WAITFORIT_end_ts - WAITFORIT_start_ts)) seconds" + break + fi + sleep 1 + done + return $WAITFORIT_result +} + +wait_for_wrapper() +{ + # In order to support SIGINT during timeout: http://unix.stackexchange.com/a/57692 + if [[ $WAITFORIT_QUIET -eq 1 ]]; then + timeout $WAITFORIT_BUSYTIMEFLAG $WAITFORIT_TIMEOUT $0 --quiet --child --host=$WAITFORIT_HOST --port=$WAITFORIT_PORT --timeout=$WAITFORIT_TIMEOUT & + else + timeout $WAITFORIT_BUSYTIMEFLAG $WAITFORIT_TIMEOUT $0 --child --host=$WAITFORIT_HOST --port=$WAITFORIT_PORT --timeout=$WAITFORIT_TIMEOUT & + fi + WAITFORIT_PID=$! + trap "kill -INT -$WAITFORIT_PID" INT + wait $WAITFORIT_PID + WAITFORIT_RESULT=$? + if [[ $WAITFORIT_RESULT -ne 0 ]]; then + echoerr "$WAITFORIT_cmdname: timeout occurred after waiting $WAITFORIT_TIMEOUT seconds for $WAITFORIT_HOST:$WAITFORIT_PORT" + fi + return $WAITFORIT_RESULT +} + +# process arguments +while [[ $# -gt 0 ]] +do + case "$1" in + *:* ) + WAITFORIT_hostport=(${1//:/ }) + WAITFORIT_HOST=${WAITFORIT_hostport[0]} + WAITFORIT_PORT=${WAITFORIT_hostport[1]} + shift 1 + ;; + --child) + WAITFORIT_CHILD=1 + shift 1 + ;; + -q | --quiet) + WAITFORIT_QUIET=1 + shift 1 + ;; + -s | --strict) + WAITFORIT_STRICT=1 + shift 1 + ;; + -h) + WAITFORIT_HOST="$2" + if [[ $WAITFORIT_HOST == "" ]]; then break; fi + shift 2 + ;; + --host=*) + WAITFORIT_HOST="${1#*=}" + shift 1 + ;; + -p) + WAITFORIT_PORT="$2" + if [[ $WAITFORIT_PORT == "" ]]; then break; fi + shift 2 + ;; + --port=*) + WAITFORIT_PORT="${1#*=}" + shift 1 + ;; + -t) + WAITFORIT_TIMEOUT="$2" + if [[ $WAITFORIT_TIMEOUT == "" ]]; then break; fi + shift 2 + ;; + --timeout=*) + WAITFORIT_TIMEOUT="${1#*=}" + shift 1 + ;; + --) + shift + WAITFORIT_CLI=("$@") + break + ;; + --help) + usage + ;; + *) + echoerr "Unknown argument: $1" + usage + ;; + esac +done + +if [[ "$WAITFORIT_HOST" == "" || "$WAITFORIT_PORT" == "" ]]; then + echoerr "Error: you need to provide a host and port to test." + usage +fi + +WAITFORIT_TIMEOUT=${WAITFORIT_TIMEOUT:-15} +WAITFORIT_STRICT=${WAITFORIT_STRICT:-0} +WAITFORIT_CHILD=${WAITFORIT_CHILD:-0} +WAITFORIT_QUIET=${WAITFORIT_QUIET:-0} + +# Check to see if timeout is from busybox? +WAITFORIT_TIMEOUT_PATH=$(type -p timeout) +WAITFORIT_TIMEOUT_PATH=$(realpath $WAITFORIT_TIMEOUT_PATH 2>/dev/null || readlink -f $WAITFORIT_TIMEOUT_PATH) + +WAITFORIT_BUSYTIMEFLAG="" +if [[ $WAITFORIT_TIMEOUT_PATH =~ "busybox" ]]; then + WAITFORIT_ISBUSY=1 + # Check if busybox timeout uses -t flag + # (recent Alpine versions don't support -t anymore) + if timeout &>/dev/stdout | grep -q -e '-t '; then + WAITFORIT_BUSYTIMEFLAG="-t" + fi +else + WAITFORIT_ISBUSY=0 +fi + +if [[ $WAITFORIT_CHILD -gt 0 ]]; then + wait_for + WAITFORIT_RESULT=$? + exit $WAITFORIT_RESULT +else + if [[ $WAITFORIT_TIMEOUT -gt 0 ]]; then + wait_for_wrapper + WAITFORIT_RESULT=$? + else + wait_for + WAITFORIT_RESULT=$? + fi +fi + +if [[ $WAITFORIT_CLI != "" ]]; then + echo $WAITFORIT_RESULT + echo $WAITFORIT_STRICT + if [[ $WAITFORIT_RESULT -ne 0 && $WAITFORIT_STRICT -eq 1 ]]; then + echoerr "$WAITFORIT_cmdname: strict mode, refusing to execute subprocess" + exit $WAITFORIT_RESULT + fi + exec "${WAITFORIT_CLI[@]}" +else + exit $WAITFORIT_RESULT +fi \ No newline at end of file From d20f01ac01dd1ec820e679eaecede2ca4972824f Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Wed, 31 Mar 2021 12:35:50 +0200 Subject: [PATCH 053/172] update Dockerfile --- Dockerfile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Dockerfile b/Dockerfile index ee79f37..b0f03f5 100644 --- a/Dockerfile +++ b/Dockerfile @@ -71,7 +71,7 @@ RUN cd /opt/vosk-api/python && \ python3 setup.py install --user --single-version-externally-managed --root=/ # Install curl for healthcheck -RUN apt-get install -y curl +RUN apt-get update && apt-get install -y curl # Define the main folder WORKDIR /usr/src/speech-to-text From 53116ee7f7428e41a6736c245028a0cd2c8a43a1 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Wed, 31 Mar 2021 16:12:27 +0200 Subject: [PATCH 054/172] fix text punctuation error and clean entrypoint code --- docker-entrypoint.sh | 14 ++------------ tools.py | 22 +++++++++++++--------- 2 files changed, 15 insertions(+), 21 deletions(-) diff --git a/docker-entrypoint.sh b/docker-entrypoint.sh index 3555b4b..6a826a7 100755 --- a/docker-entrypoint.sh +++ b/docker-entrypoint.sh @@ -8,26 +8,16 @@ for retry in $(seq 1 $max_attempts); do echo "Waiting punctuation service... [attempt=$retry]" punctuation_state=1 ./wait-for-it.sh $PUCTUATION_HOST:$PUCTUATION_PORT --timeout=$delay || punctuation_state=0 + if [ $punctuation_state == 1 ]; then break; fi done -if [ $punctuation_state == 1 ]; then - echo "$PUCTUATION_HOST:$PUCTUATION_PORT is up" -else - echo "punctuation service is not runninig" -fi - for retry in $(seq 1 $max_attempts); do echo "Waiting speaker diarization service... [attempt=$retry]" spkdiarization_state=1 ./wait-for-it.sh $SPEAKER_DIARIZATION_HOST:$SPEAKER_DIARIZATION_PORT --timeout=$delay || spkdiarization_state=0 + if [ $spkdiarization_state == 1 ]; then break; fi done -if [ $spkdiarization_state == 1 ]; then - echo "$SPEAKER_DIARIZATION_HOST:$SPEAKER_DIARIZATION_PORT is up" -else - echo "speaker diarization service is not runninig" -fi - echo "RUNNING service" python3 ./run.py --puctuation $punctuation_state --speaker_diarization $spkdiarization_state \ No newline at end of file diff --git a/tools.py b/tools.py index 286490b..02af229 100644 --- a/tools.py +++ b/tools.py @@ -374,15 +374,19 @@ def get(self, text): try: if self.PUCTUATION_ISON: if isinstance(text, dict): - text_punc = [] - for utterance in text['text']: - data = utterance.split(':') - result = requests.post(self.url, data=data[1].strip().encode('utf-8'), headers={'content-type': 'application/octet-stream'}) - if result.status_code != 200: - raise ValueError(result.text) - - text_punc.append(data[0]+": "+result.text.encode('latin-1').decode('utf-8')) - text['text'] = text_punc + if isinstance(text['text'], list): + text_punc = [] + for utterance in text['text']: + data = utterance.split(':') + result = requests.post(self.url, data=data[1].strip().encode('utf-8'), headers={'content-type': 'application/octet-stream'}) + if result.status_code != 200: + raise ValueError(result.text) + + text_punc.append(data[0]+": "+result.text.encode('latin-1').decode('utf-8')) + text['text'] = text_punc + else: + result = requests.post(self.url, data=text['text'].strip().encode('utf-8'), headers={'content-type': 'application/octet-stream'}) + text['text'] = result.text.encode('latin-1').decode('utf-8') return text else: result = requests.post(self.url, data=text.encode('utf-8'), headers={'content-type': 'application/octet-stream'}) From 7437b6b23575a70279d77d072f25dbb73a7e714b Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Wed, 31 Mar 2021 16:50:15 +0200 Subject: [PATCH 055/172] update swagger param --- tools.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools.py b/tools.py index 02af229..36ab763 100644 --- a/tools.py +++ b/tools.py @@ -57,6 +57,8 @@ def __init__(self): ) == "true" else False if 'SWAGGER_PATH' in os.environ: self.SWAGGER_PATH = os.environ['SWAGGER_PATH'] + if 'SWAGGER_URL' in os.environ: + self.SWAGGER_URL = os.environ['SWAGGER_URL'] # start loading ASR configuration self.log.info("Create the new config files") From e6766319c6b07aa22d023836a7ee900b8486cbfa Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Wed, 31 Mar 2021 17:07:36 +0200 Subject: [PATCH 056/172] add prefix to swagger ui --- tools.py | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/tools.py b/tools.py index 36ab763..c02859f 100644 --- a/tools.py +++ b/tools.py @@ -42,6 +42,7 @@ def __init__(self): self.SAVE_AUDIO = False self.SERVICE_PORT = 80 self.SWAGGER_URL = '/api-doc' + self.SWAGGER_PREFIX = '' self.SWAGGER_PATH = '' self.ONLINE = False @@ -57,8 +58,8 @@ def __init__(self): ) == "true" else False if 'SWAGGER_PATH' in os.environ: self.SWAGGER_PATH = os.environ['SWAGGER_PATH'] - if 'SWAGGER_URL' in os.environ: - self.SWAGGER_URL = os.environ['SWAGGER_URL'] + if 'SWAGGER_PREFIX' in os.environ: + self.SWAGGER_PREFIX = os.environ['SWAGGER_PREFIX'] # start loading ASR configuration self.log.info("Create the new config files") @@ -70,7 +71,7 @@ def swaggerUI(self, app): open(self.SWAGGER_PATH, 'r'), Loader=yaml.Loader) swaggerui = get_swaggerui_blueprint( # Swagger UI static files will be mapped to '{SWAGGER_URL}/dist/' - self.SWAGGER_URL, + self.SWAGGER_PREFIX+self.SWAGGER_URL, self.SWAGGER_PATH, config={ # Swagger UI config overrides 'app_name': "STT API Documentation", From ad3cc3ae07fe85d274eb68085f79720b16aabc14 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Thu, 1 Apr 2021 16:19:55 +0200 Subject: [PATCH 057/172] remove healthcheck --- Dockerfile | 2 -- 1 file changed, 2 deletions(-) diff --git a/Dockerfile b/Dockerfile index b0f03f5..3aa8b30 100644 --- a/Dockerfile +++ b/Dockerfile @@ -81,7 +81,5 @@ COPY tools.py run.py docker-entrypoint.sh wait-for-it.sh ./ EXPOSE 80 -HEALTHCHECK CMD curl http://localhost/healthcheck || exit 1 - # Entrypoint handles the passed arguments ENTRYPOINT ["./docker-entrypoint.sh"] \ No newline at end of file From 59a81b4d4c90bfbb8427b07564110de3ebb7c1ca Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Wed, 14 Apr 2021 14:28:18 +0200 Subject: [PATCH 058/172] fix services call --- tools.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools.py b/tools.py index c02859f..b8c4f9a 100644 --- a/tools.py +++ b/tools.py @@ -342,7 +342,7 @@ def get(self, audio_path): return speakers else: - raise ValueError('Service is OFF') + return None except Exception as e: self.log.error(str(e)) return None @@ -398,7 +398,7 @@ def get(self, text): return result.text else: - raise ValueError('Service is OFF') + return text except Exception as e: self.log.error(str(e)) return text From fec438fcd135b5d15fdc6069dfce9eef3ee2454b Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Wed, 28 Apr 2021 15:57:38 +0200 Subject: [PATCH 059/172] add a new response type to activate/deactivate speaker diarization --- run.py | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/run.py b/run.py index 7cba003..3c698b8 100644 --- a/run.py +++ b/run.py @@ -43,12 +43,17 @@ def transcribe(): (strftime("%d/%b/%d %H:%M:%S", gmtime()))) is_metadata = False + do_spk = True # get response content type if request.headers.get('accept').lower() == 'application/json': is_metadata = True + elif request.headers.get('accept').lower() == 'application/json-nospk': + is_metadata = True + do_spk = False elif request.headers.get('accept').lower() == 'text/plain': is_metadata = False + do_spk = False else: raise ValueError('Not accepted header') @@ -57,7 +62,9 @@ def transcribe(): file = request.files['file'] worker.getAudio(file) data, confidence = decode(is_metadata) - spk = speakerdiarization.get(worker.file_path) + spk = None + if do_spk: + spk = speakerdiarization.get(worker.file_path) trans = worker.get_response(data, spk, confidence, is_metadata) response = punctuation.get(trans) worker.clean() From fc060f5120bb9b9da48f156311ab8b41164b709e Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Wed, 28 Apr 2021 17:15:50 +0200 Subject: [PATCH 060/172] fix punctuation text encode --- tools.py | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/tools.py b/tools.py index b8c4f9a..6e34818 100644 --- a/tools.py +++ b/tools.py @@ -381,20 +381,21 @@ def get(self, text): text_punc = [] for utterance in text['text']: data = utterance.split(':') + self.log.info(data[1].strip()) result = requests.post(self.url, data=data[1].strip().encode('utf-8'), headers={'content-type': 'application/octet-stream'}) if result.status_code != 200: raise ValueError(result.text) - text_punc.append(data[0]+": "+result.text.encode('latin-1').decode('utf-8')) + text_punc.append(data[0]+": "+result.text) text['text'] = text_punc else: result = requests.post(self.url, data=text['text'].strip().encode('utf-8'), headers={'content-type': 'application/octet-stream'}) - text['text'] = result.text.encode('latin-1').decode('utf-8') + text['text'] = result.text return text else: result = requests.post(self.url, data=text.encode('utf-8'), headers={'content-type': 'application/octet-stream'}) if result.status_code != 200: - raise ValueError(result.text.encode('latin-1').decode('utf-8')) + raise ValueError(result.text) return result.text else: From 349d3d9e0f4bed6bdc08d4e2a6e3a3e509ec910a Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Wed, 28 Apr 2021 17:17:32 +0200 Subject: [PATCH 061/172] remove verification message --- tools.py | 1 - 1 file changed, 1 deletion(-) diff --git a/tools.py b/tools.py index 6e34818..6a3324d 100644 --- a/tools.py +++ b/tools.py @@ -381,7 +381,6 @@ def get(self, text): text_punc = [] for utterance in text['text']: data = utterance.split(':') - self.log.info(data[1].strip()) result = requests.post(self.url, data=data[1].strip().encode('utf-8'), headers={'content-type': 'application/octet-stream'}) if result.status_code != 200: raise ValueError(result.text) From 4d58d1a84486f528d8cc494baff4a7f028d7b38f Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 11 May 2021 15:36:32 +0200 Subject: [PATCH 062/172] clean memory --- run.py | 1 + tools.py | 1 + 2 files changed, 2 insertions(+) diff --git a/run.py b/run.py index 3c698b8..cfec1ab 100644 --- a/run.py +++ b/run.py @@ -29,6 +29,7 @@ def decode(is_metadata): confidence = rec.uttConfidence() if is_metadata: data = rec.GetMetadata() + del rec return data, confidence # API diff --git a/tools.py b/tools.py index 6a3324d..226f6cc 100644 --- a/tools.py +++ b/tools.py @@ -97,6 +97,7 @@ def getAudio(self, file): def clean(self): if not self.SAVE_AUDIO: os.remove(self.file_path) + del self.data # re-create config files def loadConfig(self): From aa957e4601977ead3002c807dcf4aa492a2f530e Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Thu, 26 Aug 2021 21:03:20 +0200 Subject: [PATCH 063/172] update linstt worker --- .gitmodules | 6 ------ Dockerfile | 20 ++++---------------- docker-entrypoint.sh | 8 +++++++- pyBK | 1 - requirements.txt | 7 +++++++ run.py | 26 ++++++++++++++++++-------- tools.py | 20 ++++---------------- vosk-api | 1 - 8 files changed, 40 insertions(+), 49 deletions(-) delete mode 100644 .gitmodules delete mode 160000 pyBK create mode 100644 requirements.txt delete mode 160000 vosk-api diff --git a/.gitmodules b/.gitmodules deleted file mode 100644 index b131dc4..0000000 --- a/.gitmodules +++ /dev/null @@ -1,6 +0,0 @@ -[submodule "vosk-api"] - path = vosk-api - url = https://github.com/irebai/vosk-api.git -[submodule "pyBK"] - path = pyBK - url = https://github.com/irebai/pyBK.git diff --git a/Dockerfile b/Dockerfile index 3aa8b30..3283b9c 100644 --- a/Dockerfile +++ b/Dockerfile @@ -52,31 +52,19 @@ RUN git clone --depth 1 https://github.com/kaldi-asr/kaldi.git /opt/kaldi && \ cd /opt/kaldi/tools && mkdir openfst_ && mv openfst-*/lib openfst-*/include openfst-*/bin openfst_ && rm openfst_/lib/*.so* openfst_/lib/*.la && \ rm -r openfst-*/* && mv openfst_/* openfst-*/ && rm -r openfst_ -# Install pyBK (speaker diarization toolkit) -RUN apt install -y software-properties-common && wget https://apt.llvm.org/llvm.sh && chmod +x llvm.sh && ./llvm.sh 10 && \ - export LLVM_CONFIG=/usr/bin/llvm-config-10 && \ - pip3 install numpy && \ - pip3 install websockets && \ - pip3 install librosa webrtcvad scipy sklearn - -# Install main service packages -RUN pip3 install flask flask-cors flask-swagger-ui gevent pyyaml && \ - apt-get install -y ffmpeg +# Install python dependencies +COPY requirements.txt ./ +RUN pip3 install --no-cache-dir -r requirements.txt # build VOSK KALDI -COPY vosk-api /opt/vosk-api -RUN cd /opt/vosk-api/python && \ +RUN git clone --depth 1 https://github.com/irebai/vosk-api.git /opt/vosk-api && cd /opt/vosk-api/python && \ export KALDI_ROOT=/opt/kaldi && \ export KALDI_MKL=1 && \ python3 setup.py install --user --single-version-externally-managed --root=/ -# Install curl for healthcheck -RUN apt-get update && apt-get install -y curl - # Define the main folder WORKDIR /usr/src/speech-to-text -COPY pyBK/diarizationFunctions.py pyBK/diarizationFunctions.py COPY tools.py run.py docker-entrypoint.sh wait-for-it.sh ./ EXPOSE 80 diff --git a/docker-entrypoint.sh b/docker-entrypoint.sh index 6a826a7..8ca4752 100755 --- a/docker-entrypoint.sh +++ b/docker-entrypoint.sh @@ -4,20 +4,26 @@ set -e max_attempts=3 delay=5 +punctuation_state=0 +if [[ ! -z $PUCTUATION_HOST && ! -z $PUCTUATION_PORT ]]; then for retry in $(seq 1 $max_attempts); do echo "Waiting punctuation service... [attempt=$retry]" punctuation_state=1 ./wait-for-it.sh $PUCTUATION_HOST:$PUCTUATION_PORT --timeout=$delay || punctuation_state=0 if [ $punctuation_state == 1 ]; then break; fi done +fi +spkdiarization_state=0 +if [[ ! -z $SPEAKER_DIARIZATION_HOST && ! -z $SPEAKER_DIARIZATION_PORT ]]; then for retry in $(seq 1 $max_attempts); do echo "Waiting speaker diarization service... [attempt=$retry]" spkdiarization_state=1 ./wait-for-it.sh $SPEAKER_DIARIZATION_HOST:$SPEAKER_DIARIZATION_PORT --timeout=$delay || spkdiarization_state=0 if [ $spkdiarization_state == 1 ]; then break; fi done +fi -echo "RUNNING service" +echo "Start service" python3 ./run.py --puctuation $punctuation_state --speaker_diarization $spkdiarization_state \ No newline at end of file diff --git a/pyBK b/pyBK deleted file mode 160000 index 1e5dc7d..0000000 --- a/pyBK +++ /dev/null @@ -1 +0,0 @@ -Subproject commit 1e5dc7de4e0a7d43a44152a68beca0699c14fd4c diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..86cc88c --- /dev/null +++ b/requirements.txt @@ -0,0 +1,7 @@ +flask>=1.1.2 +flask-cors>=3.0.10 +flask-swagger-ui>=3.36.0 +gevent>=21.8.0 +pyyaml>=5.4.1 +wavio>=0.0.4 +requests>=2.26.0 \ No newline at end of file diff --git a/run.py b/run.py index cfec1ab..fdcb819 100644 --- a/run.py +++ b/run.py @@ -23,13 +23,19 @@ spkModel = None def decode(is_metadata): - rec = KaldiRecognizer(model, spkModel, worker.rate, worker.ONLINE) - rec.AcceptWaveform(worker.data) - data = rec.FinalResult() - confidence = rec.uttConfidence() + if is_metadata and len(worker.data) / worker.rate > 30 : + recognizer = KaldiRecognizer(model, spkModel, worker.rate, is_metadata, True) + for i in range(0, len(worker.data), int(worker.rate/4)): + if recognizer.AcceptWaveform(worker.data[i:i + int(worker.rate/4)]): + recognizer.Result() + else: + recognizer = KaldiRecognizer(model, None, worker.rate, is_metadata, False) + recognizer.AcceptWaveform(worker.data) + + data = recognizer.FinalResult() + confidence = recognizer.uttConfidence() if is_metadata: - data = rec.GetMetadata() - del rec + data = recognizer.GetMetadata() return data, confidence # API @@ -40,7 +46,7 @@ def healthcheck(): @app.route('/transcribe', methods=['POST']) def transcribe(): try: - worker.log.info('[%s] New user entry on /transcribe' % + worker.log.info('[%s] Transcribe request received' % (strftime("%d/%b/%d %H:%M:%S", gmtime()))) is_metadata = False @@ -62,19 +68,23 @@ def transcribe(): if 'file' in request.files.keys(): file = request.files['file'] worker.getAudio(file) + worker.log.info("Start decoding [Audio duration={}(s)]".format(str(int(len(worker.data) / worker.rate)))) data, confidence = decode(is_metadata) + worker.log.info("Decoding complete") spk = None if do_spk: spk = speakerdiarization.get(worker.file_path) trans = worker.get_response(data, spk, confidence, is_metadata) response = punctuation.get(trans) worker.clean() + worker.log.info("... Complete") else: raise ValueError('No audio file was uploaded') return response, 200 except ValueError as error: - return str(error), 400 + worker.log.error(e) + return 'Server Error', 400 except Exception as e: worker.log.error(e) return 'Server Error', 500 diff --git a/tools.py b/tools.py index 226f6cc..8238ecc 100644 --- a/tools.py +++ b/tools.py @@ -1,20 +1,7 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- -#  ASR -from vosk import Model, KaldiRecognizer -############## - -# Speaker Diarization -from pyBK.diarizationFunctions import * -import librosa -import time -import webrtcvad -############## - -# other packages import configparser -import librosa import logging import os import re @@ -22,10 +9,9 @@ import json import yaml import numpy as np -from scipy.io import wavfile +import wavio from flask_swagger_ui import get_swaggerui_blueprint import requests -############## class Worker: @@ -86,7 +72,9 @@ def getAudio(self, file): self.file_path = self.TEMP_FILE_PATH+"/"+filename file.save(self.file_path) try: - self.rate, self.data = wavfile.read(self.file_path) + file_content = wavio.read(self.file_path) + self.rate = file_content.rate + self.data = file_content.data # if stereo file, convert to mono by computing the mean of the channels if len(self.data.shape) == 2 and self.data.shape[1] == 2: self.data = np.mean(self.data, axis=1, dtype=np.int16) diff --git a/vosk-api b/vosk-api deleted file mode 160000 index 7f555e4..0000000 --- a/vosk-api +++ /dev/null @@ -1 +0,0 @@ -Subproject commit 7f555e464c1d6b16233354491868f46d009c453c From 44888b82eeb905c3498b365f7b9db240d3d4eab9 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Fri, 27 Aug 2021 10:32:07 +0200 Subject: [PATCH 064/172] remove speaker information from response --- tools.py | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tools.py b/tools.py index 8238ecc..5a6eaae 100644 --- a/tools.py +++ b/tools.py @@ -184,14 +184,14 @@ def get_response(self, dataJson, speakers, confidence, is_metadata): # Generate final output data return self.process_output_v2(data, speakers) else: - return {'speakers': [], 'text': data['text'], 'confidence-score': data['conf'], 'words': data['words']} + return {'text': data['text'], 'confidence-score': data['conf'], 'words': data['words']} elif 'text' in data: - return {'speakers': [], 'text': data['text'], 'confidence-score': data['conf'], 'words': []} + return {'text': data['text'], 'confidence-score': data['conf'], 'words': []} else: - return {'speakers': [], 'text': '', 'confidence-score': 0, 'words': []} + return {'text': '', 'confidence-score': 0, 'words': []} else: - return {'speakers': [], 'text': '', 'confidence-score': 0, 'words': []} + return {'text': '', 'confidence-score': 0, 'words': []} # return a json object including word-data, speaker-data def process_output(self, data, spkrs): From 0e9843c89f6beeacae091374627cc726c2a86d86 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Fri, 27 Aug 2021 10:49:46 +0200 Subject: [PATCH 065/172] update linstt --- run.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/run.py b/run.py index fdcb819..a4bef94 100644 --- a/run.py +++ b/run.py @@ -23,7 +23,7 @@ spkModel = None def decode(is_metadata): - if is_metadata and len(worker.data) / worker.rate > 30 : + if is_metadata and len(worker.data) / worker.rate > 1800 : recognizer = KaldiRecognizer(model, spkModel, worker.rate, is_metadata, True) for i in range(0, len(worker.data), int(worker.rate/4)): if recognizer.AcceptWaveform(worker.data[i:i + int(worker.rate/4)]): From 30af1fc86b20064c3154021a3c0732e3ae5cf7fe Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 31 Aug 2021 22:37:52 +0200 Subject: [PATCH 066/172] add new features/roots --- document/swagger.yml | 37 +++++++++++++++++++++- run.py | 74 ++++++++++++++++++++++++++++++++++---------- tools.py | 25 +++++++++------ 3 files changed, 109 insertions(+), 27 deletions(-) diff --git a/document/swagger.yml b/document/swagger.yml index b52b52c..4426218 100644 --- a/document/swagger.yml +++ b/document/swagger.yml @@ -7,7 +7,7 @@ info: schemes: - http -host: localhost:8888 +host: 127.0.0.1:8888 basePath: / paths: @@ -30,3 +30,38 @@ paths: responses: 200: description: Successfully transcribe the audio + 400: + description: Request error + 500: + description: Server error + /transcription/{PID}: + get: + tags: + - "Speech-To-Text API" + summary: Perform Speech-to-Text + consumes: + - "multipart/form-data" + produces: + - "application/json" + - "text/plain" + parameters: + - name: "PID" + in: "path" + description: "PID of a transcribe request" + required: true + type: "string" + responses: + 200: + description: Get transcription + 400: + description: Invalid PID + /get/pids: + get: + tags: + - "Speech-To-Text API" + summary: Get PIDs + produces: + - "application/json" + responses: + 200: + description: Get list of PIDs \ No newline at end of file diff --git a/run.py b/run.py index a4bef94..22d9677 100644 --- a/run.py +++ b/run.py @@ -8,9 +8,13 @@ from gevent.pywsgi import WSGIServer import argparse import os +import _thread +import uuid app = Flask("__stt-standelone-worker__") +max_duration = 1800 + # instantiate services worker = Worker() punctuation = Punctuation() @@ -23,7 +27,7 @@ spkModel = None def decode(is_metadata): - if is_metadata and len(worker.data) / worker.rate > 1800 : + if is_metadata and len(worker.data) / worker.rate > max_duration : recognizer = KaldiRecognizer(model, spkModel, worker.rate, is_metadata, True) for i in range(0, len(worker.data), int(worker.rate/4)): if recognizer.AcceptWaveform(worker.data[i:i + int(worker.rate/4)]): @@ -38,11 +42,44 @@ def decode(is_metadata): data = recognizer.GetMetadata() return data, confidence +def processing(is_metadata, do_spk, audio_buffer, file_path=None): + try: + worker.log.info("Start decoding") + data, confidence = decode(is_metadata) + worker.log.info("Decoding complete") + worker.log.info("Post Processing ...") + spk = None + if do_spk: + spk = speakerdiarization.get(audio_buffer) + trans = worker.get_response(data, spk, confidence, is_metadata) + response = punctuation.get(trans) + worker.log.info("... Complete") + if file_path is not None: + with open(file_path, 'w') as outfile: + json.dump(response, outfile) + else: + return response + except Exception as e: + worker.log.error(e) + exit(1) + # API @app.route('/healthcheck', methods=['GET']) def healthcheck(): return "1", 200 +@app.route('/transcription/', methods=['GET']) +def transcription(PID): + file_path = worker.TRANS_FILES_PATH + "/" + str(PID) + if os.path.exists(file_path): + return json.load(open(file_path,)), 200 + else: + return "PID {} is invalid".format(str(PID)), 400 + +@app.route('/get/pids', methods=['GET']) +def get(): + return json.load(open(worker.TRANS_FILES_PATH + "/pids.json")), 200 + @app.route('/transcribe', methods=['POST']) def transcribe(): try: @@ -65,26 +102,29 @@ def transcribe(): raise ValueError('Not accepted header') # get input file - if 'file' in request.files.keys(): - file = request.files['file'] - worker.getAudio(file) - worker.log.info("Start decoding [Audio duration={}(s)]".format(str(int(len(worker.data) / worker.rate)))) - data, confidence = decode(is_metadata) - worker.log.info("Decoding complete") - spk = None - if do_spk: - spk = speakerdiarization.get(worker.file_path) - trans = worker.get_response(data, spk, confidence, is_metadata) - response = punctuation.get(trans) - worker.clean() - worker.log.info("... Complete") - else: + if 'file' not in request.files.keys(): raise ValueError('No audio file was uploaded') + audio_buffer = request.files['file'].read() + worker.getAudio(audio_buffer) + duration = int(len(worker.data) / worker.rate) + if duration > max_duration: + filename = str(uuid.uuid4()) + file_path = worker.TRANS_FILES_PATH + "/" + filename + + pids = json.load(open(worker.TRANS_FILES_PATH + "/pids.json")) + pids['pids'].append({'pid':filename, 'time':strftime("%d/%b/%d %H:%M:%S", gmtime())}) + with open(worker.TRANS_FILES_PATH + "/pids.json", 'w') as pids_file: + json.dump(pids, pids_file) + + _thread.start_new_thread(processing, (is_metadata, do_spk, audio_buffer, file_path,)) + return "The approximate decoding time is {} seconds. Use this PID={} to get the transcription after decoding.".format(str(int(duration*0.33)), filename), 200 + response = processing(is_metadata, do_spk, audio_buffer) + return response, 200 - except ValueError as error: + except ValueError as e: worker.log.error(e) - return 'Server Error', 400 + return str(e), 400 except Exception as e: worker.log.error(e) return 'Server Error', 500 diff --git a/tools.py b/tools.py index 5a6eaae..0450885 100644 --- a/tools.py +++ b/tools.py @@ -4,6 +4,7 @@ import configparser import logging import os +import io import re import uuid import json @@ -24,6 +25,7 @@ def __init__(self): self.AM_PATH = '/opt/models/AM' self.LM_PATH = '/opt/models/LM' self.TEMP_FILE_PATH = '/opt/tmp' + self.TRANS_FILES_PATH = '/opt/trans' self.CONFIG_FILES_PATH = '/opt/config' self.SAVE_AUDIO = False self.SERVICE_PORT = 80 @@ -38,6 +40,12 @@ def __init__(self): if not os.path.isdir(self.TEMP_FILE_PATH): os.mkdir(self.TEMP_FILE_PATH) + if not os.path.isdir(self.TRANS_FILES_PATH): + os.mkdir(self.TRANS_FILES_PATH) + + with open(self.TRANS_FILES_PATH + "/pids.json", 'w') as outfile: + json.dump({'pids':[]}, outfile) + # Environment parameters if 'SAVE_AUDIO' in os.environ: self.SAVE_AUDIO = True if os.environ['SAVE_AUDIO'].lower( @@ -68,11 +76,8 @@ def swaggerUI(self, app): ### end swagger specific ### def getAudio(self, file): - filename = str(uuid.uuid4()) - self.file_path = self.TEMP_FILE_PATH+"/"+filename - file.save(self.file_path) try: - file_content = wavio.read(self.file_path) + file_content = wavio.read(io.BytesIO(file)) self.rate = file_content.rate self.data = file_content.data # if stereo file, convert to mono by computing the mean of the channels @@ -82,10 +87,12 @@ def getAudio(self, file): self.log.error(e) raise ValueError("The uploaded file format is not supported!!!") - def clean(self): - if not self.SAVE_AUDIO: - os.remove(self.file_path) - del self.data + def saveFile(self, file): + if self.SAVE_AUDIO: + filename = str(uuid.uuid4()) + self.file_path = self.TEMP_FILE_PATH+"/"+filename + file.save(self.file_path) + # re-create config files def loadConfig(self): @@ -184,7 +191,7 @@ def get_response(self, dataJson, speakers, confidence, is_metadata): # Generate final output data return self.process_output_v2(data, speakers) else: - return {'text': data['text'], 'confidence-score': data['conf'], 'words': data['words']} + return {'text': self.parse_text(data['text']), 'confidence-score': data['conf'], 'words': data['words']} elif 'text' in data: return {'text': data['text'], 'confidence-score': data['conf'], 'words': []} From 86fc204df63d87cf6831011f8970fa1248d4e59b Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Tue, 31 Aug 2021 22:52:58 +0200 Subject: [PATCH 067/172] rename PID to jobid --- document/swagger.yml | 14 +++++++------- run.py | 24 ++++++++++++------------ tools.py | 4 ++-- 3 files changed, 21 insertions(+), 21 deletions(-) diff --git a/document/swagger.yml b/document/swagger.yml index 4426218..d169694 100644 --- a/document/swagger.yml +++ b/document/swagger.yml @@ -34,7 +34,7 @@ paths: description: Request error 500: description: Server error - /transcription/{PID}: + /transcription/{jobid}: get: tags: - "Speech-To-Text API" @@ -45,23 +45,23 @@ paths: - "application/json" - "text/plain" parameters: - - name: "PID" + - name: "jobid" in: "path" - description: "PID of a transcribe request" + description: "jobid of a transcribe request" required: true type: "string" responses: 200: description: Get transcription 400: - description: Invalid PID - /get/pids: + description: Invalid jobid + /get/jobids: get: tags: - "Speech-To-Text API" - summary: Get PIDs + summary: Get jobids produces: - "application/json" responses: 200: - description: Get list of PIDs \ No newline at end of file + description: Get list of jobids \ No newline at end of file diff --git a/run.py b/run.py index 22d9677..2a039c4 100644 --- a/run.py +++ b/run.py @@ -68,17 +68,17 @@ def processing(is_metadata, do_spk, audio_buffer, file_path=None): def healthcheck(): return "1", 200 -@app.route('/transcription/', methods=['GET']) -def transcription(PID): - file_path = worker.TRANS_FILES_PATH + "/" + str(PID) +@app.route('/transcription/', methods=['GET']) +def transcription(jobid): + file_path = worker.TRANS_FILES_PATH + "/" + str(jobid) if os.path.exists(file_path): return json.load(open(file_path,)), 200 else: - return "PID {} is invalid".format(str(PID)), 400 + return "jobid {} is invalid".format(str(jobid)), 400 -@app.route('/get/pids', methods=['GET']) +@app.route('/get/jobids', methods=['GET']) def get(): - return json.load(open(worker.TRANS_FILES_PATH + "/pids.json")), 200 + return json.load(open(worker.TRANS_FILES_PATH + "/jobids.json")), 200 @app.route('/transcribe', methods=['POST']) def transcribe(): @@ -109,16 +109,16 @@ def transcribe(): worker.getAudio(audio_buffer) duration = int(len(worker.data) / worker.rate) if duration > max_duration: - filename = str(uuid.uuid4()) - file_path = worker.TRANS_FILES_PATH + "/" + filename + jobid = str(uuid.uuid4()) + file_path = worker.TRANS_FILES_PATH + "/" + jobid - pids = json.load(open(worker.TRANS_FILES_PATH + "/pids.json")) - pids['pids'].append({'pid':filename, 'time':strftime("%d/%b/%d %H:%M:%S", gmtime())}) - with open(worker.TRANS_FILES_PATH + "/pids.json", 'w') as pids_file: + pids = json.load(open(worker.TRANS_FILES_PATH + "/jobids.json")) + pids['jobids'].append({'jobid':jobid, 'time':strftime("%d/%b/%d %H:%M:%S", gmtime())}) + with open(worker.TRANS_FILES_PATH + "/jobids.json", 'w') as pids_file: json.dump(pids, pids_file) _thread.start_new_thread(processing, (is_metadata, do_spk, audio_buffer, file_path,)) - return "The approximate decoding time is {} seconds. Use this PID={} to get the transcription after decoding.".format(str(int(duration*0.33)), filename), 200 + return "The approximate decoding time is {} seconds. Use this jobid={} to get the transcription after decoding.".format(str(int(duration*0.33)), jobid), 200 response = processing(is_metadata, do_spk, audio_buffer) return response, 200 diff --git a/tools.py b/tools.py index 0450885..d9353d7 100644 --- a/tools.py +++ b/tools.py @@ -43,8 +43,8 @@ def __init__(self): if not os.path.isdir(self.TRANS_FILES_PATH): os.mkdir(self.TRANS_FILES_PATH) - with open(self.TRANS_FILES_PATH + "/pids.json", 'w') as outfile: - json.dump({'pids':[]}, outfile) + with open(self.TRANS_FILES_PATH + "/jobids.json", 'w') as outfile: + json.dump({'jobids':[]}, outfile) # Environment parameters if 'SAVE_AUDIO' in os.environ: From 701e4e2046c68b0b97579751e7301623e947a176 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Thu, 2 Sep 2021 17:12:40 +0200 Subject: [PATCH 068/172] fix audio format to send to the service of speaker diarization --- tools.py | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/tools.py b/tools.py index d9353d7..6bee1bc 100644 --- a/tools.py +++ b/tools.py @@ -317,11 +317,10 @@ def setParam(self, SPEAKER_DIARIZATION_ISON): self.log.info(self.url) if self.url is not None else self.log.warn( "The Speaker Diarization service is not running!") - def get(self, audio_path): + def get(self, audio_buffer): try: if self.SPEAKER_DIARIZATION_ISON: - file = open(audio_path, 'rb') - result = requests.post(self.url, files={'file': file}) + result = requests.post(self.url, files={'file': audio_buffer}) if result.status_code != 200: raise ValueError(result.text) From 2df35e63aa96ac530fd77042372c38b67669b4e7 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Thu, 2 Sep 2021 20:51:34 +0200 Subject: [PATCH 069/172] update README --- README.md | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 64 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 419dcd6..aae3c2d 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,10 @@ This service is mandatory in a LinTO platform stack as the main worker for speec Generally, Automatic Speech Recognition (ASR) is the task of recognition and translation of spoken language into text. Our ASR system takes advantages from the recent advances in machine learning technologies and in particular deep learning ones (TDNN, LSTM, attentation-based architecture). The core of our system consists of two main components: an acoustic model and a decoding graph. A high-performance ASR system relies on an accurate acoustic model as well as a perfect decoding graph. +**NB**: The service works as follows: +* If the audio's duration is less that 30 minutes, the service will return the transcription after decoding. +* Otherwise, the server will return a **jobid** that could be used to get the transcription after decoding using the API **`/transcription/{jobid}`**. + ## Usage See documentation : [doc.linto.ai](https://doc.linto.ai) @@ -27,7 +31,7 @@ While there is no specific minimal requirement on the CPU, speech recognition is ## Installation ### Packaged in Docker -To start the LinSTT service on your local machine or your cloud, you need first to download the source code and set the environment file, as follows: +To start the STT worker on your local machine or your cloud, you need first to download the source code and set the environment file, as follows: ```bash git clone https://github.com/linto-ai/linto-platform-stt-standalone-worker @@ -57,7 +61,7 @@ docker pull lintoai/linto-platform-stt-standalone-worker:latest NB: You must install docker and docker-compose on your machine. ## Configuration -The LinSTT service that will be set-up here require KALDI models, the acoustic model and the decoding graph. Indeed, these models are not included in the repository; you must download them in order to run LinSTT. You can use our pre-trained models from here: [Downloads](https://doc.linto.ai/#/services/linstt_download). +The STT worker that will be set-up here require KALDI models, the acoustic model and the decoding graph. Indeed, these models are not included in the repository; you must download them in order to run LinSTT. You can use our pre-trained models from here: [Downloads](https://doc.linto.ai/#/services/linstt_download). 1- Download the French acoustic model and the small decoding graph (linstt.v1). You can download the latest version for optimal performance and you should make sure that you have the hardware requirement in terms of RAM. @@ -83,8 +87,10 @@ mv DG_fr-FR ~/linstt_model_storage 4- Configure the environment file `.env` included in this repository +```bash AM_PATH=~/linstt_model_storage/AM_fr-FR LM_PATH=~/linstt_model_storage/DG_fr-FR +``` NB: if you want to use the visual user interface of the service, you need also to configure the swagger file `document/swagger.yml` included in this repository. Specifically, in the section `host`, specify the adress of the machine in which the service is deployed. @@ -129,6 +135,32 @@ Convert a speech to text > > **{text|Json}** : Return the full transcription or a json object with metadata + +#### /transcription/{jobid} + +Get the transcription using the jobid + +### Functionality +> `get`
+> Make a GET request +>> Arguments : +>> - **{String} jobid** jobid - An identifier used to find the corresponding transcription +> +> **{text|Json}** : Return the transcription + + +#### /jobids + +List of the transcription jobids + +### Functionality +> `get`
+> Make a GET request +>> Arguments : +>> - no arguments +> +> **{Json}** : Return a json object with jobids + @@ -145,4 +177,33 @@ And run the test script: ./test_deployment.sh ``` -To run personal test, you can use swagger interface: `localhost:8888/api-doc/` \ No newline at end of file +To run personal test, you can use swagger interface: `localhost:8888/api-doc/` + + +### Extrat metadata +If you would like to have a transcription with speaker information and punctuation marks, it's possible thanks to our open-source services: + +* Speaker diarization worker: https://github.com/linto-ai/linto-platform-speaker-diarization-worker +* Text punctuation worker: https://github.com/linto-ai/linto-platform-text-punctuation-worker + +To do that, you need first to start either the speaker or punctuation service or you can start both if it's necessary. **Please read the documentation to know how to install, configure, and start these services.** + +Once the services are on, you need to configure the STT worker as follows: + +1- Edit the environment file `.env` as follows: + +* if you started the punctuation worker, the following variables should be used + +```bash + PUCTUATION_HOST=text-punctuation-worker-host-name + PUCTUATION_PORT=worker-port-example-80 + PUCTUATION_ROUTE=/api/route/path/ +``` +* if you started the speaker diarization worker, the following variables should be used + +```bash + SPEAKER_DIARIZATION_HOST=speaker-diarization-worker-host-name + SPEAKER_DIARIZATION_PORT=worker-port-example-80 +``` + +2- Start the service using the same command described in section **Execute** \ No newline at end of file From 5058f436d0967b74be1736b845299766ac9ca773 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Thu, 2 Sep 2021 20:57:23 +0200 Subject: [PATCH 070/172] remove extra parameters --- .envdefault | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/.envdefault b/.envdefault index 8cc601e..130f6ef 100644 --- a/.envdefault +++ b/.envdefault @@ -1,10 +1,3 @@ AM_PATH=/path/to/acoustic/models/dir LM_PATH=/path/to/language/models/dir -SWAGGER_PATH=./document/swagger.yml - -# dependent services config -PUCTUATION_HOST=text-punctuation-worker-host-name -PUCTUATION_PORT=8080 -PUCTUATION_ROUTE="/api/route/path/" -SPEAKER_DIARIZATION_HOST=speaker-diarization-worker-host-name -SPEAKER_DIARIZATION_PORT=80 \ No newline at end of file +SWAGGER_PATH=./document/swagger.yml \ No newline at end of file From 2d5a26e64899d9634f3c7f0f6a79fb47413030f2 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Fri, 3 Sep 2021 09:55:38 +0200 Subject: [PATCH 071/172] update the response --- run.py | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/run.py b/run.py index 2a039c4..8adbabe 100644 --- a/run.py +++ b/run.py @@ -13,7 +13,7 @@ app = Flask("__stt-standelone-worker__") -max_duration = 1800 +max_duration = 10 # instantiate services worker = Worker() @@ -118,7 +118,13 @@ def transcribe(): json.dump(pids, pids_file) _thread.start_new_thread(processing, (is_metadata, do_spk, audio_buffer, file_path,)) - return "The approximate decoding time is {} seconds. Use this jobid={} to get the transcription after decoding.".format(str(int(duration*0.33)), jobid), 200 + estdur = str(int(duration*0.33)) + response = { + 'jobid': jobid, + 'decoding_time': '~' + estdur + ' seconds', + 'message': "Use the jobid to get the transcription after decoding", + } + return response, 200 response = processing(is_metadata, do_spk, audio_buffer) return response, 200 From 6e96e339046ddd04c73681ba0a3d35da04b34654 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Fri, 3 Sep 2021 12:17:43 +0200 Subject: [PATCH 072/172] update the decoding function --- document/swagger.yml | 4 ++-- run.py | 8 ++++---- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/document/swagger.yml b/document/swagger.yml index d169694..a1ed3e7 100644 --- a/document/swagger.yml +++ b/document/swagger.yml @@ -7,7 +7,7 @@ info: schemes: - http -host: 127.0.0.1:8888 +host: localhost:8888 basePath: / paths: @@ -55,7 +55,7 @@ paths: description: Get transcription 400: description: Invalid jobid - /get/jobids: + /jobids: get: tags: - "Speech-To-Text API" diff --git a/run.py b/run.py index 8adbabe..e095af8 100644 --- a/run.py +++ b/run.py @@ -13,7 +13,7 @@ app = Flask("__stt-standelone-worker__") -max_duration = 10 +max_duration = 1800 # instantiate services worker = Worker() @@ -27,7 +27,7 @@ spkModel = None def decode(is_metadata): - if is_metadata and len(worker.data) / worker.rate > max_duration : + if len(worker.data) / worker.rate > max_duration : recognizer = KaldiRecognizer(model, spkModel, worker.rate, is_metadata, True) for i in range(0, len(worker.data), int(worker.rate/4)): if recognizer.AcceptWaveform(worker.data[i:i + int(worker.rate/4)]): @@ -76,7 +76,7 @@ def transcription(jobid): else: return "jobid {} is invalid".format(str(jobid)), 400 -@app.route('/get/jobids', methods=['GET']) +@app.route('/jobids', methods=['GET']) def get(): return json.load(open(worker.TRANS_FILES_PATH + "/jobids.json")), 200 @@ -118,7 +118,7 @@ def transcribe(): json.dump(pids, pids_file) _thread.start_new_thread(processing, (is_metadata, do_spk, audio_buffer, file_path,)) - estdur = str(int(duration*0.33)) + estdur = str(int(duration*0.3)) if is_metadata else str(int(duration*0.18)) response = { 'jobid': jobid, 'decoding_time': '~' + estdur + ' seconds', From 14604313f33f1c1c5056abb518bc51b03293cdf4 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Fri, 3 Sep 2021 12:31:48 +0200 Subject: [PATCH 073/172] update README --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index aae3c2d..bbeb319 100644 --- a/README.md +++ b/README.md @@ -123,7 +123,7 @@ Our service requires an audio file in `Waveform format`. It should has the follo Convert a speech to text -### Functionality +#### Functionality > `post`
> Make a POST request >> Arguments : @@ -140,7 +140,7 @@ Convert a speech to text Get the transcription using the jobid -### Functionality +#### Functionality > `get`
> Make a GET request >> Arguments : @@ -153,7 +153,7 @@ Get the transcription using the jobid List of the transcription jobids -### Functionality +#### Functionality > `get`
> Make a GET request >> Arguments : @@ -180,7 +180,7 @@ And run the test script: To run personal test, you can use swagger interface: `localhost:8888/api-doc/` -### Extrat metadata +### Additional metadata If you would like to have a transcription with speaker information and punctuation marks, it's possible thanks to our open-source services: * Speaker diarization worker: https://github.com/linto-ai/linto-platform-speaker-diarization-worker From 04535cac1c1f8c23a20bb7915f742b5ef5d8b654 Mon Sep 17 00:00:00 2001 From: Ilyes Rebai Date: Fri, 1 Oct 2021 00:30:14 +0200 Subject: [PATCH 074/172] fix response format --- run.py | 6 ++- tools.py | 120 ++++++++++++++++++------------------------------------- 2 files changed, 44 insertions(+), 82 deletions(-) diff --git a/run.py b/run.py index e095af8..7a47547 100644 --- a/run.py +++ b/run.py @@ -50,8 +50,12 @@ def processing(is_metadata, do_spk, audio_buffer, file_path=None): worker.log.info("Post Processing ...") spk = None if do_spk: - spk = speakerdiarization.get(audio_buffer) + spk = speakerdiarization.get(audio_buffer, int(len(worker.data) / worker.rate)) trans = worker.get_response(data, spk, confidence, is_metadata) + + if trans is None: + raise ValueError('Transcription error') + response = punctuation.get(trans) worker.log.info("... Complete") if file_path is not None: diff --git a/tools.py b/tools.py index 6bee1bc..e285a66 100644 --- a/tools.py +++ b/tools.py @@ -182,73 +182,17 @@ def get_response(self, dataJson, speakers, confidence, is_metadata): if dataJson is not None: data = json.loads(dataJson) data['conf'] = confidence - if not is_metadata: - text = data['text'] # get text from response - return self.parse_text(text) - - elif 'words' in data: - if speakers is not None: + if 'text' in data: + if not is_metadata: + text = data['text'] # get text from response + return self.parse_text(text) + elif 'words' in data and len(data['words']) > 0: # Generate final output data - return self.process_output_v2(data, speakers) - else: - return {'text': self.parse_text(data['text']), 'confidence-score': data['conf'], 'words': data['words']} - - elif 'text' in data: - return {'text': data['text'], 'confidence-score': data['conf'], 'words': []} - else: - return {'text': '', 'confidence-score': 0, 'words': []} - else: - return {'text': '', 'confidence-score': 0, 'words': []} + return self.process_output(data, speakers) + return None # return a json object including word-data, speaker-data def process_output(self, data, spkrs): - try: - speakers = [] - text = [] - i = 0 - text_ = "" - words = [] - for word in data['words']: - if i+1 == len(spkrs): - continue - if i+1 < len(spkrs) and word["end"] < spkrs[i+1][0]: - text_ += word["word"] + " " - words.append(word) - elif len(words) != 0: - speaker = {} - speaker["start"] = words[0]["start"] - speaker["end"] = words[len(words)-1]["end"] - speaker["speaker_id"] = 'spk'+str(int(spkrs[i][2])) - speaker["words"] = words - - text.append( - 'spk'+str(int(spkrs[i][2]))+' : ' + self.parse_text(text_)) - speakers.append(speaker) - - words = [word] - text_ = word["word"] + " " - i += 1 - else: - words = [word] - text_ = word["word"] + " " - i += 1 - - speaker = {} - speaker["start"] = words[0]["start"] - speaker["end"] = words[len(words)-1]["end"] - speaker["speaker_id"] = 'spk'+str(int(spkrs[i][2])) - speaker["words"] = words - - text.append('spk'+str(int(spkrs[i][2])) + - ' : ' + self.parse_text(text_)) - speakers.append(speaker) - - return {'speakers': speakers, 'text': text, 'confidence-score': data['conf']} - except: - return {'text': data['text'], 'words': data['words'], 'confidence-score': data['conf'], 'spks': []} - - # return a json object including word-data, speaker-data - def process_output_v2(self, data, spkrs): try: speakers = [] text = [] @@ -256,6 +200,9 @@ def process_output_v2(self, data, spkrs): text_ = "" words = [] + # Capitalize first word + data['words'][0]['word'] = data['words'][0]['word'].capitalize() + for word in data['words']: if i+1 == len(spkrs): continue @@ -265,7 +212,7 @@ def process_output_v2(self, data, spkrs): elif len(words) != 0: speaker = {} speaker["start"] = words[0]["start"] - speaker["end"] = words[len(words)-1]["end"] + speaker["end"] = words[-1]["end"] speaker["speaker_id"] = str(spkrs[i]["spk_id"]) speaker["words"] = words @@ -281,9 +228,13 @@ def process_output_v2(self, data, spkrs): text_ = word["word"] + " " i += 1 + if i == 0: + words = data['words'] + text_ = data['text'].capitalize() + speaker = {} speaker["start"] = words[0]["start"] - speaker["end"] = words[len(words)-1]["end"] + speaker["end"] = words[-1]["end"] speaker["speaker_id"] = str(spkrs[i]["spk_id"]) speaker["words"] = words @@ -294,7 +245,7 @@ def process_output_v2(self, data, spkrs): return {'speakers': speakers, 'text': text, 'confidence-score': data['conf']} except Exception as e: self.log.error(e) - return {'text': data['text'], 'words': data['words'], 'confidence-score': data['conf'], 'spks': []} + return {'text': data['text'], 'words': data['words'], 'confidence-score': data['conf'], 'speakers': []} class SpeakerDiarization: @@ -317,7 +268,14 @@ def setParam(self, SPEAKER_DIARIZATION_ISON): self.log.info(self.url) if self.url is not None else self.log.warn( "The Speaker Diarization service is not running!") - def get(self, audio_buffer): + def get(self, audio_buffer, duration): + emptyReturn = [{ + "seg_id":1, + "spk_id":"spk1", + "seg_begin":0, + "seg_end":duration, + }] + try: if self.SPEAKER_DIARIZATION_ISON: result = requests.post(self.url, files={'file': audio_buffer}) @@ -337,13 +295,13 @@ def get(self, audio_buffer): return speakers else: - return None + return emptyReturn except Exception as e: self.log.error(str(e)) - return None + return emptyReturn except ValueError as error: self.log.error(str(error)) - return None + return emptyReturn class Punctuation: @@ -368,24 +326,24 @@ def setParam(self, PUCTUATION_ISON): self.log.info(self.url) if self.url is not None else self.log.warn( "The Punctuation service is not running!") - def get(self, text): + def get(self, obj): try: if self.PUCTUATION_ISON: - if isinstance(text, dict): - if isinstance(text['text'], list): + if isinstance(obj, dict): + if isinstance(obj['text'], list): text_punc = [] - for utterance in text['text']: + for utterance in obj['text']: data = utterance.split(':') result = requests.post(self.url, data=data[1].strip().encode('utf-8'), headers={'content-type': 'application/octet-stream'}) if result.status_code != 200: raise ValueError(result.text) text_punc.append(data[0]+": "+result.text) - text['text'] = text_punc + obj['text-punc'] = text_punc else: - result = requests.post(self.url, data=text['text'].strip().encode('utf-8'), headers={'content-type': 'application/octet-stream'}) - text['text'] = result.text - return text + result = requests.post(self.url, data=obj['text'].strip().encode('utf-8'), headers={'content-type': 'application/octet-stream'}) + obj['text-punc'] = result.text + return obj else: result = requests.post(self.url, data=text.encode('utf-8'), headers={'content-type': 'application/octet-stream'}) if result.status_code != 200: @@ -393,11 +351,11 @@ def get(self, text): return result.text else: - return text + return obj except Exception as e: self.log.error(str(e)) - return text + return obj except ValueError as error: self.log.error(str(error)) - return text + return obj From 2f48a88da82ba4aa8371670a5133af1e1dace9b2 Mon Sep 17 00:00:00 2001 From: Houpert Date: Wed, 16 Feb 2022 16:25:55 +0100 Subject: [PATCH 075/172] Update Dockerfile --- Dockerfile | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Dockerfile b/Dockerfile index 8aae3ae..f7ec8bb 100644 --- a/Dockerfile +++ b/Dockerfile @@ -42,7 +42,7 @@ RUN git clone -b vosk --single-branch https://github.com/alphacep/kaldi /opt/kal fi \ && sed -i 's:-msse -msse2:-msse -msse2:g' kaldi.mk \ && sed -i 's: -O1 : -O3 :g' kaldi.mk \ - && make -j 32 online2 lm rnnlm + && make -j $(nproc) online2 lm rnnlm # Install python dependencies COPY requirements.txt ./ @@ -69,4 +69,4 @@ ENV PYTHONPATH="${PYTHONPATH}:/usr/src/app/stt" HEALTHCHECK CMD ./healthcheck.sh -ENTRYPOINT ["./docker-entrypoint.sh"] \ No newline at end of file +ENTRYPOINT ["./docker-entrypoint.sh"] From f6fbe703fe237fe6dd6dd775cd7c188c726cd6d5 Mon Sep 17 00:00:00 2001 From: Rudy Baraglia Date: Wed, 9 Mar 2022 09:57:29 +0000 Subject: [PATCH 076/172] 3.3.0 Vosk rebase and streaming. See RELEASE.md --- .envdefault | 17 +++++ .gitignore | 4 +- Dockerfile | 6 +- README.md | 133 ++++++++++++++++++++++------------- RELEASE.md | 7 ++ celery_app/tasks.py | 10 +-- docker-entrypoint.sh | 30 +++++++- http_server/ingress.py | 20 +++++- lin_to_vosk.py | 86 ++++++++++++++++++++++ requirements.txt | 2 + stt/processing/__init__.py | 20 ++---- stt/processing/decoding.py | 18 ++--- stt/processing/model.py | 81 --------------------- stt/processing/streaming.py | 107 ++++++++++++++++++++++++++++ websocket/__init__.py | 0 websocket/websocketserver.py | 21 ++++++ 16 files changed, 398 insertions(+), 164 deletions(-) create mode 100644 .envdefault create mode 100755 lin_to_vosk.py delete mode 100644 stt/processing/model.py create mode 100644 stt/processing/streaming.py create mode 100644 websocket/__init__.py create mode 100644 websocket/websocketserver.py diff --git a/.envdefault b/.envdefault new file mode 100644 index 0000000..33a394c --- /dev/null +++ b/.envdefault @@ -0,0 +1,17 @@ +# SERVING PARAMETERS +SERVICE_MODE=http +MODEL_TYPE=lin + +# HTTP PARAMETERS +ENABLE_STREAMING=true + +# TASK PARAMETERS +SERVICE_NAME=stt +SERVICES_BROKER=redis://192.168.0.1:6379 +BROKER_PASS=password + +# WEBSOCKET PARAMETERS +STREAMING_PORT=80 + +# CONCURRENCY +CONCURRENCY=2 \ No newline at end of file diff --git a/.gitignore b/.gitignore index d2b976b..8ad21e4 100644 --- a/.gitignore +++ b/.gitignore @@ -1 +1,3 @@ -start_container.sh \ No newline at end of file +start_container.sh +.env +test/* \ No newline at end of file diff --git a/Dockerfile b/Dockerfile index f7ec8bb..af7f731 100644 --- a/Dockerfile +++ b/Dockerfile @@ -49,9 +49,9 @@ COPY requirements.txt ./ RUN pip install --no-cache-dir -r requirements.txt # Install Custom Vosk API -RUN git clone --depth 1 https://github.com/linto-ai/linto-vosk-api.git /opt/vosk-api && cd /opt/vosk-api/python && \ +RUN git clone --depth 1 https://github.com/alphacep/vosk-api /opt/vosk-api && cd /opt/vosk-api/python && \ cd /opt/vosk-api/src \ - && KALDI_MKL=$KALDI_MKL KALDI_ROOT=/opt/kaldi make -j 32 \ + && KALDI_MKL=$KALDI_MKL KALDI_ROOT=/opt/kaldi make -j $(nproc) \ && cd /opt/vosk-api/python \ && python3 ./setup.py install @@ -60,8 +60,10 @@ WORKDIR /usr/src/app COPY stt /usr/src/app/stt COPY celery_app /usr/src/app/celery_app COPY http_server /usr/src/app/http_server +COPY websocket /usr/src/app/websocket COPY document /usr/src/app/document COPY docker-entrypoint.sh wait-for-it.sh healthcheck.sh ./ +COPY lin_to_vosk.py /usr/src/app/lin_to_vosk.py RUN mkdir -p /var/log/supervisor/ diff --git a/README.md b/README.md index bf1b4a8..aa711f7 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,6 @@ # LINTO-PLATFORM-STT LinTO-platform-stt is the transcription service within the [LinTO stack](https://github.com/linto-ai/linto-platform-stack). -The STT-worker is configured with an acoustic model and a language model to perform Speech-To-Text tasks with high efficiency. - LinTO-platform-stt can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector. ## Pre-requisites @@ -10,66 +8,101 @@ LinTO-platform-stt can either be used as a standalone transcription service or d ### Hardware To run the transcription models you'll need: * At least 7Go of disk space to build the docker image. -* 500MB-3GB-7GB of RAM depending on the model used (small-medium-large). +* Up to 7GB of RAM depending on the model used. * One CPU per worker. Inference time scales on CPU performances. ### Model -The transcription service relies on 2 models: -* An acoustic model. -* A language model (or decoding graph). +LinTO-Platform-STT accepts two kinds of models: +* LinTO Acoustic and Languages models. +* Vosk models. -We provide some models on [dl.linto.ai](https://dl.linto.ai/downloads/model-distribution/). +We provide home-cured models (v2) on [dl.linto.ai](https://doc.linto.ai/#/services/linstt_download). +Or you can also use Vosk models available [here](https://alphacephei.com/vosk/models). ### Docker The transcription service requires docker up and running. ### (micro-service) Service broker and shared folder -The STT only entry point in job mode are tasks posted on a message broker. Supported message broker are RabbitMQ, Redis, Amazon SQS. -On addition, as to prevent large audio from transiting through the message broker, STT-Worker use a shared storage folder. +The STT only entry point in task mode are tasks posted on a message broker. Supported message broker are RabbitMQ, Redis, Amazon SQS. +On addition, as to prevent large audio from transiting through the message broker, STT-Worker use a shared storage folder (SHARED_FOLDER). ## Deploy linto-platform-stt -linto-platform-stt can be deployed three ways: -* As a standalone transcription service through an HTTP API. -* As a micro-service connected to a message broker. -**1- First step is to build the image:** +**1- First step is to build or pull the image:** ```bash git clone https://github.com/linto-ai/linto-platform-stt.git cd linto-platform-stt docker build . -t linto-platform-stt:latest ``` +or + +```bash +docker pull lintoai/linto-platform-stt +``` **2- Download the models** -Have the acoustic and language model ready at AM_PATH and LM_PATH. +Have the acoustic and language model ready at AM_PATH and LM_PATH if you are using LinTO models. If you are using a Vosk model, have it ready at MODEL. -### HTTP API +**3- Fill the .env** + +```bash +cp .envdefault .env +``` + +| PARAMETER | DESCRIPTION | EXEMPLE | +|---|---|---| +| SERVING_MODE | STT serving mode see [Serving mode](#serving-mode) | http\|task\|websocket | +| MODEL_TYPE | Type of STT model used. | lin\|vosk | +| ENABLE_STREAMING | Using http serving mode, enable the /streaming websocket route | true\|false | +| SERVICE_NAME | Using the task mode, set the queue's name for task processing | my-stt | +| SERVICE_BROKER | Using the task mode, URL of the message broker | redis://my-broker:6379 | +| BROKER_PASS | Using the task mode, broker password | my-password | +| STREAMING_PORT | Using the websocket mode, the listening port for ingoing WS connexions. | 80 | +| CONCURRENCY | Maximum number of parallel requests | >1 | + +### Serving mode +![Serving Modes](https://i.ibb.co/qrtv3Z6/platform-stt.png) + +STT can be use three ways: +* Through an [HTTP API](#http-server) using the **http**'s mode. +* Through a [message broker](#micro-service-within-linto-platform-stack) using the **task**'s mode. +* Through a [websocket server](#websocket-server) **websocket**'s mode. + +Mode is specified using the .env value or environment variable ```SERVING_MODE```. +```bash +SERVING_MODE=http +``` +### HTTP Server +The HTTP serving mode deploys a HTTP server and a swagger-ui to allow transcription request on a dedicated route. + +The SERVING_MODE value in the .env should be set to ```http```. ```bash docker run --rm \ -p HOST_SERVING_PORT:80 \ --v AM_PATH:/opt/models/AM \ --v LM_PATH:/opt/models/LM \ ---env SERVICE_NAME=stt \ ---env LANGUAGE=en_US \ ---env SERVICE_MODE=http \ ---env CONCURRENCY=10 \ +-v AM_PATH:/opt/AM \ +-v LM_PATH:/opt/LM \ +--env-file .env \ linto-platform-stt:latest ``` -This will run a container providing an http API binded on the host HOST_SERVING_PORT port. +This will run a container providing an [HTTP API](#http-api) binded on the host HOST_SERVING_PORT port. **Parameters:** | Variables | Description | Example | |:-|:-|:-| | HOST_SERVING_PORT | Host serving port | 80 | -| AM_PATH | Path to the acoustic model | /my/path/to/models/AM_fr-FR_v2.2.0 | -| LM_PATH | Path to the language model | /my/path/to/models/AM_fr-FR_v2.2.0 | -| LANGUAGE | Language code as a BCP-47 code | en-US, fr_FR, ... | -| CONCURRENCY | Number of worker (1 worker = 1 cpu) | 4 | +| AM_PATH | Path to the acoustic model on the host machine mounted to /opt/AM | /my/path/to/models/AM_fr-FR_v2.2.0 | +| LM_PATH | Path to the language model on the host machine mounted to /opt/LM | /my/path/to/models/fr-FR_big-v2.2.0 | +| MODEL_PATH | Path to the model (using MODEL_TYPE=vosk) mounted to /opt/model | /my/path/to/models/vosk-model | ### Micro-service within LinTO-Platform stack +The HTTP serving mode connect a celery worker to a message broker. + +The SERVING_MODE value in the .env should be set to ```task```. + >LinTO-platform-stt can be deployed within the linto-platform-stack through the use of linto-platform-services-manager. Used this way, the container spawn celery worker waiting for transcription task on a message broker. >LinTO-platform-stt in task mode is not intended to be launch manually. >However, if you intent to connect it to your custom message's broker here are the parameters: @@ -81,35 +114,29 @@ docker run --rm \ -v AM_PATH:/opt/models/AM \ -v LM_PATH:/opt/models/LM \ -v SHARED_AUDIO_FOLDER:/opt/audio \ ---env SERVICES_BROKER=MY_SERVICE_BROKER \ ---env BROKER_PASS=MY_BROKER_PASS \ ---env SERVICE_NAME=stt \ ---env LANGUAGE=en_US \ ---env SERVICE_MODE=task \ ---env CONCURRENCY=10 \ -linstt:dev +--env-file .env \ +linto-platform-stt:latest ``` **Parameters:** | Variables | Description | Example | |:-|:-|:-| -| AM_PATH | Path to the acoustic model | /my/path/to/models/AM_fr-FR_v2.2.0 | -| LM_PATH | Path to the language model | /my/path/to/models/AM_fr-FR_v2.2.0 | -| SERVICES_BROKER | Service broker uri | redis://my_redis_broker:6379 | -| BROKER_PASS | Service broker password (Leave empty if there is no password) | my_password | -| SERVICE_NAME* | Transcription service name | my_stt | -| LANGUAGE | Transcription language | en-US | -| CONCURRENCY | Number of worker (1 worker = 1 cpu) | [ 1 -> numberOfCPU] | +| AM_PATH | Path to the acoustic model on the host machine mounted to /opt/AM | /my/path/to/models/AM_fr-FR_v2.2.0 | +| LM_PATH | Path to the language model on the host machine mounted to /opt/LM | /my/path/to/models/fr-FR_big-v2.2.0 | +| MODEL_PATH | Path to the model (using MODEL_TYPE=vosk) mounted to /opt/model | /my/path/to/models/vosk-model | +| SHARED_AUDIO_FOLDER | Shared audio folder mounted to /opt/audio | /my/path/to/models/vosk-model | -(* SERVICE NAME needs to be the same as the linto-platform-transcription-service if used.) +### Websocket Server +Websocket server's mode deploy a streaming transcription service only. -## Usages +The SERVING_MODE value in the .env should be set to ```websocket```. -### HTTP API +Usage is the same as the [http streaming API](#/streaming) +## Usages +### HTTP API #### /healthcheck - Returns the state of the API Method: GET @@ -117,12 +144,11 @@ Method: GET Returns "1" if healthcheck passes. #### /transcribe - Transcription API * Method: POST * Response content: text/plain or application/json -* File: An Wave f ile 16b 16Khz +* File: An Wave file 16b 16Khz Return the transcripted text using "text/plain" or a json object when using "application/json" structure as followed: ```json @@ -136,8 +162,20 @@ Return the transcripted text using "text/plain" or a json object when using "app } ``` +#### /streaming +The /streaming route is accessible if the ENABLE_STREAMING environment variable is set to true. + +The route accepts websocket connexions. Exchanges are structured as followed: +1. Client send a json {"config": {"sample_rate":16000}}. +2. Client send audio chunk (go to 3- ) or {"eof" : 1} (go to 5-). +3. Server send either a partial result {"partial" : "this is a "} or a final result {"text": "this is a transcription"}. +4. Back to 2- +5. Server send a final result and close the connexion. + +> Connexion will be closed and the worker will be freed if no chunk are received for 10s. + #### /docs -The /docs route offers a OpenAPI/swagger interface. +The /docs route offers a OpenAPI/swagger interface. ### Through the message broker @@ -169,6 +207,7 @@ On a successfull transcription the returned object is a json object structured a * The word field contains each word with their time stamp and individual confidence. (Empty if with_metadata=False) * The confidence field contains the overall confidence for the transcription. (0.0 if with_metadata=False) + ## Test ### Curl You can test you http API using curl: diff --git a/RELEASE.md b/RELEASE.md index e6f14a9..0a146d5 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,3 +1,10 @@ +# 3.3.0 +- Added optional streaming route to the http serving mode +- Added serving mode: websocket +- Added Dynamic model conversion allowing to use either Vosk Models or Linagora AM/LM models +- Changer Vosk dependency to alphacep/vosk +- Updated README.md + # 3.2.1 - Repository total rework. The goal being to have a simple transcription service embeddable within a micro-service infrastructure. - Changed repository name from linto-platform-stt-standalone-worker to linto-platform-stt. diff --git a/celery_app/tasks.py b/celery_app/tasks.py index aaf5975..6921a0c 100644 --- a/celery_app/tasks.py +++ b/celery_app/tasks.py @@ -1,15 +1,17 @@ import os +import asyncio from stt import logger from stt.processing import model from celery_app.celeryapp import celery from stt.processing.utils import load_wave -from stt.processing.decoding import decode +from stt.processing import decode @celery.task(name="transcribe_task") def transcribe_task(file_name: str, with_metadata: bool): - """ transcribe_task do a synchronous call to the transcribe worker API """ + """ transcribe_task """ logger.info("Received transcription task for {}".format(file_name)) + # Load wave file_path = os.path.join("/opt/audio", file_name) try: @@ -25,6 +27,4 @@ def transcribe_task(file_name: str, with_metadata: bool): logger.error("Failed to decode: {}".format(e)) raise Exception("Failed to decode {}".format(file_path)) - return result - - \ No newline at end of file + return result \ No newline at end of file diff --git a/docker-entrypoint.sh b/docker-entrypoint.sh index 16c25d2..212b145 100755 --- a/docker-entrypoint.sh +++ b/docker-entrypoint.sh @@ -3,10 +3,29 @@ set -ea echo "RUNNING STT" +# Check model +echo "Checking model format ..." +if [ -z "$MODEL_TYPE" ] +then + echo "Model type not specified, expecting Vosk Model" + export MODEL_TYPE=vosk +fi + +if [ "$MODEL_TYPE" = "vosk" ] +then + echo "Using Vosk format's model" + +elif [ "$MODEL_TYPE" = "lin" ] +then + echo "Processing model ... " + ./lin_to_vosk.py +else + echo "Unknown model type $MODEL_TYPE. Assuming vosk model" +fi # Launch parameters, environement variables and dependencies check if [ -z "$SERVICE_MODE" ] then - echo "ERROR: Must specify a serving mode: [ http | task ]" + echo "ERROR: Must specify a serving mode: [ http | task | websocket ]" exit -1 else if [ "$SERVICE_MODE" = "http" ] @@ -18,11 +37,16 @@ else if [[ -z "$SERVICES_BROKER" ]] then echo "ERROR: SERVICES_BROKER variable not specified, cannot start celery worker." - return -1 + exit -1 fi /usr/src/app/wait-for-it.sh $(echo $SERVICES_BROKER | cut -d'/' -f 3) --timeout=20 --strict -- echo " $SERVICES_BROKER (Service Broker) is up" echo "RUNNING STT CELERY WORKER" - celery --app=celery_app.celeryapp worker -Ofair -n ${SERVICE_NAME}_worker@%h --queues=${SERVICE_NAME} -c ${CONCURRENCY} + celery --app=celery_app.celeryapp worker -Ofair --queues=${SERVICE_NAME} -c ${CONCURRENCY} -n ${SERVICE_NAME}_worker@%h + + elif [ "$SERVICE_MODE" == "websocket" ] + then + echo "Running Websocket server on port ${STREAMING_PORT:=80}" + python websocket/websocketserver.py else echo "ERROR: Wrong serving command: $1" exit -1 diff --git a/http_server/ingress.py b/http_server/ingress.py index c43a353..69b6e47 100644 --- a/http_server/ingress.py +++ b/http_server/ingress.py @@ -6,19 +6,33 @@ import json from flask import Flask, request, abort, Response, json +from flask_sock import Sock from serving import GunicornServing from confparser import createParser from swagger import setupSwaggerUI -from stt.processing import model -from stt.processing import decode, formatAudio +from stt.processing import model, decode, formatAudio +from stt.processing.streaming import ws_streaming + app = Flask("__stt-standalone-worker__") +app.config["JSON_AS_ASCII"] = False +app.config["JSON_SORT_KEYS"] = False logging.basicConfig(format='%(asctime)s %(name)s %(levelname)s: %(message)s', datefmt='%d/%m/%Y %H:%M:%S') logger = logging.getLogger("__stt-standalone-worker__") +# If websocket streaming route is enabled +if os.environ.get('ENABLE_STREAMING', False) in [True, "true", 1]: + logger.info("Init websocket serving ...") + sock = Sock(app) + logger.info("Streaming is enabled") + + @sock.route('/streaming') + def streaming(ws): + ws_streaming(ws, model) + @app.route('/healthcheck', methods=['GET']) def healthcheck(): return json.dumps({"healthcheck": "OK"}), 200 @@ -98,7 +112,7 @@ def server_error(error): logger.warning("Could not setup swagger: {}".format(str(e))) serving = GunicornServing(app, {'bind': '{}:{}'.format("0.0.0.0", args.service_port), - 'workers': args.workers,}) + 'workers': args.workers, 'timeout': 3600}) logger.info(args) try: serving.run() diff --git a/lin_to_vosk.py b/lin_to_vosk.py new file mode 100755 index 0000000..1df8ee7 --- /dev/null +++ b/lin_to_vosk.py @@ -0,0 +1,86 @@ +#!/usr/bin/env python3 +import os +import re +import configparser + +#LANGUAGE_MODEL_PATH= "/home/rbaraglia/training_ground/STT/fr-FR_Big_v2.2.0" +#ACOUSTIC_MODEL_PATH= "/home/rbaraglia/training_ground/STT/AM_fr-FR_v2.2.0" +#TARGET_PATH= "/home/rbaraglia/training_ground/STT/generated_model" + +LANGUAGE_MODEL_PATH="/opt/LM" +ACOUSTIC_MODEL_PATH="/opt/AM" +TARGET_PATH="/opt/model" + +def lin_to_vosk_format(am_path: str, lm_path: str, target_path: str): + os.mkdir(target_path) + # Create directory structure + print("Create directory structure") + for subfolder in ["am", "conf", "graph", "ivector", "rescore"]: + os.mkdir(os.path.join(target_path, subfolder)) + + # Populate am directory + # final.mdl + print("Populate am directory") + for f in ["final.mdl"]: + print(f) + os.symlink(os.path.join(am_path, f), + os.path.join(target_path, "am", f)) + + # Populate conf directory + print("Populate conf directory") + print("mfcc.conf") + os.symlink(os.path.join(am_path, "conf", "mfcc.conf"), + os.path.join(target_path, "conf", "mfcc.conf")) + + print("model.conf") + with open(os.path.join(target_path, "conf", "model.conf"), 'w') as f: + f.write("--min-active=200\n") + f.write("--max-active=7000\n") + f.write("--beam=13.0\n") + f.write("--lattice-beam=6.0\n") + f.write("--acoustic-scale=1.0\n") + f.write("--frame-subsampling-factor=3\n") + f.write("--endpoint.silence-phones=1:2:3:4:5:6:7:8:9:10\n") + f.write("--endpoint.rule2.min-trailing-silence=0.5\n") + f.write("--endpoint.rule3.min-trailing-silence=1.0\n") + f.write("--endpoint.rule4.min-trailing-silence=2.0\n") + + # Populate graph directory + print("Populate graph directory") + for f in ["HCLG.fst", "words.txt"]: + print(f) + os.symlink(os.path.join(lm_path, f), + os.path.join(target_path, "graph", f)) + + print("phones.txt") + os.symlink(os.path.join(am_path, "phones.txt"), + os.path.join(target_path, "graph", "phones.txt")) + + # Populate graph/phones directory + os.mkdir(os.path.join(target_path, "graph", "phones")) + + print("Populate graph/phones directory") + + print("word_boundary.int") + os.symlink(os.path.join(lm_path, "word_boundary.int"), + os.path.join(target_path, "graph", "phones", "word_boundary.int")) + + # Populate ivector directory + print("Populate graph/phones directory") + for f in ["final.dubm", "final.ie", "final.mat", "global_cmvn.stats", "online_cmvn.conf"]: + print(f) + os.symlink(os.path.join(am_path, "ivector_extractor", f), + os.path.join(target_path, "ivector", f)) + + print("splice.conf") + with open(os.path.join(am_path, "ivector_extractor", "splice_opts"), 'r') as in_f: + with open(os.path.join(target_path, "ivector", "splice.conf"), 'w') as out_f: + for param in in_f.read().split(" "): + out_f.write(f"{param}\n") + + # Populate rescore + # ? + +if __name__ == "__main__": + lin_to_vosk_format(ACOUSTIC_MODEL_PATH, LANGUAGE_MODEL_PATH, TARGET_PATH) + diff --git a/requirements.txt b/requirements.txt index cb9cd90..c863c6a 100644 --- a/requirements.txt +++ b/requirements.txt @@ -3,7 +3,9 @@ numpy>=1.18.5 flask>=1.1.2 flask-cors>=3.0.10 flask-swagger-ui>=3.36.0 +flask-sock gunicorn pyyaml>=5.4.1 wavio>=0.0.4 requests>=2.26.0 +websockets \ No newline at end of file diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index 701499d..d8c095e 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -1,29 +1,23 @@ import os from time import time +from vosk import Model + from stt import logger -from stt.processing.model import prepare, loadModel from stt.processing.decoding import decode from stt.processing.utils import load_wave, formatAudio +#from stt.processing.model import loadModel -# Model locations (should be mounted) -AM_PATH='/opt/models/AM' -LM_PATH='/opt/models/LM' -CONF_PATH='/opt/config' +__all__ = ["model", "logger", "decode", "load_wave", "formatAudio"] -# Prepare Model -logger.debug("Setting folders and configuration files") -try: - prepare(AM_PATH, LM_PATH, CONF_PATH) -except Exception as e: - logger.error("Could not prepare service: {}".format(str(e))) - exit(-1) +# Model locations (should be mounted) +MODEL_PATH="/opt/model" # Load ASR models (acoustic model and decoding graph) logger.info('Loading acoustic model and decoding graph ...') start = time() try: - model = loadModel(AM_PATH, LM_PATH, os.path.join(CONF_PATH, "online.conf")) + model = Model(MODEL_PATH) except Exception as e: raise Exception("Failed to load transcription model: {}".format(str(e))) exit(-1) diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 50c6532..fec7bec 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -1,13 +1,14 @@ import json +import re from vosk import KaldiRecognizer, Model def decode(audio_data: bytes, model: Model, sampling_rate: int, with_metadata: bool) -> dict: ''' Transcribe the audio data using the vosk library with the defined model.''' - result = {'text':'', 'words':[], 'confidence-score': 0.0} + result = {'text':'', 'confidence-score': 0.0, 'words':[]} - recognizer = KaldiRecognizer(model, sampling_rate, False) - recognizer.SetMaxAlternatives(1) + recognizer = KaldiRecognizer(model, sampling_rate) + recognizer.SetMaxAlternatives(0) # Set confidence per words recognizer.SetWords(with_metadata) recognizer.AcceptWaveform(audio_data) @@ -20,10 +21,9 @@ def decode(audio_data: bytes, model: Model, sampling_rate: int, with_metadata: b except Exception: return result - result['text'] = decoder_result['text'].strip() - if 'words' in decoder_result: - result['words'] = decoder_result['words'] - if 'confidence' in decoder_result: - result['confidence-score'] = decoder_result['confidence'] - + result["text"] = re.sub(" " , "", decoder_result["text"]) + if "word" in decoder_result: + result["words"] = [w for w in decoder_result["result"] if w["word"] != ""] + if "confidence" in decoder_result: + result["confidence-score"] = sum([w["conf"] for w in words]) / len(words) return result \ No newline at end of file diff --git a/stt/processing/model.py b/stt/processing/model.py deleted file mode 100644 index 866037b..0000000 --- a/stt/processing/model.py +++ /dev/null @@ -1,81 +0,0 @@ -import os -import re -import configparser - -from vosk import Model - -def prepare(am_path:str, lm_path:str, config_path:str): - ''' Prepare folder and configuration files needed for the model usage ''' - - if not os.path.isdir(config_path): - os.mkdir(config_path) - - # load decoder parameters from "decode.cfg" - decoder_settings = configparser.ConfigParser() - if not os.path.exists(am_path+'/decode.cfg'): - raise FileNotFoundError("decode.cfg file is missing") - - decoder_settings.read(am_path+'/decode.cfg') - - # Prepare "online.conf" - with open(am_path+"/conf/online.conf") as f: - values = f.readlines() - with open(config_path+"/online.conf", 'w') as f: - for i in values: - f.write(i) - f.write("--ivector-extraction-config=" + - config_path+"/ivector_extractor.conf\n") - f.write("--mfcc-config=" + os.path.join(am_path, "conf/mfcc.conf") + "\n") - f.write("--beam=" + decoder_settings.get('decoder_params', 'beam') + "\n") - f.write("--lattice-beam=" + decoder_settings.get('decoder_params', 'lattice_beam')+"\n") - f.write("--acoustic-scale=" + decoder_settings.get('decoder_params', 'acwt') + "\n") - f.write("--min-active=" + decoder_settings.get('decoder_params', 'min_active') + "\n") - f.write("--max-active=" + decoder_settings.get('decoder_params', 'max_active') + "\n") - f.write("--frame-subsampling-factor=" + decoder_settings.get('decoder_params', 'frame_subsampling_factor') + "\n") - - # Prepare "ivector_extractor.conf" - with open(am_path+"/conf/ivector_extractor.conf") as f: - values = f.readlines() - with open(config_path+"/ivector_extractor.conf", 'w') as f: - for i in values: - f.write(i) - f.write("--splice-config="+am_path+"/conf/splice.conf\n") - f.write("--cmvn-config="+am_path + - "/conf/online_cmvn.conf\n") - f.write("--lda-matrix="+am_path + - "/ivector_extractor/final.mat\n") - f.write("--global-cmvn-stats="+am_path + - "/ivector_extractor/global_cmvn.stats\n") - f.write("--diag-ubm="+am_path + - "/ivector_extractor/final.dubm\n") - f.write("--ivector-extractor="+am_path + - "/ivector_extractor/final.ie") - - # Prepare "word_boundary.int" if not exist - if not os.path.exists(lm_path+"/word_boundary.int") and os.path.exists(am_path+"/phones.txt"): - print("Create word_boundary.int based on phones.txt") - with open(am_path+"/phones.txt", 'r') as f: - phones = f.readlines() - - with open(lm_path+"/word_boundary.int", "w") as f: - for phone in phones: - phone = phone.strip() - phone = re.sub('^ .*', '', phone) - phone = re.sub('^#\d+ .*', '', phone) - if phone != '': - id = phone.split(' ')[1] - if '_I ' in phone: - f.write(id+" internal\n") - elif '_B ' in phone: - f.write(id+" begin\n") - elif '_E ' in phone: - f.write(id+" end\n") - elif '_S ' in phone: - f.write(id+" singleton\n") - else: - f.write(id+" nonword\n") - -def loadModel(am_path: str, lm_path: str, config_path: str) -> Model: - """ Load STT model """ - print("MODEL" , os.path.join(config_path, "online.conf")) - return Model(am_path,lm_path, config_path) diff --git a/stt/processing/streaming.py b/stt/processing/streaming.py new file mode 100644 index 0000000..5e36b8c --- /dev/null +++ b/stt/processing/streaming.py @@ -0,0 +1,107 @@ +import json +import re +from typing import Union + +from websockets.legacy.server import WebSocketServerProtocol +from simple_websocket.ws import Server as WSServer +from vosk import KaldiRecognizer, Model + +from stt import logger + +async def wssDecode(ws: WebSocketServerProtocol, model: Model): + """ Async Decode function endpoint """ + # Wait for config + res = await ws.recv() + + # Parse config + try: + config = json.loads(res)["config"] + sample_rate = config["sample_rate"] + except Exception as e : + logger.error("Failed to read stream configuration") + await ws.close(reason="Failed to load configuration") + + # Recognizer + try: + recognizer = KaldiRecognizer(model, sample_rate) + except Exception as e: + logger.error("Failed to load recognizer") + await ws.close(reason="Failed to load recognizer") + + # Wait for chunks + while True: + try: + # Client data + message = await ws.recv() + if message is None or message == "": # Timeout + ws.close() + except Exception as e: + print("Connection closed by client: {}".format(str(e))) + break + + # End frame + if "eof" in str(message): + ret = recognizer.FinalResult() + await ws.send(json.dumps(ret)) + await ws.close(reason="End of stream") + break + + # Audio chunk + if recognizer.AcceptWaveform(message): + ret = recognizer.Result() # Result seems to not work properly + await ws.send(ret) + + else: + ret = recognizer.PartialResult() + last_utterance = ret + await ws.send(ret) + +def ws_streaming(ws: WSServer, model: Model): + """ Sync Decode function endpoint""" + # Wait for config + res = ws.receive(timeout=10) + + # Timeout + if res is None: + pass + + # Parse config + try: + config = json.loads(res)["config"] + sample_rate = config["sample_rate"] + except Exception as e : + logger.error("Failed to read stream configuration") + ws.close() + + # Recognizer + try: + recognizer = KaldiRecognizer(model, sample_rate) + except Exception as e: + logger.error("Failed to load recognizer") + ws.close() + + # Wait for chunks + while True: + try: + # Client data + message = ws.receive(timeout=10) + if message is None: # Timeout + ws.close() + except Exception: + print("Connection closed by client") + break + # End frame + if "eof" in str(message): + ret = recognizer.FinalResult() + ws.send(json.dumps(re.sub(" ", "", ret))) + ws.close() + break + # Audio chunk + print("Received chunk") + if recognizer.AcceptWaveform(message): + ret = recognizer.Result() + ws.send(re.sub(" ", "", ret)) + + else: + ret = recognizer.PartialResult() + ws.send(re.sub(" ", "", ret)) \ No newline at end of file diff --git a/websocket/__init__.py b/websocket/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/websocket/websocketserver.py b/websocket/websocketserver.py new file mode 100644 index 0000000..eb3f9f2 --- /dev/null +++ b/websocket/websocketserver.py @@ -0,0 +1,21 @@ +import os +import asyncio + +import websockets + +from stt.processing import model +from stt.processing.streaming import wssDecode + +async def _fun_wrapper(ws): + """ Wrap wssDecode function to add STT Model reference """ + return await wssDecode(ws, model) + +async def WSServer(port: int): + """ Launch the websocket server """ + async with websockets.serve(_fun_wrapper, "0.0.0.0", serving_port): + await asyncio.Future() + +if __name__ == "__main__": + serving_port = os.environ.get("STREAMING_PORT", 80) + asyncio.run(WSServer(serving_port)) + \ No newline at end of file From 36dfd29f967dd300d6beff34180807e1729f2717 Mon Sep 17 00:00:00 2001 From: Rudy Baraglia Date: Wed, 9 Mar 2022 10:30:31 +0000 Subject: [PATCH 077/172] Fixed key --- stt/processing/decoding.py | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index fec7bec..a072741 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -20,10 +20,8 @@ def decode(audio_data: bytes, model: Model, sampling_rate: int, with_metadata: b decoder_result = json.loads(decoder_result_raw) except Exception: return result - result["text"] = re.sub(" " , "", decoder_result["text"]) - if "word" in decoder_result: + if "result" in decoder_result: result["words"] = [w for w in decoder_result["result"] if w["word"] != ""] - if "confidence" in decoder_result: - result["confidence-score"] = sum([w["conf"] for w in words]) / len(words) + result["confidence-score"] = sum([w["conf"] for w in result["words"]]) / len(result["words"]) return result \ No newline at end of file From 07032f9d81f2d785cecc67dd5875ff6083d9cfa5 Mon Sep 17 00:00:00 2001 From: Rudy Baraglia Date: Wed, 9 Mar 2022 10:39:11 +0000 Subject: [PATCH 078/172] Fixed division by zero --- stt/processing/decoding.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index a072741..3a7d33e 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -23,5 +23,6 @@ def decode(audio_data: bytes, model: Model, sampling_rate: int, with_metadata: b result["text"] = re.sub(" " , "", decoder_result["text"]) if "result" in decoder_result: result["words"] = [w for w in decoder_result["result"] if w["word"] != ""] - result["confidence-score"] = sum([w["conf"] for w in result["words"]]) / len(result["words"]) + if len(result["words"]): + result["confidence-score"] = sum([w["conf"] for w in result["words"]]) / len(result["words"]) return result \ No newline at end of file From 0cf497d14368615e85bda56c64f9c8c30ae2e333 Mon Sep 17 00:00:00 2001 From: Rudy Baraglia Date: Thu, 1 Sep 2022 13:35:05 +0000 Subject: [PATCH 079/172] Auto styling and linting --- .gitignore | 2 +- Makefile | 13 ++++++ celery_app/celeryapp.py | 22 ++++++---- celery_app/tasks.py | 12 +++--- http_server/confparser.py | 80 +++++++++++++++--------------------- http_server/ingress.py | 80 +++++++++++++++++++++--------------- http_server/serving.py | 11 +++-- http_server/swagger.py | 14 +++---- stt/__init__.py | 8 ++-- stt/processing/__init__.py | 11 ++--- stt/processing/decoding.py | 15 ++++--- stt/processing/streaming.py | 42 ++++++++++--------- stt/processing/utils.py | 9 ++-- websocket/websocketserver.py | 3 +- 14 files changed, 174 insertions(+), 148 deletions(-) create mode 100644 Makefile diff --git a/.gitignore b/.gitignore index c556a27..0b8d9ad 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,3 @@ start_container.sh -.env +.env* test/* diff --git a/Makefile b/Makefile new file mode 100644 index 0000000..71be1a8 --- /dev/null +++ b/Makefile @@ -0,0 +1,13 @@ +.DEFAULT_GOAL := help + +target_dirs := stt http_server celery_app + +help: + @grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-30s\033[0m %s\n", $$1, $$2}' + +style: ## update code style. + black -l 100 ${target_dirs} + isort ${target_dirs} + +lint: ## run pylint linter. + pylint ${target_dirs} diff --git a/celery_app/celeryapp.py b/celery_app/celeryapp.py index 5f1d96e..d4a5cb4 100644 --- a/celery_app/celeryapp.py +++ b/celery_app/celeryapp.py @@ -1,26 +1,30 @@ import os + from celery import Celery from stt import logger -celery = Celery(__name__, include=['celery_app.tasks']) +celery = Celery(__name__, include=["celery_app.tasks"]) service_name = os.environ.get("SERVICE_NAME") broker_url = os.environ.get("SERVICES_BROKER") if os.environ.get("BROKER_PASS", False): - components = broker_url.split('//') + components = broker_url.split("//") broker_url = f'{components[0]}//:{os.environ.get("BROKER_PASS")}@{components[1]}' celery.conf.broker_url = "{}/0".format(broker_url) celery.conf.result_backend = "{}/1".format(broker_url) -celery.conf.update( - result_expires=3600, - task_acks_late=True, - task_track_started = True) +celery.conf.update(result_expires=3600, task_acks_late=True, task_track_started=True) # Queues celery.conf.update( - {'task_routes': { - 'transcribe_task' : {'queue': service_name},} + { + "task_routes": { + "transcribe_task": {"queue": service_name}, + } } ) -logger.info("Celery configured for broker located at {} with service name {}".format(broker_url, service_name)) \ No newline at end of file +logger.info( + "Celery configured for broker located at {} with service name {}".format( + broker_url, service_name + ) +) diff --git a/celery_app/tasks.py b/celery_app/tasks.py index 6921a0c..f2a2b08 100644 --- a/celery_app/tasks.py +++ b/celery_app/tasks.py @@ -1,15 +1,15 @@ -import os import asyncio +import os -from stt import logger -from stt.processing import model from celery_app.celeryapp import celery +from stt import logger +from stt.processing import decode, model from stt.processing.utils import load_wave -from stt.processing import decode + @celery.task(name="transcribe_task") def transcribe_task(file_name: str, with_metadata: bool): - """ transcribe_task """ + """transcribe_task""" logger.info("Received transcription task for {}".format(file_name)) # Load wave @@ -27,4 +27,4 @@ def transcribe_task(file_name: str, with_metadata: bool): logger.error("Failed to decode: {}".format(e)) raise Exception("Failed to decode {}".format(file_path)) - return result \ No newline at end of file + return result diff --git a/http_server/confparser.py b/http_server/confparser.py index 4c2171d..f676e1a 100644 --- a/http_server/confparser.py +++ b/http_server/confparser.py @@ -1,68 +1,52 @@ -import os import argparse +import os __all__ = ["createParser"] + def createParser() -> argparse.ArgumentParser: parser = argparse.ArgumentParser() - + # SERVICE parser.add_argument( - '--service_name', + "--service_name", type=str, - help='Service Name', - default=os.environ.get('SERVICE_NAME', 'stt')) + help="Service Name", + default=os.environ.get("SERVICE_NAME", "stt"), + ) # MODELS + parser.add_argument("--am_path", type=str, help="Acoustic Model Path", default="/opt/models/AM") + parser.add_argument("--lm_path", type=str, help="Decoding graph path", default="/opt/models/LM") parser.add_argument( - '--am_path', - type=str, - help='Acoustic Model Path', - default='/opt/models/AM') - parser.add_argument( - '--lm_path', - type=str, - help='Decoding graph path', - default='/opt/models/LM') - parser.add_argument( - '--config_path', - type=str, - help='Configuration files path', - default='/opt/config') - - #GUNICORN - parser.add_argument( - '--service_port', - type=int, - help='Service port', - default=80) + "--config_path", type=str, help="Configuration files path", default="/opt/config" + ) + + # GUNICORN + parser.add_argument("--service_port", type=int, help="Service port", default=80) parser.add_argument( - '--workers', + "--workers", type=int, help="Number of Gunicorn workers (default=CONCURRENCY + 1)", - default=int(os.environ.get('CONCURRENCY', 1)) + 1) - - #SWAGGER - parser.add_argument( - '--swagger_url', - type=str, - help='Swagger interface url', - default='/docs') + default=int(os.environ.get("CONCURRENCY", 1)) + 1, + ) + + # SWAGGER + parser.add_argument("--swagger_url", type=str, help="Swagger interface url", default="/docs") parser.add_argument( - '--swagger_prefix', + "--swagger_prefix", type=str, - help='Swagger prefix', - default=os.environ.get('SWAGGER_PREFIX', '')) + help="Swagger prefix", + default=os.environ.get("SWAGGER_PREFIX", ""), + ) parser.add_argument( - '--swagger_path', + "--swagger_path", type=str, - help='Swagger file path', - default=os.environ.get('SWAGGER_PATH', '/usr/src/app/document/swagger.yml')) - - #MISC - parser.add_argument( - '--debug', - action='store_true', - help='Display debug logs') + help="Swagger file path", + default=os.environ.get("SWAGGER_PATH", "/usr/src/app/document/swagger.yml"), + ) + + # MISC + parser.add_argument("--debug", action="store_true", help="Display debug logs") - return parser \ No newline at end of file + return parser diff --git a/http_server/ingress.py b/http_server/ingress.py index 69b6e47..ffe21ce 100644 --- a/http_server/ingress.py +++ b/http_server/ingress.py @@ -1,78 +1,81 @@ #!/usr/bin/env python3 +import json +import logging import os from time import time -import logging -import json -from flask import Flask, request, abort, Response, json +from confparser import createParser +from flask import Flask, Response, abort, json, request from flask_sock import Sock - from serving import GunicornServing -from confparser import createParser from swagger import setupSwaggerUI -from stt.processing import model, decode, formatAudio +from stt.processing import decode, formatAudio, model from stt.processing.streaming import ws_streaming - app = Flask("__stt-standalone-worker__") app.config["JSON_AS_ASCII"] = False app.config["JSON_SORT_KEYS"] = False -logging.basicConfig(format='%(asctime)s %(name)s %(levelname)s: %(message)s', datefmt='%d/%m/%Y %H:%M:%S') +logging.basicConfig( + format="%(asctime)s %(name)s %(levelname)s: %(message)s", datefmt="%d/%m/%Y %H:%M:%S" +) logger = logging.getLogger("__stt-standalone-worker__") # If websocket streaming route is enabled -if os.environ.get('ENABLE_STREAMING', False) in [True, "true", 1]: +if os.environ.get("ENABLE_STREAMING", False) in [True, "true", 1]: logger.info("Init websocket serving ...") sock = Sock(app) logger.info("Streaming is enabled") - @sock.route('/streaming') + @sock.route("/streaming") def streaming(ws): ws_streaming(ws, model) -@app.route('/healthcheck', methods=['GET']) + +@app.route("/healthcheck", methods=["GET"]) def healthcheck(): return json.dumps({"healthcheck": "OK"}), 200 -@app.route("/oas_docs", methods=['GET']) + +@app.route("/oas_docs", methods=["GET"]) def oas_docs(): return "Not Implemented", 501 -@app.route('/transcribe', methods=['POST']) + +@app.route("/transcribe", methods=["POST"]) def transcribe(): try: - logger.info('Transcribe request received') + logger.info("Transcribe request received") # get response content type - logger.debug(request.headers.get('accept').lower()) - if request.headers.get('accept').lower() == 'application/json': + logger.debug(request.headers.get("accept").lower()) + if request.headers.get("accept").lower() == "application/json": join_metadata = True - elif request.headers.get('accept').lower() == 'text/plain': + elif request.headers.get("accept").lower() == "text/plain": join_metadata = False else: - raise ValueError('Not accepted header') + raise ValueError("Not accepted header") logger.debug("Metadata: {}".format(join_metadata)) # get input file - if 'file' in request.files.keys(): - file_buffer = request.files['file'].read() + if "file" in request.files.keys(): + file_buffer = request.files["file"].read() audio_data, sampling_rate = formatAudio(file_buffer) start_t = time() - + # Transcription transcription = decode(audio_data, model, sampling_rate, join_metadata) logger.debug("Transcription complete (t={}s)".format(time() - start_t)) - + logger.debug("... Complete") - + else: - raise ValueError('No audio file was uploaded') + raise ValueError("No audio file was uploaded") if join_metadata: - return json.dumps(transcription,ensure_ascii=False) , 200 + return json.dumps(transcription, ensure_ascii=False), 200 else: return transcription["text"], 200 return response, 200 @@ -81,23 +84,26 @@ def transcribe(): return str(error), 400 except Exception as e: logger.error(e) - return 'Server Error: {}'.format(str(e)), 500 + return "Server Error: {}".format(str(e)), 500 + -# Rejected request handlers @app.errorhandler(405) def method_not_allowed(error): - return 'The method is not allowed for the requested URL', 405 + return "The method is not allowed for the requested URL", 405 + @app.errorhandler(404) def page_not_found(error): - return 'The requested URL was not found', 404 + return "The requested URL was not found", 404 + @app.errorhandler(500) def server_error(error): logger.error(error) - return 'Server Error', 500 + return "Server Error", 500 -if __name__ == '__main__': + +if __name__ == "__main__": logger.info("Startup...") parser = createParser() @@ -110,9 +116,15 @@ def server_error(error): logger.debug("Swagger UI set.") except Exception as e: logger.warning("Could not setup swagger: {}".format(str(e))) - - serving = GunicornServing(app, {'bind': '{}:{}'.format("0.0.0.0", args.service_port), - 'workers': args.workers, 'timeout': 3600}) + + serving = GunicornServing( + app, + { + "bind": "{}:{}".format("0.0.0.0", args.service_port), + "workers": args.workers, + "timeout": 3600, + }, + ) logger.info(args) try: serving.run() diff --git a/http_server/serving.py b/http_server/serving.py index 076f34d..d2dd7e8 100644 --- a/http_server/serving.py +++ b/http_server/serving.py @@ -1,17 +1,20 @@ import gunicorn.app.base -class GunicornServing(gunicorn.app.base.BaseApplication): +class GunicornServing(gunicorn.app.base.BaseApplication): def __init__(self, app, options=None): self.options = options or {} self.application = app super().__init__() def load_config(self): - config = {key: value for key, value in self.options.items() - if key in self.cfg.settings and value is not None} + config = { + key: value + for key, value in self.options.items() + if key in self.cfg.settings and value is not None + } for key, value in config.items(): self.cfg.set(key.lower(), value) def load(self): - return self.application \ No newline at end of file + return self.application diff --git a/http_server/swagger.py b/http_server/swagger.py index c0af319..fe58685 100644 --- a/http_server/swagger.py +++ b/http_server/swagger.py @@ -1,17 +1,17 @@ import yaml from flask_swagger_ui import get_swaggerui_blueprint + def setupSwaggerUI(app, args): - '''Setup Swagger UI within the app''' - swagger_yml = yaml.load( - open(args.swagger_path, 'r'), Loader=yaml.Loader) + """Setup Swagger UI within the app""" + swagger_yml = yaml.load(open(args.swagger_path, "r"), Loader=yaml.Loader) swaggerui = get_swaggerui_blueprint( # Swagger UI static files will be mapped to '{SWAGGER_URL}/dist/' args.swagger_prefix + args.swagger_url, args.swagger_path, config={ # Swagger UI config overrides - 'app_name': "LinTO Platform STT", - 'spec': swagger_yml - } + "app_name": "LinTO Platform STT", + "spec": swagger_yml, + }, ) - app.register_blueprint(swaggerui, url_prefix=args.swagger_url) \ No newline at end of file + app.register_blueprint(swaggerui, url_prefix=args.swagger_url) diff --git a/stt/__init__.py b/stt/__init__.py index 8e6dc75..a624077 100644 --- a/stt/__init__.py +++ b/stt/__init__.py @@ -1,5 +1,7 @@ -import os import logging +import os -logging.basicConfig(format='%(asctime)s %(name)s %(levelname)s: %(message)s', datefmt='%d/%m/%Y %H:%M:%S') -logger = logging.getLogger("__stt__") \ No newline at end of file +logging.basicConfig( + format="%(asctime)s %(name)s %(levelname)s: %(message)s", datefmt="%d/%m/%Y %H:%M:%S" +) +logger = logging.getLogger("__stt__") diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index d8c095e..d1a29db 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -5,20 +5,21 @@ from stt import logger from stt.processing.decoding import decode -from stt.processing.utils import load_wave, formatAudio -#from stt.processing.model import loadModel +from stt.processing.utils import formatAudio, load_wave + +# from stt.processing.model import loadModel __all__ = ["model", "logger", "decode", "load_wave", "formatAudio"] # Model locations (should be mounted) -MODEL_PATH="/opt/model" +MODEL_PATH = "/opt/model" # Load ASR models (acoustic model and decoding graph) -logger.info('Loading acoustic model and decoding graph ...') +logger.info("Loading acoustic model and decoding graph ...") start = time() try: model = Model(MODEL_PATH) except Exception as e: raise Exception("Failed to load transcription model: {}".format(str(e))) exit(-1) -logger.info('Acoustic model and decoding graph loaded. (t={}s)'.format(time() - start)) +logger.info("Acoustic model and decoding graph loaded. (t={}s)".format(time() - start)) diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 3a7d33e..9908ba0 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -3,12 +3,13 @@ from vosk import KaldiRecognizer, Model + def decode(audio_data: bytes, model: Model, sampling_rate: int, with_metadata: bool) -> dict: - ''' Transcribe the audio data using the vosk library with the defined model.''' - result = {'text':'', 'confidence-score': 0.0, 'words':[]} + """Transcribe the audio data using the vosk library with the defined model.""" + result = {"text": "", "confidence-score": 0.0, "words": []} recognizer = KaldiRecognizer(model, sampling_rate) - recognizer.SetMaxAlternatives(0) # Set confidence per words + recognizer.SetMaxAlternatives(0) # Set confidence per words recognizer.SetWords(with_metadata) recognizer.AcceptWaveform(audio_data) @@ -20,9 +21,11 @@ def decode(audio_data: bytes, model: Model, sampling_rate: int, with_metadata: b decoder_result = json.loads(decoder_result_raw) except Exception: return result - result["text"] = re.sub(" " , "", decoder_result["text"]) + result["text"] = re.sub(" ", "", decoder_result["text"]) if "result" in decoder_result: result["words"] = [w for w in decoder_result["result"] if w["word"] != ""] if len(result["words"]): - result["confidence-score"] = sum([w["conf"] for w in result["words"]]) / len(result["words"]) - return result \ No newline at end of file + result["confidence-score"] = sum([w["conf"] for w in result["words"]]) / len( + result["words"] + ) + return result diff --git a/stt/processing/streaming.py b/stt/processing/streaming.py index 5e36b8c..36f9eca 100644 --- a/stt/processing/streaming.py +++ b/stt/processing/streaming.py @@ -2,43 +2,44 @@ import re from typing import Union -from websockets.legacy.server import WebSocketServerProtocol from simple_websocket.ws import Server as WSServer from vosk import KaldiRecognizer, Model +from websockets.legacy.server import WebSocketServerProtocol + +from stt import logger -from stt import logger async def wssDecode(ws: WebSocketServerProtocol, model: Model): - """ Async Decode function endpoint """ + """Async Decode function endpoint""" # Wait for config res = await ws.recv() - + # Parse config try: config = json.loads(res)["config"] sample_rate = config["sample_rate"] - except Exception as e : + except Exception as e: logger.error("Failed to read stream configuration") await ws.close(reason="Failed to load configuration") - + # Recognizer - try: + try: recognizer = KaldiRecognizer(model, sample_rate) except Exception as e: logger.error("Failed to load recognizer") await ws.close(reason="Failed to load recognizer") - + # Wait for chunks - while True: + while True: try: # Client data message = await ws.recv() - if message is None or message == "": # Timeout + if message is None or message == "": # Timeout ws.close() except Exception as e: print("Connection closed by client: {}".format(str(e))) break - + # End frame if "eof" in str(message): ret = recognizer.FinalResult() @@ -48,16 +49,17 @@ async def wssDecode(ws: WebSocketServerProtocol, model: Model): # Audio chunk if recognizer.AcceptWaveform(message): - ret = recognizer.Result() # Result seems to not work properly + ret = recognizer.Result() # Result seems to not work properly await ws.send(ret) - + else: ret = recognizer.PartialResult() last_utterance = ret await ws.send(ret) + def ws_streaming(ws: WSServer, model: Model): - """ Sync Decode function endpoint""" + """Sync Decode function endpoint""" # Wait for config res = ws.receive(timeout=10) @@ -69,7 +71,7 @@ def ws_streaming(ws: WSServer, model: Model): try: config = json.loads(res)["config"] sample_rate = config["sample_rate"] - except Exception as e : + except Exception as e: logger.error("Failed to read stream configuration") ws.close() @@ -81,11 +83,11 @@ def ws_streaming(ws: WSServer, model: Model): ws.close() # Wait for chunks - while True: + while True: try: # Client data message = ws.receive(timeout=10) - if message is None: # Timeout + if message is None: # Timeout ws.close() except Exception: print("Connection closed by client") @@ -95,13 +97,13 @@ def ws_streaming(ws: WSServer, model: Model): ret = recognizer.FinalResult() ws.send(json.dumps(re.sub(" ", "", ret))) ws.close() - break + break # Audio chunk print("Received chunk") if recognizer.AcceptWaveform(message): ret = recognizer.Result() ws.send(re.sub(" ", "", ret)) - + else: ret = recognizer.PartialResult() - ws.send(re.sub(" ", "", ret)) \ No newline at end of file + ws.send(re.sub(" ", "", ret)) diff --git a/stt/processing/utils.py b/stt/processing/utils.py index 016716d..642f427 100644 --- a/stt/processing/utils.py +++ b/stt/processing/utils.py @@ -1,16 +1,17 @@ import io import wavio -from numpy import squeeze, int16 +from numpy import int16, squeeze + def load_wave(file_path): - ''' Formats audio from a wavFile buffer to a bytebuffer''' + """Formats audio from a wavFile buffer to a bytebuffer""" audio = squeeze(wavio.read(file_path).data) return audio.tobytes() def formatAudio(file_buffer): - ''' Formats audio from a wavFile buffer to a numpy array for processing.''' + """Formats audio from a wavFile buffer to a numpy array for processing.""" file_buffer_io = io.BytesIO(file_buffer) file_content = wavio.read(file_buffer_io) # if stereo file, convert to mono by computing the mean over the channels @@ -21,4 +22,4 @@ def formatAudio(file_buffer): data = mean(data, axis=1, dtype=int16) return data.tobytes(), file_content.rate else: - raise Exception("Audio Format not supported.") \ No newline at end of file + raise Exception("Audio Format not supported.") diff --git a/websocket/websocketserver.py b/websocket/websocketserver.py index eb3f9f2..9f1f683 100644 --- a/websocket/websocketserver.py +++ b/websocket/websocketserver.py @@ -6,10 +6,12 @@ from stt.processing import model from stt.processing.streaming import wssDecode + async def _fun_wrapper(ws): """ Wrap wssDecode function to add STT Model reference """ return await wssDecode(ws, model) + async def WSServer(port: int): """ Launch the websocket server """ async with websockets.serve(_fun_wrapper, "0.0.0.0", serving_port): @@ -18,4 +20,3 @@ async def WSServer(port: int): if __name__ == "__main__": serving_port = os.environ.get("STREAMING_PORT", 80) asyncio.run(WSServer(serving_port)) - \ No newline at end of file From 4cf9be25e4a0150fd288a6162e87a5744318aee6 Mon Sep 17 00:00:00 2001 From: HOUPERT Date: Fri, 2 Sep 2022 14:15:33 +0200 Subject: [PATCH 080/172] Add github action for dockerhub description --- .github/workflows/dockerhub-description.yml | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) create mode 100644 .github/workflows/dockerhub-description.yml diff --git a/.github/workflows/dockerhub-description.yml b/.github/workflows/dockerhub-description.yml new file mode 100644 index 0000000..0367b21 --- /dev/null +++ b/.github/workflows/dockerhub-description.yml @@ -0,0 +1,20 @@ +name: Update Docker Hub Description +on: + push: + branches: + - master + paths: + - README.md + - .github/workflows/dockerhub-description.yml +jobs: + dockerHubDescription: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + - name: Docker Hub Description + uses: peter-evans/dockerhub-description@v3 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_PASSWORD }} + repository: lintoai/linto-platform-stt + readme-filepath: ./README.md From 8d8346d26c7f418780b15218dab13b9be8ca120d Mon Sep 17 00:00:00 2001 From: Rudy Baraglia Date: Mon, 12 Sep 2022 09:23:45 +0000 Subject: [PATCH 081/172] 3.3.1: Fixes and style --- README.md | 4 +- RELEASE.md | 5 +++ celery_app/celeryapp.py | 8 ++-- celery_app/tasks.py | 14 +++---- http_server/confparser.py | 5 ++- http_server/ingress.py | 33 ++++++++-------- http_server/swagger.py | 3 +- lin_to_vosk.py | 74 ++++++++++++++++++++++-------------- stt/__init__.py | 3 +- stt/processing/__init__.py | 9 ++--- stt/processing/decoding.py | 6 +-- stt/processing/streaming.py | 24 ++++++------ stt/processing/utils.py | 5 +-- websocket/websocketserver.py | 7 ++-- 14 files changed, 112 insertions(+), 88 deletions(-) diff --git a/README.md b/README.md index aa711f7..7fd1fa0 100644 --- a/README.md +++ b/README.md @@ -111,8 +111,8 @@ You need a message broker up and running at MY_SERVICE_BROKER. ```bash docker run --rm \ --v AM_PATH:/opt/models/AM \ --v LM_PATH:/opt/models/LM \ +-v AM_PATH:/opt/AM \ +-v LM_PATH:/opt/LM \ -v SHARED_AUDIO_FOLDER:/opt/audio \ --env-file .env \ linto-platform-stt:latest diff --git a/RELEASE.md b/RELEASE.md index 0a146d5..2626e10 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,3 +1,8 @@ +# 3.3.1 +- Fixed lin_to_vosk throwing an error on a already existing container. +- Corrected an error on the README regarding mounting model volumes. +- Code styling (PEP 8) + # 3.3.0 - Added optional streaming route to the http serving mode - Added serving mode: websocket diff --git a/celery_app/celeryapp.py b/celery_app/celeryapp.py index d4a5cb4..e04d73b 100644 --- a/celery_app/celeryapp.py +++ b/celery_app/celeryapp.py @@ -10,8 +10,8 @@ if os.environ.get("BROKER_PASS", False): components = broker_url.split("//") broker_url = f'{components[0]}//:{os.environ.get("BROKER_PASS")}@{components[1]}' -celery.conf.broker_url = "{}/0".format(broker_url) -celery.conf.result_backend = "{}/1".format(broker_url) +celery.conf.broker_url = f"{broker_url}/0" +celery.conf.result_backend = f"{broker_url}/1" celery.conf.update(result_expires=3600, task_acks_late=True, task_track_started=True) # Queues @@ -24,7 +24,5 @@ ) logger.info( - "Celery configured for broker located at {} with service name {}".format( - broker_url, service_name - ) + f"Celery configured for broker located at {broker_url} with service name {service_name}" ) diff --git a/celery_app/tasks.py b/celery_app/tasks.py index f2a2b08..ce2ca4d 100644 --- a/celery_app/tasks.py +++ b/celery_app/tasks.py @@ -10,21 +10,21 @@ @celery.task(name="transcribe_task") def transcribe_task(file_name: str, with_metadata: bool): """transcribe_task""" - logger.info("Received transcription task for {}".format(file_name)) + logger.info(f"Received transcription task for {file_name}") # Load wave file_path = os.path.join("/opt/audio", file_name) try: file_content = load_wave(file_path) - except Exception as e: - logger.error("Failed to load ressource: {}".format(e)) - raise Exception("Could not open ressource {}".format(file_path)) + except Exception as err: + logger.error(f"Failed to load ressource: {repr(err)}") + raise Exception(f"Could not open ressource {file_path}") from err # Decode try: result = decode(file_content, model, 16000, with_metadata) - except Exception as e: - logger.error("Failed to decode: {}".format(e)) - raise Exception("Failed to decode {}".format(file_path)) + except Exception as err: + logger.error(f"Failed to decode: {repr(err)}") + raise Exception(f"Failed to decode {file_path}") from err return result diff --git a/http_server/confparser.py b/http_server/confparser.py index f676e1a..2396d71 100644 --- a/http_server/confparser.py +++ b/http_server/confparser.py @@ -19,7 +19,10 @@ def createParser() -> argparse.ArgumentParser: parser.add_argument("--am_path", type=str, help="Acoustic Model Path", default="/opt/models/AM") parser.add_argument("--lm_path", type=str, help="Decoding graph path", default="/opt/models/LM") parser.add_argument( - "--config_path", type=str, help="Configuration files path", default="/opt/config" + "--config_path", + type=str, + help="Configuration files path", + default="/opt/config", ) # GUNICORN diff --git a/http_server/ingress.py b/http_server/ingress.py index ffe21ce..5a9c661 100644 --- a/http_server/ingress.py +++ b/http_server/ingress.py @@ -19,7 +19,8 @@ app.config["JSON_SORT_KEYS"] = False logging.basicConfig( - format="%(asctime)s %(name)s %(levelname)s: %(message)s", datefmt="%d/%m/%Y %H:%M:%S" + format="%(asctime)s %(name)s %(levelname)s: %(message)s", + datefmt="%d/%m/%Y %H:%M:%S", ) logger = logging.getLogger("__stt-standalone-worker__") @@ -30,8 +31,8 @@ logger.info("Streaming is enabled") @sock.route("/streaming") - def streaming(ws): - ws_streaming(ws, model) + def streaming(web_socket): + ws_streaming(web_socket, model) @app.route("/healthcheck", methods=["GET"]) @@ -76,24 +77,22 @@ def transcribe(): if join_metadata: return json.dumps(transcription, ensure_ascii=False), 200 - else: - return transcription["text"], 200 - return response, 200 + return transcription["text"], 200 except ValueError as error: return str(error), 400 - except Exception as e: - logger.error(e) - return "Server Error: {}".format(str(e)), 500 + except Exception as error: + logger.error(error) + return "Server Error: {}".format(str(error)), 500 @app.errorhandler(405) -def method_not_allowed(error): +def method_not_allowed(_): return "The method is not allowed for the requested URL", 405 @app.errorhandler(404) -def page_not_found(error): +def page_not_found(_): return "The requested URL was not found", 404 @@ -114,13 +113,13 @@ def server_error(error): if args.swagger_path is not None: setupSwaggerUI(app, args) logger.debug("Swagger UI set.") - except Exception as e: - logger.warning("Could not setup swagger: {}".format(str(e))) + except Exception as err: + logger.warning("Could not setup swagger: {}".format(str(err))) serving = GunicornServing( app, { - "bind": "{}:{}".format("0.0.0.0", args.service_port), + "bind": f"0.0.0.0:{args.service_port}", "workers": args.workers, "timeout": 3600, }, @@ -130,7 +129,7 @@ def server_error(error): serving.run() except KeyboardInterrupt: logger.info("Process interrupted by user") - except Exception as e: - logger.error(str(e)) + except Exception as err: + logger.error(str(err)) logger.critical("Service is shut down (Error)") - exit(e) + exit(err) diff --git a/http_server/swagger.py b/http_server/swagger.py index fe58685..a9b93d0 100644 --- a/http_server/swagger.py +++ b/http_server/swagger.py @@ -4,7 +4,8 @@ def setupSwaggerUI(app, args): """Setup Swagger UI within the app""" - swagger_yml = yaml.load(open(args.swagger_path, "r"), Loader=yaml.Loader) + with open(args.swagger_path, "r") as yml_file: + swagger_yml = yaml.load(yml_file, Loader=yaml.Loader) swaggerui = get_swaggerui_blueprint( # Swagger UI static files will be mapped to '{SWAGGER_URL}/dist/' args.swagger_prefix + args.swagger_url, diff --git a/lin_to_vosk.py b/lin_to_vosk.py index 9c8d513..62025a0 100755 --- a/lin_to_vosk.py +++ b/lin_to_vosk.py @@ -1,35 +1,42 @@ #!/usr/bin/env python3 +import configparser import os import re -import configparser -LANGUAGE_MODEL_PATH="/opt/LM" -ACOUSTIC_MODEL_PATH="/opt/AM" -TARGET_PATH="/opt/model" +LANGUAGE_MODEL_PATH = "/opt/LM" +ACOUSTIC_MODEL_PATH = "/opt/AM" +TARGET_PATH = "/opt/model" + def lin_to_vosk_format(am_path: str, lm_path: str, target_path: str): + if os.path.exists(target_path): + print( + "Target model folder already exist, assuming model has already been converted. Skipping..." + ) + return os.mkdir(target_path) # Create directory structure print("Create directory structure") for subfolder in ["am", "conf", "graph", "ivector", "rescore"]: os.mkdir(os.path.join(target_path, subfolder)) - + # Populate am directory # final.mdl print("Populate am directory") for f in ["final.mdl"]: print(f) - os.symlink(os.path.join(am_path, f), - os.path.join(target_path, "am", f)) + os.symlink(os.path.join(am_path, f), os.path.join(target_path, "am", f)) # Populate conf directory print("Populate conf directory") print("mfcc.conf") - os.symlink(os.path.join(am_path, "conf", "mfcc.conf"), - os.path.join(target_path, "conf", "mfcc.conf")) - + os.symlink( + os.path.join(am_path, "conf", "mfcc.conf"), + os.path.join(target_path, "conf", "mfcc.conf"), + ) + print("model.conf") - with open(os.path.join(target_path, "conf", "model.conf"), 'w') as f: + with open(os.path.join(target_path, "conf", "model.conf"), "w") as f: f.write("--min-active=200\n") f.write("--max-active=7000\n") f.write("--beam=13.0\n") @@ -45,38 +52,49 @@ def lin_to_vosk_format(am_path: str, lm_path: str, target_path: str): print("Populate graph directory") for f in ["HCLG.fst", "words.txt"]: print(f) - os.symlink(os.path.join(lm_path, f), - os.path.join(target_path, "graph", f)) + os.symlink(os.path.join(lm_path, f), os.path.join(target_path, "graph", f)) print("phones.txt") - os.symlink(os.path.join(am_path, "phones.txt"), - os.path.join(target_path, "graph", "phones.txt")) - + os.symlink( + os.path.join(am_path, "phones.txt"), + os.path.join(target_path, "graph", "phones.txt"), + ) + # Populate graph/phones directory os.mkdir(os.path.join(target_path, "graph", "phones")) - + print("Populate graph/phones directory") - + print("word_boundary.int") - os.symlink(os.path.join(lm_path, "word_boundary.int"), - os.path.join(target_path, "graph", "phones", "word_boundary.int")) - + os.symlink( + os.path.join(lm_path, "word_boundary.int"), + os.path.join(target_path, "graph", "phones", "word_boundary.int"), + ) + # Populate ivector directory print("Populate graph/phones directory") - for f in ["final.dubm", "final.ie", "final.mat", "global_cmvn.stats", "online_cmvn.conf"]: + for f in [ + "final.dubm", + "final.ie", + "final.mat", + "global_cmvn.stats", + "online_cmvn.conf", + ]: print(f) - os.symlink(os.path.join(am_path, "ivector_extractor", f), - os.path.join(target_path, "ivector", f)) - + os.symlink( + os.path.join(am_path, "ivector_extractor", f), + os.path.join(target_path, "ivector", f), + ) + print("splice.conf") - with open(os.path.join(am_path, "ivector_extractor", "splice_opts"), 'r') as in_f: - with open(os.path.join(target_path, "ivector", "splice.conf"), 'w') as out_f: + with open(os.path.join(am_path, "ivector_extractor", "splice_opts"), "r") as in_f: + with open(os.path.join(target_path, "ivector", "splice.conf"), "w") as out_f: for param in in_f.read().split(" "): out_f.write(f"{param}\n") # Populate rescore # ? + if __name__ == "__main__": lin_to_vosk_format(ACOUSTIC_MODEL_PATH, LANGUAGE_MODEL_PATH, TARGET_PATH) - diff --git a/stt/__init__.py b/stt/__init__.py index a624077..73c3a1a 100644 --- a/stt/__init__.py +++ b/stt/__init__.py @@ -2,6 +2,7 @@ import os logging.basicConfig( - format="%(asctime)s %(name)s %(levelname)s: %(message)s", datefmt="%d/%m/%Y %H:%M:%S" + format="%(asctime)s %(name)s %(levelname)s: %(message)s", + datefmt="%d/%m/%Y %H:%M:%S", ) logger = logging.getLogger("__stt__") diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index d1a29db..2a3eca5 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -1,4 +1,5 @@ import os +import sys from time import time from vosk import Model @@ -7,8 +8,6 @@ from stt.processing.decoding import decode from stt.processing.utils import formatAudio, load_wave -# from stt.processing.model import loadModel - __all__ = ["model", "logger", "decode", "load_wave", "formatAudio"] # Model locations (should be mounted) @@ -19,7 +18,7 @@ start = time() try: model = Model(MODEL_PATH) -except Exception as e: - raise Exception("Failed to load transcription model: {}".format(str(e))) - exit(-1) +except Exception as err: + raise Exception("Failed to load transcription model: {}".format(str(err))) from err + sys.exit(-1) logger.info("Acoustic model and decoding graph loaded. (t={}s)".format(time() - start)) diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 9908ba0..2e1fb7c 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -15,8 +15,8 @@ def decode(audio_data: bytes, model: Model, sampling_rate: int, with_metadata: b recognizer.AcceptWaveform(audio_data) try: decoder_result_raw = recognizer.FinalResult() - except Exception as e: - raise Exception("Failed to decode") + except Exception as err: + raise Exception("Failed to decode") from err try: decoder_result = json.loads(decoder_result_raw) except Exception: @@ -24,7 +24,7 @@ def decode(audio_data: bytes, model: Model, sampling_rate: int, with_metadata: b result["text"] = re.sub(" ", "", decoder_result["text"]) if "result" in decoder_result: result["words"] = [w for w in decoder_result["result"] if w["word"] != ""] - if len(result["words"]): + if result["words"]: result["confidence-score"] = sum([w["conf"] for w in result["words"]]) / len( result["words"] ) diff --git a/stt/processing/streaming.py b/stt/processing/streaming.py index 36f9eca..28274b8 100644 --- a/stt/processing/streaming.py +++ b/stt/processing/streaming.py @@ -58,10 +58,10 @@ async def wssDecode(ws: WebSocketServerProtocol, model: Model): await ws.send(ret) -def ws_streaming(ws: WSServer, model: Model): +def ws_streaming(websocket_server: WSServer, model: Model): """Sync Decode function endpoint""" # Wait for config - res = ws.receive(timeout=10) + res = websocket_server.receive(timeout=10) # Timeout if res is None: @@ -71,39 +71,39 @@ def ws_streaming(ws: WSServer, model: Model): try: config = json.loads(res)["config"] sample_rate = config["sample_rate"] - except Exception as e: + except Exception: logger.error("Failed to read stream configuration") - ws.close() + websocket_server.close() # Recognizer try: recognizer = KaldiRecognizer(model, sample_rate) - except Exception as e: + except Exception: logger.error("Failed to load recognizer") - ws.close() + websocket_server.close() # Wait for chunks while True: try: # Client data - message = ws.receive(timeout=10) + message = websocket_server.receive(timeout=10) if message is None: # Timeout - ws.close() + websocket_server.close() except Exception: print("Connection closed by client") break # End frame if "eof" in str(message): ret = recognizer.FinalResult() - ws.send(json.dumps(re.sub(" ", "", ret))) - ws.close() + websocket_server.send(json.dumps(re.sub(" ", "", ret))) + websocket_server.close() break # Audio chunk print("Received chunk") if recognizer.AcceptWaveform(message): ret = recognizer.Result() - ws.send(re.sub(" ", "", ret)) + websocket_server.send(re.sub(" ", "", ret)) else: ret = recognizer.PartialResult() - ws.send(re.sub(" ", "", ret)) + websocket_server.send(re.sub(" ", "", ret)) diff --git a/stt/processing/utils.py b/stt/processing/utils.py index 642f427..d003fc8 100644 --- a/stt/processing/utils.py +++ b/stt/processing/utils.py @@ -1,7 +1,7 @@ import io import wavio -from numpy import int16, squeeze +from numpy import int16, squeeze, mean def load_wave(file_path): @@ -21,5 +21,4 @@ def formatAudio(file_buffer): elif file_content.data.shape[1] == 2: data = mean(data, axis=1, dtype=int16) return data.tobytes(), file_content.rate - else: - raise Exception("Audio Format not supported.") + raise Exception("Audio Format not supported.") diff --git a/websocket/websocketserver.py b/websocket/websocketserver.py index 9f1f683..81e035b 100644 --- a/websocket/websocketserver.py +++ b/websocket/websocketserver.py @@ -1,5 +1,5 @@ -import os import asyncio +import os import websockets @@ -8,15 +8,16 @@ async def _fun_wrapper(ws): - """ Wrap wssDecode function to add STT Model reference """ + """Wrap wssDecode function to add STT Model reference""" return await wssDecode(ws, model) async def WSServer(port: int): - """ Launch the websocket server """ + """Launch the websocket server""" async with websockets.serve(_fun_wrapper, "0.0.0.0", serving_port): await asyncio.Future() + if __name__ == "__main__": serving_port = os.environ.get("STREAMING_PORT", 80) asyncio.run(WSServer(serving_port)) From 7394acc0b5f407620d68fc20ba77787fbd82509f Mon Sep 17 00:00:00 2001 From: rbaraglia Date: Mon, 3 Oct 2022 14:34:41 +0200 Subject: [PATCH 082/172] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index aa711f7..33c6a34 100644 --- a/README.md +++ b/README.md @@ -65,7 +65,7 @@ cp .envdefault .env ### Serving mode ![Serving Modes](https://i.ibb.co/qrtv3Z6/platform-stt.png) -STT can be use three ways: +STT can be used three ways: * Through an [HTTP API](#http-server) using the **http**'s mode. * Through a [message broker](#micro-service-within-linto-platform-stack) using the **task**'s mode. * Through a [websocket server](#websocket-server) **websocket**'s mode. From 45c59cf831e8e194cb2bcb30f9369b8d5c60d813 Mon Sep 17 00:00:00 2001 From: Rudy Baraglia Date: Thu, 27 Oct 2022 08:47:34 +0000 Subject: [PATCH 083/172] Update README.md --- README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 7fd1fa0..f76955c 100644 --- a/README.md +++ b/README.md @@ -53,7 +53,7 @@ cp .envdefault .env | PARAMETER | DESCRIPTION | EXEMPLE | |---|---|---| -| SERVING_MODE | STT serving mode see [Serving mode](#serving-mode) | http\|task\|websocket | +| SERVICE_MODE | STT serving mode see [Serving mode](#serving-mode) | http\|task\|websocket | | MODEL_TYPE | Type of STT model used. | lin\|vosk | | ENABLE_STREAMING | Using http serving mode, enable the /streaming websocket route | true\|false | | SERVICE_NAME | Using the task mode, set the queue's name for task processing | my-stt | @@ -72,12 +72,12 @@ STT can be use three ways: Mode is specified using the .env value or environment variable ```SERVING_MODE```. ```bash -SERVING_MODE=http +SERVICE_MODE=http ``` ### HTTP Server The HTTP serving mode deploys a HTTP server and a swagger-ui to allow transcription request on a dedicated route. -The SERVING_MODE value in the .env should be set to ```http```. +The SERVICE_MODE value in the .env should be set to ```http```. ```bash docker run --rm \ @@ -101,7 +101,7 @@ This will run a container providing an [HTTP API](#http-api) binded on the host ### Micro-service within LinTO-Platform stack The HTTP serving mode connect a celery worker to a message broker. -The SERVING_MODE value in the .env should be set to ```task```. +The SERVICE_MODE value in the .env should be set to ```task```. >LinTO-platform-stt can be deployed within the linto-platform-stack through the use of linto-platform-services-manager. Used this way, the container spawn celery worker waiting for transcription task on a message broker. >LinTO-platform-stt in task mode is not intended to be launch manually. @@ -130,7 +130,7 @@ linto-platform-stt:latest ### Websocket Server Websocket server's mode deploy a streaming transcription service only. -The SERVING_MODE value in the .env should be set to ```websocket```. +The SERVICE_MODE value in the .env should be set to ```websocket```. Usage is the same as the [http streaming API](#/streaming) From 9abd1744362947adcfc5f6e98ccc8e1021ac9448 Mon Sep 17 00:00:00 2001 From: Rudy Baraglia Date: Thu, 27 Oct 2022 09:32:26 +0000 Subject: [PATCH 084/172] Fix broken models link --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index f76955c..06e2ef0 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ LinTO-Platform-STT accepts two kinds of models: * LinTO Acoustic and Languages models. * Vosk models. -We provide home-cured models (v2) on [dl.linto.ai](https://doc.linto.ai/#/services/linstt_download). +We provide home-cured models (v2) on [dl.linto.ai](https://doc.linto.ai/docs/developpers/apis/ASR/models). Or you can also use Vosk models available [here](https://alphacephei.com/vosk/models). ### Docker From 88999116f95a0cfd4f4408f788a31aa970566b45 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 21 Dec 2022 10:41:55 +0100 Subject: [PATCH 085/172] Fix stereo to mono conversion --- RELEASE.md | 3 +++ stt/processing/utils.py | 4 +++- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/RELEASE.md b/RELEASE.md index 2626e10..9966250 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,3 +1,6 @@ +# 3.3.2 +- Fixed use of stereo audio in http serving mode + # 3.3.1 - Fixed lin_to_vosk throwing an error on a already existing container. - Corrected an error on the README regarding mounting model volumes. diff --git a/stt/processing/utils.py b/stt/processing/utils.py index d003fc8..b81cc5d 100644 --- a/stt/processing/utils.py +++ b/stt/processing/utils.py @@ -19,6 +19,8 @@ def formatAudio(file_buffer): if file_content.data.shape[1] == 1: data = squeeze(file_content.data) elif file_content.data.shape[1] == 2: - data = mean(data, axis=1, dtype=int16) + data = mean(file_content.data, axis=1, dtype=int16) + else: + raise Exception("Audio Format not supported.") return data.tobytes(), file_content.rate raise Exception("Audio Format not supported.") From d929283f51b4a101ead26e9de462463dbc9eec71 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 22 Dec 2022 13:15:07 +0100 Subject: [PATCH 086/172] First version of STT platform with OpenAI Whisper (and SpeechBrain for word alignment) --- .envdefault | 9 +- Dockerfile | 44 +--- Jenkinsfile | 21 ++ README.md | 96 ++++----- RELEASE.md | 3 + celery_app/tasks.py | 8 +- docker-entrypoint.sh | 19 +- http_server/ingress.py | 17 +- lin_to_vosk.py | 100 --------- load_alignment_model.py | 79 +++++++ requirements.txt | 8 +- stt/processing/__init__.py | 46 +++-- stt/processing/alignment_model.py | 66 ++++++ stt/processing/decoding.py | 331 +++++++++++++++++++++++++++--- stt/processing/load_model.py | 62 ++++++ stt/processing/streaming.py | 109 ---------- stt/processing/utils.py | 48 +++-- stt/processing/word_alignment.py | 169 +++++++++++++++ websocket/websocketserver.py | 23 --- 19 files changed, 844 insertions(+), 414 deletions(-) delete mode 100755 lin_to_vosk.py create mode 100644 load_alignment_model.py create mode 100644 stt/processing/alignment_model.py create mode 100644 stt/processing/load_model.py delete mode 100644 stt/processing/streaming.py create mode 100644 stt/processing/word_alignment.py delete mode 100644 websocket/websocketserver.py diff --git a/.envdefault b/.envdefault index 33a394c..61a57bd 100644 --- a/.envdefault +++ b/.envdefault @@ -1,17 +1,12 @@ # SERVING PARAMETERS SERVICE_MODE=http -MODEL_TYPE=lin - -# HTTP PARAMETERS -ENABLE_STREAMING=true +MODEL_TYPE=/opt/model.pt +LANGUAGE=fr # TASK PARAMETERS SERVICE_NAME=stt SERVICES_BROKER=redis://192.168.0.1:6379 BROKER_PASS=password -# WEBSOCKET PARAMETERS -STREAMING_PORT=80 - # CONCURRENCY CONCURRENCY=2 \ No newline at end of file diff --git a/Dockerfile b/Dockerfile index bdf65c0..4761b3d 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,5 +1,5 @@ FROM python:3.9 -LABEL maintainer="irebai@linagora.com, rbaraglia@linagora.com" +LABEL maintainer="jlouradour@linagora.com" ARG KALDI_MKL @@ -11,6 +11,7 @@ RUN apt-get update && \ unzip \ xz-utils \ sox \ + ffmpeg \ g++ \ make \ cmake \ @@ -20,40 +21,20 @@ RUN apt-get update && \ autoconf \ libtool \ pkg-config \ - ca-certificates \ - && rm -rf /var/lib/apt/lists/* + ca-certificates -# Build vosk-kaldi -RUN git clone -b vosk --single-branch https://github.com/alphacep/kaldi /opt/kaldi \ - && cd /opt/kaldi/tools \ - && sed -i 's:status=0:exit 0:g' extras/check_dependencies.sh \ - && sed -i 's:--enable-ngram-fsts:--enable-ngram-fsts --disable-bin:g' Makefile \ - && make -j $(nproc) openfst cub \ - && if [ "x$KALDI_MKL" != "x1" ] ; then \ - extras/install_openblas_clapack.sh; \ - else \ - extras/install_mkl.sh; \ - fi \ - && cd /opt/kaldi/src \ - && if [ "x$KALDI_MKL" != "x1" ] ; then \ - ./configure --mathlib=OPENBLAS_CLAPACK --shared; \ - else \ - ./configure --mathlib=MKL --shared; \ - fi \ - && sed -i 's:-msse -msse2:-msse -msse2:g' kaldi.mk \ - && sed -i 's: -O1 : -O3 :g' kaldi.mk \ - && make -j $(nproc) online2 lm rnnlm +RUN rm -rf /var/lib/apt/lists/* # Install python dependencies COPY requirements.txt ./ -RUN pip install --no-cache-dir -r requirements.txt +RUN pip install --force-reinstall --no-cache-dir -r requirements.txt -# Install Custom Vosk API -RUN git clone --depth 1 https://github.com/alphacep/vosk-api /opt/vosk-api && cd /opt/vosk-api/python && \ - cd /opt/vosk-api/src \ - && KALDI_MKL=$KALDI_MKL KALDI_ROOT=/opt/kaldi make -j $(nproc) \ - && cd /opt/vosk-api/python \ - && python3 ./setup.py install +# Download alignment model +COPY load_alignment_model.py ./ +RUN python3 load_alignment_model.py + +# Cleaning +RUN rm requirements.txt load_alignment_model.py WORKDIR /usr/src/app @@ -63,9 +44,6 @@ COPY http_server /usr/src/app/http_server COPY websocket /usr/src/app/websocket COPY document /usr/src/app/document COPY docker-entrypoint.sh wait-for-it.sh healthcheck.sh ./ -COPY lin_to_vosk.py /usr/src/app/lin_to_vosk.py - -RUN mkdir -p /var/log/supervisor/ ENV PYTHONPATH="${PYTHONPATH}:/usr/src/app/stt" diff --git a/Jenkinsfile b/Jenkinsfile index 95e42b0..572c1c5 100644 --- a/Jenkinsfile +++ b/Jenkinsfile @@ -47,5 +47,26 @@ pipeline { } } } + + // stage('Docker build for whisper branch'){ + // when{ + // branch 'feature/whisper' + // } + // steps { + // echo 'Publishing whisper' + // script { + // image = docker.build(env.DOCKER_HUB_REPO) + // VERSION = sh( + // returnStdout: true, + // script: "awk -v RS='' '/#/ {print; exit}' RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" + // ).trim() + + // docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { + // image.push("${VERSION}") + // image.push('whisper') + // } + // } + // } + // } }// end stages } \ No newline at end of file diff --git a/README.md b/README.md index ec70060..50f03a8 100644 --- a/README.md +++ b/README.md @@ -7,17 +7,26 @@ LinTO-platform-stt can either be used as a standalone transcription service or d ### Hardware To run the transcription models you'll need: -* At least 7Go of disk space to build the docker image. +* At least 8Go of disk space to build the docker image. * Up to 7GB of RAM depending on the model used. * One CPU per worker. Inference time scales on CPU performances. ### Model -LinTO-Platform-STT accepts two kinds of models: -* LinTO Acoustic and Languages models. -* Vosk models. - -We provide home-cured models (v2) on [dl.linto.ai](https://doc.linto.ai/docs/developpers/apis/ASR/models). -Or you can also use Vosk models available [here](https://alphacephei.com/vosk/models). +LinTO-Platform-STT accepts one Whisper models in the PyTorch format. + +You can download mutli-lingual models with the following links: +* tiny: "https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt +* base: https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt +* small: https://openaipublic.azureedge.net/main/whisper/models/9ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt +* medium: https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt +* large-v1: https://openaipublic.azureedge.net/main/whisper/models/e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a/large-v1.pt +* large-v2: https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt + +Models specialized for English can also be found: +* tiny.en: "https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt +* base.en: https://openaipublic.azureedge.net/main/whisper/models/25a8566e1d0c1e2231d1c762132cd20e0f96a85d16145c3a00adf5d1ac670ead/base.en.pt +* small.en: https://openaipublic.azureedge.net/main/whisper/models/f953ad0fd29cacd07d5a9eda5624af0f6bcf2258be67c92b79389873d91e0872/small.en.pt +* medium.en: https://openaipublic.azureedge.net/main/whisper/models/d7440d1dc186f76616474e0ff0b3b6b879abc9d1a4926b7adfa41db2d497ab4f/medium.en.pt ### Docker The transcription service requires docker up and running. @@ -39,11 +48,14 @@ or ```bash docker pull lintoai/linto-platform-stt -``` +``` with the following links **2- Download the models** -Have the acoustic and language model ready at AM_PATH and LM_PATH if you are using LinTO models. If you are using a Vosk model, have it ready at MODEL. +Have the Whisper model file ready at ASR_PATH. + +You can downloaded with the links mentioned above, if you don't have already a Whisper model. +If you already used Whisper in the past, you may have models in ~/.cache/whisper. **3- Fill the .env** @@ -54,12 +66,10 @@ cp .envdefault .env | PARAMETER | DESCRIPTION | EXEMPLE | |---|---|---| | SERVICE_MODE | STT serving mode see [Serving mode](#serving-mode) | http\|task\|websocket | -| MODEL_TYPE | Type of STT model used. | lin\|vosk | -| ENABLE_STREAMING | Using http serving mode, enable the /streaming websocket route | true\|false | +| MODEL_TYPE | Path to the model or type of model used. | ASR_PATH\|small\|medium\|large-v1\|... | | SERVICE_NAME | Using the task mode, set the queue's name for task processing | my-stt | | SERVICE_BROKER | Using the task mode, URL of the message broker | redis://my-broker:6379 | | BROKER_PASS | Using the task mode, broker password | my-password | -| STREAMING_PORT | Using the websocket mode, the listening port for ingoing WS connexions. | 80 | | CONCURRENCY | Maximum number of parallel requests | >1 | ### Serving mode @@ -82,8 +92,7 @@ The SERVICE_MODE value in the .env should be set to ```http```. ```bash docker run --rm \ -p HOST_SERVING_PORT:80 \ --v AM_PATH:/opt/AM \ --v LM_PATH:/opt/LM \ +-v ASR_PATH:/opt/model.pt \ --env-file .env \ linto-platform-stt:latest ``` @@ -94,9 +103,7 @@ This will run a container providing an [HTTP API](#http-api) binded on the host | Variables | Description | Example | |:-|:-|:-| | HOST_SERVING_PORT | Host serving port | 80 | -| AM_PATH | Path to the acoustic model on the host machine mounted to /opt/AM | /my/path/to/models/AM_fr-FR_v2.2.0 | -| LM_PATH | Path to the language model on the host machine mounted to /opt/LM | /my/path/to/models/fr-FR_big-v2.2.0 | -| MODEL_PATH | Path to the model (using MODEL_TYPE=vosk) mounted to /opt/model | /my/path/to/models/vosk-model | +| ASR_PATH | (Optional) Path to the Whisper model on the host machine to /opt/model.pt | /my/path/to/models/medium.pt | ### Micro-service within LinTO-Platform stack The HTTP serving mode connect a celery worker to a message broker. @@ -111,8 +118,7 @@ You need a message broker up and running at MY_SERVICE_BROKER. ```bash docker run --rm \ --v AM_PATH:/opt/AM \ --v LM_PATH:/opt/LM \ +-v ASR_PATH:/opt/model.pt \ -v SHARED_AUDIO_FOLDER:/opt/audio \ --env-file .env \ linto-platform-stt:latest @@ -121,19 +127,10 @@ linto-platform-stt:latest **Parameters:** | Variables | Description | Example | |:-|:-|:-| -| AM_PATH | Path to the acoustic model on the host machine mounted to /opt/AM | /my/path/to/models/AM_fr-FR_v2.2.0 | -| LM_PATH | Path to the language model on the host machine mounted to /opt/LM | /my/path/to/models/fr-FR_big-v2.2.0 | -| MODEL_PATH | Path to the model (using MODEL_TYPE=vosk) mounted to /opt/model | /my/path/to/models/vosk-model | +| ASR_PATH | (Optional) Path to the Whisper model on the host machine to /opt/model.pt | /my/path/to/models/medium.pt | | SHARED_AUDIO_FOLDER | Shared audio folder mounted to /opt/audio | /my/path/to/models/vosk-model | -### Websocket Server -Websocket server's mode deploy a streaming transcription service only. - -The SERVICE_MODE value in the .env should be set to ```websocket```. - -Usage is the same as the [http streaming API](#/streaming) - ## Usages ### HTTP API #### /healthcheck @@ -153,27 +150,20 @@ Transcription API Return the transcripted text using "text/plain" or a json object when using "application/json" structure as followed: ```json { - "text" : "This is the transcription", - "words" : [ - {"word":"This", "start": 0.123, "end": 0.453, "conf": 0.9}, - ... - ] - "confidence-score": 0.879 + "text" : "This is the transcription as text", + "words": [ + { + "word" : "This", + "start": 0.0, + "end": 0.124, + "conf": 0.82341 + }, + ... + ], + "confidence-score": 0.879 } ``` -#### /streaming -The /streaming route is accessible if the ENABLE_STREAMING environment variable is set to true. - -The route accepts websocket connexions. Exchanges are structured as followed: -1. Client send a json {"config": {"sample_rate":16000}}. -2. Client send audio chunk (go to 3- ) or {"eof" : 1} (go to 5-). -3. Server send either a partial result {"partial" : "this is a "} or a final result {"text": "this is a transcription"}. -4. Back to 2- -5. Server send a final result and close the connexion. - -> Connexion will be closed and the worker will be freed if no chunk are received for 10s. - #### /docs The /docs route offers a OpenAPI/swagger interface. @@ -189,17 +179,17 @@ STT-Worker accepts requests with the following arguments: On a successfull transcription the returned object is a json object structured as follow: ```json { - "text" : "this is the transcription as text", + "text" : "This is the transcription as text", "words": [ { - "word" : "this", + "word" : "This", "start": 0.0, "end": 0.124, - "conf": 1.0 + "conf": 0.82341 }, ... ], - "confidence-score": "" + "confidence-score": 0.879 } ``` @@ -220,5 +210,5 @@ This project is developped under the AGPLv3 License (see LICENSE). ## Acknowlegment. -* [Vosk, speech recognition toolkit](https://alphacephei.com/vosk/). -* [Kaldi Speech Recognition Toolkit](https://github.com/kaldi-asr/kaldi) +* [OpenAI Whisper](https://github.com/openai/whisper) +* [SpeechBrain](https://github.com/speechbrain/speechbrain). diff --git a/RELEASE.md b/RELEASE.md index 2626e10..a569376 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,3 +1,6 @@ +# 4.0.0 +- Integration of Whisper + # 3.3.1 - Fixed lin_to_vosk throwing an error on a already existing container. - Corrected an error on the README regarding mounting model volumes. diff --git a/celery_app/tasks.py b/celery_app/tasks.py index ce2ca4d..3b7251f 100644 --- a/celery_app/tasks.py +++ b/celery_app/tasks.py @@ -3,8 +3,8 @@ from celery_app.celeryapp import celery from stt import logger -from stt.processing import decode, model -from stt.processing.utils import load_wave +from stt.processing import decode, model, alignment_model +from stt.processing.utils import load_audiofile @celery.task(name="transcribe_task") @@ -15,14 +15,14 @@ def transcribe_task(file_name: str, with_metadata: bool): # Load wave file_path = os.path.join("/opt/audio", file_name) try: - file_content = load_wave(file_path) + file_content = load_audiofile(file_path) except Exception as err: logger.error(f"Failed to load ressource: {repr(err)}") raise Exception(f"Could not open ressource {file_path}") from err # Decode try: - result = decode(file_content, model, 16000, with_metadata) + result = decode(file_content, model, alignment_model, with_metadata) except Exception as err: logger.error(f"Failed to decode: {repr(err)}") raise Exception(f"Failed to decode {file_path}") from err diff --git a/docker-entrypoint.sh b/docker-entrypoint.sh index 212b145..4d67cca 100755 --- a/docker-entrypoint.sh +++ b/docker-entrypoint.sh @@ -7,21 +7,10 @@ echo "RUNNING STT" echo "Checking model format ..." if [ -z "$MODEL_TYPE" ] then - echo "Model type not specified, expecting Vosk Model" - export MODEL_TYPE=vosk + echo "Model type not specified, choosing Whisper medium model" + export MODEL_TYPE=medium fi -if [ "$MODEL_TYPE" = "vosk" ] -then - echo "Using Vosk format's model" - -elif [ "$MODEL_TYPE" = "lin" ] -then - echo "Processing model ... " - ./lin_to_vosk.py -else - echo "Unknown model type $MODEL_TYPE. Assuming vosk model" -fi # Launch parameters, environement variables and dependencies check if [ -z "$SERVICE_MODE" ] then @@ -43,10 +32,6 @@ else echo "RUNNING STT CELERY WORKER" celery --app=celery_app.celeryapp worker -Ofair --queues=${SERVICE_NAME} -c ${CONCURRENCY} -n ${SERVICE_NAME}_worker@%h - elif [ "$SERVICE_MODE" == "websocket" ] - then - echo "Running Websocket server on port ${STREAMING_PORT:=80}" - python websocket/websocketserver.py else echo "ERROR: Wrong serving command: $1" exit -1 diff --git a/http_server/ingress.py b/http_server/ingress.py index 5a9c661..6ccd090 100644 --- a/http_server/ingress.py +++ b/http_server/ingress.py @@ -11,8 +11,7 @@ from serving import GunicornServing from swagger import setupSwaggerUI -from stt.processing import decode, formatAudio, model -from stt.processing.streaming import ws_streaming +from stt.processing import decode, load_wave_buffer, model, alignment_model app = Flask("__stt-standalone-worker__") app.config["JSON_AS_ASCII"] = False @@ -24,16 +23,6 @@ ) logger = logging.getLogger("__stt-standalone-worker__") -# If websocket streaming route is enabled -if os.environ.get("ENABLE_STREAMING", False) in [True, "true", 1]: - logger.info("Init websocket serving ...") - sock = Sock(app) - logger.info("Streaming is enabled") - - @sock.route("/streaming") - def streaming(web_socket): - ws_streaming(web_socket, model) - @app.route("/healthcheck", methods=["GET"]) def healthcheck(): @@ -63,11 +52,11 @@ def transcribe(): # get input file if "file" in request.files.keys(): file_buffer = request.files["file"].read() - audio_data, sampling_rate = formatAudio(file_buffer) + audio_data = load_wave_buffer(file_buffer) start_t = time() # Transcription - transcription = decode(audio_data, model, sampling_rate, join_metadata) + transcription = decode(audio_data, model, alignment_model, join_metadata) logger.debug("Transcription complete (t={}s)".format(time() - start_t)) logger.debug("... Complete") diff --git a/lin_to_vosk.py b/lin_to_vosk.py deleted file mode 100755 index 62025a0..0000000 --- a/lin_to_vosk.py +++ /dev/null @@ -1,100 +0,0 @@ -#!/usr/bin/env python3 -import configparser -import os -import re - -LANGUAGE_MODEL_PATH = "/opt/LM" -ACOUSTIC_MODEL_PATH = "/opt/AM" -TARGET_PATH = "/opt/model" - - -def lin_to_vosk_format(am_path: str, lm_path: str, target_path: str): - if os.path.exists(target_path): - print( - "Target model folder already exist, assuming model has already been converted. Skipping..." - ) - return - os.mkdir(target_path) - # Create directory structure - print("Create directory structure") - for subfolder in ["am", "conf", "graph", "ivector", "rescore"]: - os.mkdir(os.path.join(target_path, subfolder)) - - # Populate am directory - # final.mdl - print("Populate am directory") - for f in ["final.mdl"]: - print(f) - os.symlink(os.path.join(am_path, f), os.path.join(target_path, "am", f)) - - # Populate conf directory - print("Populate conf directory") - print("mfcc.conf") - os.symlink( - os.path.join(am_path, "conf", "mfcc.conf"), - os.path.join(target_path, "conf", "mfcc.conf"), - ) - - print("model.conf") - with open(os.path.join(target_path, "conf", "model.conf"), "w") as f: - f.write("--min-active=200\n") - f.write("--max-active=7000\n") - f.write("--beam=13.0\n") - f.write("--lattice-beam=6.0\n") - f.write("--acoustic-scale=1.0\n") - f.write("--frame-subsampling-factor=3\n") - f.write("--endpoint.silence-phones=1:2:3:4:5:6:7:8:9:10\n") - f.write("--endpoint.rule2.min-trailing-silence=0.5\n") - f.write("--endpoint.rule3.min-trailing-silence=1.0\n") - f.write("--endpoint.rule4.min-trailing-silence=2.0\n") - - # Populate graph directory - print("Populate graph directory") - for f in ["HCLG.fst", "words.txt"]: - print(f) - os.symlink(os.path.join(lm_path, f), os.path.join(target_path, "graph", f)) - - print("phones.txt") - os.symlink( - os.path.join(am_path, "phones.txt"), - os.path.join(target_path, "graph", "phones.txt"), - ) - - # Populate graph/phones directory - os.mkdir(os.path.join(target_path, "graph", "phones")) - - print("Populate graph/phones directory") - - print("word_boundary.int") - os.symlink( - os.path.join(lm_path, "word_boundary.int"), - os.path.join(target_path, "graph", "phones", "word_boundary.int"), - ) - - # Populate ivector directory - print("Populate graph/phones directory") - for f in [ - "final.dubm", - "final.ie", - "final.mat", - "global_cmvn.stats", - "online_cmvn.conf", - ]: - print(f) - os.symlink( - os.path.join(am_path, "ivector_extractor", f), - os.path.join(target_path, "ivector", f), - ) - - print("splice.conf") - with open(os.path.join(am_path, "ivector_extractor", "splice_opts"), "r") as in_f: - with open(os.path.join(target_path, "ivector", "splice.conf"), "w") as out_f: - for param in in_f.read().split(" "): - out_f.write(f"{param}\n") - - # Populate rescore - # ? - - -if __name__ == "__main__": - lin_to_vosk_format(ACOUSTIC_MODEL_PATH, LANGUAGE_MODEL_PATH, TARGET_PATH) diff --git a/load_alignment_model.py b/load_alignment_model.py new file mode 100644 index 0000000..0cf6087 --- /dev/null +++ b/load_alignment_model.py @@ -0,0 +1,79 @@ +import os +import urllib.request +import zipfile + +import huggingface_hub +import speechbrain as sb +import requests + + +def load_alignment_model(name, download_root = "/opt"): + if name.startswith("linSTT"): + destdir = os.path.join(download_root, name) + if not os.path.exists(destdir): + # Download model + url = f"https://dl.linto.ai/downloads/model-distribution/acoustic-models/fr-FR/{name}.zip" + destzip = destdir+".zip" + if not os.path.exists(destzip): + print("Downloading", url, "into", destdir) + os.makedirs(download_root, exist_ok=True) + urllib.request.urlretrieve(url, destzip) + with zipfile.ZipFile(destzip, 'r') as z: + os.makedirs(destdir, exist_ok=True) + z.extractall(destdir) + assert os.path.isdir(destdir) + os.remove(destzip) + else: + destdir = name + load_speechbrain_model(destdir, download_root = download_root) + +def load_speechbrain_model(source, device = None, download_root = "/opt"): + + if os.path.isdir(source): + yaml_file = os.path.join(source, "hyperparams.yaml") + assert os.path.isfile(yaml_file), f"Hyperparams file {yaml_file} not found" + else: + try: + yaml_file = huggingface_hub.hf_hub_download(repo_id=source, filename="hyperparams.yaml", cache_dir = os.path.join(download_root, "huggingface/hub")) + except requests.exceptions.HTTPError: + yaml_file = None + + overrides = make_yaml_overrides(yaml_file, {"save_path": os.path.join(download_root, "speechbrain")}) + savedir = os.path.join(download_root, "speechbrain") + try: + model = sb.pretrained.EncoderASR.from_hparams(source = source, savedir = savedir, overrides = overrides) + except ValueError: + model = sb.pretrained.EncoderDecoderASR.from_hparams(source = source, savedir = savedir, overrides = overrides) + return model + +def make_yaml_overrides(yaml_file, key_values): + """ + return a dictionary of overrides to be used with speechbrain + yaml_file: path to yaml file + key_values: dict of key values to override + """ + if yaml_file is None: return None + + override = {} + with open(yaml_file, "r") as f: + parent = None + for line in f: + if line.strip() == "": + parent = None + elif line == line.lstrip(): + if ":" in line: + parent = line.split(":")[0].strip() + if parent in key_values: + override[parent] = key_values[parent] + elif ":" in line: + child = line.strip().split(":")[0].strip() + if child in key_values: + override[parent] = override.get(parent, {}) | {child: key_values[child]} + return override + + +if __name__ == "__main__": + + import sys + assert len(sys.argv) in [1, 2], f"Usage: {sys.argv[0]} " + load_alignment_model(sys.argv[1] if len(sys.argv) > 1 else "linSTT_speechbrain_fr-FR_v1.0.0") diff --git a/requirements.txt b/requirements.txt index 132bdfc..a93dc9f 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,11 +1,13 @@ celery[redis,auth,msgpack]>=4.4.7 -numpy>=1.18.5 flask>=1.1.2 flask-cors>=3.0.10 -flask-swagger-ui>=3.36.0 flask-sock +flask-swagger-ui>=3.36.0 gunicorn +num2words pyyaml>=5.4.1 -wavio>=0.0.4 requests>=2.26.0 +speechbrain +wavio>=0.0.4 websockets +git+https://github.com/openai/whisper.git \ No newline at end of file diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index 2a3eca5..a4d6182 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -1,24 +1,46 @@ import os -import sys +import logging from time import time -from vosk import Model +import torch +import whisper from stt import logger -from stt.processing.decoding import decode -from stt.processing.utils import formatAudio, load_wave +from stt.processing.decoding import decode, get_default_language +from stt.processing.utils import load_wave_buffer, load_audiofile -__all__ = ["model", "logger", "decode", "load_wave", "formatAudio"] +from .load_model import load_whisper_model, load_speechbrain_model -# Model locations (should be mounted) -MODEL_PATH = "/opt/model" +__all__ = ["logger", "decode", "model", "alignment_model", "load_audiofile", "load_wave_buffer"] -# Load ASR models (acoustic model and decoding graph) -logger.info("Loading acoustic model and decoding graph ...") +# Set logger level +logger.setLevel(logging.INFO) + +# Set device +device = os.environ.get("DEVICE", "cuda:0" if torch.cuda.is_available() else "cpu") +try: + device = torch.device(device) +except Exception as err: + raise Exception("Failed to set device: {}".format(str(err))) from err + +# Check language +available_languages = [k for k,v in whisper.tokenizer.LANGUAGES.items()] + [None] +if get_default_language() not in available_languages: + raise RuntimeError(f"Langaue {get_default_language()} is not available. Available languages are: {available_languages}") + +# Load ASR model +model_type = os.environ.get("MODEL_TYPE", "medium") +logger.info(f"Loading Whisper model {model_type} ({'local' if os.path.isfile(model_type) else 'remote'})...") start = time() try: - model = Model(MODEL_PATH) + model = load_whisper_model(model_type, device = device) except Exception as err: raise Exception("Failed to load transcription model: {}".format(str(err))) from err - sys.exit(-1) -logger.info("Acoustic model and decoding graph loaded. (t={}s)".format(time() - start)) +logger.info("Model loaded. (t={}s)".format(time() - start)) + +# Load alignment model +alignment_model_type = os.environ.get("ALIGNMENT_MODEL_TYPE", "/opt/linSTT_speechbrain_fr-FR_v1.0.0") +logger.info(f"Loading alignment model...") +start = time() +alignment_model = load_speechbrain_model(alignment_model_type, device = device, download_root = "/opt") +logger.info("Alignment Model loaded. (t={}s)".format(time() - start)) diff --git a/stt/processing/alignment_model.py b/stt/processing/alignment_model.py new file mode 100644 index 0000000..f6d52c8 --- /dev/null +++ b/stt/processing/alignment_model.py @@ -0,0 +1,66 @@ +import math +import torch +import torch.nn.utils.rnn as rnn_utils + +from stt import logger + +def speechbrain_get_vocab(model): + tokenizer = model.tokenizer + labels = [{'':" ", ' ⁇ ':""}.get(i,i).lower() for i in tokenizer.decode([[i] for i in range(tokenizer.get_piece_size())])] + blank_id = labels.index("") + return labels, blank_id + + +# The following limit is to handle the corner Case of too long audio segment (which is better to split it to avoid memory overflow). +# But it is 2240400 / 16000 Hz ~ 140 seconds, which should not happen for segments detected by Whisper (usually one sentence). +# Also note that Whisper works with 30 seconds segment, so there is chance that this limit is never reached. +MAX_LEN = 2240400 + +def speechbrain_compute_log_probas(model, audios, max_len = MAX_LEN): + # Single audio + if not isinstance(audios, list): + audios = [audios] + log_probas = speechbrain_compute_log_probas(model, audios, max_len = max_len) + return log_probas[0] + + # Batch of audios (can occur when max_len is reached) + assert len(audios) > 0, "audios must be a non-empty list" + if not isinstance(audios[0], torch.Tensor): + audios = [torch.from_numpy(a) for a in audios] + if max([len(a) for a in audios]) > max_len: + # Split audios into chunks of max_len + batch_size = len(audios) + chunks = [] + i_audio = [] + for a in audios: + chunks.extend([a[i:min(i+max_len, len(a))] for i in range(0, len(a), max_len)]) + i_audio.append(len(chunks)) + if len(chunks) > 1: + logger.warning("Audio too long, splitting into {} chunks for alignment".format(len(chunks))) + # Decode chunks of audio and concatenate results + log_probas = [[] for i in range(len(audios))] + for i in range(0, len(chunks), batch_size): + chunk = chunks[i:min(i+batch_size, len(chunks))] + log_probas_tmp = speechbrain_compute_log_probas(model, chunk) + for j in range(i,i+len(chunk)): + k = 0 + while j >= i_audio[k]: + k += 1 + log_probas[k].append(log_probas_tmp[j-i]) + log_probas = [torch.cat(p, dim = 0) for p in log_probas] + log_probas, wav_lens = pack_sequences(log_probas, device = model.device) + else: + batch, wav_lens = pack_sequences(audios, device = model.device) + log_probas = model.forward(batch, wav_lens) + + log_probas = torch.log_softmax(log_probas, dim=-1) + return log_probas + +def pack_sequences(tensors, device = "cpu"): + if len(tensors) == 1: + return tensors[0].unsqueeze(0).to(device), torch.Tensor([1.]).to(device) + tensor = rnn_utils.pad_sequence(tensors, batch_first=True) + wav_lens = [len(x) for x in tensors] + maxwav_lens = max(wav_lens) + wav_lens = torch.Tensor([l/maxwav_lens for l in wav_lens]) + return tensor.to(device), wav_lens.to(device) diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 2e1fb7c..7290af4 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -1,31 +1,316 @@ -import json +import os + +import whisper +from whisper.audio import SAMPLE_RATE + +import math +import numpy as np +import torch + import re +import string +from num2words import num2words + +from stt import logger +from .word_alignment import compute_alignment -from vosk import KaldiRecognizer, Model +# TODO: understand and remove this limitations +torch.set_num_threads(1) +def get_default_language(): + return os.environ.get("LANGUAGE", None) -def decode(audio_data: bytes, model: Model, sampling_rate: int, with_metadata: bool) -> dict: - """Transcribe the audio data using the vosk library with the defined model.""" +def decode(audio: torch.Tensor, + model: whisper.model.Whisper, + alignment_model: "Any", + with_word_timestamps: bool, + language: str = None, + beam_size: int = None, + no_speech_threshold: float = 0.6, + logprob_threshold: float = -1.0, + compression_ratio_threshold: float = 2.4, + normalize_text_as_words = False, + ) -> dict: + """Transcribe the audio data using Whisper with the defined model.""" result = {"text": "", "confidence-score": 0.0, "words": []} - recognizer = KaldiRecognizer(model, sampling_rate) - recognizer.SetMaxAlternatives(0) # Set confidence per words - recognizer.SetWords(with_metadata) + fp16 = model.device != torch.device("cpu") + + if language is None: + language = get_default_language() + + logger.info(f"Transcribing audio with language {language}...") + + whisper_res = model.transcribe(audio, + language = language, + fp16 = fp16, + temperature = 0.0, # For deterministic results + beam_size = beam_size, + no_speech_threshold = no_speech_threshold, + logprob_threshold = logprob_threshold, + compression_ratio_threshold = compression_ratio_threshold + ) + + text = whisper_res["text"].strip() + if normalize_text_as_words: + text = normalize_text(text, language) + text = remove_punctuation(text) + segments = whisper_res["segments"] + + result["text"] = text + result["confidence-score"] = np.exp(np.array([r["avg_logprob"] for r in segments])).mean() + if not with_word_timestamps: + if not normalize_text_as_words: + text = normalize_text(text, language) + text = remove_punctuation(text) + result["words"] = text.split() + else: + # Compute word timestamps + result["words"] = [] + max_t = audio.shape[0] + for segment in segments: + offset = segment["start"] + start = min(max_t, round(segment["start"] * SAMPLE_RATE)) + end = min(max_t, round(segment["end"] * SAMPLE_RATE)) + sub_audio = audio[start:end] + sub_text = segment["text"] + sub_text = normalize_text(sub_text, language) + sub_text = remove_punctuation(sub_text) + labels, emission, trellis, segments, word_segments = compute_alignment(sub_audio, sub_text, alignment_model) + ratio = len(sub_audio) / (trellis.size(0) * SAMPLE_RATE) + sub_words = sub_text.split() + assert len(sub_words) == len(word_segments), f"Unexpected number of words: {len(sub_words)} != {len(word_segments)}" + for word, segment in zip(sub_words, word_segments): + result["words"].append({ + "word": word, + "start": segment.start * ratio + offset, + "end": segment.end * ratio + offset, + "conf": segment.score, + }) - recognizer.AcceptWaveform(audio_data) - try: - decoder_result_raw = recognizer.FinalResult() - except Exception as err: - raise Exception("Failed to decode") from err - try: - decoder_result = json.loads(decoder_result_raw) - except Exception: - return result - result["text"] = re.sub(" ", "", decoder_result["text"]) - if "result" in decoder_result: - result["words"] = [w for w in decoder_result["result"] if w["word"] != ""] - if result["words"]: - result["confidence-score"] = sum([w["conf"] for w in result["words"]]) / len( - result["words"] - ) return result + + +custom_punctuations = string.punctuation.replace("'", "").replace("-", "") + +def remove_punctuation(text: str) -> str: + # Remove all punctuation except apostrophe + return text.translate(str.maketrans("", "", custom_punctuations)) + +_whitespace_re = re.compile(r'[^\S\r\n]+') + +def collapse_whitespace(text): + return re.sub(_whitespace_re, ' ', text).strip() + + +def normalize_text(text: str, lang: str) -> str: + """ Transform digits into characters... """ + + # Roman digits + if re.search(r"[IVX]", text): + if lang == "en": + digits = re.findall(r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})(st|nd|rd|th)?\b", text) + digits = ["".join(d) for d in digits] + elif lang == "fr": + digits = re.findall(r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})(ème|eme|e|er|ère)?\b", text) + digits = ["".join(d) for d in digits] + else: + digits = [] + if digits: + digits = sorted(list(set(digits)), reverse=True, key=lambda x: (len(x), x)) + for s in digits: + filtered = re.sub("[a-z]", "", s) + ordinal = filtered != s + digit = romanToDecimal(filtered) + v = undigit(str(digit), lang=lang, to= "ordinal" if ordinal else "cardinal") + text = re.sub(r"\b" + s + r"\b", v, text) + + # Ordinal digits + if lang == "en": + digits = re.findall(r"\b\d*1(?:st)|\d*2(?:nd)|\d*3(?:rd)|\d+(?:th)\b", text) + elif lang == "fr": + digits = re.findall(r"\b1(?:ère|ere|er|re|r)|2(?:nd|nde)|\d+(?:ème|eme|e)\b", text) + else: + logger.warn(f"Language {lang} not supported for normalization. Some words might be mis-localized.") + digits = [] + if digits: + digits = sorted(list(set(digits)), reverse=True, key=lambda x: (len(x), x)) + for digit in digits: + word = undigit(re.findall(r"\d+", digit)[0], to= "ordinal", lang = lang) + text = re.sub(r'\b'+str(digit)+r'\b', word, text) + + # Cardinal digits + digits = re.findall(r"(?:\-?\b[\d/]*\d+(?: \d\d\d)+\b)|(?:\-?\d[/\d]*)",text) + digits = list(map(lambda s: s.strip(r"[/ ]"), digits)) + digits = list(set(digits)) + digits = digits + flatten([c.split() for c in digits if " " in c]) + digits = digits + flatten([c.split("/") for c in digits if "/" in c]) + digits = sorted(digits, reverse=True, key=lambda x: (len(x), x)) + for digit in digits: + digitf = re.sub("/+", "/", digit) + if not digitf: + continue + numslash = len(re.findall("/", digitf)) + if numslash == 0: + word = undigit(digitf, lang = lang) + elif numslash == 1: # Fraction or date + i = digitf.index("/") + is_date = False + if len(digitf[i+1:]) == 2: + try: + first = int(digitf[:i]) + second = int(digitf[i+1:]) + is_date = first > 0 and first < 32 and second > 0 and second < 13 + except: pass + if is_date: + first = undigit(digitf[:i].lstrip("0"), lang = lang) + if first == "un": first = "premier" + second = _int_to_month[second] + else: + first = undigit(digitf[:i], lang = lang) + second = undigit(digitf[i+1:], to="denominator", lang = lang) + if float(digitf[:i]) > 2. and second[-1] != "s": + second += "s" + word = first + " " + second + elif numslash == 2: # Maybe a date + i1 = digitf.index("/") + i2 = digitf.index("/", i1+1) + is_date = False + if len(digitf[i1+1:i2]) == 2 and len(digitf[i2+1:]) == 4: + try: + first = int(digitf[:i1]) + second = int(digitf[i1+1:i2]) + third = int(digitf[i2+1:]) + is_date = first > 0 and first < 32 and second > 0 and second < 13 and third > 1000 + except: pass + third = undigit(digitf[i2+1:], lang = lang) + if is_date: + first = undigit(digitf[:i1].lstrip("0"), lang = lang) + if first == "un": first = "premier" + second = _int_to_month.get(lang, {}).get(int(digitf[i1+1:i2]), digitf[i1+1:i2]) + word = " ".join([first, second, third]) + else: + word = " / ".join([undigit(s, lang = lang) for s in digitf.split('/')]) + else: + word = " / ".join([undigit(s, lang = lang) for s in digitf.split('/')]) + # Replace + if " " in digit: + text = re.sub(r'\b'+str(digit)+r'\b', " "+word+" ", text) + else: + text = re.sub(str(digit), " "+word+" ", text) + + # TODO: symbols (currencies...) + + return collapse_whitespace(text) + +def undigit(str, lang, to="cardinal"): + str = re.sub(" ","", str) + if to == "denominator": + assert lang == "fr" + if str == "2": return "demi" + if str == "3": return "tiers" + if str == "4": return "quart" + to = "ordinal" + if str.startswith("0") and to == "cardinal": + numZeros = len(re.findall(r"0+", str)[0]) + if numZeros < len(str): + return numZeros * (my_num2words(0, lang=lang, to="cardinal")+" ") + my_num2words(float(str), lang=lang, to=to) + return my_num2words(float(str), lang=lang, to=to) + + +def my_num2words(x, lang, to = "cardinal", orig = ""): + """ + Bugfix for num2words + """ + try: + if lang == "fr" and to == "ordinal": + return num2words(x, lang=lang, to=to).replace("vingtsième", "vingtième") + else: + return num2words(x, lang=lang, to=to) + except OverflowError: + if x == math.inf: # ! + return " ".join(my_num2words(xi, lang=lang, to=to) for xi in orig) + if x == -math.inf: # ! + return "moins " + my_num2words(-x, lang=lang, to=to, orig=orig.replace("-" , "")) + # TODO: print a warning + return my_num2words(x//10, lang=lang, to=to) + +def flatten(l): + """ + flatten a list of lists + """ + return [item for sublist in l for item in sublist] + +_int_to_month = { + "fr": { + 1: "janvier", + 2: "février", + 3: "mars", + 4: "avril", + 5: "mai", + 6: "juin", + 7: "juillet", + 8: "août", + 9: "septembre", + 10: "octobre", + 11: "novembre", + 12: "décembre", + }, + "en": { + 1: "january", + 2: "february", + 3: "march", + 4: "april", + 5: "may", + 6: "june", + 7: "july", + 8: "august", + 9: "september", + 10: "october", + 11: "november", + 12: "december", + } +} + + +def romanToDecimal(str): + def value(r): + if (r == 'I'): + return 1 + if (r == 'V'): + return 5 + if (r == 'X'): + return 10 + if (r == 'L'): + return 50 + if (r == 'C'): + return 100 + if (r == 'D'): + return 500 + if (r == 'M'): + return 1000 + return -1 + + res = 0 + i = 0 + while (i < len(str)): + # Getting value of symbol s[i] + s1 = value(str[i]) + if (i + 1 < len(str)): + # Getting value of symbol s[i + 1] + s2 = value(str[i + 1]) + # Comparing both values + if (s1 >= s2): + # Value of current symbol is greater + # or equal to the next symbol + res = res + s1 + i = i + 1 + else: + # Value of current symbol is greater + # or equal to the next symbol + res = res + s2 - s1 + i = i + 2 + else: + res = res + s1 + i = i + 1 + return res diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py new file mode 100644 index 0000000..da5d98c --- /dev/null +++ b/stt/processing/load_model.py @@ -0,0 +1,62 @@ +import whisper + +import os +import requests +import huggingface_hub +import speechbrain as sb + +def load_whisper_model(model_type_or_file, device = "cpu", download_root = "/opt"): + + model = whisper.load_model(model_type_or_file, device = device, download_root = os.path.join(download_root, "whisper")) + + model.eval() + model.requires_grad_(False) + return model + +def load_speechbrain_model(source, device = "cpu", download_root = "/opt"): + + if os.path.isdir(source): + yaml_file = os.path.join(source, "hyperparams.yaml") + assert os.path.isfile(yaml_file), f"Hyperparams file {yaml_file} not found" + else: + try: + yaml_file = huggingface_hub.hf_hub_download(repo_id=source, filename="hyperparams.yaml", cache_dir = os.path.join(download_root, "huggingface/hub")) + except requests.exceptions.HTTPError: + yaml_file = None + overrides = make_yaml_overrides(yaml_file, {"save_path": os.path.join(download_root, "speechbrain")}) + + savedir = os.path.join(download_root, "speechbrain") + try: + model = sb.pretrained.EncoderASR.from_hparams(source = source, run_opts= {"device": device}, savedir = savedir, overrides = overrides) + except ValueError: + model = sb.pretrained.EncoderDecoderASR.from_hparams(source = source, run_opts= {"device": device}, savedir = savedir, overrides = overrides) + + model.train(False) + model.requires_grad_(False) + return model + + +def make_yaml_overrides(yaml_file, key_values): + """ + return a dictionary of overrides to be used with speechbrain (hyperyaml files) + yaml_file: path to yaml file + key_values: dict of key values to override + """ + if yaml_file is None: return None + + override = {} + with open(yaml_file, "r") as f: + parent = None + for line in f: + if line.strip() == "": + parent = None + elif line == line.lstrip(): + if ":" in line: + parent = line.split(":")[0].strip() + if parent in key_values: + override[parent] = key_values[parent] + elif ":" in line: + child = line.strip().split(":")[0].strip() + if child in key_values: + override[parent] = override.get(parent, {}) | {child: key_values[child]} + return override diff --git a/stt/processing/streaming.py b/stt/processing/streaming.py deleted file mode 100644 index 28274b8..0000000 --- a/stt/processing/streaming.py +++ /dev/null @@ -1,109 +0,0 @@ -import json -import re -from typing import Union - -from simple_websocket.ws import Server as WSServer -from vosk import KaldiRecognizer, Model -from websockets.legacy.server import WebSocketServerProtocol - -from stt import logger - - -async def wssDecode(ws: WebSocketServerProtocol, model: Model): - """Async Decode function endpoint""" - # Wait for config - res = await ws.recv() - - # Parse config - try: - config = json.loads(res)["config"] - sample_rate = config["sample_rate"] - except Exception as e: - logger.error("Failed to read stream configuration") - await ws.close(reason="Failed to load configuration") - - # Recognizer - try: - recognizer = KaldiRecognizer(model, sample_rate) - except Exception as e: - logger.error("Failed to load recognizer") - await ws.close(reason="Failed to load recognizer") - - # Wait for chunks - while True: - try: - # Client data - message = await ws.recv() - if message is None or message == "": # Timeout - ws.close() - except Exception as e: - print("Connection closed by client: {}".format(str(e))) - break - - # End frame - if "eof" in str(message): - ret = recognizer.FinalResult() - await ws.send(json.dumps(ret)) - await ws.close(reason="End of stream") - break - - # Audio chunk - if recognizer.AcceptWaveform(message): - ret = recognizer.Result() # Result seems to not work properly - await ws.send(ret) - - else: - ret = recognizer.PartialResult() - last_utterance = ret - await ws.send(ret) - - -def ws_streaming(websocket_server: WSServer, model: Model): - """Sync Decode function endpoint""" - # Wait for config - res = websocket_server.receive(timeout=10) - - # Timeout - if res is None: - pass - - # Parse config - try: - config = json.loads(res)["config"] - sample_rate = config["sample_rate"] - except Exception: - logger.error("Failed to read stream configuration") - websocket_server.close() - - # Recognizer - try: - recognizer = KaldiRecognizer(model, sample_rate) - except Exception: - logger.error("Failed to load recognizer") - websocket_server.close() - - # Wait for chunks - while True: - try: - # Client data - message = websocket_server.receive(timeout=10) - if message is None: # Timeout - websocket_server.close() - except Exception: - print("Connection closed by client") - break - # End frame - if "eof" in str(message): - ret = recognizer.FinalResult() - websocket_server.send(json.dumps(re.sub(" ", "", ret))) - websocket_server.close() - break - # Audio chunk - print("Received chunk") - if recognizer.AcceptWaveform(message): - ret = recognizer.Result() - websocket_server.send(re.sub(" ", "", ret)) - - else: - ret = recognizer.PartialResult() - websocket_server.send(re.sub(" ", "", ret)) diff --git a/stt/processing/utils.py b/stt/processing/utils.py index d003fc8..6956161 100644 --- a/stt/processing/utils.py +++ b/stt/processing/utils.py @@ -1,24 +1,40 @@ import io - import wavio -from numpy import int16, squeeze, mean +import os +import numpy as np +import torch +import torchaudio +import whisper +def conform_audio(audio, sample_rate = 16_000): + if sample_rate != whisper.audio.SAMPLE_RATE: + # Down or Up sample to the right sampling rate + audio = torchaudio.transforms.Resample(sample_rate, whisper.audio.SAMPLE_RATE)(audio) + if audio.shape[0] > 1: + # Stereo to mono + # audio = torchaudio.transforms.DownmixMono()(audio, channels_first = True) + audio = audio.mean(0) + else: + audio = audio.squeeze(0) + return audio -def load_wave(file_path): - """Formats audio from a wavFile buffer to a bytebuffer""" - audio = squeeze(wavio.read(file_path).data) - return audio.tobytes() +def load_audiofile(path): + if not os.path.isfile(path): + raise RuntimeError("File not found: %s" % path) + elif not os.access(path, os.R_OK): + raise RuntimeError("Missing reading permission for: %s" % path) + # audio, sample_rate = torchaudio.load(path) + # return conform_audio(audio, sample_rate) + audio = whisper.load_audio(path) + audio = torch.from_numpy(audio) + return audio -def formatAudio(file_buffer): - """Formats audio from a wavFile buffer to a numpy array for processing.""" +def load_wave_buffer(file_buffer): + """ Formats audio from a wavFile buffer to a torch array for processing. """ file_buffer_io = io.BytesIO(file_buffer) file_content = wavio.read(file_buffer_io) - # if stereo file, convert to mono by computing the mean over the channels - if file_content.data.ndim == 2: - if file_content.data.shape[1] == 1: - data = squeeze(file_content.data) - elif file_content.data.shape[1] == 2: - data = mean(data, axis=1, dtype=int16) - return data.tobytes(), file_content.rate - raise Exception("Audio Format not supported.") + sample_rate = file_content.rate + audio = torch.from_numpy(file_content.data.astype(np.float32)/32768) + audio = audio.transpose(0,1) + return conform_audio(audio, sample_rate) diff --git a/stt/processing/word_alignment.py b/stt/processing/word_alignment.py new file mode 100644 index 0000000..974e528 --- /dev/null +++ b/stt/processing/word_alignment.py @@ -0,0 +1,169 @@ +import unicodedata +from dataclasses import dataclass +import torch + +from stt import logger +from .alignment_model import speechbrain_compute_log_probas as compute_log_probas +from .alignment_model import speechbrain_get_vocab as get_vocab + + +def compute_alignment(audio, transcript, model): + """ Compute the alignment of the audio and a transcript, for a given model that returns log-probabilities on the charset defined the transcript.""" + + emission = compute_log_probas(model, audio) + labels, blank_id = get_vocab(model) + labels = labels[:emission.shape[1]] + dictionary = {c: i for i, c in enumerate(labels)} + + tokens = [loose_get_char_index(dictionary, c, blank_id) for c in transcript] + tokens = [i for i in tokens if i is not None] + + trellis = get_trellis(emission, tokens, blank_id = blank_id) + + path = backtrack(trellis, emission, tokens, blank_id = blank_id) + + segments = merge_repeats(transcript, path) + + word_segments = merge_words(segments) + + return labels, emission, trellis, segments, word_segments + +def loose_get_char_index(dictionary, c, default): + i = dictionary.get(c, None) + if i is None: + other_char = list(set([c.lower(), c.upper(), transliterate(c), transliterate(c).lower(), transliterate(c).upper()])) + for c2 in other_char: + i = dictionary.get(c2, None) + if i is not None: + break + if i is None: + logger.warn("Cannot find label " + " / ".join(list(set([c] + other_char)))) + i = default + return i + +def transliterate(c): + # Transliterates a character to its closest ASCII equivalent. + # For example, "é" becomes "e". + # This is useful for converting Vietnamese text to ASCII. + # See https://stackoverflow.com/a/517974/446579 + return unicodedata.normalize("NFKD", c).encode("ascii", "ignore").decode("ascii") + +def get_trellis(emission, tokens, blank_id=0, use_max = False): + num_frame = emission.size(0) + num_tokens = len(tokens) + + # Trellis has extra diemsions for both time axis and tokens. + # The extra dim for tokens represents (start-of-sentence) + # The extra dim for time axis is for simplification of the code. + trellis = torch.empty((num_frame + 1, num_tokens + 1)).to(emission.device) + trellis[0, 0] = 0 + trellis[1:, 0] = torch.cumsum(emission[:, blank_id], 0) + trellis[0, -num_tokens:] = -float("inf") + trellis[-num_tokens:, 0] = float("inf") + + for t in range(num_frame): + trellis[t + 1, 1:] = torch.maximum( + # Score for staying at the same token + trellis[t, 1:] + emission[t, blank_id], + torch.maximum(trellis[t, 1:] + emission[t, tokens], + # Score for changing to the next token + trellis[t, :-1] + emission[t, tokens]) + ) if use_max else torch.logaddexp( + trellis[t, 1:] + emission[t, blank_id], + torch.logaddexp(trellis[t, 1:] + emission[t, tokens], + trellis[t, :-1] + emission[t, tokens]) + ) + return trellis + +@dataclass +class Point: + token_index: int + time_index: int + score: float + + +def backtrack(trellis, emission, tokens, blank_id=0): + # Note: + # j and t are indices for trellis, which has extra dimensions + # for time and tokens at the beginning. + # When referring to time frame index `T` in trellis, + # the corresponding index in emission is `T-1`. + # Similarly, when referring to token index `J` in trellis, + # the corresponding index in transcript is `J-1`. + j = trellis.size(1) - 1 + t_start = torch.argmax(trellis[:, j]).item() + + path = [] + for t in range(t_start, 0, -1): + # 1. Figure out if the current position was stay or change + # Note (again): + # `emission[J-1]` is the emission at time frame `J` of trellis dimension. + # Score for token staying the same from time frame J-1 to T. + stayed = trellis[t - 1, j] + emission[t - 1, blank_id] + # Score for token changing from C-1 at T-1 to J at T. + changed = trellis[t - 1, j - 1] + emission[t - 1, tokens[j - 1]] + + # 2. Store the path with frame-wise probability. + prob = emission[t - 1, tokens[j - 1] if changed > stayed else 0].exp().item() + # Return token index and time index in non-trellis coordinate. + path.append(Point(j - 1, t - 1, prob)) + + # 3. Update the token + if changed > stayed: + j -= 1 + if j == 0: + break + else: + raise ValueError("Failed to align") + return path[::-1] + + +# Merge the labels +@dataclass +class Segment: + label: str + start: int + end: int + score: float + + def __repr__(self): + return f"{self.label}\t({self.score:4.2f}): [{self.start:5d}, {self.end:5d})" + + @property + def length(self): + return self.end - self.start + + +def merge_repeats(transcript, path): + i1, i2 = 0, 0 + segments = [] + while i1 < len(path): + while i2 < len(path) and path[i1].token_index == path[i2].token_index: + i2 += 1 + score = sum(path[k].score for k in range(i1, i2)) / (i2 - i1) + segments.append( + Segment( + transcript[path[i1].token_index], + path[i1].time_index, + path[i2 - 1].time_index + 1, + score, + ) + ) + i1 = i2 + return segments + +def merge_words(segments, separator=" "): + words = [] + i1, i2 = 0, 0 + while i1 < len(segments): + if i2 >= len(segments) or segments[i2].label == separator: + if i1 != i2: + segs = segments[i1:i2] + word = "".join([seg.label for seg in segs]) + score = sum(seg.score * seg.length for seg in segs) / sum(seg.length for seg in segs) + words.append(Segment(word, segments[i1].start, segments[i2 - 1].end, score)) + i1 = i2 + 1 + i2 = i1 + else: + i2 += 1 + return words \ No newline at end of file diff --git a/websocket/websocketserver.py b/websocket/websocketserver.py deleted file mode 100644 index 81e035b..0000000 --- a/websocket/websocketserver.py +++ /dev/null @@ -1,23 +0,0 @@ -import asyncio -import os - -import websockets - -from stt.processing import model -from stt.processing.streaming import wssDecode - - -async def _fun_wrapper(ws): - """Wrap wssDecode function to add STT Model reference""" - return await wssDecode(ws, model) - - -async def WSServer(port: int): - """Launch the websocket server""" - async with websockets.serve(_fun_wrapper, "0.0.0.0", serving_port): - await asyncio.Future() - - -if __name__ == "__main__": - serving_port = os.environ.get("STREAMING_PORT", 80) - asyncio.run(WSServer(serving_port)) From 0135916b55eb06a5285da8626d275e66a9538374 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 22 Dec 2022 15:57:58 +0100 Subject: [PATCH 087/172] Rename MODEL_TYPE -> MODEL. Document LANGUAGE --- .envdefault | 2 +- README.md | 25 ++++++++++++++++++++++--- docker-entrypoint.sh | 4 ++-- stt/processing/__init__.py | 2 +- 4 files changed, 26 insertions(+), 7 deletions(-) diff --git a/.envdefault b/.envdefault index 61a57bd..4452be3 100644 --- a/.envdefault +++ b/.envdefault @@ -1,6 +1,6 @@ # SERVING PARAMETERS SERVICE_MODE=http -MODEL_TYPE=/opt/model.pt +MODEL=/opt/model.pt LANGUAGE=fr # TASK PARAMETERS diff --git a/README.md b/README.md index 50f03a8..a15b330 100644 --- a/README.md +++ b/README.md @@ -66,12 +66,32 @@ cp .envdefault .env | PARAMETER | DESCRIPTION | EXEMPLE | |---|---|---| | SERVICE_MODE | STT serving mode see [Serving mode](#serving-mode) | http\|task\|websocket | -| MODEL_TYPE | Path to the model or type of model used. | ASR_PATH\|small\|medium\|large-v1\|... | +| MODEL | Path to the model or type of model used. | ASR_PATH\|small\|medium\|large-v1\|... | +| LANGUAGE | (Optional) Language to recognize | fr\|en\|... | | SERVICE_NAME | Using the task mode, set the queue's name for task processing | my-stt | | SERVICE_BROKER | Using the task mode, URL of the message broker | redis://my-broker:6379 | | BROKER_PASS | Using the task mode, broker password | my-password | | CONCURRENCY | Maximum number of parallel requests | >1 | +The language is a code of two or three letters. The list of languages supported by Whisper are: +``` +af(afrikaans), am(amharic), ar(arabic), as(assamese), az(azerbaijani), +ba(bashkir), be(belarusian), bg(bulgarian), bn(bengali), bo(tibetan), br(breton), bs(bosnian), +ca(catalan), cs(czech), cy(welsh), da(danish), de(german), el(greek), en(english), es(spanish), +et(estonian), eu(basque), fa(persian), fi(finnish), fo(faroese), fr(french), gl(galician), +gu(gujarati), ha(hausa), haw(hawaiian), he(hebrew), hi(hindi), hr(croatian), ht(haitian creole), +hu(hungarian), hy(armenian), id(indonesian), is(icelandic), it(italian), ja(japanese), +jw(javanese), ka(georgian), kk(kazakh), km(khmer), kn(kannada), ko(korean), la(latin), +lb(luxembourgish), ln(lingala), lo(lao), lt(lithuanian), lv(latvian), mg(malagasy), mi(maori), +mk(macedonian), ml(malayalam), mn(mongolian), mr(marathi), ms(malay), mt(maltese), my(myanmar), +ne(nepali), nl(dutch), nn(nynorsk), no(norwegian), oc(occitan), pa(punjabi), pl(polish), +ps(pashto), pt(portuguese), ro(romanian), ru(russian), sa(sanskrit), sd(sindhi), si(sinhala), +sk(slovak), sl(slovenian), sn(shona), so(somali), sq(albanian), sr(serbian), su(sundanese), +sv(swedish), sw(swahili), ta(tamil), te(telugu), tg(tajik), th(thai), tk(turkmen), tl(tagalog), +tr(turkish), tt(tatar), uk(ukrainian), ur(urdu), uz(uzbek), vi(vietnamese), yi(yiddish), +yo(yoruba), zh(chinese) +``` + ### Serving mode ![Serving Modes](https://i.ibb.co/qrtv3Z6/platform-stt.png) @@ -122,9 +142,8 @@ docker run --rm \ -v SHARED_AUDIO_FOLDER:/opt/audio \ --env-file .env \ linto-platform-stt:latest -``` +```| LANGUAGE | (Optional) Language to recognize | fr\|en\|... | -**Parameters:** | Variables | Description | Example | |:-|:-|:-| | ASR_PATH | (Optional) Path to the Whisper model on the host machine to /opt/model.pt | /my/path/to/models/medium.pt | diff --git a/docker-entrypoint.sh b/docker-entrypoint.sh index 4d67cca..5014d8f 100755 --- a/docker-entrypoint.sh +++ b/docker-entrypoint.sh @@ -5,10 +5,10 @@ echo "RUNNING STT" # Check model echo "Checking model format ..." -if [ -z "$MODEL_TYPE" ] +if [ -z "$MODEL" ] then echo "Model type not specified, choosing Whisper medium model" - export MODEL_TYPE=medium + export MODEL=medium fi # Launch parameters, environement variables and dependencies check diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index a4d6182..19d7f0e 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -29,7 +29,7 @@ raise RuntimeError(f"Langaue {get_default_language()} is not available. Available languages are: {available_languages}") # Load ASR model -model_type = os.environ.get("MODEL_TYPE", "medium") +model_type = os.environ.get("MODEL", "medium") logger.info(f"Loading Whisper model {model_type} ({'local' if os.path.isfile(model_type) else 'remote'})...") start = time() try: From 7291f87e4e74fcee4bf6d83f6e2f1dbaf3794d92 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 22 Dec 2022 17:18:10 +0100 Subject: [PATCH 088/172] Isolate everything related to text normalization in one place. And implement stuff for symbols --- stt/processing/decoding.py | 235 +---------------------- stt/processing/text_normalize.py | 316 +++++++++++++++++++++++++++++++ stt/processing/utils.py | 6 + stt/processing/word_alignment.py | 50 +++-- 4 files changed, 358 insertions(+), 249 deletions(-) create mode 100644 stt/processing/text_normalize.py diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 7290af4..dc8d95a 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -3,16 +3,12 @@ import whisper from whisper.audio import SAMPLE_RATE -import math import numpy as np import torch -import re -import string -from num2words import num2words - from stt import logger from .word_alignment import compute_alignment +from .text_normalize import remove_punctuation, normalize_text # TODO: understand and remove this limitations torch.set_num_threads(1) @@ -76,10 +72,14 @@ def decode(audio: torch.Tensor, sub_text = segment["text"] sub_text = normalize_text(sub_text, language) sub_text = remove_punctuation(sub_text) + if not sub_text: + logger.warn(f"Lost text in segment {segment['start']}-{segment['end']}") + continue labels, emission, trellis, segments, word_segments = compute_alignment(sub_audio, sub_text, alignment_model) ratio = len(sub_audio) / (trellis.size(0) * SAMPLE_RATE) sub_words = sub_text.split() - assert len(sub_words) == len(word_segments), f"Unexpected number of words: {len(sub_words)} != {len(word_segments)}" + assert len(sub_words) == len(word_segments), \ + f"Unexpected number of words: {len(sub_words)} != {len(word_segments)}\n>>>\n{sub_words}\n<<<\n{[segment.label for segment in word_segments]}" for word, segment in zip(sub_words, word_segments): result["words"].append({ "word": word, @@ -91,226 +91,3 @@ def decode(audio: torch.Tensor, return result -custom_punctuations = string.punctuation.replace("'", "").replace("-", "") - -def remove_punctuation(text: str) -> str: - # Remove all punctuation except apostrophe - return text.translate(str.maketrans("", "", custom_punctuations)) - -_whitespace_re = re.compile(r'[^\S\r\n]+') - -def collapse_whitespace(text): - return re.sub(_whitespace_re, ' ', text).strip() - - -def normalize_text(text: str, lang: str) -> str: - """ Transform digits into characters... """ - - # Roman digits - if re.search(r"[IVX]", text): - if lang == "en": - digits = re.findall(r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})(st|nd|rd|th)?\b", text) - digits = ["".join(d) for d in digits] - elif lang == "fr": - digits = re.findall(r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})(ème|eme|e|er|ère)?\b", text) - digits = ["".join(d) for d in digits] - else: - digits = [] - if digits: - digits = sorted(list(set(digits)), reverse=True, key=lambda x: (len(x), x)) - for s in digits: - filtered = re.sub("[a-z]", "", s) - ordinal = filtered != s - digit = romanToDecimal(filtered) - v = undigit(str(digit), lang=lang, to= "ordinal" if ordinal else "cardinal") - text = re.sub(r"\b" + s + r"\b", v, text) - - # Ordinal digits - if lang == "en": - digits = re.findall(r"\b\d*1(?:st)|\d*2(?:nd)|\d*3(?:rd)|\d+(?:th)\b", text) - elif lang == "fr": - digits = re.findall(r"\b1(?:ère|ere|er|re|r)|2(?:nd|nde)|\d+(?:ème|eme|e)\b", text) - else: - logger.warn(f"Language {lang} not supported for normalization. Some words might be mis-localized.") - digits = [] - if digits: - digits = sorted(list(set(digits)), reverse=True, key=lambda x: (len(x), x)) - for digit in digits: - word = undigit(re.findall(r"\d+", digit)[0], to= "ordinal", lang = lang) - text = re.sub(r'\b'+str(digit)+r'\b', word, text) - - # Cardinal digits - digits = re.findall(r"(?:\-?\b[\d/]*\d+(?: \d\d\d)+\b)|(?:\-?\d[/\d]*)",text) - digits = list(map(lambda s: s.strip(r"[/ ]"), digits)) - digits = list(set(digits)) - digits = digits + flatten([c.split() for c in digits if " " in c]) - digits = digits + flatten([c.split("/") for c in digits if "/" in c]) - digits = sorted(digits, reverse=True, key=lambda x: (len(x), x)) - for digit in digits: - digitf = re.sub("/+", "/", digit) - if not digitf: - continue - numslash = len(re.findall("/", digitf)) - if numslash == 0: - word = undigit(digitf, lang = lang) - elif numslash == 1: # Fraction or date - i = digitf.index("/") - is_date = False - if len(digitf[i+1:]) == 2: - try: - first = int(digitf[:i]) - second = int(digitf[i+1:]) - is_date = first > 0 and first < 32 and second > 0 and second < 13 - except: pass - if is_date: - first = undigit(digitf[:i].lstrip("0"), lang = lang) - if first == "un": first = "premier" - second = _int_to_month[second] - else: - first = undigit(digitf[:i], lang = lang) - second = undigit(digitf[i+1:], to="denominator", lang = lang) - if float(digitf[:i]) > 2. and second[-1] != "s": - second += "s" - word = first + " " + second - elif numslash == 2: # Maybe a date - i1 = digitf.index("/") - i2 = digitf.index("/", i1+1) - is_date = False - if len(digitf[i1+1:i2]) == 2 and len(digitf[i2+1:]) == 4: - try: - first = int(digitf[:i1]) - second = int(digitf[i1+1:i2]) - third = int(digitf[i2+1:]) - is_date = first > 0 and first < 32 and second > 0 and second < 13 and third > 1000 - except: pass - third = undigit(digitf[i2+1:], lang = lang) - if is_date: - first = undigit(digitf[:i1].lstrip("0"), lang = lang) - if first == "un": first = "premier" - second = _int_to_month.get(lang, {}).get(int(digitf[i1+1:i2]), digitf[i1+1:i2]) - word = " ".join([first, second, third]) - else: - word = " / ".join([undigit(s, lang = lang) for s in digitf.split('/')]) - else: - word = " / ".join([undigit(s, lang = lang) for s in digitf.split('/')]) - # Replace - if " " in digit: - text = re.sub(r'\b'+str(digit)+r'\b', " "+word+" ", text) - else: - text = re.sub(str(digit), " "+word+" ", text) - - # TODO: symbols (currencies...) - - return collapse_whitespace(text) - -def undigit(str, lang, to="cardinal"): - str = re.sub(" ","", str) - if to == "denominator": - assert lang == "fr" - if str == "2": return "demi" - if str == "3": return "tiers" - if str == "4": return "quart" - to = "ordinal" - if str.startswith("0") and to == "cardinal": - numZeros = len(re.findall(r"0+", str)[0]) - if numZeros < len(str): - return numZeros * (my_num2words(0, lang=lang, to="cardinal")+" ") + my_num2words(float(str), lang=lang, to=to) - return my_num2words(float(str), lang=lang, to=to) - - -def my_num2words(x, lang, to = "cardinal", orig = ""): - """ - Bugfix for num2words - """ - try: - if lang == "fr" and to == "ordinal": - return num2words(x, lang=lang, to=to).replace("vingtsième", "vingtième") - else: - return num2words(x, lang=lang, to=to) - except OverflowError: - if x == math.inf: # ! - return " ".join(my_num2words(xi, lang=lang, to=to) for xi in orig) - if x == -math.inf: # ! - return "moins " + my_num2words(-x, lang=lang, to=to, orig=orig.replace("-" , "")) - # TODO: print a warning - return my_num2words(x//10, lang=lang, to=to) - -def flatten(l): - """ - flatten a list of lists - """ - return [item for sublist in l for item in sublist] - -_int_to_month = { - "fr": { - 1: "janvier", - 2: "février", - 3: "mars", - 4: "avril", - 5: "mai", - 6: "juin", - 7: "juillet", - 8: "août", - 9: "septembre", - 10: "octobre", - 11: "novembre", - 12: "décembre", - }, - "en": { - 1: "january", - 2: "february", - 3: "march", - 4: "april", - 5: "may", - 6: "june", - 7: "july", - 8: "august", - 9: "september", - 10: "october", - 11: "november", - 12: "december", - } -} - - -def romanToDecimal(str): - def value(r): - if (r == 'I'): - return 1 - if (r == 'V'): - return 5 - if (r == 'X'): - return 10 - if (r == 'L'): - return 50 - if (r == 'C'): - return 100 - if (r == 'D'): - return 500 - if (r == 'M'): - return 1000 - return -1 - - res = 0 - i = 0 - while (i < len(str)): - # Getting value of symbol s[i] - s1 = value(str[i]) - if (i + 1 < len(str)): - # Getting value of symbol s[i + 1] - s2 = value(str[i + 1]) - # Comparing both values - if (s1 >= s2): - # Value of current symbol is greater - # or equal to the next symbol - res = res + s1 - i = i + 1 - else: - # Value of current symbol is greater - # or equal to the next symbol - res = res + s2 - s1 - i = i + 2 - else: - res = res + s1 - i = i + 1 - return res diff --git a/stt/processing/text_normalize.py b/stt/processing/text_normalize.py new file mode 100644 index 0000000..5ba3358 --- /dev/null +++ b/stt/processing/text_normalize.py @@ -0,0 +1,316 @@ +import math +import re +#import string +import unicodedata +from num2words import num2words + +from stt import logger +from .utils import flatten + +_punctuations = '!"#$%&()*+,/:;<=>?@[\\]^_`{|}~«»¿' # string.punctuation, plus Whisper specific "«»¿", minus apostrophe "'", dash "-", and dot "." (which will be processed as special) + +def remove_punctuation(text: str) -> str: + text = text.translate(str.maketrans("", "", _punctuations)) + # We don't remove dots inside words (e.g. "ab@gmail.com") + text = re.sub(r"\.(\s)",r"\1", text+" ").strip() + return collapse_whitespace(text) + +_whitespace_re = re.compile(r'[^\S\r\n]+') + +def collapse_whitespace(text): + return re.sub(_whitespace_re, ' ', text).strip() + +def transliterate(c): + # Transliterates a character to its closest ASCII equivalent. + # Example: transliterate("à ß œ fl") = "a ss oe fl" + c = re.sub("œ", "oe", c) + c = re.sub("æ", "ae", c) + c = re.sub("Œ", "OE", c) + c = re.sub("Æ", "AE", c) + c = re.sub("ß", "ss", c) + return unicodedata.normalize("NFKD", c).encode("ascii", "ignore").decode("ascii") + + +def normalize_text(text: str, lang: str) -> str: + """ Transform digits into characters... """ + + # Reorder currencies (1,20€ -> 1 € 20) + coma = "," if lang in ["fr"] else "\." + for c in _currencies: + if c in text: + text = re.sub(r"\b(\d+)" + coma + r"(\d+)\s*" + c, r"\1 " + c + r" \2", text) + + # Roman digits + if re.search(r"[IVX]", text): + if lang == "en": + digits = re.findall(r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})(º|st|nd|rd|th)?\b", text) + digits = ["".join(d) for d in digits] + elif lang == "fr": + digits = re.findall(r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})(º|ème|eme|e|er|ère)?\b", text) + digits = ["".join(d) for d in digits] + else: + digits = [] + if digits: + digits = sorted(list(set(digits)), reverse=True, key=lambda x: (len(x), x)) + for s in digits: + filtered = re.sub("[a-z]", "", s) + ordinal = filtered != s + digit = roman_to_decimal(filtered) + v = undigit(str(digit), lang=lang, to= "ordinal" if ordinal else "cardinal") + text = re.sub(r"\b" + s + r"\b", v, text) + + # Ordinal digits + if lang == "en": + digits = re.findall(r"\b\d*1(?:st)|\d*2(?:nd)|\d*3(?:rd)|\d+(?:º|th)\b", text) + elif lang == "fr": + digits = re.findall(r"\b1(?:ère|ere|er|re|r)|2(?:nd|nde)|\d+(?:º|ème|eme|e)\b", text) + else: + logger.warn(f"Language {lang} not supported for normalization. Some words might be mis-localized.") + digits = [] + if digits: + digits = sorted(list(set(digits)), reverse=True, key=lambda x: (len(x), x)) + for digit in digits: + word = undigit(re.findall(r"\d+", digit)[0], to= "ordinal", lang = lang) + text = re.sub(r'\b'+str(digit)+r'\b', word, text) + + # Cardinal digits + digits = re.findall(r"(?:\-?\b[\d/]*\d+(?: \d\d\d)+\b)|(?:\-?\d[/\d]*)",text) + digits = list(map(lambda s: s.strip(r"[/ ]"), digits)) + digits = list(set(digits)) + digits = digits + flatten([c.split() for c in digits if " " in c]) + digits = digits + flatten([c.split("/") for c in digits if "/" in c]) + digits = sorted(digits, reverse=True, key=lambda x: (len(x), x)) + for digit in digits: + digitf = re.sub("/+", "/", digit) + if not digitf: + continue + numslash = len(re.findall("/", digitf)) + if numslash == 0: + word = undigit(digitf, lang = lang) + elif numslash == 1: # Fraction or date + i = digitf.index("/") + is_date = False + if len(digitf[i+1:]) == 2: + try: + first = int(digitf[:i]) + second = int(digitf[i+1:]) + is_date = first > 0 and first < 32 and second > 0 and second < 13 + except: pass + if is_date: + first = digitf[:i].lstrip("0") + use_ordinal = (lang == "fr" and first == "1") or (lang != "fr" and first[-1] in ["1", "2", "3"]) + first = undigit(first, lang = lang, to="ordinal" if use_ordinal else "cardinal") + second = _int_to_month[second] + else: + first = undigit(digitf[:i], lang = lang) + second = undigit(digitf[i+1:], to="denominator", lang = lang) + if float(digitf[:i]) > 2. and second[-1] != "s": + second += "s" + word = first + " " + second + elif numslash == 2: # Maybe a date + i1 = digitf.index("/") + i2 = digitf.index("/", i1+1) + is_date = False + if len(digitf[i1+1:i2]) == 2 and len(digitf[i2+1:]) == 4: + try: + first = int(digitf[:i1]) + second = int(digitf[i1+1:i2]) + third = int(digitf[i2+1:]) + is_date = first > 0 and first < 32 and second > 0 and second < 13 and third > 1000 + except: pass + third = undigit(digitf[i2+1:], lang = lang) + if is_date: + first = digitf[:i].lstrip("0") + use_ordinal = (lang == "fr" and first == "1") or (lang != "fr" and first[-1] in ["1", "2", "3"]) + first = undigit(first, lang = lang, to="ordinal" if use_ordinal else "cardinal") + second = _int_to_month.get(lang, {}).get(int(digitf[i1+1:i2]), digitf[i1+1:i2]) + word = " ".join([first, second, third]) + else: + word = " / ".join([undigit(s, lang = lang) for s in digitf.split('/')]) + else: + word = " / ".join([undigit(s, lang = lang) for s in digitf.split('/')]) + if " " in digit: + text = re.sub(r'\b'+str(digit)+r'\b', " "+word+" ", text) + else: + text = re.sub(str(digit), " "+word+" ", text) + + # Symbols (currencies, percent...) + symbol_table = _symbol_to_word.get(lang, {}) + for k, v in symbol_table.items(): + text = re.sub(k, " "+v+" ", text) + + return collapse_whitespace(text) + +def undigit(str, lang, to="cardinal"): + str = re.sub(" ","", str) + if to == "denominator": + assert lang == "fr" + if str == "2": return "demi" + if str == "3": return "tiers" + if str == "4": return "quart" + to = "ordinal" + if str.startswith("0") and to == "cardinal": + numZeros = len(re.findall(r"0+", str)[0]) + if numZeros < len(str): + return numZeros * (my_num2words(0, lang=lang, to="cardinal")+" ") + my_num2words(float(str), lang=lang, to=to) + return my_num2words(float(str), lang=lang, to=to) + + +def my_num2words(x, lang, to = "cardinal", orig = ""): + """ + Bugfix for num2words + """ + try: + if lang == "fr" and to == "ordinal": + return num2words(x, lang=lang, to=to).replace("vingtsième", "vingtième") + else: + return num2words(x, lang=lang, to=to) + except OverflowError: + if x == math.inf: # ! + return " ".join(my_num2words(xi, lang=lang, to=to) for xi in orig) + if x == -math.inf: # ! + return "moins " + my_num2words(-x, lang=lang, to=to, orig=orig.replace("-" , "")) + # TODO: print a warning + return my_num2words(x//10, lang=lang, to=to) + +def roman_to_decimal(str): + def value(r): + if (r == 'I'): + return 1 + if (r == 'V'): + return 5 + if (r == 'X'): + return 10 + if (r == 'L'): + return 50 + if (r == 'C'): + return 100 + if (r == 'D'): + return 500 + if (r == 'M'): + return 1000 + return -1 + + res = 0 + i = 0 + while (i < len(str)): + s1 = value(str[i]) + if (i + 1 < len(str)): + s2 = value(str[i + 1]) + if (s1 >= s2): + # Value of current symbol is greater or equal to the next symbol + res = res + s1 + i = i + 1 + else: + # Value of current symbol is greater or equal to the next symbol + res = res + s2 - s1 + i = i + 2 + else: + res = res + s1 + i = i + 1 + return res + +_int_to_month = { + "fr": { + 1: "janvier", + 2: "février", + 3: "mars", + 4: "avril", + 5: "mai", + 6: "juin", + 7: "juillet", + 8: "août", + 9: "septembre", + 10: "octobre", + 11: "novembre", + 12: "décembre", + }, + "en": { + 1: "january", + 2: "february", + 3: "march", + 4: "april", + 5: "may", + 6: "june", + 7: "july", + 8: "august", + 9: "september", + 10: "october", + 11: "november", + 12: "december", + } +} + +_currencies = ["€", "$", "£", "¥"] + +_symbol_to_word = { + "fr": { + "%": "pour cents", + "÷": "divisé par", + "\*": "fois", # ? + "×": "fois", + "±": "plus ou moins", + "\+": "plus", + "&": "et", + "@": "arobase", + "m²": "mètres carrés", + "m³": "mètres cubes", + "²": "au carré", + "³": "au cube", + "¼": "un quart", + "½": "un demi", + "¾": "trois quarts", + "§": "section", + "°C": "degrés Celsius", + "°F": "degrés Fahrenheit", + "°K": "kelvins", + "°": "degrés", + "€": "euros", + "¢": "cents", + "\$": "dollars", + "£": "livres", + "¥": "yens", + # Below: not in Whisper tokens + #"₩": "wons", + #"₽": "roubles", + #"₹": "roupies", + #"₺": "liras", + #"₪": "shekels", + #"₴": "hryvnias", + #"₮": "tugriks", + #"℃": "degrés Celsius", + #"℉": "degrés Fahrenheit", + # "Ω": "ohms", + # "Ω": "ohms", + # "K": "kelvins", + # "ℓ": "litres", + }, + "en": { + "%": "percent", + "÷": "divided by", + "\*": "times", # ? + "×": "times", + "±": "plus or minus", + "\+": "plus", + "&": "and", + "@": "at", + "m²": "square meters", + "m³": "cubic meters", + "²": "squared", + "³": "cubed", + "¼": "one quarter", + "½": "one half", + "¾": "three quarters", + "§": "section", + "°C": "degrees Celsius", + "°F": "degrees Fahrenheit", + "°K": "kelvins", + "°": "degrees", + "€": "euros", + "¢": "cents", + "\$": "dollars", + "£": "pounds", + "¥": "yens", + } +} + diff --git a/stt/processing/utils.py b/stt/processing/utils.py index 6956161..1e35c91 100644 --- a/stt/processing/utils.py +++ b/stt/processing/utils.py @@ -38,3 +38,9 @@ def load_wave_buffer(file_buffer): audio = torch.from_numpy(file_content.data.astype(np.float32)/32768) audio = audio.transpose(0,1) return conform_audio(audio, sample_rate) + +def flatten(l): + """ + flatten a list of lists + """ + return [item for sublist in l for item in sublist] diff --git a/stt/processing/word_alignment.py b/stt/processing/word_alignment.py index 974e528..2180d0d 100644 --- a/stt/processing/word_alignment.py +++ b/stt/processing/word_alignment.py @@ -1,10 +1,11 @@ -import unicodedata from dataclasses import dataclass import torch from stt import logger from .alignment_model import speechbrain_compute_log_probas as compute_log_probas from .alignment_model import speechbrain_get_vocab as get_vocab +from .utils import flatten +from .text_normalize import transliterate, remove_punctuation def compute_alignment(audio, transcript, model): @@ -13,10 +14,12 @@ def compute_alignment(audio, transcript, model): emission = compute_log_probas(model, audio) labels, blank_id = get_vocab(model) labels = labels[:emission.shape[1]] + labels[blank_id] = " " dictionary = {c: i for i, c in enumerate(labels)} - tokens = [loose_get_char_index(dictionary, c, blank_id) for c in transcript] - tokens = [i for i in tokens if i is not None] + tokens = [loose_get_char_index(dictionary, c) for c in transcript] + tokens = flatten(tokens) + transcript = "".join([labels[i][0] for i in tokens]) # Make sure transcript has the same length as tokens (could be different because of transliteration "œ" -> "oe") trellis = get_trellis(emission, tokens, blank_id = blank_id) @@ -28,25 +31,32 @@ def compute_alignment(audio, transcript, model): return labels, emission, trellis, segments, word_segments -def loose_get_char_index(dictionary, c, default): - i = dictionary.get(c, None) +def loose_get_char_index(dictionary, c): + i = dictionary.get(c, None) + if i is None: + # Try with alternative versions of the character + tc = transliterate(c) + other_char = list(set([c.lower(), c.upper(), tc, tc.lower(), tc.upper()])) + for c2 in other_char: + i = dictionary.get(c2, None) + if i is not None: + i = [i] + break + # Some transliterated versions may correspond to multiple characters if i is None: - other_char = list(set([c.lower(), c.upper(), transliterate(c), transliterate(c).lower(), transliterate(c).upper()])) for c2 in other_char: - i = dictionary.get(c2, None) - if i is not None: - break - if i is None: - logger.warn("Cannot find label " + " / ".join(list(set([c] + other_char)))) - i = default - return i - -def transliterate(c): - # Transliterates a character to its closest ASCII equivalent. - # For example, "é" becomes "e". - # This is useful for converting Vietnamese text to ASCII. - # See https://stackoverflow.com/a/517974/446579 - return unicodedata.normalize("NFKD", c).encode("ascii", "ignore").decode("ascii") + if len(c2) > 1: + candidate = [dictionary[c3] for c3 in c2 if c3 in dictionary] + if len(candidate) > 0 and (i is None or len(candidate) > len(i)): + i = candidate + # If still not found + if i is None: + logger.warn("Cannot find label " + " / ".join(list(set([c] + other_char)))) + i = [] # [default] # Could be [] ... + else: + i = [i] + return i + def get_trellis(emission, tokens, blank_id=0, use_max = False): num_frame = emission.size(0) From d09d704c62586a5d6e56a6fdc6064e0ee5006918 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 22 Dec 2022 18:32:05 +0100 Subject: [PATCH 089/172] set logging level at the right place --- http_server/ingress.py | 5 ++++- stt/processing/__init__.py | 3 --- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/http_server/ingress.py b/http_server/ingress.py index 6ccd090..db739d4 100644 --- a/http_server/ingress.py +++ b/http_server/ingress.py @@ -12,6 +12,7 @@ from swagger import setupSwaggerUI from stt.processing import decode, load_wave_buffer, model, alignment_model +from stt import logger as stt_logger app = Flask("__stt-standalone-worker__") app.config["JSON_AS_ASCII"] = False @@ -96,7 +97,9 @@ def server_error(error): parser = createParser() args = parser.parse_args() - logger.setLevel(logging.DEBUG if args.debug else logging.INFO) + logger_level = logging.DEBUG if args.debug else logging.INFO + logger.setLevel(logger_level) + stt_logger.setLevel(logger_level) try: # Setup SwaggerUI if args.swagger_path is not None: diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index 19d7f0e..dc3a6a6 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -13,9 +13,6 @@ __all__ = ["logger", "decode", "model", "alignment_model", "load_audiofile", "load_wave_buffer"] -# Set logger level -logger.setLevel(logging.INFO) - # Set device device = os.environ.get("DEVICE", "cuda:0" if torch.cuda.is_available() else "cpu") try: From e186990f14f3bc643e8d0a3cd4b9cf2896198433 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 22 Dec 2022 18:44:18 +0100 Subject: [PATCH 090/172] Robustness to corner cases (no transcription, too long transcription from Whisper, emojis...) --- stt/processing/decoding.py | 47 ++++++++++++++++++++++---------- stt/processing/text_normalize.py | 4 +++ stt/processing/word_alignment.py | 34 +++++++++++++++++++---- 3 files changed, 64 insertions(+), 21 deletions(-) diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index dc8d95a..48d1b66 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -8,7 +8,7 @@ from stt import logger from .word_alignment import compute_alignment -from .text_normalize import remove_punctuation, normalize_text +from .text_normalize import remove_punctuation, normalize_text, remove_emoji # TODO: understand and remove this limitations torch.set_num_threads(1) @@ -26,6 +26,7 @@ def decode(audio: torch.Tensor, logprob_threshold: float = -1.0, compression_ratio_threshold: float = 2.4, normalize_text_as_words = False, + remove_punctuation_from_words = False, ) -> dict: """Transcribe the audio data using Whisper with the defined model.""" result = {"text": "", "confidence-score": 0.0, "words": []} @@ -47,18 +48,23 @@ def decode(audio: torch.Tensor, compression_ratio_threshold = compression_ratio_threshold ) - text = whisper_res["text"].strip() + text = whisper_res["text"] + text = remove_emoji(text).strip() if normalize_text_as_words: text = normalize_text(text, language) - text = remove_punctuation(text) + if remove_punctuation_from_words: + text = remove_punctuation(text) segments = whisper_res["segments"] + if language is None: + language = whisper_res["language"] result["text"] = text - result["confidence-score"] = np.exp(np.array([r["avg_logprob"] for r in segments])).mean() + result["confidence-score"] = np.exp(np.array([r["avg_logprob"] for r in segments])).mean() if len(segments) else 0.0 if not with_word_timestamps: if not normalize_text_as_words: text = normalize_text(text, language) - text = remove_punctuation(text) + if remove_punctuation_from_words: + text = remove_punctuation(text) result["words"] = text.split() else: # Compute word timestamps @@ -70,23 +76,34 @@ def decode(audio: torch.Tensor, end = min(max_t, round(segment["end"] * SAMPLE_RATE)) sub_audio = audio[start:end] sub_text = segment["text"] + logger.debug(f"Aligning text: {sub_text}") + sub_text = remove_emoji(sub_text).strip() sub_text = normalize_text(sub_text, language) - sub_text = remove_punctuation(sub_text) + if remove_punctuation_from_words: + sub_text = remove_punctuation(sub_text) if not sub_text: logger.warn(f"Lost text in segment {segment['start']}-{segment['end']}") continue labels, emission, trellis, segments, word_segments = compute_alignment(sub_audio, sub_text, alignment_model) ratio = len(sub_audio) / (trellis.size(0) * SAMPLE_RATE) sub_words = sub_text.split() - assert len(sub_words) == len(word_segments), \ - f"Unexpected number of words: {len(sub_words)} != {len(word_segments)}\n>>>\n{sub_words}\n<<<\n{[segment.label for segment in word_segments]}" - for word, segment in zip(sub_words, word_segments): - result["words"].append({ - "word": word, - "start": segment.start * ratio + offset, - "end": segment.end * ratio + offset, - "conf": segment.score, - }) + if len(sub_words) == len(word_segments): + for word, segment in zip(sub_words, word_segments): + result["words"].append({ + "word": word, + "start": segment.start * ratio + offset, + "end": segment.end * ratio + offset, + "conf": segment.score, + }) + else: + logger.warn(f"Alignment failed. Results might differ on some words.\nNumber of words: {len(sub_words)} != {len(word_segments)}\n>>>\n{sub_words}\n<<<\n{[segment.label for segment in word_segments]}") + for segment in word_segments: + result["words"].append({ + "word": segment.label, + "start": segment.start * ratio + offset, + "end": segment.end * ratio + offset, + "conf": segment.score, + }) return result diff --git a/stt/processing/text_normalize.py b/stt/processing/text_normalize.py index 5ba3358..af9fdbd 100644 --- a/stt/processing/text_normalize.py +++ b/stt/processing/text_normalize.py @@ -30,6 +30,10 @@ def transliterate(c): c = re.sub("ß", "ss", c) return unicodedata.normalize("NFKD", c).encode("ascii", "ignore").decode("ascii") +def remove_emoji(text): + # Remove emojis + return re.sub(r"[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF]+", "", text) + def normalize_text(text: str, lang: str) -> str: """ Transform digits into characters... """ diff --git a/stt/processing/word_alignment.py b/stt/processing/word_alignment.py index 2180d0d..34bacf0 100644 --- a/stt/processing/word_alignment.py +++ b/stt/processing/word_alignment.py @@ -1,3 +1,6 @@ +""" +source: https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html +""" from dataclasses import dataclass import torch @@ -5,7 +8,7 @@ from .alignment_model import speechbrain_compute_log_probas as compute_log_probas from .alignment_model import speechbrain_get_vocab as get_vocab from .utils import flatten -from .text_normalize import transliterate, remove_punctuation +from .text_normalize import transliterate def compute_alignment(audio, transcript, model): @@ -17,9 +20,24 @@ def compute_alignment(audio, transcript, model): labels[blank_id] = " " dictionary = {c: i for i, c in enumerate(labels)} - tokens = [loose_get_char_index(dictionary, c) for c in transcript] + default = labels.index("-") if "-" in labels else None + tokens = [loose_get_char_index(dictionary, c, default) for c in transcript] tokens = flatten(tokens) - transcript = "".join([labels[i][0] for i in tokens]) # Make sure transcript has the same length as tokens (could be different because of transliteration "œ" -> "oe") + + num_emissions = emission.shape[0] + num_repetitions = count_repetitions(tokens) + if len(tokens) + num_repetitions > num_emissions: + # It will be impossible to find a path... + # It can happen when Whisper is lost in a loop (ex: "Ha ha ha ha ...") + logger.warn(f"Got too many characters from Whisper. Shrinking to the first characters.") + tokens = tokens[:num_emissions] + num_repetitions = count_repetitions(tokens) + while len(tokens) + num_repetitions > num_emissions: + tokens = tokens[:-1] + num_repetitions = count_repetitions(tokens) + + # Make sure transcript has the same length as tokens (it could be different just because of transliteration "œ" -> "oe") + transcript = "".join([labels[i][0] for i in tokens]) trellis = get_trellis(emission, tokens, blank_id = blank_id) @@ -31,7 +49,10 @@ def compute_alignment(audio, transcript, model): return labels, emission, trellis, segments, word_segments -def loose_get_char_index(dictionary, c): +def count_repetitions(tokens): + return sum([a==b for a,b in zip(tokens[1:], tokens[:-1])]) + +def loose_get_char_index(dictionary, c, default = None): i = dictionary.get(c, None) if i is None: # Try with alternative versions of the character @@ -52,7 +73,7 @@ def loose_get_char_index(dictionary, c): # If still not found if i is None: logger.warn("Cannot find label " + " / ".join(list(set([c] + other_char)))) - i = [] # [default] # Could be [] ... + i = [default] if default is not None else [] else: i = [i] return i @@ -124,7 +145,8 @@ def backtrack(trellis, emission, tokens, blank_id=0): if j == 0: break else: - raise ValueError("Failed to align") + logger.warn(f"Failed to align {len(tokens)} tokens") + return path return path[::-1] From f2d33d570e9f27b9bfe28b7e6b1ca05585469580 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 22 Dec 2022 18:50:57 +0100 Subject: [PATCH 091/172] remove unused stuff --- http_server/confparser.py | 18 ------------------ 1 file changed, 18 deletions(-) diff --git a/http_server/confparser.py b/http_server/confparser.py index 2396d71..d296dbb 100644 --- a/http_server/confparser.py +++ b/http_server/confparser.py @@ -7,24 +7,6 @@ def createParser() -> argparse.ArgumentParser: parser = argparse.ArgumentParser() - # SERVICE - parser.add_argument( - "--service_name", - type=str, - help="Service Name", - default=os.environ.get("SERVICE_NAME", "stt"), - ) - - # MODELS - parser.add_argument("--am_path", type=str, help="Acoustic Model Path", default="/opt/models/AM") - parser.add_argument("--lm_path", type=str, help="Decoding graph path", default="/opt/models/LM") - parser.add_argument( - "--config_path", - type=str, - help="Configuration files path", - default="/opt/config", - ) - # GUNICORN parser.add_argument("--service_port", type=int, help="Service port", default=80) parser.add_argument( From 3b6a839586a56eeba64e131c29751976affcb760 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 22 Dec 2022 19:08:42 +0100 Subject: [PATCH 092/172] Less cryptic warning --- stt/processing/word_alignment.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/stt/processing/word_alignment.py b/stt/processing/word_alignment.py index 34bacf0..fc16af4 100644 --- a/stt/processing/word_alignment.py +++ b/stt/processing/word_alignment.py @@ -72,7 +72,7 @@ def loose_get_char_index(dictionary, c, default = None): i = candidate # If still not found if i is None: - logger.warn("Cannot find label " + " / ".join(list(set([c] + other_char)))) + logger.warn("Character not correctly handled by alignment model: '" + "' / '".join(list(set([c] + other_char))) + "'") i = [default] if default is not None else [] else: i = [i] From b4bcd9a982062e04cf6f89f21686d7430621022b Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 22 Dec 2022 21:27:21 +0100 Subject: [PATCH 093/172] PEP8 formatting --- stt/processing/__init__.py | 26 +++-- stt/processing/alignment_model.py | 29 +++--- stt/processing/decoding.py | 54 +++++----- stt/processing/load_model.py | 32 +++--- stt/processing/text_normalize.py | 159 +++++++++++++++++++----------- stt/processing/utils.py | 10 +- stt/processing/word_alignment.py | 45 +++++---- 7 files changed, 220 insertions(+), 135 deletions(-) diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index dc3a6a6..81bd784 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -11,33 +11,41 @@ from .load_model import load_whisper_model, load_speechbrain_model -__all__ = ["logger", "decode", "model", "alignment_model", "load_audiofile", "load_wave_buffer"] +__all__ = ["logger", "decode", "model", "alignment_model", + "load_audiofile", "load_wave_buffer"] # Set device -device = os.environ.get("DEVICE", "cuda:0" if torch.cuda.is_available() else "cpu") +device = os.environ.get( + "DEVICE", "cuda:0" if torch.cuda.is_available() else "cpu") try: device = torch.device(device) except Exception as err: raise Exception("Failed to set device: {}".format(str(err))) from err # Check language -available_languages = [k for k,v in whisper.tokenizer.LANGUAGES.items()] + [None] +available_languages = [ + k for k, v in whisper.tokenizer.LANGUAGES.items()] + [None] if get_default_language() not in available_languages: - raise RuntimeError(f"Langaue {get_default_language()} is not available. Available languages are: {available_languages}") + raise RuntimeError( + f"Language {get_default_language()} is not available. Available languages are: {available_languages}") # Load ASR model model_type = os.environ.get("MODEL", "medium") -logger.info(f"Loading Whisper model {model_type} ({'local' if os.path.isfile(model_type) else 'remote'})...") +logger.info( + f"Loading Whisper model {model_type} ({'local' if os.path.isfile(model_type) else 'remote'})...") start = time() try: - model = load_whisper_model(model_type, device = device) + model = load_whisper_model(model_type, device=device) except Exception as err: - raise Exception("Failed to load transcription model: {}".format(str(err))) from err + raise Exception( + "Failed to load transcription model: {}".format(str(err))) from err logger.info("Model loaded. (t={}s)".format(time() - start)) # Load alignment model -alignment_model_type = os.environ.get("ALIGNMENT_MODEL_TYPE", "/opt/linSTT_speechbrain_fr-FR_v1.0.0") +alignment_model_type = os.environ.get( + "ALIGNMENT_MODEL_TYPE", "/opt/linSTT_speechbrain_fr-FR_v1.0.0") logger.info(f"Loading alignment model...") start = time() -alignment_model = load_speechbrain_model(alignment_model_type, device = device, download_root = "/opt") +alignment_model = load_speechbrain_model( + alignment_model_type, device=device, download_root="/opt") logger.info("Alignment Model loaded. (t={}s)".format(time() - start)) diff --git a/stt/processing/alignment_model.py b/stt/processing/alignment_model.py index f6d52c8..309b7af 100644 --- a/stt/processing/alignment_model.py +++ b/stt/processing/alignment_model.py @@ -4,9 +4,11 @@ from stt import logger + def speechbrain_get_vocab(model): tokenizer = model.tokenizer - labels = [{'':" ", ' ⁇ ':""}.get(i,i).lower() for i in tokenizer.decode([[i] for i in range(tokenizer.get_piece_size())])] + labels = [{'': " ", ' ⁇ ': ""}.get(i, i).lower() for i in tokenizer.decode( + [[i] for i in range(tokenizer.get_piece_size())])] blank_id = labels.index("") return labels, blank_id @@ -14,13 +16,15 @@ def speechbrain_get_vocab(model): # The following limit is to handle the corner Case of too long audio segment (which is better to split it to avoid memory overflow). # But it is 2240400 / 16000 Hz ~ 140 seconds, which should not happen for segments detected by Whisper (usually one sentence). # Also note that Whisper works with 30 seconds segment, so there is chance that this limit is never reached. -MAX_LEN = 2240400 +MAX_LEN = 2240400 + -def speechbrain_compute_log_probas(model, audios, max_len = MAX_LEN): +def speechbrain_compute_log_probas(model, audios, max_len=MAX_LEN): # Single audio if not isinstance(audios, list): audios = [audios] - log_probas = speechbrain_compute_log_probas(model, audios, max_len = max_len) + log_probas = speechbrain_compute_log_probas( + model, audios, max_len=max_len) return log_probas[0] # Batch of audios (can occur when max_len is reached) @@ -33,30 +37,33 @@ def speechbrain_compute_log_probas(model, audios, max_len = MAX_LEN): chunks = [] i_audio = [] for a in audios: - chunks.extend([a[i:min(i+max_len, len(a))] for i in range(0, len(a), max_len)]) + chunks.extend([a[i:min(i+max_len, len(a))] + for i in range(0, len(a), max_len)]) i_audio.append(len(chunks)) if len(chunks) > 1: - logger.warning("Audio too long, splitting into {} chunks for alignment".format(len(chunks))) + logger.warning( + "Audio too long, splitting into {} chunks for alignment".format(len(chunks))) # Decode chunks of audio and concatenate results log_probas = [[] for i in range(len(audios))] for i in range(0, len(chunks), batch_size): chunk = chunks[i:min(i+batch_size, len(chunks))] log_probas_tmp = speechbrain_compute_log_probas(model, chunk) - for j in range(i,i+len(chunk)): + for j in range(i, i+len(chunk)): k = 0 while j >= i_audio[k]: k += 1 log_probas[k].append(log_probas_tmp[j-i]) - log_probas = [torch.cat(p, dim = 0) for p in log_probas] - log_probas, wav_lens = pack_sequences(log_probas, device = model.device) + log_probas = [torch.cat(p, dim=0) for p in log_probas] + log_probas, wav_lens = pack_sequences(log_probas, device=model.device) else: - batch, wav_lens = pack_sequences(audios, device = model.device) + batch, wav_lens = pack_sequences(audios, device=model.device) log_probas = model.forward(batch, wav_lens) log_probas = torch.log_softmax(log_probas, dim=-1) return log_probas -def pack_sequences(tensors, device = "cpu"): + +def pack_sequences(tensors, device="cpu"): if len(tensors) == 1: return tensors[0].unsqueeze(0).to(device), torch.Tensor([1.]).to(device) tensor = rnn_utils.pad_sequence(tensors, batch_first=True) diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 48d1b66..f257fb1 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -13,21 +13,23 @@ # TODO: understand and remove this limitations torch.set_num_threads(1) + def get_default_language(): return os.environ.get("LANGUAGE", None) + def decode(audio: torch.Tensor, - model: whisper.model.Whisper, - alignment_model: "Any", - with_word_timestamps: bool, - language: str = None, - beam_size: int = None, - no_speech_threshold: float = 0.6, - logprob_threshold: float = -1.0, - compression_ratio_threshold: float = 2.4, - normalize_text_as_words = False, - remove_punctuation_from_words = False, - ) -> dict: + model: whisper.model.Whisper, + alignment_model: "Any", + with_word_timestamps: bool, + language: str = None, + beam_size: int = None, + no_speech_threshold: float = 0.6, + logprob_threshold: float = -1.0, + compression_ratio_threshold: float = 2.4, + normalize_text_as_words=False, + remove_punctuation_from_words=False, + ) -> dict: """Transcribe the audio data using Whisper with the defined model.""" result = {"text": "", "confidence-score": 0.0, "words": []} @@ -39,14 +41,14 @@ def decode(audio: torch.Tensor, logger.info(f"Transcribing audio with language {language}...") whisper_res = model.transcribe(audio, - language = language, - fp16 = fp16, - temperature = 0.0, # For deterministic results - beam_size = beam_size, - no_speech_threshold = no_speech_threshold, - logprob_threshold = logprob_threshold, - compression_ratio_threshold = compression_ratio_threshold - ) + language=language, + fp16=fp16, + temperature=0.0, # For deterministic results + beam_size=beam_size, + no_speech_threshold=no_speech_threshold, + logprob_threshold=logprob_threshold, + compression_ratio_threshold=compression_ratio_threshold + ) text = whisper_res["text"] text = remove_emoji(text).strip() @@ -59,7 +61,8 @@ def decode(audio: torch.Tensor, language = whisper_res["language"] result["text"] = text - result["confidence-score"] = np.exp(np.array([r["avg_logprob"] for r in segments])).mean() if len(segments) else 0.0 + result["confidence-score"] = np.exp(np.array([r["avg_logprob"] + for r in segments])).mean() if len(segments) else 0.0 if not with_word_timestamps: if not normalize_text_as_words: text = normalize_text(text, language) @@ -82,9 +85,11 @@ def decode(audio: torch.Tensor, if remove_punctuation_from_words: sub_text = remove_punctuation(sub_text) if not sub_text: - logger.warn(f"Lost text in segment {segment['start']}-{segment['end']}") + logger.warn( + f"Lost text in segment {segment['start']}-{segment['end']}") continue - labels, emission, trellis, segments, word_segments = compute_alignment(sub_audio, sub_text, alignment_model) + labels, emission, trellis, segments, word_segments = compute_alignment( + sub_audio, sub_text, alignment_model) ratio = len(sub_audio) / (trellis.size(0) * SAMPLE_RATE) sub_words = sub_text.split() if len(sub_words) == len(word_segments): @@ -96,7 +101,8 @@ def decode(audio: torch.Tensor, "conf": segment.score, }) else: - logger.warn(f"Alignment failed. Results might differ on some words.\nNumber of words: {len(sub_words)} != {len(word_segments)}\n>>>\n{sub_words}\n<<<\n{[segment.label for segment in word_segments]}") + logger.warn( + f"Alignment failed. Results might differ on some words.\nNumber of words: {len(sub_words)} != {len(word_segments)}\n>>>\n{sub_words}\n<<<\n{[segment.label for segment in word_segments]}") for segment in word_segments: result["words"].append({ "word": segment.label, @@ -106,5 +112,3 @@ def decode(audio: torch.Tensor, }) return result - - diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py index da5d98c..27fdf9a 100644 --- a/stt/processing/load_model.py +++ b/stt/processing/load_model.py @@ -5,31 +5,39 @@ import huggingface_hub import speechbrain as sb -def load_whisper_model(model_type_or_file, device = "cpu", download_root = "/opt"): - model = whisper.load_model(model_type_or_file, device = device, download_root = os.path.join(download_root, "whisper")) +def load_whisper_model(model_type_or_file, device="cpu", download_root="/opt"): + + model = whisper.load_model(model_type_or_file, device=device, + download_root=os.path.join(download_root, "whisper")) model.eval() model.requires_grad_(False) return model -def load_speechbrain_model(source, device = "cpu", download_root = "/opt"): - + +def load_speechbrain_model(source, device="cpu", download_root="/opt"): + if os.path.isdir(source): yaml_file = os.path.join(source, "hyperparams.yaml") - assert os.path.isfile(yaml_file), f"Hyperparams file {yaml_file} not found" + assert os.path.isfile( + yaml_file), f"Hyperparams file {yaml_file} not found" else: try: - yaml_file = huggingface_hub.hf_hub_download(repo_id=source, filename="hyperparams.yaml", cache_dir = os.path.join(download_root, "huggingface/hub")) + yaml_file = huggingface_hub.hf_hub_download( + repo_id=source, filename="hyperparams.yaml", cache_dir=os.path.join(download_root, "huggingface/hub")) except requests.exceptions.HTTPError: yaml_file = None - overrides = make_yaml_overrides(yaml_file, {"save_path": os.path.join(download_root, "speechbrain")}) + overrides = make_yaml_overrides( + yaml_file, {"save_path": os.path.join(download_root, "speechbrain")}) savedir = os.path.join(download_root, "speechbrain") try: - model = sb.pretrained.EncoderASR.from_hparams(source = source, run_opts= {"device": device}, savedir = savedir, overrides = overrides) + model = sb.pretrained.EncoderASR.from_hparams( + source=source, run_opts={"device": device}, savedir=savedir, overrides=overrides) except ValueError: - model = sb.pretrained.EncoderDecoderASR.from_hparams(source = source, run_opts= {"device": device}, savedir = savedir, overrides = overrides) + model = sb.pretrained.EncoderDecoderASR.from_hparams( + source=source, run_opts={"device": device}, savedir=savedir, overrides=overrides) model.train(False) model.requires_grad_(False) @@ -42,7 +50,8 @@ def make_yaml_overrides(yaml_file, key_values): yaml_file: path to yaml file key_values: dict of key values to override """ - if yaml_file is None: return None + if yaml_file is None: + return None override = {} with open(yaml_file, "r") as f: @@ -58,5 +67,6 @@ def make_yaml_overrides(yaml_file, key_values): elif ":" in line: child = line.strip().split(":")[0].strip() if child in key_values: - override[parent] = override.get(parent, {}) | {child: key_values[child]} + override[parent] = override.get(parent, {}) | { + child: key_values[child]} return override diff --git a/stt/processing/text_normalize.py b/stt/processing/text_normalize.py index af9fdbd..7e2f6fb 100644 --- a/stt/processing/text_normalize.py +++ b/stt/processing/text_normalize.py @@ -1,25 +1,30 @@ import math import re -#import string +# import string import unicodedata from num2words import num2words from stt import logger from .utils import flatten -_punctuations = '!"#$%&()*+,/:;<=>?@[\\]^_`{|}~«»¿' # string.punctuation, plus Whisper specific "«»¿", minus apostrophe "'", dash "-", and dot "." (which will be processed as special) +# string.punctuation, plus Whisper specific "«»¿", minus apostrophe "'", dash "-", and dot "." (which will be processed as special) +_punctuations = '!"#$%&()*+,/:;<=>?@[\\]^_`{|}~«»¿' + def remove_punctuation(text: str) -> str: text = text.translate(str.maketrans("", "", _punctuations)) # We don't remove dots inside words (e.g. "ab@gmail.com") - text = re.sub(r"\.(\s)",r"\1", text+" ").strip() + text = re.sub(r"\.(\s)", r"\1", text+" ").strip() return collapse_whitespace(text) + _whitespace_re = re.compile(r'[^\S\r\n]+') + def collapse_whitespace(text): return re.sub(_whitespace_re, ' ', text).strip() + def transliterate(c): # Transliterates a character to its closest ASCII equivalent. # Example: transliterate("à ß œ fl") = "a ss oe fl" @@ -30,6 +35,7 @@ def transliterate(c): c = re.sub("ß", "ss", c) return unicodedata.normalize("NFKD", c).encode("ascii", "ignore").decode("ascii") + def remove_emoji(text): # Remove emojis return re.sub(r"[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF]+", "", text) @@ -42,43 +48,54 @@ def normalize_text(text: str, lang: str) -> str: coma = "," if lang in ["fr"] else "\." for c in _currencies: if c in text: - text = re.sub(r"\b(\d+)" + coma + r"(\d+)\s*" + c, r"\1 " + c + r" \2", text) - + text = re.sub(r"\b(\d+)" + coma + r"(\d+)\s*" + + c, r"\1 " + c + r" \2", text) + # Roman digits if re.search(r"[IVX]", text): if lang == "en": - digits = re.findall(r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})(º|st|nd|rd|th)?\b", text) + digits = re.findall( + r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})(º|st|nd|rd|th)?\b", text) digits = ["".join(d) for d in digits] elif lang == "fr": - digits = re.findall(r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})(º|ème|eme|e|er|ère)?\b", text) + digits = re.findall( + r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})(º|ème|eme|e|er|ère)?\b", text) digits = ["".join(d) for d in digits] else: digits = [] if digits: - digits = sorted(list(set(digits)), reverse=True, key=lambda x: (len(x), x)) + digits = sorted(list(set(digits)), reverse=True, + key=lambda x: (len(x), x)) for s in digits: filtered = re.sub("[a-z]", "", s) ordinal = filtered != s digit = roman_to_decimal(filtered) - v = undigit(str(digit), lang=lang, to= "ordinal" if ordinal else "cardinal") + v = undigit(str(digit), lang=lang, + to="ordinal" if ordinal else "cardinal") text = re.sub(r"\b" + s + r"\b", v, text) # Ordinal digits if lang == "en": - digits = re.findall(r"\b\d*1(?:st)|\d*2(?:nd)|\d*3(?:rd)|\d+(?:º|th)\b", text) + digits = re.findall( + r"\b\d*1(?:st)|\d*2(?:nd)|\d*3(?:rd)|\d+(?:º|th)\b", text) elif lang == "fr": - digits = re.findall(r"\b1(?:ère|ere|er|re|r)|2(?:nd|nde)|\d+(?:º|ème|eme|e)\b", text) + digits = re.findall( + r"\b1(?:ère|ere|er|re|r)|2(?:nd|nde)|\d+(?:º|ème|eme|e)\b", text) else: - logger.warn(f"Language {lang} not supported for normalization. Some words might be mis-localized.") + logger.warn( + f"Language {lang} not supported for normalization. Some words might be mis-localized.") digits = [] if digits: - digits = sorted(list(set(digits)), reverse=True, key=lambda x: (len(x), x)) + digits = sorted(list(set(digits)), reverse=True, + key=lambda x: (len(x), x)) for digit in digits: - word = undigit(re.findall(r"\d+", digit)[0], to= "ordinal", lang = lang) + word = undigit(re.findall(r"\d+", digit) + [0], to="ordinal", lang=lang) text = re.sub(r'\b'+str(digit)+r'\b', word, text) # Cardinal digits - digits = re.findall(r"(?:\-?\b[\d/]*\d+(?: \d\d\d)+\b)|(?:\-?\d[/\d]*)",text) + digits = re.findall( + r"(?:\-?\b[\d/]*\d+(?: \d\d\d)+\b)|(?:\-?\d[/\d]*)", text) digits = list(map(lambda s: s.strip(r"[/ ]"), digits)) digits = list(set(digits)) digits = digits + flatten([c.split() for c in digits if " " in c]) @@ -90,8 +107,8 @@ def normalize_text(text: str, lang: str) -> str: continue numslash = len(re.findall("/", digitf)) if numslash == 0: - word = undigit(digitf, lang = lang) - elif numslash == 1: # Fraction or date + word = undigit(digitf, lang=lang) + elif numslash == 1: # Fraction or date i = digitf.index("/") is_date = False if len(digitf[i+1:]) == 2: @@ -99,19 +116,22 @@ def normalize_text(text: str, lang: str) -> str: first = int(digitf[:i]) second = int(digitf[i+1:]) is_date = first > 0 and first < 32 and second > 0 and second < 13 - except: pass + except: + pass if is_date: first = digitf[:i].lstrip("0") - use_ordinal = (lang == "fr" and first == "1") or (lang != "fr" and first[-1] in ["1", "2", "3"]) - first = undigit(first, lang = lang, to="ordinal" if use_ordinal else "cardinal") + use_ordinal = (lang == "fr" and first == "1") or ( + lang != "fr" and first[-1] in ["1", "2", "3"]) + first = undigit(first, lang=lang, + to="ordinal" if use_ordinal else "cardinal") second = _int_to_month[second] else: - first = undigit(digitf[:i], lang = lang) - second = undigit(digitf[i+1:], to="denominator", lang = lang) + first = undigit(digitf[:i], lang=lang) + second = undigit(digitf[i+1:], to="denominator", lang=lang) if float(digitf[:i]) > 2. and second[-1] != "s": second += "s" word = first + " " + second - elif numslash == 2: # Maybe a date + elif numslash == 2: # Maybe a date i1 = digitf.index("/") i2 = digitf.index("/", i1+1) is_date = False @@ -121,18 +141,24 @@ def normalize_text(text: str, lang: str) -> str: second = int(digitf[i1+1:i2]) third = int(digitf[i2+1:]) is_date = first > 0 and first < 32 and second > 0 and second < 13 and third > 1000 - except: pass - third = undigit(digitf[i2+1:], lang = lang) + except: + pass + third = undigit(digitf[i2+1:], lang=lang) if is_date: first = digitf[:i].lstrip("0") - use_ordinal = (lang == "fr" and first == "1") or (lang != "fr" and first[-1] in ["1", "2", "3"]) - first = undigit(first, lang = lang, to="ordinal" if use_ordinal else "cardinal") - second = _int_to_month.get(lang, {}).get(int(digitf[i1+1:i2]), digitf[i1+1:i2]) + use_ordinal = (lang == "fr" and first == "1") or ( + lang != "fr" and first[-1] in ["1", "2", "3"]) + first = undigit(first, lang=lang, + to="ordinal" if use_ordinal else "cardinal") + second = _int_to_month.get(lang, {}).get( + int(digitf[i1+1:i2]), digitf[i1+1:i2]) word = " ".join([first, second, third]) else: - word = " / ".join([undigit(s, lang = lang) for s in digitf.split('/')]) + word = " / ".join([undigit(s, lang=lang) + for s in digitf.split('/')]) else: - word = " / ".join([undigit(s, lang = lang) for s in digitf.split('/')]) + word = " / ".join([undigit(s, lang=lang) + for s in digitf.split('/')]) if " " in digit: text = re.sub(r'\b'+str(digit)+r'\b', " "+word+" ", text) else: @@ -145,37 +171,52 @@ def normalize_text(text: str, lang: str) -> str: return collapse_whitespace(text) + def undigit(str, lang, to="cardinal"): - str = re.sub(" ","", str) + str = re.sub(" ", "", str) if to == "denominator": - assert lang == "fr" - if str == "2": return "demi" - if str == "3": return "tiers" - if str == "4": return "quart" + if lang == "fr": + if str == "2": + return "demi" + if str == "3": + return "tiers" + if str == "4": + return "quart" + elif lang == "en": + if str == "2": + return "half" + if str == "4": + return "quarter" + elif lang == "es": + if str == "2": + return "mitad" + if str == "3": + return "tercio" to = "ordinal" if str.startswith("0") and to == "cardinal": numZeros = len(re.findall(r"0+", str)[0]) if numZeros < len(str): - return numZeros * (my_num2words(0, lang=lang, to="cardinal")+" ") + my_num2words(float(str), lang=lang, to=to) - return my_num2words(float(str), lang=lang, to=to) + return numZeros * (robust_num2words(0, lang=lang)+" ") + robust_num2words(float(str), lang=lang, to=to) + return robust_num2words(float(str), lang=lang, to=to) -def my_num2words(x, lang, to = "cardinal", orig = ""): +def robust_num2words(x, lang, to="cardinal", orig=""): """ Bugfix for num2words """ try: + res = num2words(x, lang=lang, to=to) if lang == "fr" and to == "ordinal": - return num2words(x, lang=lang, to=to).replace("vingtsième", "vingtième") - else: - return num2words(x, lang=lang, to=to) + res = res.replace("vingtsième", "vingtième") + return res except OverflowError: - if x == math.inf: # ! - return " ".join(my_num2words(xi, lang=lang, to=to) for xi in orig) - if x == -math.inf: # ! - return "moins " + my_num2words(-x, lang=lang, to=to, orig=orig.replace("-" , "")) + if x == math.inf: # ! + return " ".join(robust_num2words(xi, lang=lang, to=to) for xi in orig) + if x == -math.inf: # ! + return "moins " + robust_num2words(-x, lang=lang, to=to, orig=orig.replace("-", "")) # TODO: print a warning - return my_num2words(x//10, lang=lang, to=to) + return robust_num2words(x//10, lang=lang, to=to) + def roman_to_decimal(str): def value(r): @@ -214,6 +255,7 @@ def value(r): i = i + 1 return res + _int_to_month = { "fr": { 1: "janvier", @@ -251,10 +293,10 @@ def value(r): "fr": { "%": "pour cents", "÷": "divisé par", - "\*": "fois", # ? + "\*": "fois", # ? "×": "fois", "±": "plus ou moins", - "\+": "plus", + "\+": "plus", "&": "et", "@": "arobase", "m²": "mètres carrés", @@ -275,15 +317,15 @@ def value(r): "£": "livres", "¥": "yens", # Below: not in Whisper tokens - #"₩": "wons", - #"₽": "roubles", - #"₹": "roupies", - #"₺": "liras", - #"₪": "shekels", - #"₴": "hryvnias", - #"₮": "tugriks", - #"℃": "degrés Celsius", - #"℉": "degrés Fahrenheit", + # "₩": "wons", + # "₽": "roubles", + # "₹": "roupies", + # "₺": "liras", + # "₪": "shekels", + # "₴": "hryvnias", + # "₮": "tugriks", + # "℃": "degrés Celsius", + # "℉": "degrés Fahrenheit", # "Ω": "ohms", # "Ω": "ohms", # "K": "kelvins", @@ -292,7 +334,7 @@ def value(r): "en": { "%": "percent", "÷": "divided by", - "\*": "times", # ? + "\*": "times", # ? "×": "times", "±": "plus or minus", "\+": "plus", @@ -317,4 +359,3 @@ def value(r): "¥": "yens", } } - diff --git a/stt/processing/utils.py b/stt/processing/utils.py index 1e35c91..5ff706a 100644 --- a/stt/processing/utils.py +++ b/stt/processing/utils.py @@ -6,10 +6,12 @@ import torchaudio import whisper -def conform_audio(audio, sample_rate = 16_000): + +def conform_audio(audio, sample_rate=16_000): if sample_rate != whisper.audio.SAMPLE_RATE: # Down or Up sample to the right sampling rate - audio = torchaudio.transforms.Resample(sample_rate, whisper.audio.SAMPLE_RATE)(audio) + audio = torchaudio.transforms.Resample( + sample_rate, whisper.audio.SAMPLE_RATE)(audio) if audio.shape[0] > 1: # Stereo to mono # audio = torchaudio.transforms.DownmixMono()(audio, channels_first = True) @@ -18,6 +20,7 @@ def conform_audio(audio, sample_rate = 16_000): audio = audio.squeeze(0) return audio + def load_audiofile(path): if not os.path.isfile(path): raise RuntimeError("File not found: %s" % path) @@ -36,9 +39,10 @@ def load_wave_buffer(file_buffer): file_content = wavio.read(file_buffer_io) sample_rate = file_content.rate audio = torch.from_numpy(file_content.data.astype(np.float32)/32768) - audio = audio.transpose(0,1) + audio = audio.transpose(0, 1) return conform_audio(audio, sample_rate) + def flatten(l): """ flatten a list of lists diff --git a/stt/processing/word_alignment.py b/stt/processing/word_alignment.py index fc16af4..7dd0c8f 100644 --- a/stt/processing/word_alignment.py +++ b/stt/processing/word_alignment.py @@ -29,7 +29,8 @@ def compute_alignment(audio, transcript, model): if len(tokens) + num_repetitions > num_emissions: # It will be impossible to find a path... # It can happen when Whisper is lost in a loop (ex: "Ha ha ha ha ...") - logger.warn(f"Got too many characters from Whisper. Shrinking to the first characters.") + logger.warn( + f"Got too many characters from Whisper. Shrinking to the first characters.") tokens = tokens[:num_emissions] num_repetitions = count_repetitions(tokens) while len(tokens) + num_repetitions > num_emissions: @@ -39,25 +40,28 @@ def compute_alignment(audio, transcript, model): # Make sure transcript has the same length as tokens (it could be different just because of transliteration "œ" -> "oe") transcript = "".join([labels[i][0] for i in tokens]) - trellis = get_trellis(emission, tokens, blank_id = blank_id) + trellis = get_trellis(emission, tokens, blank_id=blank_id) + + path = backtrack(trellis, emission, tokens, blank_id=blank_id) - path = backtrack(trellis, emission, tokens, blank_id = blank_id) - segments = merge_repeats(transcript, path) word_segments = merge_words(segments) return labels, emission, trellis, segments, word_segments + def count_repetitions(tokens): - return sum([a==b for a,b in zip(tokens[1:], tokens[:-1])]) + return sum([a == b for a, b in zip(tokens[1:], tokens[:-1])]) + -def loose_get_char_index(dictionary, c, default = None): +def loose_get_char_index(dictionary, c, default=None): i = dictionary.get(c, None) if i is None: # Try with alternative versions of the character tc = transliterate(c) - other_char = list(set([c.lower(), c.upper(), tc, tc.lower(), tc.upper()])) + other_char = list( + set([c.lower(), c.upper(), tc, tc.lower(), tc.upper()])) for c2 in other_char: i = dictionary.get(c2, None) if i is not None: @@ -67,19 +71,21 @@ def loose_get_char_index(dictionary, c, default = None): if i is None: for c2 in other_char: if len(c2) > 1: - candidate = [dictionary[c3] for c3 in c2 if c3 in dictionary] + candidate = [dictionary[c3] + for c3 in c2 if c3 in dictionary] if len(candidate) > 0 and (i is None or len(candidate) > len(i)): i = candidate # If still not found if i is None: - logger.warn("Character not correctly handled by alignment model: '" + "' / '".join(list(set([c] + other_char))) + "'") + logger.warn("Character not correctly handled by alignment model: '" + + "' / '".join(list(set([c] + other_char))) + "'") i = [default] if default is not None else [] else: i = [i] return i -def get_trellis(emission, tokens, blank_id=0, use_max = False): +def get_trellis(emission, tokens, blank_id=0, use_max=False): num_frame = emission.size(0) num_tokens = len(tokens) @@ -97,15 +103,16 @@ def get_trellis(emission, tokens, blank_id=0, use_max = False): # Score for staying at the same token trellis[t, 1:] + emission[t, blank_id], torch.maximum(trellis[t, 1:] + emission[t, tokens], - # Score for changing to the next token - trellis[t, :-1] + emission[t, tokens]) + # Score for changing to the next token + trellis[t, :-1] + emission[t, tokens]) ) if use_max else torch.logaddexp( trellis[t, 1:] + emission[t, blank_id], torch.logaddexp(trellis[t, 1:] + emission[t, tokens], - trellis[t, :-1] + emission[t, tokens]) + trellis[t, :-1] + emission[t, tokens]) ) return trellis + @dataclass class Point: token_index: int @@ -135,7 +142,8 @@ def backtrack(trellis, emission, tokens, blank_id=0): changed = trellis[t - 1, j - 1] + emission[t - 1, tokens[j - 1]] # 2. Store the path with frame-wise probability. - prob = emission[t - 1, tokens[j - 1] if changed > stayed else 0].exp().item() + prob = emission[t - 1, tokens[j - 1] + if changed > stayed else 0].exp().item() # Return token index and time index in non-trellis coordinate. path.append(Point(j - 1, t - 1, prob)) @@ -184,6 +192,7 @@ def merge_repeats(transcript, path): i1 = i2 return segments + def merge_words(segments, separator=" "): words = [] i1, i2 = 0, 0 @@ -192,10 +201,12 @@ def merge_words(segments, separator=" "): if i1 != i2: segs = segments[i1:i2] word = "".join([seg.label for seg in segs]) - score = sum(seg.score * seg.length for seg in segs) / sum(seg.length for seg in segs) - words.append(Segment(word, segments[i1].start, segments[i2 - 1].end, score)) + score = sum(seg.score * seg.length for seg in segs) / \ + sum(seg.length for seg in segs) + words.append( + Segment(word, segments[i1].start, segments[i2 - 1].end, score)) i1 = i2 + 1 i2 = i1 else: i2 += 1 - return words \ No newline at end of file + return words From f1c3aaa60d49b393882569f7d7f321d5cfffadeb Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 3 Jan 2023 08:42:47 +0100 Subject: [PATCH 094/172] Support more model types for word alignment (transformers, torchaudio) --- requirements.txt | 1 + stt/processing/__init__.py | 23 ++--- stt/processing/alignment_model.py | 143 ++++++++++++++++++++++++++++-- stt/processing/decoding.py | 7 +- stt/processing/load_model.py | 67 ++++++++++++++ stt/processing/text_normalize.py | 2 + stt/processing/word_alignment.py | 6 +- 7 files changed, 224 insertions(+), 25 deletions(-) diff --git a/requirements.txt b/requirements.txt index a93dc9f..c4e4fd4 100644 --- a/requirements.txt +++ b/requirements.txt @@ -8,6 +8,7 @@ num2words pyyaml>=5.4.1 requests>=2.26.0 speechbrain +transformers wavio>=0.0.4 websockets git+https://github.com/openai/whisper.git \ No newline at end of file diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index 81bd784..70b6695 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -6,14 +6,17 @@ import whisper from stt import logger -from stt.processing.decoding import decode, get_default_language +from stt.processing.decoding import decode, get_language from stt.processing.utils import load_wave_buffer, load_audiofile -from .load_model import load_whisper_model, load_speechbrain_model +from .load_model import load_whisper_model, load_alignment_model, get_alignment_model, get_model_type __all__ = ["logger", "decode", "model", "alignment_model", "load_audiofile", "load_wave_buffer"] +# Set informative log +logger.setLevel(logging.INFO) + # Set device device = os.environ.get( "DEVICE", "cuda:0" if torch.cuda.is_available() else "cpu") @@ -23,11 +26,12 @@ raise Exception("Failed to set device: {}".format(str(err))) from err # Check language +language = get_language() available_languages = [ k for k, v in whisper.tokenizer.LANGUAGES.items()] + [None] -if get_default_language() not in available_languages: +if language not in available_languages: raise RuntimeError( - f"Language {get_default_language()} is not available. Available languages are: {available_languages}") + f"Language {get_language()} is not available. Available languages are: {available_languages}") # Load ASR model model_type = os.environ.get("MODEL", "medium") @@ -42,10 +46,9 @@ logger.info("Model loaded. (t={}s)".format(time() - start)) # Load alignment model -alignment_model_type = os.environ.get( - "ALIGNMENT_MODEL_TYPE", "/opt/linSTT_speechbrain_fr-FR_v1.0.0") -logger.info(f"Loading alignment model...") +alignment_model_name = get_alignment_model(language) +logger.info(f"Loading alignment model {alignment_model_name} ({'local' if os.path.isfile(alignment_model_name) else 'remote'})...") start = time() -alignment_model = load_speechbrain_model( - alignment_model_type, device=device, download_root="/opt") -logger.info("Alignment Model loaded. (t={}s)".format(time() - start)) +alignment_model = load_alignment_model( + alignment_model_name, device=device, download_root="/opt") +logger.info(f"Alignment Model of type {get_model_type(alignment_model)} loaded. (t={time() - start}s)") diff --git a/stt/processing/alignment_model.py b/stt/processing/alignment_model.py index 309b7af..b6ef333 100644 --- a/stt/processing/alignment_model.py +++ b/stt/processing/alignment_model.py @@ -3,32 +3,90 @@ import torch.nn.utils.rnn as rnn_utils from stt import logger +from .load_model import get_model_type +import whisper -def speechbrain_get_vocab(model): +################################################################################ +# Get list of labes (and blank_id) from model + + +def get_vocab(model): + type = get_model_type(model) + if type == "speechbrain": + labels, blank_id = get_vocab_speechbrain(model) + elif type == "transformers": + labels, blank_id = get_vocab_transformers(model) + else: + labels, blank_id = get_vocab_torchaudio(model) + assert isinstance(labels, list) and min( + [isinstance(l, str) for l in labels]), "labels must be a list of strings" + return norm_labels(labels, blank_id), blank_id + + +def get_vocab_speechbrain(model): tokenizer = model.tokenizer - labels = [{'': " ", ' ⁇ ': ""}.get(i, i).lower() for i in tokenizer.decode( + # Is this general enough? + labels = [{'': " ", ' ⁇ ': ""}.get(i, i) for i in tokenizer.decode( [[i] for i in range(tokenizer.get_piece_size())])] blank_id = labels.index("") return labels, blank_id +def get_vocab_torchaudio(model_and_labels): + _, labels = model_and_labels + labels = list(labels) + # WTF : blank_id = labels.index("-") ...? Is it general enough? + blank_id = 0 + return labels, blank_id + + +def get_vocab_transformers(model_and_processor): + _, processor = model_and_processor + labels_dict = dict((v, k) + for k, v in processor.tokenizer.get_vocab().items()) + labels = [labels_dict[i] for i in range(len(labels_dict))] + blank_id = labels.index("") + return labels, blank_id + + +def norm_labels(labels, blank_id): + labels[blank_id] = "" + return [l if l != "|" else " " for l in labels] + +################################################################################ +# Compute log-probabilities from model + + # The following limit is to handle the corner Case of too long audio segment (which is better to split it to avoid memory overflow). # But it is 2240400 / 16000 Hz ~ 140 seconds, which should not happen for segments detected by Whisper (usually one sentence). # Also note that Whisper works with 30 seconds segment, so there is chance that this limit is never reached. MAX_LEN = 2240400 -def speechbrain_compute_log_probas(model, audios, max_len=MAX_LEN): +def compute_logprobas(model, audios, max_len=MAX_LEN): + # Single audio if not isinstance(audios, list): audios = [audios] - log_probas = speechbrain_compute_log_probas( - model, audios, max_len=max_len) - return log_probas[0] + logits = compute_logprobas(model, audios, max_len=max_len) + return logits[0] # Batch of audios (can occur when max_len is reached) assert len(audios) > 0, "audios must be a non-empty list" + + type = get_model_type(model) + if type == "speechbrain": + logits = compute_logits_speechbrain(model, audios, max_len) + elif type == "transformers": + logits = compute_logits_transformers(model, audios, max_len) + else: + logits = compute_logits_torchaudio(model, audios, max_len) + + return torch.log_softmax(logits, dim=-1) + + +def compute_logits_speechbrain(model, audios, max_len): if not isinstance(audios[0], torch.Tensor): audios = [torch.from_numpy(a) for a in audios] if max([len(a) for a in audios]) > max_len: @@ -47,7 +105,7 @@ def speechbrain_compute_log_probas(model, audios, max_len=MAX_LEN): log_probas = [[] for i in range(len(audios))] for i in range(0, len(chunks), batch_size): chunk = chunks[i:min(i+batch_size, len(chunks))] - log_probas_tmp = speechbrain_compute_log_probas(model, chunk) + log_probas_tmp = compute_logits_speechbrain(model, chunk) for j in range(i, i+len(chunk)): k = 0 while j >= i_audio[k]: @@ -59,8 +117,7 @@ def speechbrain_compute_log_probas(model, audios, max_len=MAX_LEN): batch, wav_lens = pack_sequences(audios, device=model.device) log_probas = model.forward(batch, wav_lens) - log_probas = torch.log_softmax(log_probas, dim=-1) - return log_probas + return log_probas.cpu().detach() def pack_sequences(tensors, device="cpu"): @@ -71,3 +128,71 @@ def pack_sequences(tensors, device="cpu"): maxwav_lens = max(wav_lens) wav_lens = torch.Tensor([l/maxwav_lens for l in wav_lens]) return tensor.to(device), wav_lens.to(device) + + +def compute_logits_transformers(model_and_processor, audios, max_len): + + model, processor = model_and_processor + + # can be different from processor.feature_extractor.sampling_rate + sample_rate = whisper.audio.SAMPLE_RATE + device = model.device + + audios = [audio.numpy() for audio in audios] + processed_batch = processor(audios, sampling_rate=sample_rate) + + padded_batch = processor.pad( + processed_batch, + padding=True, + max_length=None, + pad_to_multiple_of=None, + return_tensors="pt", + ) + + l = padded_batch.input_values.shape[1] + + with torch.inference_mode(): + if l > max_len: + # Split batch in smaller chunks + logger.warning( + "Audio too long, splitting into {} chunks for alignment".format(math.ceil(l / max_len))) + logits = [] + for i in range(0, l, max_len): + j = min(i + max_len, l) + logits.append(model(padded_batch.input_values[:, i:j].to(device), + attention_mask=padded_batch.attention_mask[:, i:j].to(device)).logits) + logits = torch.cat(logits, dim=1) + else: + logits = model(padded_batch.input_values.to(device), + attention_mask=padded_batch.attention_mask.to(device)).logits + + return logits.cpu().detach() + + +def compute_logits_torchaudio(model_and_labels, audios, max_len): + # TODO: factorize with compute_logits_transformers, and add support for batch of audios + + model, _ = model_and_labels + + all_logits = [] + + with torch.inference_mode(): + for audio in audios: + l = len(audio) + if l > max_len: + # Split audio in smaller chunks + logger.warning( + "Audio too long, splitting into {} chunks for alignment".format(math.ceil(l / max_len))) + logits = [] + for i in range(0, l, max_len): + j = min(i + max_len, l) + logits.append(model(audio[i:j].unsqueeze(0))[0]) + logits = torch.cat(logits, dim=1) + else: + logits, _ = model(audio.unsqueeze(0)) + + all_logits.append(logits.cpu().detach()) + + assert len(all_logits) == 1 # TODO: support batch of audios + + return all_logits[0] diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index f257fb1..bb56d0f 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -9,12 +9,13 @@ from stt import logger from .word_alignment import compute_alignment from .text_normalize import remove_punctuation, normalize_text, remove_emoji +from .load_model import load_alignment_model, get_alignment_model # TODO: understand and remove this limitations torch.set_num_threads(1) -def get_default_language(): +def get_language(): return os.environ.get("LANGUAGE", None) @@ -36,7 +37,7 @@ def decode(audio: torch.Tensor, fp16 = model.device != torch.device("cpu") if language is None: - language = get_default_language() + language = get_language() logger.info(f"Transcribing audio with language {language}...") @@ -59,6 +60,8 @@ def decode(audio: torch.Tensor, segments = whisper_res["segments"] if language is None: language = whisper_res["language"] + if alignment_model is None: + alignment_model = load_alignment_model(get_alignment_model(language), device=model.device) result["text"] = text result["confidence-score"] = np.exp(np.array([r["avg_logprob"] diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py index 27fdf9a..4add720 100644 --- a/stt/processing/load_model.py +++ b/stt/processing/load_model.py @@ -4,6 +4,29 @@ import requests import huggingface_hub import speechbrain as sb +import transformers +import torchaudio + +# Source: https://github.com/m-bain/whisperX (in whisperx/transcribe.py) +ALIGNMENT_MODELS = { + "fr": "/opt/linSTT_speechbrain_fr-FR_v1.0.0", + # "fr": "VOXPOPULI_ASR_BASE_10K_FR", + "en": "WAV2VEC2_ASR_BASE_960H", + # "en": "jonatasgrosman/wav2vec2-large-xlsr-53-english", + "de": "VOXPOPULI_ASR_BASE_10K_DE", + "es": "VOXPOPULI_ASR_BASE_10K_ES", + "it": "VOXPOPULI_ASR_BASE_10K_IT", + "nl": "jonatasgrosman/wav2vec2-large-xlsr-53-dutch", + "ja": "jonatasgrosman/wav2vec2-large-xlsr-53-japanese", + "zh": "jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn", + "uk": "Yehor/wav2vec2-xls-r-300m-uk-with-small-lm", +} + + +def get_alignment_model(language): + source = os.environ.get("ALIGNMENT_MODEL") + if not source: + return ALIGNMENT_MODELS.get(language, ALIGNMENT_MODELS["fr"]) def load_whisper_model(model_type_or_file, device="cpu", download_root="/opt"): @@ -16,6 +39,20 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root="/opt"): return model +def load_alignment_model(source, device="cpu", download_root="/opt"): + + if source in torchaudio.pipelines.__all__: + return load_torchaudio_model(source, device=device, download_root=download_root) + try: + return load_transformers_model(source, device=device, download_root=download_root) + except Exception as err1: + try: + return load_speechbrain_model(source, device=device, download_root=download_root) + except Exception as err2: + raise Exception( + f"Failed to load alignment model:\n<<< transformers <<<\n{str(err1)}\n<<< speechbrain <<<\n{str(err2)}") from err2 + + def load_speechbrain_model(source, device="cpu", download_root="/opt"): if os.path.isdir(source): @@ -44,6 +81,36 @@ def load_speechbrain_model(source, device="cpu", download_root="/opt"): return model +def load_transformers_model(source, device="cpu", download_root="/opt"): + + model = transformers.Wav2Vec2ForCTC.from_pretrained(source).to(device) + processor = transformers.Wav2Vec2Processor.from_pretrained(source) + + model.eval() + model.requires_grad_(False) + return model, processor + + +def load_torchaudio_model(source, device="cpu", download_root="/opt"): + + bundle = torchaudio.pipelines.__dict__[source] + model = bundle.get_model().to(device) + labels = bundle.get_labels() + + model.eval() + model.requires_grad_(False) + return model, labels + + +def get_model_type(model): + if not isinstance(model, tuple): + return "speechbrain" + assert len(model) == 2, "Invalid model type" + if isinstance(model[0], transformers.Wav2Vec2ForCTC): + return "transformers" + return "torchaudio" + + def make_yaml_overrides(yaml_file, key_values): """ return a dictionary of overrides to be used with speechbrain (hyperyaml files) diff --git a/stt/processing/text_normalize.py b/stt/processing/text_normalize.py index 7e2f6fb..6065eaa 100644 --- a/stt/processing/text_normalize.py +++ b/stt/processing/text_normalize.py @@ -169,6 +169,8 @@ def normalize_text(text: str, lang: str) -> str: for k, v in symbol_table.items(): text = re.sub(k, " "+v+" ", text) + text = re.sub(r" \.",".", text) + return collapse_whitespace(text) diff --git a/stt/processing/word_alignment.py b/stt/processing/word_alignment.py index 7dd0c8f..4e32bdc 100644 --- a/stt/processing/word_alignment.py +++ b/stt/processing/word_alignment.py @@ -5,8 +5,7 @@ import torch from stt import logger -from .alignment_model import speechbrain_compute_log_probas as compute_log_probas -from .alignment_model import speechbrain_get_vocab as get_vocab +from .alignment_model import compute_logprobas, get_vocab from .utils import flatten from .text_normalize import transliterate @@ -14,10 +13,9 @@ def compute_alignment(audio, transcript, model): """ Compute the alignment of the audio and a transcript, for a given model that returns log-probabilities on the charset defined the transcript.""" - emission = compute_log_probas(model, audio) + emission = compute_logprobas(model, audio) labels, blank_id = get_vocab(model) labels = labels[:emission.shape[1]] - labels[blank_id] = " " dictionary = {c: i for i, c in enumerate(labels)} default = labels.index("-") if "-" in labels else None From 766a5d574aab68995d786caa99524a90307abd3c Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 3 Jan 2023 09:21:14 +0100 Subject: [PATCH 095/172] ignore temporary files --- .gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/.gitignore b/.gitignore index 0b8d9ad..c7b414a 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,4 @@ start_container.sh .env* test/* +tmp* \ No newline at end of file From 5da1a65f63d7f0466717ee71ecad2436328c261f Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 3 Jan 2023 10:36:04 +0100 Subject: [PATCH 096/172] ensure that word timestamps are increasing --- stt/processing/decoding.py | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index bb56d0f..ed0eef1 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -76,6 +76,17 @@ def decode(audio: torch.Tensor, # Compute word timestamps result["words"] = [] max_t = audio.shape[0] + + # Ensure that the segments start / end time are increasing + # (because there is no guarantee with Whisper) + previous_start = 0.0 + for segment in segments: + if segment["start"] < previous_start: + segment["start"] = previous_start + if segment["end"] <= segment["start"]: + segment["end"] = segment["start"] + 1.0 + previous_start = segment["end"] + for segment in segments: offset = segment["start"] start = min(max_t, round(segment["start"] * SAMPLE_RATE)) From b968cfb1487b72fed6208d0d4a6b1ef455851174 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 3 Jan 2023 13:05:12 +0100 Subject: [PATCH 097/172] Allow to have unspecied language (that can change from one segment to another) --- stt/processing/__init__.py | 19 +++++++++---------- stt/processing/decoding.py | 14 +++++++++++--- stt/processing/load_model.py | 33 ++++++++++++++++++++++++--------- 3 files changed, 44 insertions(+), 22 deletions(-) diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index 70b6695..0119461 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -1,6 +1,5 @@ import os import logging -from time import time import torch import whisper @@ -35,20 +34,20 @@ # Load ASR model model_type = os.environ.get("MODEL", "medium") -logger.info( - f"Loading Whisper model {model_type} ({'local' if os.path.isfile(model_type) else 'remote'})...") -start = time() +logger.info(f"Loading Whisper model {model_type} ({'local' if os.path.exists(model_type) else 'remote'})...") try: model = load_whisper_model(model_type, device=device) except Exception as err: raise Exception( "Failed to load transcription model: {}".format(str(err))) from err -logger.info("Model loaded. (t={}s)".format(time() - start)) # Load alignment model alignment_model_name = get_alignment_model(language) -logger.info(f"Loading alignment model {alignment_model_name} ({'local' if os.path.isfile(alignment_model_name) else 'remote'})...") -start = time() -alignment_model = load_alignment_model( - alignment_model_name, device=device, download_root="/opt") -logger.info(f"Alignment Model of type {get_model_type(alignment_model)} loaded. (t={time() - start}s)") +if alignment_model_name: + logger.info( + f"Loading alignment model {alignment_model_name} ({'local' if os.path.exists(alignment_model_name) else 'remote'})...") + alignment_model = load_alignment_model( + alignment_model_name, device=device, download_root="/opt") +else: + logger.info("No alignment model preloaded") + alignment_model = {} # Alignement model(s) will be loaded on the fly diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index ed0eef1..abbdd38 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -60,8 +60,16 @@ def decode(audio: torch.Tensor, segments = whisper_res["segments"] if language is None: language = whisper_res["language"] - if alignment_model is None: - alignment_model = load_alignment_model(get_alignment_model(language), device=model.device) + logger.info(f"Detected language: {language}") + if isinstance(alignment_model, dict): + # Load alignment model on the fly + if language not in alignment_model: + alignment_model_name = get_alignment_model(language) + logger.info(f"Loading alignment model {alignment_model_name} ({'local' if os.path.exists(alignment_model_name) else 'remote'})...") + alignment_model[language] = load_alignment_model(alignment_model_name, device=model.device, download_root="/opt") + spec_alignment_model = alignment_model[language] + else: + spec_alignment_model = alignment_model result["text"] = text result["confidence-score"] = np.exp(np.array([r["avg_logprob"] @@ -103,7 +111,7 @@ def decode(audio: torch.Tensor, f"Lost text in segment {segment['start']}-{segment['end']}") continue labels, emission, trellis, segments, word_segments = compute_alignment( - sub_audio, sub_text, alignment_model) + sub_audio, sub_text, spec_alignment_model) ratio = len(sub_audio) / (trellis.size(0) * SAMPLE_RATE) sub_words = sub_text.split() if len(sub_words) == len(word_segments): diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py index 4add720..addbc04 100644 --- a/stt/processing/load_model.py +++ b/stt/processing/load_model.py @@ -7,6 +7,9 @@ import transformers import torchaudio +import time +from stt import logger + # Source: https://github.com/m-bain/whisperX (in whisperx/transcribe.py) ALIGNMENT_MODELS = { "fr": "/opt/linSTT_speechbrain_fr-FR_v1.0.0", @@ -26,31 +29,43 @@ def get_alignment_model(language): source = os.environ.get("ALIGNMENT_MODEL") if not source: - return ALIGNMENT_MODELS.get(language, ALIGNMENT_MODELS["fr"]) + return ALIGNMENT_MODELS.get(language, None) def load_whisper_model(model_type_or_file, device="cpu", download_root="/opt"): + start = time.time() + model = whisper.load_model(model_type_or_file, device=device, download_root=os.path.join(download_root, "whisper")) model.eval() model.requires_grad_(False) + + logger.info("Whisper Model loaded. (t={}s)".format(time.time() - start)) + return model def load_alignment_model(source, device="cpu", download_root="/opt"): + start = time.time() + if source in torchaudio.pipelines.__all__: - return load_torchaudio_model(source, device=device, download_root=download_root) - try: - return load_transformers_model(source, device=device, download_root=download_root) - except Exception as err1: + model = load_torchaudio_model(source, device=device, download_root=download_root) + else: try: - return load_speechbrain_model(source, device=device, download_root=download_root) - except Exception as err2: - raise Exception( - f"Failed to load alignment model:\n<<< transformers <<<\n{str(err1)}\n<<< speechbrain <<<\n{str(err2)}") from err2 + model = load_transformers_model(source, device=device, download_root=download_root) + except Exception as err1: + try: + model = load_speechbrain_model(source, device=device, download_root=download_root) + except Exception as err2: + raise Exception( + f"Failed to load alignment model:\n<<< transformers <<<\n{str(err1)}\n<<< speechbrain <<<\n{str(err2)}") from err2 + + logger.info(f"Alignment Model of type {get_model_type(model)} loaded. (t={time.time() - start}s)") + + return model def load_speechbrain_model(source, device="cpu", download_root="/opt"): From 939b7576902344139b6567f0924a867dacfdf4c5 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 3 Jan 2023 13:54:12 +0100 Subject: [PATCH 098/172] Update README --- .envdefault | 1 + README.md | 82 +++++++++++++++++++++++++++++++++++++---------------- 2 files changed, 59 insertions(+), 24 deletions(-) diff --git a/.envdefault b/.envdefault index 4452be3..617f4ae 100644 --- a/.envdefault +++ b/.envdefault @@ -1,6 +1,7 @@ # SERVING PARAMETERS SERVICE_MODE=http MODEL=/opt/model.pt +#ALIGNMENT_MODEL=/opt/linSTT_speechbrain_fr-FR_v1.0.0 LANGUAGE=fr # TASK PARAMETERS diff --git a/README.md b/README.md index a15b330..0b27eb5 100644 --- a/README.md +++ b/README.md @@ -12,21 +12,23 @@ To run the transcription models you'll need: * One CPU per worker. Inference time scales on CPU performances. ### Model -LinTO-Platform-STT accepts one Whisper models in the PyTorch format. - -You can download mutli-lingual models with the following links: -* tiny: "https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt -* base: https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt -* small: https://openaipublic.azureedge.net/main/whisper/models/9ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt -* medium: https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt -* large-v1: https://openaipublic.azureedge.net/main/whisper/models/e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a/large-v1.pt -* large-v2: https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt - -Models specialized for English can also be found: -* tiny.en: "https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt -* base.en: https://openaipublic.azureedge.net/main/whisper/models/25a8566e1d0c1e2231d1c762132cd20e0f96a85d16145c3a00adf5d1ac670ead/base.en.pt -* small.en: https://openaipublic.azureedge.net/main/whisper/models/f953ad0fd29cacd07d5a9eda5624af0f6bcf2258be67c92b79389873d91e0872/small.en.pt -* medium.en: https://openaipublic.azureedge.net/main/whisper/models/d7440d1dc186f76616474e0ff0b3b6b879abc9d1a4926b7adfa41db2d497ab4f/medium.en.pt +LinTO-Platform-STT works with two models: +* A Whisper model to perform Automatic Speech Recognition, which must be in the PyTorch format. +* A wav2vec model to perform word alignment, which can be in the format of SpeechBrain, HuggingFace's Transformers or TorchAudio + +The wav2vec model can be specified either +* with a string corresponding to a `torchaudio` pipeline (e.g. "WAV2VEC2_ASR_BASE_960H") or +* with a string corresponding to a HuggingFace repository of a wav2vec model (e.g. "jonatasgrosman/wav2vec2-large-xlsr-53-english"), or +* with a path corresponding to a folder with a SpeechBrain model + +Default models are provided for the following languages: +* French (fr) +* English (en) +* Spanish (es) +* German (de) +* Dutch (nl) +* Japanese (ja) +* Chinese (zh) ### Docker The transcription service requires docker up and running. @@ -48,15 +50,30 @@ or ```bash docker pull lintoai/linto-platform-stt -``` with the following links +``` **2- Download the models** Have the Whisper model file ready at ASR_PATH. -You can downloaded with the links mentioned above, if you don't have already a Whisper model. If you already used Whisper in the past, you may have models in ~/.cache/whisper. +You can download mutli-lingual Whisper models with the following links: +* tiny: "https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt +* base: https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt +* small: https://openaipublic.azureedge.net/main/whisper/models/9ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt +* medium: https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt +* large-v1: https://openaipublic.azureedge.net/main/whisper/models/e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a/large-v1.pt +* large-v2: https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt + +Whisper models specialized for English can also be found here: +* tiny.en: "https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt +* base.en: https://openaipublic.azureedge.net/main/whisper/models/25a8566e1d0c1e2231d1c762132cd20e0f96a85d16145c3a00adf5d1ac670ead/base.en.pt +* small.en: https://openaipublic.azureedge.net/main/whisper/models/f953ad0fd29cacd07d5a9eda5624af0f6bcf2258be67c92b79389873d91e0872/small.en.pt +* medium.en: https://openaipublic.azureedge.net/main/whisper/models/d7440d1dc186f76616474e0ff0b3b6b879abc9d1a4926b7adfa41db2d497ab4f/medium.en.pt + +If may also want to download a specific wav2vec model for word alignment. + **3- Fill the .env** ```bash @@ -65,8 +82,9 @@ cp .envdefault .env | PARAMETER | DESCRIPTION | EXEMPLE | |---|---|---| -| SERVICE_MODE | STT serving mode see [Serving mode](#serving-mode) | http\|task\|websocket | -| MODEL | Path to the model or type of model used. | ASR_PATH\|small\|medium\|large-v1\|... | +| SERVICE_MODE | STT serving mode see [Serving mode](#serving-mode) | http\|task | +| MODEL | Path to the Whisper model, or type of Whisper model used. | ASR_PATH\|small\|medium\|large-v1\|... | +| ALIGNMENT_MODEL | (Optional) Path to the wav2vec model for word alignment, or name of HuggingFace repository or torchaudio pipeline | WAV2VEC_PATH\|jonatasgrosman/wav2vec2-large-xlsr-53-english\|WAV2VEC2_ASR_BASE_960H | | LANGUAGE | (Optional) Language to recognize | fr\|en\|... | | SERVICE_NAME | Using the task mode, set the queue's name for task processing | my-stt | | SERVICE_BROKER | Using the task mode, URL of the message broker | redis://my-broker:6379 | @@ -95,10 +113,9 @@ yo(yoruba), zh(chinese) ### Serving mode ![Serving Modes](https://i.ibb.co/qrtv3Z6/platform-stt.png) -STT can be used three ways: +STT can be used in two ways: * Through an [HTTP API](#http-server) using the **http**'s mode. * Through a [message broker](#micro-service-within-linto-platform-stack) using the **task**'s mode. -* Through a [websocket server](#websocket-server) **websocket**'s mode. Mode is specified using the .env value or environment variable ```SERVING_MODE```. ```bash @@ -119,11 +136,20 @@ linto-platform-stt:latest This will run a container providing an [HTTP API](#http-api) binded on the host HOST_SERVING_PORT port. +You may also want to mount your cache folder CACHE_PATH (e.g. "~/.cache") ```-v CACHE_PATH:/root/.cache``` +in order to avoid downloading models each time. + +Also if you want to specifiy a custom alignment model already downloaded in a folder WAV2VEC_PATH, +you can add option ```-v WAV2VEC_PATH:/opt/wav2vec``` and environment variable ```ALIGNMENT_MODEL=/opt/wav2vec```. + + **Parameters:** | Variables | Description | Example | |:-|:-|:-| | HOST_SERVING_PORT | Host serving port | 80 | -| ASR_PATH | (Optional) Path to the Whisper model on the host machine to /opt/model.pt | /my/path/to/models/medium.pt | +| ASR_PATH | Path to the Whisper model on the host machine mounted to /opt/model.pt | /my/path/to/models/medium.pt | +| CACHE_PATH | (Optional) Path to a folder to download wav2vec alignment models when relevant | /home/username/.cache | +| WAV2VEC_PATH | (Optional) Path to a folder to a custom wav2vec alignment model | /my/path/to/models/wav2vec | ### Micro-service within LinTO-Platform stack The HTTP serving mode connect a celery worker to a message broker. @@ -142,12 +168,20 @@ docker run --rm \ -v SHARED_AUDIO_FOLDER:/opt/audio \ --env-file .env \ linto-platform-stt:latest -```| LANGUAGE | (Optional) Language to recognize | fr\|en\|... | +``` + +You may also want to mount your cache folder CACHE_PATH (e.g. "~/.cache") ```-v CACHE_PATH:/root/.cache``` +in order to avoid downloading models each time. + +Also if you want to specifiy a custom alignment model already downloaded in a folder WAV2VEC_PATH, +you can add option ```-v WAV2VEC_PATH:/opt/wav2vec``` and environment variable ```ALIGNMENT_MODEL=/opt/wav2vec```. | Variables | Description | Example | |:-|:-|:-| -| ASR_PATH | (Optional) Path to the Whisper model on the host machine to /opt/model.pt | /my/path/to/models/medium.pt | | SHARED_AUDIO_FOLDER | Shared audio folder mounted to /opt/audio | /my/path/to/models/vosk-model | +| ASR_PATH | Path to the Whisper model on the host machine mounted to /opt/model.pt | /my/path/to/models/medium.pt | +| CACHE_PATH | (Optional) Path to a folder to download wav2vec alignment models when relevant | /home/username/.cache | +| WAV2VEC_PATH | (Optional) Path to a folder to a custom wav2vec alignment model | /my/path/to/models/wav2vec | ## Usages From af5f8211843044a642d56a58d7ddfa06ba343f2a Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 3 Jan 2023 13:54:45 +0100 Subject: [PATCH 099/172] cosm --- load_alignment_model.py | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/load_alignment_model.py b/load_alignment_model.py index 0cf6087..7ca700e 100644 --- a/load_alignment_model.py +++ b/load_alignment_model.py @@ -1,4 +1,5 @@ -import os +import os +import shutil import urllib.request import zipfile @@ -14,7 +15,9 @@ def load_alignment_model(name, download_root = "/opt"): # Download model url = f"https://dl.linto.ai/downloads/model-distribution/acoustic-models/fr-FR/{name}.zip" destzip = destdir+".zip" - if not os.path.exists(destzip): + if os.path.exists(os.path.basename(destzip)): + shutil.move(os.path.basename(destzip), destzip) + if not os.path.exists(destzip): print("Downloading", url, "into", destdir) os.makedirs(download_root, exist_ok=True) urllib.request.urlretrieve(url, destzip) From bef7a48f16cbac4dc40b405364e65e98ce43c78a Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Fri, 6 Jan 2023 17:25:59 +0100 Subject: [PATCH 100/172] improve logs, readme, update comment --- README.md | 16 +++++++++------- stt/processing/__init__.py | 2 ++ stt/processing/decoding.py | 2 +- stt/processing/word_alignment.py | 8 ++++++-- 4 files changed, 18 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 0b27eb5..4d997ba 100644 --- a/README.md +++ b/README.md @@ -82,14 +82,14 @@ cp .envdefault .env | PARAMETER | DESCRIPTION | EXEMPLE | |---|---|---| -| SERVICE_MODE | STT serving mode see [Serving mode](#serving-mode) | http\|task | -| MODEL | Path to the Whisper model, or type of Whisper model used. | ASR_PATH\|small\|medium\|large-v1\|... | -| ALIGNMENT_MODEL | (Optional) Path to the wav2vec model for word alignment, or name of HuggingFace repository or torchaudio pipeline | WAV2VEC_PATH\|jonatasgrosman/wav2vec2-large-xlsr-53-english\|WAV2VEC2_ASR_BASE_960H | -| LANGUAGE | (Optional) Language to recognize | fr\|en\|... | +| SERVICE_MODE | STT serving mode see [Serving mode](#serving-mode) | http \| task | +| MODEL | Path to the Whisper model, or type of Whisper model used. | \ \| medium \| large-v1 \| ... | +| ALIGNMENT_MODEL | (Optional) Path to the wav2vec model for word alignment, or name of HuggingFace repository or torchaudio pipeline | \ \| WAV2VEC2_ASR_BASE_960H \| jonatasgrosman/wav2vec2-large-xlsr-53-english \| ... | +| LANGUAGE | (Optional) Language to recognize | fr \| en \| ... | | SERVICE_NAME | Using the task mode, set the queue's name for task processing | my-stt | | SERVICE_BROKER | Using the task mode, URL of the message broker | redis://my-broker:6379 | | BROKER_PASS | Using the task mode, broker password | my-password | -| CONCURRENCY | Maximum number of parallel requests | >1 | +| CONCURRENCY | Maximum number of parallel requests | 3 | The language is a code of two or three letters. The list of languages supported by Whisper are: ``` @@ -142,11 +142,10 @@ in order to avoid downloading models each time. Also if you want to specifiy a custom alignment model already downloaded in a folder WAV2VEC_PATH, you can add option ```-v WAV2VEC_PATH:/opt/wav2vec``` and environment variable ```ALIGNMENT_MODEL=/opt/wav2vec```. - **Parameters:** | Variables | Description | Example | |:-|:-|:-| -| HOST_SERVING_PORT | Host serving port | 80 | +| HOST_SERVING_PORT | Host serving port | 8080 | | ASR_PATH | Path to the Whisper model on the host machine mounted to /opt/model.pt | /my/path/to/models/medium.pt | | CACHE_PATH | (Optional) Path to a folder to download wav2vec alignment models when relevant | /home/username/.cache | | WAV2VEC_PATH | (Optional) Path to a folder to a custom wav2vec alignment model | /my/path/to/models/wav2vec | @@ -176,6 +175,7 @@ in order to avoid downloading models each time. Also if you want to specifiy a custom alignment model already downloaded in a folder WAV2VEC_PATH, you can add option ```-v WAV2VEC_PATH:/opt/wav2vec``` and environment variable ```ALIGNMENT_MODEL=/opt/wav2vec```. +**Parameters:** | Variables | Description | Example | |:-|:-|:-| | SHARED_AUDIO_FOLDER | Shared audio folder mounted to /opt/audio | /my/path/to/models/vosk-model | @@ -265,3 +265,5 @@ This project is developped under the AGPLv3 License (see LICENSE). * [OpenAI Whisper](https://github.com/openai/whisper) * [SpeechBrain](https://github.com/speechbrain/speechbrain). +* [TorchAudio](https://github.com/pytorch/audio) +* [HuggingFace Transformers](https://github.com/huggingface/transformers) \ No newline at end of file diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index 0119461..ac22286 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -23,6 +23,7 @@ device = torch.device(device) except Exception as err: raise Exception("Failed to set device: {}".format(str(err))) from err +logger.info(f"Using device {device}") # Check language language = get_language() @@ -31,6 +32,7 @@ if language not in available_languages: raise RuntimeError( f"Language {get_language()} is not available. Available languages are: {available_languages}") +logger.info(f"Using language {language}") # Load ASR model model_type = os.environ.get("MODEL", "medium") diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index abbdd38..f16dd0b 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -11,7 +11,7 @@ from .text_normalize import remove_punctuation, normalize_text, remove_emoji from .load_model import load_alignment_model, get_alignment_model -# TODO: understand and remove this limitations +# This is to avoid hanging in a multi-threaded environment torch.set_num_threads(1) diff --git a/stt/processing/word_alignment.py b/stt/processing/word_alignment.py index 4e32bdc..ba94a14 100644 --- a/stt/processing/word_alignment.py +++ b/stt/processing/word_alignment.py @@ -9,6 +9,7 @@ from .utils import flatten from .text_normalize import transliterate +_unknown_chars = [] def compute_alignment(audio, transcript, model): """ Compute the alignment of the audio and a transcript, for a given model that returns log-probabilities on the charset defined the transcript.""" @@ -54,6 +55,7 @@ def count_repetitions(tokens): def loose_get_char_index(dictionary, c, default=None): + global _unknown_chars i = dictionary.get(c, None) if i is None: # Try with alternative versions of the character @@ -75,8 +77,10 @@ def loose_get_char_index(dictionary, c, default=None): i = candidate # If still not found if i is None: - logger.warn("Character not correctly handled by alignment model: '" + - "' / '".join(list(set([c] + other_char))) + "'") + if c not in _unknown_chars: + logger.warn("Character not correctly handled by alignment model: '" + + "' / '".join(list(set([c] + other_char))) + "'") + _unknown_chars.append(c) i = [default] if default is not None else [] else: i = [i] From d60feecfb1852f0e77933e7f76fed82a3259732e Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Fri, 6 Jan 2023 17:46:34 +0100 Subject: [PATCH 101/172] Make it work with GPU. Note: CUDA multiprocessing needs "spawn" start method, and gunicorn cannot achieve this --- http_server/ingress.py | 42 +++++++++++++++++++++++--------------- http_server/serving.py | 24 +++++++++++++++++++++- requirements.txt | 1 + stt/processing/__init__.py | 6 +++--- 4 files changed, 52 insertions(+), 21 deletions(-) diff --git a/http_server/ingress.py b/http_server/ingress.py index db739d4..ce12e53 100644 --- a/http_server/ingress.py +++ b/http_server/ingress.py @@ -2,16 +2,15 @@ import json import logging -import os -from time import time +import time from confparser import createParser from flask import Flask, Response, abort, json, request from flask_sock import Sock -from serving import GunicornServing +from serving import GeventServing, GunicornServing from swagger import setupSwaggerUI -from stt.processing import decode, load_wave_buffer, model, alignment_model +from stt.processing import decode, load_wave_buffer, model, alignment_model, use_gpu from stt import logger as stt_logger app = Flask("__stt-standalone-worker__") @@ -41,29 +40,29 @@ def transcribe(): logger.info("Transcribe request received") # get response content type - logger.debug(request.headers.get("accept").lower()) + # logger.debug(request.headers.get("accept").lower()) if request.headers.get("accept").lower() == "application/json": join_metadata = True elif request.headers.get("accept").lower() == "text/plain": join_metadata = False else: raise ValueError("Not accepted header") - logger.debug("Metadata: {}".format(join_metadata)) + # logger.debug("Metadata: {}".format(join_metadata)) # get input file - if "file" in request.files.keys(): - file_buffer = request.files["file"].read() - audio_data = load_wave_buffer(file_buffer) - start_t = time() + if "file" not in request.files.keys(): + raise ValueError("No audio file was uploaded") - # Transcription - transcription = decode(audio_data, model, alignment_model, join_metadata) - logger.debug("Transcription complete (t={}s)".format(time() - start_t)) + file_buffer = request.files["file"].read() + audio_data = load_wave_buffer(file_buffer) + start_t = time.time() - logger.debug("... Complete") + # Transcription + transcription = decode( + audio_data, model, alignment_model, join_metadata) + logger.debug("Transcription complete (t={}s)".format(time.time() - start_t)) - else: - raise ValueError("No audio file was uploaded") + logger.debug(f"END {id}: {time.time()}") if join_metadata: return json.dumps(transcription, ensure_ascii=False), 200 @@ -108,7 +107,16 @@ def server_error(error): except Exception as err: logger.warning("Could not setup swagger: {}".format(str(err))) - serving = GunicornServing( + logger.info(f"Using {args.workers} workers") + + if use_gpu: + serving_type = GeventServing + logger.debug("Serving with gevent") + else: + serving_type = GunicornServing + logger.debug("Serving with gunicorn") + + serving = serving_type( app, { "bind": f"0.0.0.0:{args.service_port}", diff --git a/http_server/serving.py b/http_server/serving.py index d2dd7e8..773c463 100644 --- a/http_server/serving.py +++ b/http_server/serving.py @@ -1,5 +1,7 @@ import gunicorn.app.base - +import gevent.pywsgi +import gevent.monkey +gevent.monkey.patch_all() class GunicornServing(gunicorn.app.base.BaseApplication): def __init__(self, app, options=None): @@ -18,3 +20,23 @@ def load_config(self): def load(self): return self.application + +class GeventServing(): + + def __init__(self, app, options=None): + self.options = options or {} + self.application = app + + def run(self): + bind = self.options.get('bind', "0.0.0.0:8080") + workers = self.options.get('workers', 1) + listener = bind.split(':') + try: + assert len(listener) == 2 + listener = (listener[0], int(listener[1])) + except: + print(f"Invalid bind address {bind}") + + server = gevent.pywsgi.WSGIServer(listener, self.application, spawn = workers) + server.serve_forever() + diff --git a/requirements.txt b/requirements.txt index c4e4fd4..6b9b488 100644 --- a/requirements.txt +++ b/requirements.txt @@ -3,6 +3,7 @@ flask>=1.1.2 flask-cors>=3.0.10 flask-sock flask-swagger-ui>=3.36.0 +gevent gunicorn num2words pyyaml>=5.4.1 diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index ac22286..492aded 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -10,19 +10,19 @@ from .load_model import load_whisper_model, load_alignment_model, get_alignment_model, get_model_type -__all__ = ["logger", "decode", "model", "alignment_model", +__all__ = ["logger", "use_gpu", "decode", "model", "alignment_model", "load_audiofile", "load_wave_buffer"] # Set informative log logger.setLevel(logging.INFO) # Set device -device = os.environ.get( - "DEVICE", "cuda:0" if torch.cuda.is_available() else "cpu") +device = os.environ.get("DEVICE", "cuda:0" if torch.cuda.is_available() else "cpu") try: device = torch.device(device) except Exception as err: raise Exception("Failed to set device: {}".format(str(err))) from err +use_gpu = device.type == "cuda" logger.info(f"Using device {device}") # Check language From 3759066abea268ded87333663946d01bc3f8bf94 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Fri, 6 Jan 2023 18:30:12 +0100 Subject: [PATCH 102/172] fix failure on GPU with torchaudio models --- stt/processing/alignment_model.py | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/stt/processing/alignment_model.py b/stt/processing/alignment_model.py index b6ef333..a4669a1 100644 --- a/stt/processing/alignment_model.py +++ b/stt/processing/alignment_model.py @@ -174,6 +174,12 @@ def compute_logits_torchaudio(model_and_labels, audios, max_len): model, _ = model_and_labels + # Get the device where is running the model + device = "cpu" + for p in model.parameters(): + device = p.device + break + all_logits = [] with torch.inference_mode(): @@ -186,10 +192,10 @@ def compute_logits_torchaudio(model_and_labels, audios, max_len): logits = [] for i in range(0, l, max_len): j = min(i + max_len, l) - logits.append(model(audio[i:j].unsqueeze(0))[0]) + logits.append(model(audio[i:j].unsqueeze(0).to(device))[0]) logits = torch.cat(logits, dim=1) else: - logits, _ = model(audio.unsqueeze(0)) + logits, _ = model(audio.unsqueeze(0).to(device)) all_logits.append(logits.cpu().detach()) From 4c6b2b3af5c74228a96df84e6baaed93643fb43a Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Fri, 6 Jan 2023 18:34:53 +0100 Subject: [PATCH 103/172] give up linstt model for french. Add more HuggingFace wav2vec models to support alignment in more languages --- .envdefault | 5 ++- Dockerfile | 11 +---- load_alignment_model.py | 82 ------------------------------------ stt/processing/load_model.py | 24 +++++++++-- 4 files changed, 24 insertions(+), 98 deletions(-) delete mode 100644 load_alignment_model.py diff --git a/.envdefault b/.envdefault index 617f4ae..ce8ca21 100644 --- a/.envdefault +++ b/.envdefault @@ -1,8 +1,9 @@ # SERVING PARAMETERS SERVICE_MODE=http -MODEL=/opt/model.pt -#ALIGNMENT_MODEL=/opt/linSTT_speechbrain_fr-FR_v1.0.0 LANGUAGE=fr +MODEL=/opt/model.pt +#ALIGNMENT_MODEL=/opt/alignment_model +#DEVICE=cuda:0 # TASK PARAMETERS SERVICE_NAME=stt diff --git a/Dockerfile b/Dockerfile index 4761b3d..844f7ac 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,8 +1,6 @@ FROM python:3.9 LABEL maintainer="jlouradour@linagora.com" -ARG KALDI_MKL - RUN apt-get update && \ apt-get install -y --no-install-recommends \ wget \ @@ -27,14 +25,7 @@ RUN rm -rf /var/lib/apt/lists/* # Install python dependencies COPY requirements.txt ./ -RUN pip install --force-reinstall --no-cache-dir -r requirements.txt - -# Download alignment model -COPY load_alignment_model.py ./ -RUN python3 load_alignment_model.py - -# Cleaning -RUN rm requirements.txt load_alignment_model.py +RUN pip install --force-reinstall --no-cache-dir -r requirements.txt && rm requirements.txt WORKDIR /usr/src/app diff --git a/load_alignment_model.py b/load_alignment_model.py deleted file mode 100644 index 7ca700e..0000000 --- a/load_alignment_model.py +++ /dev/null @@ -1,82 +0,0 @@ -import os -import shutil -import urllib.request -import zipfile - -import huggingface_hub -import speechbrain as sb -import requests - - -def load_alignment_model(name, download_root = "/opt"): - if name.startswith("linSTT"): - destdir = os.path.join(download_root, name) - if not os.path.exists(destdir): - # Download model - url = f"https://dl.linto.ai/downloads/model-distribution/acoustic-models/fr-FR/{name}.zip" - destzip = destdir+".zip" - if os.path.exists(os.path.basename(destzip)): - shutil.move(os.path.basename(destzip), destzip) - if not os.path.exists(destzip): - print("Downloading", url, "into", destdir) - os.makedirs(download_root, exist_ok=True) - urllib.request.urlretrieve(url, destzip) - with zipfile.ZipFile(destzip, 'r') as z: - os.makedirs(destdir, exist_ok=True) - z.extractall(destdir) - assert os.path.isdir(destdir) - os.remove(destzip) - else: - destdir = name - load_speechbrain_model(destdir, download_root = download_root) - -def load_speechbrain_model(source, device = None, download_root = "/opt"): - - if os.path.isdir(source): - yaml_file = os.path.join(source, "hyperparams.yaml") - assert os.path.isfile(yaml_file), f"Hyperparams file {yaml_file} not found" - else: - try: - yaml_file = huggingface_hub.hf_hub_download(repo_id=source, filename="hyperparams.yaml", cache_dir = os.path.join(download_root, "huggingface/hub")) - except requests.exceptions.HTTPError: - yaml_file = None - - overrides = make_yaml_overrides(yaml_file, {"save_path": os.path.join(download_root, "speechbrain")}) - savedir = os.path.join(download_root, "speechbrain") - try: - model = sb.pretrained.EncoderASR.from_hparams(source = source, savedir = savedir, overrides = overrides) - except ValueError: - model = sb.pretrained.EncoderDecoderASR.from_hparams(source = source, savedir = savedir, overrides = overrides) - return model - -def make_yaml_overrides(yaml_file, key_values): - """ - return a dictionary of overrides to be used with speechbrain - yaml_file: path to yaml file - key_values: dict of key values to override - """ - if yaml_file is None: return None - - override = {} - with open(yaml_file, "r") as f: - parent = None - for line in f: - if line.strip() == "": - parent = None - elif line == line.lstrip(): - if ":" in line: - parent = line.split(":")[0].strip() - if parent in key_values: - override[parent] = key_values[parent] - elif ":" in line: - child = line.strip().split(":")[0].strip() - if child in key_values: - override[parent] = override.get(parent, {}) | {child: key_values[child]} - return override - - -if __name__ == "__main__": - - import sys - assert len(sys.argv) in [1, 2], f"Usage: {sys.argv[0]} " - load_alignment_model(sys.argv[1] if len(sys.argv) > 1 else "linSTT_speechbrain_fr-FR_v1.0.0") diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py index addbc04..7fae195 100644 --- a/stt/processing/load_model.py +++ b/stt/processing/load_model.py @@ -10,19 +10,35 @@ import time from stt import logger -# Source: https://github.com/m-bain/whisperX (in whisperx/transcribe.py) +# Sources: +# * https://github.com/m-bain/whisperX (in whisperx/transcribe.py) +# * https://pytorch.org/audio/stable/pipelines.html +# * https://huggingface.co/jonatasgrosman + ALIGNMENT_MODELS = { - "fr": "/opt/linSTT_speechbrain_fr-FR_v1.0.0", - # "fr": "VOXPOPULI_ASR_BASE_10K_FR", "en": "WAV2VEC2_ASR_BASE_960H", # "en": "jonatasgrosman/wav2vec2-large-xlsr-53-english", + "fr": "VOXPOPULI_ASR_BASE_10K_FR", + # "fr": "jonatasgrosman/wav2vec2-large-xlsr-53-french", "de": "VOXPOPULI_ASR_BASE_10K_DE", + # "de": "jonatasgrosman/wav2vec2-large-xlsr-53-german", "es": "VOXPOPULI_ASR_BASE_10K_ES", + # "it": "jonatasgrosman/wav2vec2-large-xlsr-53-spanish", "it": "VOXPOPULI_ASR_BASE_10K_IT", + # "it": "jonatasgrosman/wav2vec2-large-xlsr-53-italian", + "pt": "jonatasgrosman/wav2vec2-large-xlsr-53-portuguese", "nl": "jonatasgrosman/wav2vec2-large-xlsr-53-dutch", + "pl": "jonatasgrosman/wav2vec2-large-xlsr-53-polish", + "fi": "jonatasgrosman/wav2vec2-large-xlsr-53-finnish", + "hu": "jonatasgrosman/wav2vec2-large-xlsr-53-hungarian", + "el": "jonatasgrosman/wav2vec2-large-xlsr-53-greek", + "fa": "jonatasgrosman/wav2vec2-large-xlsr-53-persian", + "ar": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic", + "ru": "jonatasgrosman/wav2vec2-large-xlsr-53-russian", + "uk": "Yehor/wav2vec2-xls-r-300m-uk-with-small-lm", "ja": "jonatasgrosman/wav2vec2-large-xlsr-53-japanese", "zh": "jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn", - "uk": "Yehor/wav2vec2-xls-r-300m-uk-with-small-lm", + "vi": "nguyenvulebinh/wav2vec2-base-vietnamese-250h", } From 3353c70e40751aaefe0729d89c236da6d6836602 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Fri, 6 Jan 2023 18:59:10 +0100 Subject: [PATCH 104/172] cosm --- stt/processing/text_normalize.py | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/stt/processing/text_normalize.py b/stt/processing/text_normalize.py index 6065eaa..7621199 100644 --- a/stt/processing/text_normalize.py +++ b/stt/processing/text_normalize.py @@ -44,6 +44,7 @@ def remove_emoji(text): def normalize_text(text: str, lang: str) -> str: """ Transform digits into characters... """ + # Reorder currencies (1,20€ -> 1 € 20) coma = "," if lang in ["fr"] else "\." for c in _currencies: @@ -62,12 +63,14 @@ def normalize_text(text: str, lang: str) -> str: r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})(º|ème|eme|e|er|ère)?\b", text) digits = ["".join(d) for d in digits] else: - digits = [] + digits = re.findall( + r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})\b", text) + digits = ["".join(d) for d in digits] if digits: digits = sorted(list(set(digits)), reverse=True, key=lambda x: (len(x), x)) for s in digits: - filtered = re.sub("[a-z]", "", s) + filtered = re.sub("[a-zèº]", "", s) ordinal = filtered != s digit = roman_to_decimal(filtered) v = undigit(str(digit), lang=lang, @@ -83,7 +86,7 @@ def normalize_text(text: str, lang: str) -> str: r"\b1(?:ère|ere|er|re|r)|2(?:nd|nde)|\d+(?:º|ème|eme|e)\b", text) else: logger.warn( - f"Language {lang} not supported for normalization. Some words might be mis-localized.") + f"Language {lang} not supported for some normalization. Some words might be mis-localized.") digits = [] if digits: digits = sorted(list(set(digits)), reverse=True, From 01d3f57cf997b6b3a480424d4f9eca977d380f40 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Fri, 6 Jan 2023 18:59:30 +0100 Subject: [PATCH 105/172] some wav2vec models do not use attention mask --- stt/processing/alignment_model.py | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/stt/processing/alignment_model.py b/stt/processing/alignment_model.py index a4669a1..8a7c39f 100644 --- a/stt/processing/alignment_model.py +++ b/stt/processing/alignment_model.py @@ -151,6 +151,8 @@ def compute_logits_transformers(model_and_processor, audios, max_len): l = padded_batch.input_values.shape[1] + use_mask = hasattr(padded_batch, "attention_mask") + with torch.inference_mode(): if l > max_len: # Split batch in smaller chunks @@ -159,12 +161,17 @@ def compute_logits_transformers(model_and_processor, audios, max_len): logits = [] for i in range(0, l, max_len): j = min(i + max_len, l) - logits.append(model(padded_batch.input_values[:, i:j].to(device), + if use_mask: + logits.append(model(padded_batch.input_values[:, i:j].to(device), attention_mask=padded_batch.attention_mask[:, i:j].to(device)).logits) + else: + logits.append(model(padded_batch.input_values[:, i:j].to(device)).logits) logits = torch.cat(logits, dim=1) - else: + elif use_mask: logits = model(padded_batch.input_values.to(device), attention_mask=padded_batch.attention_mask.to(device)).logits + else: + logits = model(padded_batch.input_values.to(device)).logits return logits.cpu().detach() From 76838ab0666587e15b6d5eba3b8284f56ca31b75 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Fri, 6 Jan 2023 19:09:14 +0100 Subject: [PATCH 106/172] glue the words inside a segment --- stt/processing/decoding.py | 32 ++++++++++++++++++++++---------- 1 file changed, 22 insertions(+), 10 deletions(-) diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index f16dd0b..5353bcd 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -115,22 +115,34 @@ def decode(audio: torch.Tensor, ratio = len(sub_audio) / (trellis.size(0) * SAMPLE_RATE) sub_words = sub_text.split() if len(sub_words) == len(word_segments): - for word, segment in zip(sub_words, word_segments): + for word, seg in zip(sub_words, word_segments): result["words"].append({ "word": word, - "start": segment.start * ratio + offset, - "end": segment.end * ratio + offset, - "conf": segment.score, + "start": seg.start * ratio + offset, + "end": seg.end * ratio + offset, + "conf": seg.score, }) else: logger.warn( - f"Alignment failed. Results might differ on some words.\nNumber of words: {len(sub_words)} != {len(word_segments)}\n>>>\n{sub_words}\n<<<\n{[segment.label for segment in word_segments]}") - for segment in word_segments: + f"Alignment failed. Some words might be mis-rendered.\nNumber of words: {len(sub_words)} != {len(word_segments)}\n>>>\n{sub_words}\n<<<\n{[segment.label for segment in word_segments]}") + for seg in word_segments: result["words"].append({ - "word": segment.label, - "start": segment.start * ratio + offset, - "end": segment.end * ratio + offset, - "conf": segment.score, + "word": seg.label, + "start": seg.start * ratio + offset, + "end": seg.end * ratio + offset, + "conf": seg.score, }) + # Glue the words inside a segment + previous_start = offset + words = result["words"] + for i, word in enumerate(words): + if i == 0: + word["start"] = segment["start"] + else: + word["start"] = words[i-1]["end"] + if i == len(words) - 1: + word["end"] = segment["end"] + else: + word["end"] = .5 * (words[i+1]["start"] + word["end"]) return result From affd53682881ade4edac24f1d638ed47f09a6892 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Mon, 9 Jan 2023 09:46:42 +0100 Subject: [PATCH 107/172] fix bug in the position of the first and last word of each segment --- stt/processing/decoding.py | 32 ++++++++++++++------------------ 1 file changed, 14 insertions(+), 18 deletions(-) diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 5353bcd..41143d3 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -114,27 +114,21 @@ def decode(audio: torch.Tensor, sub_audio, sub_text, spec_alignment_model) ratio = len(sub_audio) / (trellis.size(0) * SAMPLE_RATE) sub_words = sub_text.split() - if len(sub_words) == len(word_segments): - for word, seg in zip(sub_words, word_segments): - result["words"].append({ - "word": word, - "start": seg.start * ratio + offset, - "end": seg.end * ratio + offset, - "conf": seg.score, - }) - else: + words = [] + use_original_words = True + if len(sub_words) != len(word_segments): logger.warn( f"Alignment failed. Some words might be mis-rendered.\nNumber of words: {len(sub_words)} != {len(word_segments)}\n>>>\n{sub_words}\n<<<\n{[segment.label for segment in word_segments]}") - for seg in word_segments: - result["words"].append({ - "word": seg.label, - "start": seg.start * ratio + offset, - "end": seg.end * ratio + offset, - "conf": seg.score, - }) + assert len(word_segments) < len(sub_words) + use_original_words = False + for word, seg in zip(sub_words, word_segments): + words.append({ + "word": word if use_original_words else seg.label, + "start": seg.start * ratio + offset, + "end": seg.end * ratio + offset, + "conf": seg.score, + }) # Glue the words inside a segment - previous_start = offset - words = result["words"] for i, word in enumerate(words): if i == 0: word["start"] = segment["start"] @@ -144,5 +138,7 @@ def decode(audio: torch.Tensor, word["end"] = segment["end"] else: word["end"] = .5 * (words[i+1]["start"] + word["end"]) + # Accumulate results + result["words"] += words return result From 2cf0799247b228bb7c6c9e3efbad728414caf58a Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Mon, 9 Jan 2023 09:47:11 +0100 Subject: [PATCH 108/172] fix alignment model specified with env variable ALIGNMENT_MODEL --- stt/processing/load_model.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py index 7fae195..b4b5738 100644 --- a/stt/processing/load_model.py +++ b/stt/processing/load_model.py @@ -45,7 +45,8 @@ def get_alignment_model(language): source = os.environ.get("ALIGNMENT_MODEL") if not source: - return ALIGNMENT_MODELS.get(language, None) + source = ALIGNMENT_MODELS.get(language, None) + return source def load_whisper_model(model_type_or_file, device="cpu", download_root="/opt"): From b875897b8847c46ed62be4ff390fa187386d4d8f Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Mon, 9 Jan 2023 09:47:41 +0100 Subject: [PATCH 109/172] better text normalization for numbers/symbols before punctuation marks --- stt/processing/text_normalize.py | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/stt/processing/text_normalize.py b/stt/processing/text_normalize.py index 7621199..fa9933c 100644 --- a/stt/processing/text_normalize.py +++ b/stt/processing/text_normalize.py @@ -162,21 +162,27 @@ def normalize_text(text: str, lang: str) -> str: else: word = " / ".join([undigit(s, lang=lang) for s in digitf.split('/')]) - if " " in digit: - text = re.sub(r'\b'+str(digit)+r'\b', " "+word+" ", text) - else: - text = re.sub(str(digit), " "+word+" ", text) + text = replace_keeping_word_boundaries(digit, word, text) # Symbols (currencies, percent...) symbol_table = _symbol_to_word.get(lang, {}) for k, v in symbol_table.items(): - text = re.sub(k, " "+v+" ", text) + text = replace_keeping_word_boundaries(k, v, text) - text = re.sub(r" \.",".", text) + # Remove extra spaces before punctuation + # text = re.sub(r" ([\.,!:;])",r"\1",text) return collapse_whitespace(text) +def replace_keeping_word_boundaries(orig, dest, text): + if orig in text: + text = re.sub(r"(\W)"+orig+r"(\W)", r"\1"+dest+r"\2", text) + text = re.sub(orig+r"(\W)", " "+dest+r"\1", text) + text = re.sub(r"(\W)"+orig, r"\1"+dest+" ", text) + text = re.sub(orig, " "+dest+" ", text) + return text + def undigit(str, lang, to="cardinal"): str = re.sub(" ", "", str) if to == "denominator": From 838ab1058d0a9ce54b3cf9dcdb525c2da3dabd86 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Mon, 16 Jan 2023 16:53:56 +0100 Subject: [PATCH 110/172] integrate another approach to get word timestamps from Whisper transcription, based on cross-attention weights (no need to wav2vec model) --- .envdefault | 6 ++-- README.md | 2 +- requirements.txt | 3 +- stt/processing/__init__.py | 24 ++++++++------ stt/processing/decoding.py | 61 ++++++++++++++++++++++++++++++------ stt/processing/load_model.py | 20 +++++++++--- 6 files changed, 87 insertions(+), 29 deletions(-) diff --git a/.envdefault b/.envdefault index ce8ca21..b2105da 100644 --- a/.envdefault +++ b/.envdefault @@ -1,9 +1,11 @@ # SERVING PARAMETERS SERVICE_MODE=http -LANGUAGE=fr +STT_LANGUAGE=fr MODEL=/opt/model.pt -#ALIGNMENT_MODEL=/opt/alignment_model #DEVICE=cuda:0 +#ALIGNMENT_MODEL=fr +#ALIGNMENT_MODEL=wav2vec +#ALIGNMENT_MODEL=/opt/alignment_model # TASK PARAMETERS SERVICE_NAME=stt diff --git a/README.md b/README.md index 4d997ba..df6e3cb 100644 --- a/README.md +++ b/README.md @@ -85,7 +85,7 @@ cp .envdefault .env | SERVICE_MODE | STT serving mode see [Serving mode](#serving-mode) | http \| task | | MODEL | Path to the Whisper model, or type of Whisper model used. | \ \| medium \| large-v1 \| ... | | ALIGNMENT_MODEL | (Optional) Path to the wav2vec model for word alignment, or name of HuggingFace repository or torchaudio pipeline | \ \| WAV2VEC2_ASR_BASE_960H \| jonatasgrosman/wav2vec2-large-xlsr-53-english \| ... | -| LANGUAGE | (Optional) Language to recognize | fr \| en \| ... | +| STT_LANGUAGE | (Optional) Language to recognize | fr \| en \| ... | | SERVICE_NAME | Using the task mode, set the queue's name for task processing | my-stt | | SERVICE_BROKER | Using the task mode, URL of the message broker | redis://my-broker:6379 | | BROKER_PASS | Using the task mode, broker password | my-password | diff --git a/requirements.txt b/requirements.txt index 6b9b488..b53c4be 100644 --- a/requirements.txt +++ b/requirements.txt @@ -12,4 +12,5 @@ speechbrain transformers wavio>=0.0.4 websockets -git+https://github.com/openai/whisper.git \ No newline at end of file +# git+https://github.com/openai/whisper.git +git+https://github.com/Jeronymous/whisper-timestamped.git \ No newline at end of file diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index 492aded..757a182 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -27,11 +27,14 @@ # Check language language = get_language() -available_languages = [ - k for k, v in whisper.tokenizer.LANGUAGES.items()] + [None] +available_languages = \ + list(whisper.tokenizer.LANGUAGES.keys()) + \ + [k.title() for k in whisper.tokenizer.TO_LANGUAGE_CODE.keys()] + \ + [None] if language not in available_languages: - raise RuntimeError( - f"Language {get_language()} is not available. Available languages are: {available_languages}") + raise ValueError(f"Language {get_language()} is not available. Available languages are: {available_languages}") +if isinstance(language, str): + language = whisper.tokenizer.TO_LANGUAGE_CODE.get(language.lower(), language) logger.info(f"Using language {language}") # Load ASR model @@ -44,12 +47,13 @@ "Failed to load transcription model: {}".format(str(err))) from err # Load alignment model -alignment_model_name = get_alignment_model(language) -if alignment_model_name: +alignment_model = get_alignment_model(os.environ.get("ALIGNMENT_MODEL"), language) +if alignment_model: logger.info( - f"Loading alignment model {alignment_model_name} ({'local' if os.path.exists(alignment_model_name) else 'remote'})...") - alignment_model = load_alignment_model( - alignment_model_name, device=device, download_root="/opt") + f"Loading alignment model {alignment_model} ({'local' if os.path.exists(alignment_model) else 'remote'})...") + alignment_model = load_alignment_model(alignment_model, device=device, download_root="/opt") +elif alignment_model is None: + logger.info("Alignment will be done using Whisper cross-attention weights") else: - logger.info("No alignment model preloaded") + logger.info("No alignment model preloaded. It will be loaded on the fly depending on the detected language.") alignment_model = {} # Alignement model(s) will be loaded on the fly diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 41143d3..2f83b4f 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -2,6 +2,7 @@ import whisper from whisper.audio import SAMPLE_RATE +import whisper_timestamped import numpy as np import torch @@ -16,7 +17,7 @@ def get_language(): - return os.environ.get("LANGUAGE", None) + return os.environ.get("STT_LANGUAGE", None) def decode(audio: torch.Tensor, @@ -41,15 +42,23 @@ def decode(audio: torch.Tensor, logger.info(f"Transcribing audio with language {language}...") - whisper_res = model.transcribe(audio, - language=language, - fp16=fp16, - temperature=0.0, # For deterministic results - beam_size=beam_size, - no_speech_threshold=no_speech_threshold, - logprob_threshold=logprob_threshold, - compression_ratio_threshold=compression_ratio_threshold - ) + kwargs = dict( + language=language, + fp16=fp16, + temperature=0.0, # For deterministic results + beam_size=beam_size, + no_speech_threshold=no_speech_threshold, + logprob_threshold=logprob_threshold, + compression_ratio_threshold=compression_ratio_threshold + ) + + if alignment_model is None: + # Use Whisper cross-attention weights + return format_whisper_timestamped_response( + whisper_timestamped.transcribe(model, audio, **kwargs) + ) + + whisper_res = model.transcribe(audio, **kwargs) text = whisper_res["text"] text = remove_emoji(text).strip() @@ -71,6 +80,7 @@ def decode(audio: torch.Tensor, else: spec_alignment_model = alignment_model + result["text"] = text result["confidence-score"] = np.exp(np.array([r["avg_logprob"] for r in segments])).mean() if len(segments) else 0.0 @@ -142,3 +152,34 @@ def decode(audio: torch.Tensor, result["words"] += words return result + +def format_whisper_timestamped_response(transcription): + """Format Whisper response.""" + + # NOCOMMIT + import json + print(json.dumps(transcription, indent=2, ensure_ascii=False)) + + for i, seg in enumerate(transcription["segments"][:-1]): + for expected_keys in ["start", "end", "words", "avg_logprob"]: + assert expected_keys in seg, f"Missing '{expected_keys}' in segment {i} (that has keys {list(seg.keys())})" + + text = transcription["text"].strip() + + segments = [] + + for seg in transcription["segments"]: + seg_proba = np.exp(seg["avg_logprob"]) + for word in seg["words"]: + segments.append({ + "text": word["text"], + "start": word["start"], + "end": word["end"], + "conf": seg_proba, # Same proba for all words within the segment + }) + + return { + "text": text, + "confidence-score": np.mean([np.exp(seg["avg_logprob"]) for seg in transcription["segments"]]), + "segments": segments + } \ No newline at end of file diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py index b4b5738..9c3ff29 100644 --- a/stt/processing/load_model.py +++ b/stt/processing/load_model.py @@ -42,11 +42,21 @@ } -def get_alignment_model(language): - source = os.environ.get("ALIGNMENT_MODEL") - if not source: - source = ALIGNMENT_MODELS.get(language, None) - return source +def get_alignment_model(alignment_model_name, language, force = False): + if alignment_model_name in ["wav2vec", "wav2vec2"]: + if language is None: + # Will load alignment model on the fly depending on detected language + return {} + elif language in ALIGNMENT_MODELS: + return ALIGNMENT_MODELS[language] + elif force: + raise ValueError(f"No wav2vec alignment model for language '{language}'.") + else: + logger.warn(f"No wav2vec alignment model for language '{language}'. Fallback to English.") + return ALIGNMENT_MODELS["en"] + elif alignment_model_name in whisper.tokenizer.LANGUAGES.keys(): + return get_alignment_model("wav2vec", alignment_model_name, force = True) + return alignment_model_name def load_whisper_model(model_type_or_file, device="cpu", download_root="/opt"): From 358a64d22c58dca5db9b9fdbaa289abcc0f98905 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Mon, 16 Jan 2023 19:02:39 +0100 Subject: [PATCH 111/172] remove unwanted print --- stt/processing/decoding.py | 4 ---- 1 file changed, 4 deletions(-) diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 2f83b4f..55c1636 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -155,10 +155,6 @@ def decode(audio: torch.Tensor, def format_whisper_timestamped_response(transcription): """Format Whisper response.""" - - # NOCOMMIT - import json - print(json.dumps(transcription, indent=2, ensure_ascii=False)) for i, seg in enumerate(transcription["segments"][:-1]): for expected_keys in ["start", "end", "words", "avg_logprob"]: From c0491b3af89bed443344551b12f99cb10010fb05 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 15 Feb 2023 17:53:00 +0100 Subject: [PATCH 112/172] fix text normalization --- stt/processing/text_normalize.py | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/stt/processing/text_normalize.py b/stt/processing/text_normalize.py index fa9933c..da2675e 100644 --- a/stt/processing/text_normalize.py +++ b/stt/processing/text_normalize.py @@ -127,7 +127,7 @@ def normalize_text(text: str, lang: str) -> str: lang != "fr" and first[-1] in ["1", "2", "3"]) first = undigit(first, lang=lang, to="ordinal" if use_ordinal else "cardinal") - second = _int_to_month[second] + second = _int_to_month.get(lang, {}).get(second,digitf[i+1:]) else: first = undigit(digitf[:i], lang=lang) second = undigit(digitf[i+1:], to="denominator", lang=lang) @@ -148,7 +148,7 @@ def normalize_text(text: str, lang: str) -> str: pass third = undigit(digitf[i2+1:], lang=lang) if is_date: - first = digitf[:i].lstrip("0") + first = digitf[:i1].lstrip("0") use_ordinal = (lang == "fr" and first == "1") or ( lang != "fr" and first[-1] in ["1", "2", "3"]) first = undigit(first, lang=lang, @@ -370,3 +370,4 @@ def value(r): "¥": "yens", } } + From 20ce809987eee3a26bca2da27c4638b503c9fda2 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 15 Feb 2023 18:26:58 +0100 Subject: [PATCH 113/172] update repo url --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index b53c4be..bb3bebf 100644 --- a/requirements.txt +++ b/requirements.txt @@ -13,4 +13,4 @@ transformers wavio>=0.0.4 websockets # git+https://github.com/openai/whisper.git -git+https://github.com/Jeronymous/whisper-timestamped.git \ No newline at end of file +git+https://github.com/linto-ai/whisper-timestamped.git \ No newline at end of file From 65527f2bf56e3deb6fbb14507860908dd94d1873 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 15 Feb 2023 18:27:17 +0100 Subject: [PATCH 114/172] tune default Whisper options --- stt/processing/decoding.py | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 55c1636..fbf531e 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -26,6 +26,9 @@ def decode(audio: torch.Tensor, with_word_timestamps: bool, language: str = None, beam_size: int = None, + best_of: int = None, + temperature: float = 0.0, + condition_on_previous_text: bool = False, no_speech_threshold: float = 0.6, logprob_threshold: float = -1.0, compression_ratio_threshold: float = 2.4, @@ -45,8 +48,10 @@ def decode(audio: torch.Tensor, kwargs = dict( language=language, fp16=fp16, - temperature=0.0, # For deterministic results + temperature=temperature, beam_size=beam_size, + best_of=best_of, + condition_on_previous_text=condition_on_previous_text, no_speech_threshold=no_speech_threshold, logprob_threshold=logprob_threshold, compression_ratio_threshold=compression_ratio_threshold @@ -58,6 +63,10 @@ def decode(audio: torch.Tensor, whisper_timestamped.transcribe(model, audio, **kwargs) ) + # Force deterministic results + torch.manual_seed(1234) + torch.cuda.manual_seed_all(1234) + whisper_res = model.transcribe(audio, **kwargs) text = whisper_res["text"] From d192382d9d4795b3fe8a09d3c978c21ab612830b Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Mon, 27 Feb 2023 15:19:05 +0100 Subject: [PATCH 115/172] Use LANGUAGE env variable, and accept more values ("*", "fr-FR", ...) --- .envdefault | 8 +++++++- README.md | 23 +++++++++++++---------- stt/processing/decoding.py | 13 +++++++++++-- 3 files changed, 31 insertions(+), 13 deletions(-) diff --git a/.envdefault b/.envdefault index b2105da..d4bb2e8 100644 --- a/.envdefault +++ b/.envdefault @@ -1,8 +1,14 @@ # SERVING PARAMETERS SERVICE_MODE=http -STT_LANGUAGE=fr MODEL=/opt/model.pt + +# LANGUAGE can be in different formats: en, en-US, English, ... +# If not set or "*", the language will be detected automatically. +LANGUAGE=* + #DEVICE=cuda:0 + +# Only used for alignement using wav2vec models #ALIGNMENT_MODEL=fr #ALIGNMENT_MODEL=wav2vec #ALIGNMENT_MODEL=/opt/alignment_model diff --git a/README.md b/README.md index df6e3cb..164db6f 100644 --- a/README.md +++ b/README.md @@ -82,16 +82,19 @@ cp .envdefault .env | PARAMETER | DESCRIPTION | EXEMPLE | |---|---|---| -| SERVICE_MODE | STT serving mode see [Serving mode](#serving-mode) | http \| task | -| MODEL | Path to the Whisper model, or type of Whisper model used. | \ \| medium \| large-v1 \| ... | -| ALIGNMENT_MODEL | (Optional) Path to the wav2vec model for word alignment, or name of HuggingFace repository or torchaudio pipeline | \ \| WAV2VEC2_ASR_BASE_960H \| jonatasgrosman/wav2vec2-large-xlsr-53-english \| ... | -| STT_LANGUAGE | (Optional) Language to recognize | fr \| en \| ... | -| SERVICE_NAME | Using the task mode, set the queue's name for task processing | my-stt | -| SERVICE_BROKER | Using the task mode, URL of the message broker | redis://my-broker:6379 | -| BROKER_PASS | Using the task mode, broker password | my-password | -| CONCURRENCY | Maximum number of parallel requests | 3 | - -The language is a code of two or three letters. The list of languages supported by Whisper are: +| SERVICE_MODE | STT serving mode see [Serving mode](#serving-mode) | `http` \| `task` | +| MODEL | Path to the Whisper model, or type of Whisper model used. | \ \| `medium` \| `large-v1` \| ... | +| ALIGNMENT_MODEL | (Optional) Path to the wav2vec model for word alignment, or name of HuggingFace repository or torchaudio pipeline | \ \| `WAV2VEC2_ASR_BASE_960H` \| `jonatasgrosman/wav2vec2-large-xlsr-53-english` \| ... | +| LANGUAGE | (Optional) Language to recognize | `*` \| `fr` \| `fr-FR` \| `French` \| `en` \| `en-US` \| `English` \| ... | +| SERVICE_NAME | Using the task mode, set the queue's name for task processing | `my-stt` | +| SERVICE_BROKER | Using the task mode, URL of the message broker | `redis://my-broker:6379` | +| BROKER_PASS | Using the task mode, broker password | `my-password` | +| CONCURRENCY | Maximum number of parallel requests | `3` | + +If `*` is used for the `LANGUAGE` environment variable, or if `LANGUAGE` is not defined, +automatic language detection will be performed by Whisper. + +The language can be a code of two or three letters. The list of languages supported by Whisper are: ``` af(afrikaans), am(amharic), ar(arabic), as(assamese), az(azerbaijani), ba(bashkir), be(belarusian), bg(bulgarian), bn(bengali), bo(tibetan), br(breton), bs(bosnian), diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index fbf531e..407b2be 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -17,8 +17,17 @@ def get_language(): - return os.environ.get("STT_LANGUAGE", None) - + """ + Get the language from the environment variable LANGUAGE, and format as expected by Whisper. + """ + language = os.environ.get("LANGUAGE", "*") + # "fr-FR" -> "fr" (language-country code to ISO 639-1 code) + if len(language) > 2 and language[2] == "-": + language = language.split("-")[0] + # "*" means "all languages" + if language == "*": + language = None + return language def decode(audio: torch.Tensor, model: whisper.model.Whisper, From 858cf88240aa9696d14fbbcf797ecdd69a741a01 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 4 Apr 2023 14:50:29 +0200 Subject: [PATCH 116/172] do not use --force-reinstall which could override torch custom installation (CPU...) --- Dockerfile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Dockerfile b/Dockerfile index 844f7ac..cb584ca 100644 --- a/Dockerfile +++ b/Dockerfile @@ -25,7 +25,7 @@ RUN rm -rf /var/lib/apt/lists/* # Install python dependencies COPY requirements.txt ./ -RUN pip install --force-reinstall --no-cache-dir -r requirements.txt && rm requirements.txt +RUN pip install --no-cache-dir -r requirements.txt && rm requirements.txt WORKDIR /usr/src/app From ab5f3f436cef89249f0d323e9399ada898f1c1c8 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 4 Apr 2023 14:50:41 +0200 Subject: [PATCH 117/172] ignore pycache folders --- .gitignore | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/.gitignore b/.gitignore index c7b414a..06b349b 100644 --- a/.gitignore +++ b/.gitignore @@ -1,4 +1,5 @@ start_container.sh .env* test/* -tmp* \ No newline at end of file +tmp* +__pycache__ \ No newline at end of file From 944f3ec0e08a162ad3a5076e48999c46b3153ee3 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 4 Apr 2023 14:51:15 +0200 Subject: [PATCH 118/172] can load more types of models --- stt/processing/__init__.py | 8 ++++---- stt/processing/load_model.py | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index 757a182..d0da27a 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -2,13 +2,13 @@ import logging import torch -import whisper +import whisper_timestamped as whisper from stt import logger from stt.processing.decoding import decode, get_language from stt.processing.utils import load_wave_buffer, load_audiofile -from .load_model import load_whisper_model, load_alignment_model, get_alignment_model, get_model_type +from .load_model import load_whisper_model, load_alignment_model, get_alignment_model __all__ = ["logger", "use_gpu", "decode", "model", "alignment_model", "load_audiofile", "load_wave_buffer"] @@ -32,7 +32,7 @@ [k.title() for k in whisper.tokenizer.TO_LANGUAGE_CODE.keys()] + \ [None] if language not in available_languages: - raise ValueError(f"Language {get_language()} is not available. Available languages are: {available_languages}") + raise ValueError(f"Language '{get_language()}' is not available. Available languages are: {available_languages}") if isinstance(language, str): language = whisper.tokenizer.TO_LANGUAGE_CODE.get(language.lower(), language) logger.info(f"Using language {language}") @@ -46,7 +46,7 @@ raise Exception( "Failed to load transcription model: {}".format(str(err))) from err -# Load alignment model +# Load alignment model (if any) alignment_model = get_alignment_model(os.environ.get("ALIGNMENT_MODEL"), language) if alignment_model: logger.info( diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py index 9c3ff29..c8754d4 100644 --- a/stt/processing/load_model.py +++ b/stt/processing/load_model.py @@ -1,4 +1,4 @@ -import whisper +import whisper_timestamped as whisper import os import requests From 9d4b1f56186c8135da18fb9aa5f3d9fd90790d17 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 4 Apr 2023 15:19:03 +0200 Subject: [PATCH 119/172] fix output format, and add language key --- stt/processing/decoding.py | 31 +++++++++++++++++-------------- 1 file changed, 17 insertions(+), 14 deletions(-) diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 407b2be..98558be 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -45,7 +45,6 @@ def decode(audio: torch.Tensor, remove_punctuation_from_words=False, ) -> dict: """Transcribe the audio data using Whisper with the defined model.""" - result = {"text": "", "confidence-score": 0.0, "words": []} fp16 = model.device != torch.device("cpu") @@ -99,9 +98,11 @@ def decode(audio: torch.Tensor, spec_alignment_model = alignment_model + result = {} result["text"] = text - result["confidence-score"] = np.exp(np.array([r["avg_logprob"] - for r in segments])).mean() if len(segments) else 0.0 + result["language"] = language + result["confidence-score"] = np.exp(np.array([r["avg_logprob"] for r in segments])).mean() if len(segments) else 0.0 + if not with_word_timestamps: if not normalize_text_as_words: text = normalize_text(text, language) @@ -175,25 +176,27 @@ def format_whisper_timestamped_response(transcription): """Format Whisper response.""" for i, seg in enumerate(transcription["segments"][:-1]): - for expected_keys in ["start", "end", "words", "avg_logprob"]: - assert expected_keys in seg, f"Missing '{expected_keys}' in segment {i} (that has keys {list(seg.keys())})" + for expected_keys in ["start", "end", "words", "avg_logprob"]: + assert expected_keys in seg, f"Missing '{expected_keys}' in segment {i} (that has keys {list(seg.keys())})" text = transcription["text"].strip() - segments = [] + words = [] + + segments = transcription.get("segments", []) - for seg in transcription["segments"]: - seg_proba = np.exp(seg["avg_logprob"]) - for word in seg["words"]: - segments.append({ - "text": word["text"], + for seg in segments: + for word in seg.get("words", []): + words.append({ + "word": word["text"], "start": word["start"], "end": word["end"], - "conf": seg_proba, # Same proba for all words within the segment + "conf": word["confidence"], }) return { "text": text, - "confidence-score": np.mean([np.exp(seg["avg_logprob"]) for seg in transcription["segments"]]), - "segments": segments + "language": transcription["language"], + "confidence-score": np.exp(np.array([r["avg_logprob"] for r in segments])).mean() if len(segments) else 0.0, + "words": words, } \ No newline at end of file From 06ce05ac818136051ddfeb0b6f267f66ec2d9a11 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 4 Apr 2023 15:54:28 +0200 Subject: [PATCH 120/172] log the detected language (when automatic) --- stt/processing/decoding.py | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 98558be..cac8ab9 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -67,9 +67,11 @@ def decode(audio: torch.Tensor, if alignment_model is None: # Use Whisper cross-attention weights - return format_whisper_timestamped_response( - whisper_timestamped.transcribe(model, audio, **kwargs) - ) + whisper_res = whisper_timestamped.transcribe(model, audio, **kwargs) + if language is None: + language = whisper_res["language"] + logger.info(f"Detected language: {language}") + return format_whisper_timestamped_response(whisper_res) # Force deterministic results torch.manual_seed(1234) From 2a85bc69ec5fea47d217ce46af78b0ab3a708758 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 12 Apr 2023 10:41:16 +0200 Subject: [PATCH 121/172] add support of faster_whisper --- Dockerfile.ctranslate2 | 43 ++++++ Dockerfile => Dockerfile.torch | 4 +- Dockerfile.torch.cpu | 49 ++++++ http_server/ingress.py | 28 ++-- requirements.ctranslate2.txt | 12 ++ requirements.txt => requirements.torch.txt | 2 +- stt/__init__.py | 20 ++- stt/processing/__init__.py | 24 +-- stt/processing/alignment_model.py | 15 +- stt/processing/decoding.py | 169 +++++++++++++++++---- stt/processing/load_model.py | 78 +++++++--- stt/processing/text_normalize.py | 3 +- stt/processing/utils.py | 150 ++++++++++++++++-- stt/processing/word_alignment.py | 8 +- 14 files changed, 497 insertions(+), 108 deletions(-) create mode 100644 Dockerfile.ctranslate2 rename Dockerfile => Dockerfile.torch (88%) create mode 100644 Dockerfile.torch.cpu create mode 100644 requirements.ctranslate2.txt rename requirements.txt => requirements.torch.txt (85%) diff --git a/Dockerfile.ctranslate2 b/Dockerfile.ctranslate2 new file mode 100644 index 0000000..5989b3f --- /dev/null +++ b/Dockerfile.ctranslate2 @@ -0,0 +1,43 @@ +FROM python:3.9 +LABEL maintainer="jlouradour@linagora.com" + +RUN apt-get update && \ + apt-get install -y --no-install-recommends \ + wget \ + nano \ + bzip2 \ + unzip \ + xz-utils \ + sox \ + ffmpeg \ + g++ \ + make \ + cmake \ + git \ + zlib1g-dev \ + automake \ + autoconf \ + libtool \ + pkg-config \ + ca-certificates + +RUN rm -rf /var/lib/apt/lists/* + +# Install python dependencies +COPY requirements.ctranslate2.txt ./ +RUN pip install --no-cache-dir -r requirements.ctranslate2.txt && rm requirements.ctranslate2.txt + +WORKDIR /usr/src/app + +COPY stt /usr/src/app/stt +COPY celery_app /usr/src/app/celery_app +COPY http_server /usr/src/app/http_server +COPY websocket /usr/src/app/websocket +COPY document /usr/src/app/document +COPY docker-entrypoint.sh wait-for-it.sh healthcheck.sh ./ + +ENV PYTHONPATH="${PYTHONPATH}:/usr/src/app/stt" + +HEALTHCHECK CMD ./healthcheck.sh + +ENTRYPOINT ["./docker-entrypoint.sh"] \ No newline at end of file diff --git a/Dockerfile b/Dockerfile.torch similarity index 88% rename from Dockerfile rename to Dockerfile.torch index cb584ca..9db2a58 100644 --- a/Dockerfile +++ b/Dockerfile.torch @@ -24,8 +24,8 @@ RUN apt-get update && \ RUN rm -rf /var/lib/apt/lists/* # Install python dependencies -COPY requirements.txt ./ -RUN pip install --no-cache-dir -r requirements.txt && rm requirements.txt +COPY requirements.torch.txt ./ +RUN pip install --no-cache-dir -r requirements.torch.txt && rm requirements.torch.txt WORKDIR /usr/src/app diff --git a/Dockerfile.torch.cpu b/Dockerfile.torch.cpu new file mode 100644 index 0000000..68ceda1 --- /dev/null +++ b/Dockerfile.torch.cpu @@ -0,0 +1,49 @@ +FROM python:3.9 +LABEL maintainer="jlouradour@linagora.com" + +RUN apt-get update && \ + apt-get install -y --no-install-recommends \ + wget \ + nano \ + bzip2 \ + unzip \ + xz-utils \ + sox \ + ffmpeg \ + g++ \ + make \ + cmake \ + git \ + zlib1g-dev \ + automake \ + autoconf \ + libtool \ + pkg-config \ + ca-certificates + +RUN rm -rf /var/lib/apt/lists/* + +# Force CPU versions of torch +RUN pip3 install \ + torch==1.13.1+cpu \ + torchaudio==0.13.1+cpu \ + -f https://download.pytorch.org/whl/torch_stable.html + +# Install python dependencies +COPY requirements.torch.txt ./ +RUN pip install --no-cache-dir -r requirements.torch.txt && rm requirements.torch.txt + +WORKDIR /usr/src/app + +COPY stt /usr/src/app/stt +COPY celery_app /usr/src/app/celery_app +COPY http_server /usr/src/app/http_server +COPY websocket /usr/src/app/websocket +COPY document /usr/src/app/document +COPY docker-entrypoint.sh wait-for-it.sh healthcheck.sh ./ + +ENV PYTHONPATH="${PYTHONPATH}:/usr/src/app/stt" + +HEALTHCHECK CMD ./healthcheck.sh + +ENTRYPOINT ["./docker-entrypoint.sh"] \ No newline at end of file diff --git a/http_server/ingress.py b/http_server/ingress.py index ce12e53..fae8c2f 100644 --- a/http_server/ingress.py +++ b/http_server/ingress.py @@ -5,13 +5,13 @@ import time from confparser import createParser -from flask import Flask, Response, abort, json, request -from flask_sock import Sock -from serving import GeventServing, GunicornServing +from flask import Flask, json, request +from serving import GunicornServing, GeventServing from swagger import setupSwaggerUI -from stt.processing import decode, load_wave_buffer, model, alignment_model, use_gpu +from stt.processing import decode, load_wave_buffer, model, alignment_model from stt import logger as stt_logger +from stt import SHOULD_USE_GEVENT app = Flask("__stt-standalone-worker__") app.config["JSON_AS_ASCII"] = False @@ -37,7 +37,7 @@ def oas_docs(): @app.route("/transcribe", methods=["POST"]) def transcribe(): try: - logger.info("Transcribe request received") + logger.info(f"Transcribe request received {time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))}") # get response content type # logger.debug(request.headers.get("accept").lower()) @@ -46,33 +46,31 @@ def transcribe(): elif request.headers.get("accept").lower() == "text/plain": join_metadata = False else: - raise ValueError("Not accepted header") + raise ValueError(f"Not accepted header (accept={request.headers.get('accept')} should be either application/json or text/plain)") # logger.debug("Metadata: {}".format(join_metadata)) # get input file if "file" not in request.files.keys(): - raise ValueError("No audio file was uploaded") + raise ValueError(f"No audio file was uploaded (missing 'file' key)") file_buffer = request.files["file"].read() - audio_data = load_wave_buffer(file_buffer) start_t = time.time() + audio_data = load_wave_buffer(file_buffer) # Transcription transcription = decode( audio_data, model, alignment_model, join_metadata) logger.debug("Transcription complete (t={}s)".format(time.time() - start_t)) - logger.debug(f"END {id}: {time.time()}") - if join_metadata: return json.dumps(transcription, ensure_ascii=False), 200 return transcription["text"], 200 - except ValueError as error: - return str(error), 400 except Exception as error: - logger.error(error) - return "Server Error: {}".format(str(error)), 500 + import traceback + print(traceback.format_exc()) + logger.error(repr(error)) + return "Server Error: {}".format(str(error)), 400 if isinstance(error, ValueError) else 500 @app.errorhandler(405) @@ -109,7 +107,7 @@ def server_error(error): logger.info(f"Using {args.workers} workers") - if use_gpu: + if SHOULD_USE_GEVENT: # TODO: get rid of this serving_type = GeventServing logger.debug("Serving with gevent") else: diff --git a/requirements.ctranslate2.txt b/requirements.ctranslate2.txt new file mode 100644 index 0000000..2cf4b8d --- /dev/null +++ b/requirements.ctranslate2.txt @@ -0,0 +1,12 @@ +celery[redis,auth,msgpack]>=4.4.7 +flask>=1.1.2 +flask-cors>=3.0.10 +flask-sock +flask-swagger-ui>=3.36.0 +gevent +gunicorn +pyyaml>=5.4.1 +requests>=2.26.0 +wavio>=0.0.4 +websockets +faster_whisper \ No newline at end of file diff --git a/requirements.txt b/requirements.torch.txt similarity index 85% rename from requirements.txt rename to requirements.torch.txt index bb3bebf..1b15744 100644 --- a/requirements.txt +++ b/requirements.torch.txt @@ -12,5 +12,5 @@ speechbrain transformers wavio>=0.0.4 websockets -# git+https://github.com/openai/whisper.git +# openai-whisper git+https://github.com/linto-ai/whisper-timestamped.git \ No newline at end of file diff --git a/stt/__init__.py b/stt/__init__.py index 73c3a1a..43c2725 100644 --- a/stt/__init__.py +++ b/stt/__init__.py @@ -1,8 +1,26 @@ import logging -import os logging.basicConfig( format="%(asctime)s %(name)s %(levelname)s: %(message)s", datefmt="%d/%m/%Y %H:%M:%S", ) logger = logging.getLogger("__stt__") + +try: + import faster_whisper + USE_CTRANSLATE2 = True +except ImportError: + USE_CTRANSLATE2 = False + +try: + import torch, torchaudio + USE_TORCH = True +except ImportError: + USE_TORCH = False + +# TODO: Get rid of that +if USE_TORCH: + SHOULD_USE_GEVENT = torch.cuda.is_available() + torch.set_num_threads(1) +else: + SHOULD_USE_GEVENT = USE_CTRANSLATE2 diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index d0da27a..86d6fde 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -1,40 +1,32 @@ import os import logging -import torch -import whisper_timestamped as whisper - from stt import logger -from stt.processing.decoding import decode, get_language -from stt.processing.utils import load_wave_buffer, load_audiofile +from .decoding import decode, get_language +from .utils import get_device, LANGUAGES, load_wave_buffer, load_audiofile from .load_model import load_whisper_model, load_alignment_model, get_alignment_model -__all__ = ["logger", "use_gpu", "decode", "model", "alignment_model", +__all__ = ["logger", "decode", "model", "alignment_model", "load_audiofile", "load_wave_buffer"] # Set informative log logger.setLevel(logging.INFO) # Set device -device = os.environ.get("DEVICE", "cuda:0" if torch.cuda.is_available() else "cpu") -try: - device = torch.device(device) -except Exception as err: - raise Exception("Failed to set device: {}".format(str(err))) from err -use_gpu = device.type == "cuda" +device, use_gpu = get_device() logger.info(f"Using device {device}") # Check language language = get_language() available_languages = \ - list(whisper.tokenizer.LANGUAGES.keys()) + \ - [k.title() for k in whisper.tokenizer.TO_LANGUAGE_CODE.keys()] + \ + list(LANGUAGES.keys()) + \ + [k.lower() for k in LANGUAGES.values()] + \ [None] if language not in available_languages: raise ValueError(f"Language '{get_language()}' is not available. Available languages are: {available_languages}") -if isinstance(language, str): - language = whisper.tokenizer.TO_LANGUAGE_CODE.get(language.lower(), language) +if isinstance(language, str) and language not in LANGUAGES: + language = {v: k for k, v in LANGUAGES.items()}[language.lower()] logger.info(f"Using language {language}") # Load ASR model diff --git a/stt/processing/alignment_model.py b/stt/processing/alignment_model.py index 8a7c39f..f026f7a 100644 --- a/stt/processing/alignment_model.py +++ b/stt/processing/alignment_model.py @@ -1,11 +1,12 @@ -import math -import torch -import torch.nn.utils.rnn as rnn_utils - -from stt import logger +from stt import logger, USE_TORCH +from .utils import SAMPLE_RATE from .load_model import get_model_type -import whisper +import math + +if USE_TORCH: + import torch + import torch.nn.utils.rnn as rnn_utils ################################################################################ # Get list of labes (and blank_id) from model @@ -135,7 +136,7 @@ def compute_logits_transformers(model_and_processor, audios, max_len): model, processor = model_and_processor # can be different from processor.feature_extractor.sampling_rate - sample_rate = whisper.audio.SAMPLE_RATE + sample_rate = SAMPLE_RATE device = model.device audios = [audio.numpy() for audio in audios] diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index cac8ab9..77b8881 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -1,19 +1,17 @@ import os -import whisper -from whisper.audio import SAMPLE_RATE -import whisper_timestamped - import numpy as np -import torch +import copy -from stt import logger -from .word_alignment import compute_alignment -from .text_normalize import remove_punctuation, normalize_text, remove_emoji +from stt import logger, USE_CTRANSLATE2 +from .utils import SAMPLE_RATE from .load_model import load_alignment_model, get_alignment_model +from .text_normalize import remove_punctuation, normalize_text, remove_emoji, _punctuations +from .word_alignment import compute_alignment -# This is to avoid hanging in a multi-threaded environment -torch.set_num_threads(1) +if not USE_CTRANSLATE2: + import torch + import whisper_timestamped def get_language(): @@ -29,30 +27,83 @@ def get_language(): language = None return language -def decode(audio: torch.Tensor, - model: whisper.model.Whisper, + +def decode(audio, + model, alignment_model: "Any", with_word_timestamps: bool, language: str = None, + remove_punctuation_from_words=False, beam_size: int = None, best_of: int = None, temperature: float = 0.0, condition_on_previous_text: bool = False, no_speech_threshold: float = 0.6, - logprob_threshold: float = -1.0, compression_ratio_threshold: float = 2.4, - normalize_text_as_words=False, - remove_punctuation_from_words=False, ) -> dict: - """Transcribe the audio data using Whisper with the defined model.""" - - fp16 = model.device != torch.device("cpu") if language is None: language = get_language() + kwargs = copy.copy(locals()) + logger.info(f"Transcribing audio with language {language}...") + if USE_CTRANSLATE2: + kwargs.pop("alignment_model") + return decode_ct2(**kwargs) + else: + return decode_torch(**kwargs) + + +def decode_ct2(audio, + model, + with_word_timestamps, + language, + remove_punctuation_from_words, + **kwargs + ): + + kwargs["no_speech_threshold"] = 1 # To avoid empty output + if kwargs.get("beam_size") is None: + kwargs["beam_size"] = 1 + if kwargs.get("best_of") is None: + kwargs["best_of"] = 1 + + segments, info = model.transcribe( + audio, + word_timestamps=with_word_timestamps, + language=language, + # Careful with the following options + max_initial_timestamp=10000.0, + **kwargs) + + segments = list(segments) + + return format_faster_whisper_response( + segments, info, + remove_punctuation_from_words=remove_punctuation_from_words + ) + + +def decode_torch(audio, + model, + alignment_model, + with_word_timestamps, + language, + remove_punctuation_from_words, + beam_size, + best_of, + temperature, + condition_on_previous_text, + no_speech_threshold, + compression_ratio_threshold, + normalize_text_as_words=False, + ): + """Transcribe the audio data using Whisper with the defined model.""" + + fp16 = model.device != torch.device("cpu") + kwargs = dict( language=language, fp16=fp16, @@ -61,7 +112,6 @@ def decode(audio: torch.Tensor, best_of=best_of, condition_on_previous_text=condition_on_previous_text, no_speech_threshold=no_speech_threshold, - logprob_threshold=logprob_threshold, compression_ratio_threshold=compression_ratio_threshold ) @@ -71,7 +121,7 @@ def decode(audio: torch.Tensor, if language is None: language = whisper_res["language"] logger.info(f"Detected language: {language}") - return format_whisper_timestamped_response(whisper_res) + return format_whisper_timestamped_response(whisper_res, remove_punctuation_from_words=remove_punctuation_from_words) # Force deterministic results torch.manual_seed(1234) @@ -99,11 +149,12 @@ def decode(audio: torch.Tensor, else: spec_alignment_model = alignment_model - result = {} result["text"] = text result["language"] = language - result["confidence-score"] = np.exp(np.array([r["avg_logprob"] for r in segments])).mean() if len(segments) else 0.0 + result["confidence-score"] = np.exp( + np.array([r["avg_logprob"] for r in segments]) + ).mean() if len(segments) else 0.0 if not with_word_timestamps: if not normalize_text_as_words: @@ -174,31 +225,91 @@ def decode(audio: torch.Tensor, return result -def format_whisper_timestamped_response(transcription): + +def format_whisper_timestamped_response(transcription, remove_punctuation_from_words=False): """Format Whisper response.""" for i, seg in enumerate(transcription["segments"][:-1]): for expected_keys in ["start", "end", "words", "avg_logprob"]: assert expected_keys in seg, f"Missing '{expected_keys}' in segment {i} (that has keys {list(seg.keys())})" - text = transcription["text"].strip() - words = [] segments = transcription.get("segments", []) for seg in segments: for word in seg.get("words", []): + text = word["text"] + if remove_punctuation_from_words: + text = remove_punctuation(text) words.append({ - "word": word["text"], + "word": text, "start": word["start"], "end": word["end"], "conf": word["confidence"], }) return { - "text": text, + "text": transcription["text"].strip(), "language": transcription["language"], - "confidence-score": np.exp(np.array([r["avg_logprob"] for r in segments])).mean() if len(segments) else 0.0, + "confidence-score": round(np.exp(np.array([r["avg_logprob"] for r in segments])).mean(), 2) if len(segments) else 0.0, "words": words, - } \ No newline at end of file + } + + +def format_faster_whisper_response(segments, info, + remove_punctuation_from_words=False): + + language = info.language + duration = info.duration + + def checked_timestamps(start, end=None): + if start > duration or (end is not None and end > duration): + print("WARNING, timestamp %f is greater than duration %f" % (max(start, end if end else start), duration)) + if end and end <= start: + if end == start: + pass # end = start + 0.01 + else: + print("WARNING, end timestamp %f is smaller than start timestamp %f" % (end, start)) + if end is None: + return start + return (start, end) + + segments_list = [] + for segment in segments: + start, end = checked_timestamps(segment.start, segment.end) + + words = [] + if segment.words: + for word in segment.words: + if len(words) and (not(word.word.strip()) or word.word.strip()[0] in _punctuations): + words[-1]["text"] += word.word + if word.word.strip() not in _punctuations: + words[-1]["confidence"].append(word.probability) + _, words[-1]["end"] = checked_timestamps(words[-1]["end"], word.end) + continue + words.append( + {"text": word.word, "confidence": [word.probability]} | dict(zip(("start", "end"), checked_timestamps(word.start, word.end))) + ) + + for word in words: + word["text"] = word["text"].strip() + word["confidence"] = round(np.mean([c for c in word["confidence"]]), 2) + + segments_list.append({ + "text": segment.text.strip(), + "start": start, + "end": end, + "avg_logprob": segment.avg_log_prob, + "words": words + }) + + assert len(segments_list) + + transcription = { + "text": " ".join(segment["text"] for segment in segments_list), + "language": language, + "confidence": round(np.exp(np.mean([segment.avg_log_prob for segment in segments])), 2), + "segments": segments_list, + } + return format_whisper_timestamped_response(transcription, remove_punctuation_from_words=remove_punctuation_from_words) \ No newline at end of file diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py index c8754d4..dc2e9fc 100644 --- a/stt/processing/load_model.py +++ b/stt/processing/load_model.py @@ -1,14 +1,20 @@ -import whisper_timestamped as whisper - import os import requests -import huggingface_hub -import speechbrain as sb -import transformers -import torchaudio - import time -from stt import logger + +from stt import logger, USE_CTRANSLATE2, USE_TORCH +from .utils import LANGUAGES + +if USE_CTRANSLATE2: + import faster_whisper as whisper +else: + import whisper_timestamped as whisper + +if USE_TORCH: + import huggingface_hub + import speechbrain as sb + import transformers + import torchaudio # Sources: # * https://github.com/m-bain/whisperX (in whisperx/transcribe.py) @@ -42,20 +48,24 @@ } -def get_alignment_model(alignment_model_name, language, force = False): +def get_alignment_model(alignment_model_name, language, force=False): if alignment_model_name in ["wav2vec", "wav2vec2"]: if language is None: - # Will load alignment model on the fly depending on detected language + # Will load alignment model on the fly depending + # on detected language return {} elif language in ALIGNMENT_MODELS: return ALIGNMENT_MODELS[language] elif force: - raise ValueError(f"No wav2vec alignment model for language '{language}'.") + raise ValueError( + f"No wav2vec alignment model for language '{language}'.") else: - logger.warn(f"No wav2vec alignment model for language '{language}'. Fallback to English.") + logger.warn( + f"No wav2vec alignment model for language '{language}'. Fallback to English." + ) return ALIGNMENT_MODELS["en"] - elif alignment_model_name in whisper.tokenizer.LANGUAGES.keys(): - return get_alignment_model("wav2vec", alignment_model_name, force = True) + elif alignment_model_name in LANGUAGES.keys(): + return get_alignment_model("wav2vec", alignment_model_name, force=True) return alignment_model_name @@ -63,11 +73,27 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root="/opt"): start = time.time() - model = whisper.load_model(model_type_or_file, device=device, - download_root=os.path.join(download_root, "whisper")) + if USE_CTRANSLATE2: + if not os.path.isdir(model_type_or_file): + # To specify the cache directory + model_type_or_file = whisper.utils.download_model( + model_type_or_file, + output_dir=os.path.join(download_root, "huggingface/hub") + ) + model = whisper.WhisperModel(model_type_or_file, device=device, + # vvv TODO + compute_type="default", + # cpu_threads=0, + # num_workers=1, + ) - model.eval() - model.requires_grad_(False) + else: + model = whisper.load_model( + model_type_or_file, device=device, + download_root=os.path.join(download_root, "whisper") + ) + model.eval() + model.requires_grad_(False) logger.info("Whisper Model loaded. (t={}s)".format(time.time() - start)) @@ -76,21 +102,29 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root="/opt"): def load_alignment_model(source, device="cpu", download_root="/opt"): + if not USE_TORCH: + raise NotImplementedError( + "Alignement model not available without Torch") + start = time.time() if source in torchaudio.pipelines.__all__: - model = load_torchaudio_model(source, device=device, download_root=download_root) + model = load_torchaudio_model( + source, device=device, download_root=download_root) else: try: - model = load_transformers_model(source, device=device, download_root=download_root) + model = load_transformers_model( + source, device=device, download_root=download_root) except Exception as err1: try: - model = load_speechbrain_model(source, device=device, download_root=download_root) + model = load_speechbrain_model( + source, device=device, download_root=download_root) except Exception as err2: raise Exception( f"Failed to load alignment model:\n<<< transformers <<<\n{str(err1)}\n<<< speechbrain <<<\n{str(err2)}") from err2 - logger.info(f"Alignment Model of type {get_model_type(model)} loaded. (t={time.time() - start}s)") + logger.info( + f"Alignment Model of type {get_model_type(model)} loaded. (t={time.time() - start}s)") return model diff --git a/stt/processing/text_normalize.py b/stt/processing/text_normalize.py index da2675e..a4037bd 100644 --- a/stt/processing/text_normalize.py +++ b/stt/processing/text_normalize.py @@ -2,7 +2,6 @@ import re # import string import unicodedata -from num2words import num2words from stt import logger from .utils import flatten @@ -44,7 +43,6 @@ def remove_emoji(text): def normalize_text(text: str, lang: str) -> str: """ Transform digits into characters... """ - # Reorder currencies (1,20€ -> 1 € 20) coma = "," if lang in ["fr"] else "\." for c in _currencies: @@ -215,6 +213,7 @@ def robust_num2words(x, lang, to="cardinal", orig=""): """ Bugfix for num2words """ + from num2words import num2words try: res = num2words(x, lang=lang, to=to) if lang == "fr" and to == "ordinal": diff --git a/stt/processing/utils.py b/stt/processing/utils.py index 5ff706a..e8b2bd8 100644 --- a/stt/processing/utils.py +++ b/stt/processing/utils.py @@ -1,17 +1,42 @@ +from stt import USE_CTRANSLATE2, USE_TORCH + import io import wavio import os import numpy as np -import torch -import torchaudio -import whisper +SAMPLE_RATE = 16000 # whisper.audio.SAMPLE_RATE + +if USE_CTRANSLATE2: + import ctranslate2 + import faster_whisper +else: + import torch + import torchaudio + import whisper + +def has_cuda(): + if USE_CTRANSLATE2: + return ctranslate2.get_cuda_device_count() > 0 + else: + return torch.cuda.is_available() + +def get_device(): + device = os.environ.get("DEVICE", "cuda" if has_cuda() else "cpu") + use_gpu = "cuda" in device + if not USE_CTRANSLATE2: + try: + device = torch.device(device) + except Exception as err: + raise Exception("Failed to set device: {}".format(str(err))) from err + return device, use_gpu def conform_audio(audio, sample_rate=16_000): - if sample_rate != whisper.audio.SAMPLE_RATE: + if sample_rate != SAMPLE_RATE: + if not USE_TORCH: + raise NotImplementedError("Resampling not available without Torch") # Down or Up sample to the right sampling rate - audio = torchaudio.transforms.Resample( - sample_rate, whisper.audio.SAMPLE_RATE)(audio) + audio = torchaudio.transforms.Resample(sample_rate, SAMPLE_RATE)(audio) if audio.shape[0] > 1: # Stereo to mono # audio = torchaudio.transforms.DownmixMono()(audio, channels_first = True) @@ -26,8 +51,8 @@ def load_audiofile(path): raise RuntimeError("File not found: %s" % path) elif not os.access(path, os.R_OK): raise RuntimeError("Missing reading permission for: %s" % path) - # audio, sample_rate = torchaudio.load(path) - # return conform_audio(audio, sample_rate) + if USE_CTRANSLATE2: + return faster_whisper.decode_audio(path, sampling_rate=SAMPLE_RATE) audio = whisper.load_audio(path) audio = torch.from_numpy(audio) return audio @@ -36,10 +61,13 @@ def load_audiofile(path): def load_wave_buffer(file_buffer): """ Formats audio from a wavFile buffer to a torch array for processing. """ file_buffer_io = io.BytesIO(file_buffer) + if USE_CTRANSLATE2: + return faster_whisper.decode_audio(file_buffer_io, sampling_rate=SAMPLE_RATE) file_content = wavio.read(file_buffer_io) sample_rate = file_content.rate - audio = torch.from_numpy(file_content.data.astype(np.float32)/32768) - audio = audio.transpose(0, 1) + audio = file_content.data.astype(np.float32)/32768 + audio = audio.transpose() + audio = torch.from_numpy(audio) return conform_audio(audio, sample_rate) @@ -48,3 +76,105 @@ def flatten(l): flatten a list of lists """ return [item for sublist in l for item in sublist] + +LANGUAGES = { # whisper.tokenizer.LANGUAGES + 'en': 'english', + 'zh': 'chinese', + 'de': 'german', + 'es': 'spanish', + 'ru': 'russian', + 'ko': 'korean', + 'fr': 'french', + 'ja': 'japanese', + 'pt': 'portuguese', + 'tr': 'turkish', + 'pl': 'polish', + 'ca': 'catalan', + 'nl': 'dutch', + 'ar': 'arabic', + 'sv': 'swedish', + 'it': 'italian', + 'id': 'indonesian', + 'hi': 'hindi', + 'fi': 'finnish', + 'vi': 'vietnamese', + 'he': 'hebrew', + 'uk': 'ukrainian', + 'el': 'greek', + 'ms': 'malay', + 'cs': 'czech', + 'ro': 'romanian', + 'da': 'danish', + 'hu': 'hungarian', + 'ta': 'tamil', + 'no': 'norwegian', + 'th': 'thai', + 'ur': 'urdu', + 'hr': 'croatian', + 'bg': 'bulgarian', + 'lt': 'lithuanian', + 'la': 'latin', + 'mi': 'maori', + 'ml': 'malayalam', + 'cy': 'welsh', + 'sk': 'slovak', + 'te': 'telugu', + 'fa': 'persian', + 'lv': 'latvian', + 'bn': 'bengali', + 'sr': 'serbian', + 'az': 'azerbaijani', + 'sl': 'slovenian', + 'kn': 'kannada', + 'et': 'estonian', + 'mk': 'macedonian', + 'br': 'breton', + 'eu': 'basque', + 'is': 'icelandic', + 'hy': 'armenian', + 'ne': 'nepali', + 'mn': 'mongolian', + 'bs': 'bosnian', + 'kk': 'kazakh', + 'sq': 'albanian', + 'sw': 'swahili', + 'gl': 'galician', + 'mr': 'marathi', + 'pa': 'punjabi', + 'si': 'sinhala', + 'km': 'khmer', + 'sn': 'shona', + 'yo': 'yoruba', + 'so': 'somali', + 'af': 'afrikaans', + 'oc': 'occitan', + 'ka': 'georgian', + 'be': 'belarusian', + 'tg': 'tajik', + 'sd': 'sindhi', + 'gu': 'gujarati', + 'am': 'amharic', + 'yi': 'yiddish', + 'lo': 'lao', + 'uz': 'uzbek', + 'fo': 'faroese', + 'ht': 'haitian creole', + 'ps': 'pashto', + 'tk': 'turkmen', + 'nn': 'nynorsk', + 'mt': 'maltese', + 'sa': 'sanskrit', + 'lb': 'luxembourgish', + 'my': 'myanmar', + 'bo': 'tibetan', + 'tl': 'tagalog', + 'mg': 'malagasy', + 'as': 'assamese', + 'tt': 'tatar', + 'haw': 'hawaiian', + 'ln': 'lingala', + 'ha': 'hausa', + 'ba': 'bashkir', + 'jw': 'javanese', + 'su': 'sundanese' +} diff --git a/stt/processing/word_alignment.py b/stt/processing/word_alignment.py index ba94a14..229fb43 100644 --- a/stt/processing/word_alignment.py +++ b/stt/processing/word_alignment.py @@ -1,14 +1,16 @@ """ -source: https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html +Credits: https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html """ +from stt import logger, USE_TORCH from dataclasses import dataclass -import torch -from stt import logger from .alignment_model import compute_logprobas, get_vocab from .utils import flatten from .text_normalize import transliterate +if USE_TORCH: + import torch + _unknown_chars = [] def compute_alignment(audio, transcript, model): From 888227ea274b18719318c4027b008794d68a0f06 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 12 Apr 2023 12:49:51 +0200 Subject: [PATCH 122/172] better format in logger timing --- http_server/ingress.py | 6 +----- stt/__init__.py | 4 ++-- 2 files changed, 3 insertions(+), 7 deletions(-) diff --git a/http_server/ingress.py b/http_server/ingress.py index fae8c2f..d5524ca 100644 --- a/http_server/ingress.py +++ b/http_server/ingress.py @@ -17,10 +17,6 @@ app.config["JSON_AS_ASCII"] = False app.config["JSON_SORT_KEYS"] = False -logging.basicConfig( - format="%(asctime)s %(name)s %(levelname)s: %(message)s", - datefmt="%d/%m/%Y %H:%M:%S", -) logger = logging.getLogger("__stt-standalone-worker__") @@ -37,7 +33,7 @@ def oas_docs(): @app.route("/transcribe", methods=["POST"]) def transcribe(): try: - logger.info(f"Transcribe request received {time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))}") + logger.info(f"Transcribe request received") # get response content type # logger.debug(request.headers.get("accept").lower()) diff --git a/stt/__init__.py b/stt/__init__.py index 43c2725..5460088 100644 --- a/stt/__init__.py +++ b/stt/__init__.py @@ -1,8 +1,8 @@ import logging logging.basicConfig( - format="%(asctime)s %(name)s %(levelname)s: %(message)s", - datefmt="%d/%m/%Y %H:%M:%S", + format="[%(asctime)s,%(msecs)03d %(name)s] %(levelname)s: %(message)s", + datefmt="%Y-%m-%d %H:%M:%S", ) logger = logging.getLogger("__stt__") From ff1bf622267b14b15d99c05c6e7ca0e665bff86d Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 12 Apr 2023 13:08:37 +0200 Subject: [PATCH 123/172] reorganize code: move alignment model related stuff in alignment_model.py --- stt/processing/__init__.py | 3 +- stt/processing/alignment_model.py | 182 +++++++++++++++++++++++++++++- stt/processing/decoding.py | 4 +- stt/processing/load_model.py | 180 +---------------------------- 4 files changed, 185 insertions(+), 184 deletions(-) diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index 86d6fde..f891984 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -5,7 +5,8 @@ from .decoding import decode, get_language from .utils import get_device, LANGUAGES, load_wave_buffer, load_audiofile -from .load_model import load_whisper_model, load_alignment_model, get_alignment_model +from .load_model import load_whisper_model +from .alignment_model import load_alignment_model, get_alignment_model __all__ = ["logger", "decode", "model", "alignment_model", "load_audiofile", "load_wave_buffer"] diff --git a/stt/processing/alignment_model.py b/stt/processing/alignment_model.py index f026f7a..08a5e45 100644 --- a/stt/processing/alignment_model.py +++ b/stt/processing/alignment_model.py @@ -1,15 +1,191 @@ from stt import logger, USE_TORCH -from .utils import SAMPLE_RATE -from .load_model import get_model_type +from .utils import SAMPLE_RATE, LANGUAGES +import os import math +import time +import requests if USE_TORCH: import torch import torch.nn.utils.rnn as rnn_utils + import huggingface_hub + import speechbrain as sb + import transformers + import torchaudio ################################################################################ -# Get list of labes (and blank_id) from model +# Load models + +# Sources: +# * https://github.com/m-bain/whisperX (in whisperx/transcribe.py) +# * https://pytorch.org/audio/stable/pipelines.html +# * https://huggingface.co/jonatasgrosman + +ALIGNMENT_MODELS = { + "en": "WAV2VEC2_ASR_BASE_960H", + # "en": "jonatasgrosman/wav2vec2-large-xlsr-53-english", + "fr": "VOXPOPULI_ASR_BASE_10K_FR", + # "fr": "jonatasgrosman/wav2vec2-large-xlsr-53-french", + "de": "VOXPOPULI_ASR_BASE_10K_DE", + # "de": "jonatasgrosman/wav2vec2-large-xlsr-53-german", + "es": "VOXPOPULI_ASR_BASE_10K_ES", + # "it": "jonatasgrosman/wav2vec2-large-xlsr-53-spanish", + "it": "VOXPOPULI_ASR_BASE_10K_IT", + # "it": "jonatasgrosman/wav2vec2-large-xlsr-53-italian", + "pt": "jonatasgrosman/wav2vec2-large-xlsr-53-portuguese", + "nl": "jonatasgrosman/wav2vec2-large-xlsr-53-dutch", + "pl": "jonatasgrosman/wav2vec2-large-xlsr-53-polish", + "fi": "jonatasgrosman/wav2vec2-large-xlsr-53-finnish", + "hu": "jonatasgrosman/wav2vec2-large-xlsr-53-hungarian", + "el": "jonatasgrosman/wav2vec2-large-xlsr-53-greek", + "fa": "jonatasgrosman/wav2vec2-large-xlsr-53-persian", + "ar": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic", + "ru": "jonatasgrosman/wav2vec2-large-xlsr-53-russian", + "uk": "Yehor/wav2vec2-xls-r-300m-uk-with-small-lm", + "ja": "jonatasgrosman/wav2vec2-large-xlsr-53-japanese", + "zh": "jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn", + "vi": "nguyenvulebinh/wav2vec2-base-vietnamese-250h", +} + + +def get_alignment_model(alignment_model_name, language, force=False): + if alignment_model_name in ["wav2vec", "wav2vec2"]: + if language is None: + # Will load alignment model on the fly depending + # on detected language + return {} + elif language in ALIGNMENT_MODELS: + return ALIGNMENT_MODELS[language] + elif force: + raise ValueError( + f"No wav2vec alignment model for language '{language}'.") + else: + logger.warn( + f"No wav2vec alignment model for language '{language}'. Fallback to English." + ) + return ALIGNMENT_MODELS["en"] + elif alignment_model_name in LANGUAGES.keys(): + return get_alignment_model("wav2vec", alignment_model_name, force=True) + return alignment_model_name + +def load_alignment_model(source, device="cpu", download_root="/opt"): + + if not USE_TORCH: + raise NotImplementedError( + "Alignement model not available without Torch") + + start = time.time() + + if source in torchaudio.pipelines.__all__: + model = load_torchaudio_model( + source, device=device, download_root=download_root) + else: + try: + model = load_transformers_model( + source, device=device, download_root=download_root) + except Exception as err1: + try: + model = load_speechbrain_model( + source, device=device, download_root=download_root) + except Exception as err2: + raise Exception( + f"Failed to load alignment model:\n<<< transformers <<<\n{str(err1)}\n<<< speechbrain <<<\n{str(err2)}") from err2 + + logger.info( + f"Alignment Model of type {get_model_type(model)} loaded. (t={time.time() - start}s)") + + return model + + +def load_speechbrain_model(source, device="cpu", download_root="/opt"): + + if os.path.isdir(source): + yaml_file = os.path.join(source, "hyperparams.yaml") + assert os.path.isfile( + yaml_file), f"Hyperparams file {yaml_file} not found" + else: + try: + yaml_file = huggingface_hub.hf_hub_download( + repo_id=source, filename="hyperparams.yaml", cache_dir=os.path.join(download_root, "huggingface/hub")) + except requests.exceptions.HTTPError: + yaml_file = None + overrides = make_yaml_overrides( + yaml_file, {"save_path": os.path.join(download_root, "speechbrain")}) + + savedir = os.path.join(download_root, "speechbrain") + try: + model = sb.pretrained.EncoderASR.from_hparams( + source=source, run_opts={"device": device}, savedir=savedir, overrides=overrides) + except ValueError: + model = sb.pretrained.EncoderDecoderASR.from_hparams( + source=source, run_opts={"device": device}, savedir=savedir, overrides=overrides) + + model.train(False) + model.requires_grad_(False) + return model + + +def load_transformers_model(source, device="cpu", download_root="/opt"): + + model = transformers.Wav2Vec2ForCTC.from_pretrained(source).to(device) + processor = transformers.Wav2Vec2Processor.from_pretrained(source) + + model.eval() + model.requires_grad_(False) + return model, processor + + +def load_torchaudio_model(source, device="cpu", download_root="/opt"): + + bundle = torchaudio.pipelines.__dict__[source] + model = bundle.get_model().to(device) + labels = bundle.get_labels() + + model.eval() + model.requires_grad_(False) + return model, labels + + +def get_model_type(model): + if not isinstance(model, tuple): + return "speechbrain" + assert len(model) == 2, "Invalid model type" + if isinstance(model[0], transformers.Wav2Vec2ForCTC): + return "transformers" + return "torchaudio" + + +def make_yaml_overrides(yaml_file, key_values): + """ + return a dictionary of overrides to be used with speechbrain (hyperyaml files) + yaml_file: path to yaml file + key_values: dict of key values to override + """ + if yaml_file is None: + return None + + override = {} + with open(yaml_file, "r") as f: + parent = None + for line in f: + if line.strip() == "": + parent = None + elif line == line.lstrip(): + if ":" in line: + parent = line.split(":")[0].strip() + if parent in key_values: + override[parent] = key_values[parent] + elif ":" in line: + child = line.strip().split(":")[0].strip() + if child in key_values: + override[parent] = override.get(parent, {}) | { + child: key_values[child]} + return override + + +################################################################################ +# Get list of labels (and blank_id) from model def get_vocab(model): diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 77b8881..f6d00e5 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -5,8 +5,8 @@ from stt import logger, USE_CTRANSLATE2 from .utils import SAMPLE_RATE -from .load_model import load_alignment_model, get_alignment_model from .text_normalize import remove_punctuation, normalize_text, remove_emoji, _punctuations +from .alignment_model import get_alignment_model, load_alignment_model from .word_alignment import compute_alignment if not USE_CTRANSLATE2: @@ -312,4 +312,4 @@ def checked_timestamps(start, end=None): "confidence": round(np.exp(np.mean([segment.avg_log_prob for segment in segments])), 2), "segments": segments_list, } - return format_whisper_timestamped_response(transcription, remove_punctuation_from_words=remove_punctuation_from_words) \ No newline at end of file + return format_whisper_timestamped_response(transcription, remove_punctuation_from_words=remove_punctuation_from_words) diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py index dc2e9fc..4ba9f0e 100644 --- a/stt/processing/load_model.py +++ b/stt/processing/load_model.py @@ -1,74 +1,13 @@ import os -import requests import time -from stt import logger, USE_CTRANSLATE2, USE_TORCH -from .utils import LANGUAGES +from stt import logger, USE_CTRANSLATE2 if USE_CTRANSLATE2: import faster_whisper as whisper else: import whisper_timestamped as whisper -if USE_TORCH: - import huggingface_hub - import speechbrain as sb - import transformers - import torchaudio - -# Sources: -# * https://github.com/m-bain/whisperX (in whisperx/transcribe.py) -# * https://pytorch.org/audio/stable/pipelines.html -# * https://huggingface.co/jonatasgrosman - -ALIGNMENT_MODELS = { - "en": "WAV2VEC2_ASR_BASE_960H", - # "en": "jonatasgrosman/wav2vec2-large-xlsr-53-english", - "fr": "VOXPOPULI_ASR_BASE_10K_FR", - # "fr": "jonatasgrosman/wav2vec2-large-xlsr-53-french", - "de": "VOXPOPULI_ASR_BASE_10K_DE", - # "de": "jonatasgrosman/wav2vec2-large-xlsr-53-german", - "es": "VOXPOPULI_ASR_BASE_10K_ES", - # "it": "jonatasgrosman/wav2vec2-large-xlsr-53-spanish", - "it": "VOXPOPULI_ASR_BASE_10K_IT", - # "it": "jonatasgrosman/wav2vec2-large-xlsr-53-italian", - "pt": "jonatasgrosman/wav2vec2-large-xlsr-53-portuguese", - "nl": "jonatasgrosman/wav2vec2-large-xlsr-53-dutch", - "pl": "jonatasgrosman/wav2vec2-large-xlsr-53-polish", - "fi": "jonatasgrosman/wav2vec2-large-xlsr-53-finnish", - "hu": "jonatasgrosman/wav2vec2-large-xlsr-53-hungarian", - "el": "jonatasgrosman/wav2vec2-large-xlsr-53-greek", - "fa": "jonatasgrosman/wav2vec2-large-xlsr-53-persian", - "ar": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic", - "ru": "jonatasgrosman/wav2vec2-large-xlsr-53-russian", - "uk": "Yehor/wav2vec2-xls-r-300m-uk-with-small-lm", - "ja": "jonatasgrosman/wav2vec2-large-xlsr-53-japanese", - "zh": "jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn", - "vi": "nguyenvulebinh/wav2vec2-base-vietnamese-250h", -} - - -def get_alignment_model(alignment_model_name, language, force=False): - if alignment_model_name in ["wav2vec", "wav2vec2"]: - if language is None: - # Will load alignment model on the fly depending - # on detected language - return {} - elif language in ALIGNMENT_MODELS: - return ALIGNMENT_MODELS[language] - elif force: - raise ValueError( - f"No wav2vec alignment model for language '{language}'.") - else: - logger.warn( - f"No wav2vec alignment model for language '{language}'. Fallback to English." - ) - return ALIGNMENT_MODELS["en"] - elif alignment_model_name in LANGUAGES.keys(): - return get_alignment_model("wav2vec", alignment_model_name, force=True) - return alignment_model_name - - def load_whisper_model(model_type_or_file, device="cpu", download_root="/opt"): start = time.time() @@ -97,119 +36,4 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root="/opt"): logger.info("Whisper Model loaded. (t={}s)".format(time.time() - start)) - return model - - -def load_alignment_model(source, device="cpu", download_root="/opt"): - - if not USE_TORCH: - raise NotImplementedError( - "Alignement model not available without Torch") - - start = time.time() - - if source in torchaudio.pipelines.__all__: - model = load_torchaudio_model( - source, device=device, download_root=download_root) - else: - try: - model = load_transformers_model( - source, device=device, download_root=download_root) - except Exception as err1: - try: - model = load_speechbrain_model( - source, device=device, download_root=download_root) - except Exception as err2: - raise Exception( - f"Failed to load alignment model:\n<<< transformers <<<\n{str(err1)}\n<<< speechbrain <<<\n{str(err2)}") from err2 - - logger.info( - f"Alignment Model of type {get_model_type(model)} loaded. (t={time.time() - start}s)") - - return model - - -def load_speechbrain_model(source, device="cpu", download_root="/opt"): - - if os.path.isdir(source): - yaml_file = os.path.join(source, "hyperparams.yaml") - assert os.path.isfile( - yaml_file), f"Hyperparams file {yaml_file} not found" - else: - try: - yaml_file = huggingface_hub.hf_hub_download( - repo_id=source, filename="hyperparams.yaml", cache_dir=os.path.join(download_root, "huggingface/hub")) - except requests.exceptions.HTTPError: - yaml_file = None - overrides = make_yaml_overrides( - yaml_file, {"save_path": os.path.join(download_root, "speechbrain")}) - - savedir = os.path.join(download_root, "speechbrain") - try: - model = sb.pretrained.EncoderASR.from_hparams( - source=source, run_opts={"device": device}, savedir=savedir, overrides=overrides) - except ValueError: - model = sb.pretrained.EncoderDecoderASR.from_hparams( - source=source, run_opts={"device": device}, savedir=savedir, overrides=overrides) - - model.train(False) - model.requires_grad_(False) - return model - - -def load_transformers_model(source, device="cpu", download_root="/opt"): - - model = transformers.Wav2Vec2ForCTC.from_pretrained(source).to(device) - processor = transformers.Wav2Vec2Processor.from_pretrained(source) - - model.eval() - model.requires_grad_(False) - return model, processor - - -def load_torchaudio_model(source, device="cpu", download_root="/opt"): - - bundle = torchaudio.pipelines.__dict__[source] - model = bundle.get_model().to(device) - labels = bundle.get_labels() - - model.eval() - model.requires_grad_(False) - return model, labels - - -def get_model_type(model): - if not isinstance(model, tuple): - return "speechbrain" - assert len(model) == 2, "Invalid model type" - if isinstance(model[0], transformers.Wav2Vec2ForCTC): - return "transformers" - return "torchaudio" - - -def make_yaml_overrides(yaml_file, key_values): - """ - return a dictionary of overrides to be used with speechbrain (hyperyaml files) - yaml_file: path to yaml file - key_values: dict of key values to override - """ - if yaml_file is None: - return None - - override = {} - with open(yaml_file, "r") as f: - parent = None - for line in f: - if line.strip() == "": - parent = None - elif line == line.lstrip(): - if ":" in line: - parent = line.split(":")[0].strip() - if parent in key_values: - override[parent] = key_values[parent] - elif ":" in line: - child = line.strip().split(":")[0].strip() - if child in key_values: - override[parent] = override.get(parent, {}) | { - child: key_values[child]} - return override + return model \ No newline at end of file From 16b5743960b3ece59bb69fe8c0f902fa7373f04c Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 12 Apr 2023 18:32:51 +0200 Subject: [PATCH 124/172] cosm --- stt/processing/load_model.py | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py index 4ba9f0e..14ec3e2 100644 --- a/stt/processing/load_model.py +++ b/stt/processing/load_model.py @@ -19,11 +19,11 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root="/opt"): model_type_or_file, output_dir=os.path.join(download_root, "huggingface/hub") ) - model = whisper.WhisperModel(model_type_or_file, device=device, - # vvv TODO + model = whisper.WhisperModel(model_type_or_file, + device=device, compute_type="default", - # cpu_threads=0, - # num_workers=1, + cpu_threads=0, # Can be controled with OMP_NUM_THREADS + num_workers=1, ) else: From 916e1292454a9967023f8968f0cf19b2522fcbe1 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 12 Apr 2023 18:34:35 +0200 Subject: [PATCH 125/172] log processing time at a common place --- http_server/ingress.py | 3 +-- stt/processing/decoding.py | 12 +++++++++--- 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/http_server/ingress.py b/http_server/ingress.py index d5524ca..bab7b65 100644 --- a/http_server/ingress.py +++ b/http_server/ingress.py @@ -50,13 +50,12 @@ def transcribe(): raise ValueError(f"No audio file was uploaded (missing 'file' key)") file_buffer = request.files["file"].read() - start_t = time.time() + audio_data = load_wave_buffer(file_buffer) # Transcription transcription = decode( audio_data, model, alignment_model, join_metadata) - logger.debug("Transcription complete (t={}s)".format(time.time() - start_t)) if join_metadata: return json.dumps(transcription, ensure_ascii=False), 200 diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index f6d00e5..6572053 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -1,5 +1,5 @@ import os - +import time import numpy as np import copy @@ -49,11 +49,17 @@ def decode(audio, logger.info(f"Transcribing audio with language {language}...") + start_t = time.time() + if USE_CTRANSLATE2: kwargs.pop("alignment_model") - return decode_ct2(**kwargs) + res = decode_ct2(**kwargs) else: - return decode_torch(**kwargs) + res = decode_torch(**kwargs) + + logger.info("Transcription complete (t={}s)".format(time.time() - start_t)) + + return res def decode_ct2(audio, From 182443be94e92cb3707d65211c8cd5b4593034ac Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 13 Apr 2023 11:22:33 +0200 Subject: [PATCH 126/172] Simplify Dockerfile --- Dockerfile.ctranslate2 | 24 ++---------------------- Dockerfile.ctranslate2.cpu | 23 +++++++++++++++++++++++ Dockerfile.torch | 22 +--------------------- Dockerfile.torch.cpu | 22 +--------------------- 4 files changed, 27 insertions(+), 64 deletions(-) create mode 100644 Dockerfile.ctranslate2.cpu diff --git a/Dockerfile.ctranslate2 b/Dockerfile.ctranslate2 index 5989b3f..e2e0008 100644 --- a/Dockerfile.ctranslate2 +++ b/Dockerfile.ctranslate2 @@ -1,27 +1,7 @@ -FROM python:3.9 +FROM ghcr.io/opennmt/ctranslate2:latest-ubuntu20.04-cuda11.2 LABEL maintainer="jlouradour@linagora.com" -RUN apt-get update && \ - apt-get install -y --no-install-recommends \ - wget \ - nano \ - bzip2 \ - unzip \ - xz-utils \ - sox \ - ffmpeg \ - g++ \ - make \ - cmake \ - git \ - zlib1g-dev \ - automake \ - autoconf \ - libtool \ - pkg-config \ - ca-certificates - -RUN rm -rf /var/lib/apt/lists/* +RUN apt-get update && apt-get install -y --no-install-recommends ffmpeg # Install python dependencies COPY requirements.ctranslate2.txt ./ diff --git a/Dockerfile.ctranslate2.cpu b/Dockerfile.ctranslate2.cpu new file mode 100644 index 0000000..46c148e --- /dev/null +++ b/Dockerfile.ctranslate2.cpu @@ -0,0 +1,23 @@ +FROM python:3.9 +LABEL maintainer="jlouradour@linagora.com" + +RUN apt-get update && apt-get install -y --no-install-recommends ffmpeg + +# Install python dependencies +COPY requirements.ctranslate2.txt ./ +RUN pip install --no-cache-dir -r requirements.ctranslate2.txt && rm requirements.ctranslate2.txt + +WORKDIR /usr/src/app + +COPY stt /usr/src/app/stt +COPY celery_app /usr/src/app/celery_app +COPY http_server /usr/src/app/http_server +COPY websocket /usr/src/app/websocket +COPY document /usr/src/app/document +COPY docker-entrypoint.sh wait-for-it.sh healthcheck.sh ./ + +ENV PYTHONPATH="${PYTHONPATH}:/usr/src/app/stt" + +HEALTHCHECK CMD ./healthcheck.sh + +ENTRYPOINT ["./docker-entrypoint.sh"] \ No newline at end of file diff --git a/Dockerfile.torch b/Dockerfile.torch index 9db2a58..37480c0 100644 --- a/Dockerfile.torch +++ b/Dockerfile.torch @@ -1,27 +1,7 @@ FROM python:3.9 LABEL maintainer="jlouradour@linagora.com" -RUN apt-get update && \ - apt-get install -y --no-install-recommends \ - wget \ - nano \ - bzip2 \ - unzip \ - xz-utils \ - sox \ - ffmpeg \ - g++ \ - make \ - cmake \ - git \ - zlib1g-dev \ - automake \ - autoconf \ - libtool \ - pkg-config \ - ca-certificates - -RUN rm -rf /var/lib/apt/lists/* +RUN apt-get update && apt-get install -y --no-install-recommends ffmpeg # Install python dependencies COPY requirements.torch.txt ./ diff --git a/Dockerfile.torch.cpu b/Dockerfile.torch.cpu index 68ceda1..72582b6 100644 --- a/Dockerfile.torch.cpu +++ b/Dockerfile.torch.cpu @@ -1,27 +1,7 @@ FROM python:3.9 LABEL maintainer="jlouradour@linagora.com" -RUN apt-get update && \ - apt-get install -y --no-install-recommends \ - wget \ - nano \ - bzip2 \ - unzip \ - xz-utils \ - sox \ - ffmpeg \ - g++ \ - make \ - cmake \ - git \ - zlib1g-dev \ - automake \ - autoconf \ - libtool \ - pkg-config \ - ca-certificates - -RUN rm -rf /var/lib/apt/lists/* +RUN apt-get update && apt-get install -y --no-install-recommends ffmpeg # Force CPU versions of torch RUN pip3 install \ From 51ffca54cf8db6f33cc8921737bac435e250e429 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 13 Apr 2023 11:23:18 +0200 Subject: [PATCH 127/172] tune and document default .env file --- .envdefault | 45 +++++++++++++++++++++++++++++++-------------- 1 file changed, 31 insertions(+), 14 deletions(-) diff --git a/.envdefault b/.envdefault index d4bb2e8..1dbc2b1 100644 --- a/.envdefault +++ b/.envdefault @@ -1,22 +1,39 @@ +############################################ # SERVING PARAMETERS +############################################ +# "http" or "task" SERVICE_MODE=http -MODEL=/opt/model.pt -# LANGUAGE can be in different formats: en, en-US, English, ... -# If not set or "*", the language will be detected automatically. +# Below: used when SERVICE_MODE=task +SERVICE_NAME=stt +SERVICES_BROKER=redis://172.17.0.1:6379 +BROKER_PASS= + +############################################ +# STT MODELING PARAMETERS +############################################ + +# The model can be a path to a model, or a model name ("tiny", "base", "small", "medium", "large-v1" or "large-v2") +MODEL=medium + +# The language can be in different formats: "en", "en-US", "English", ... +# If not set or set to "*", the language will be detected automatically. LANGUAGE=* -#DEVICE=cuda:0 +# An alignment wav2vec model can be used to get word timestamps. +# It can be a path to a model, a language code (fr, en, ...), or "wav2vec" to automatically chose a model for the language +# This option is experimental (and not implemented with ctranslate2). +# ALIGNMENT_MODEL=wav2vec -# Only used for alignement using wav2vec models -#ALIGNMENT_MODEL=fr -#ALIGNMENT_MODEL=wav2vec -#ALIGNMENT_MODEL=/opt/alignment_model +############################################ +# EFFICIENCY PARAMETERS +############################################ -# TASK PARAMETERS -SERVICE_NAME=stt -SERVICES_BROKER=redis://192.168.0.1:6379 -BROKER_PASS=password +# Device to use. It can be "cuda" to force/check GPU, "cpu" to force computation on CPU, or a specific GPU ("cuda:0", "cuda:1", ...) +# DEVICE=cuda:0 + +# Number of threads per worker when running on CPU +OMP_NUM_THREADS=4 -# CONCURRENCY -CONCURRENCY=2 \ No newline at end of file +# Number of workers +CONCURRENCY=2 From 0d47486708b0866c55d695a72efa3781808442d2 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 13 Apr 2023 12:39:27 +0200 Subject: [PATCH 128/172] use --pool=solo option in celery on GPU to avoid CUDA initialization error --- docker-entrypoint.sh | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/docker-entrypoint.sh b/docker-entrypoint.sh index 5014d8f..97a3804 100755 --- a/docker-entrypoint.sh +++ b/docker-entrypoint.sh @@ -1,5 +1,5 @@ #!/bin/bash -set -ea +set -a echo "RUNNING STT" @@ -20,7 +20,7 @@ else if [ "$SERVICE_MODE" = "http" ] then echo "RUNNING STT HTTP SERVER" - python http_server/ingress.py --debug + python3 http_server/ingress.py --debug elif [ "$SERVICE_MODE" == "task" ] then if [[ -z "$SERVICES_BROKER" ]] @@ -28,12 +28,22 @@ else echo "ERROR: SERVICES_BROKER variable not specified, cannot start celery worker." exit -1 fi - /usr/src/app/wait-for-it.sh $(echo $SERVICES_BROKER | cut -d'/' -f 3) --timeout=20 --strict -- echo " $SERVICES_BROKER (Service Broker) is up" + nvidia-smi 2> /dev/null > /dev/null + if [ $? -eq 0 ];then + echo "GPU detected" + GPU=1 + OPT="--pool=solo" + else + echo "No GPU detected" + GPU=0 + OPT="" + fi + /usr/src/app/wait-for-it.sh $(echo $SERVICES_BROKER | cut -d'/' -f 3) --timeout=20 --strict -- echo " $SERVICES_BROKER (Service Broker) is up" || exit 1 echo "RUNNING STT CELERY WORKER" - celery --app=celery_app.celeryapp worker -Ofair --queues=${SERVICE_NAME} -c ${CONCURRENCY} -n ${SERVICE_NAME}_worker@%h + celery --app=celery_app.celeryapp worker $OPT -Ofair --queues=${SERVICE_NAME} -c ${CONCURRENCY} -n ${SERVICE_NAME}_worker@%h else - echo "ERROR: Wrong serving command: $1" + echo "ERROR: Wrong serving command: $SERVICE_MODE" exit -1 fi fi From e1c7ecda643cd2b6d04d001052c37329927b7cf3 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 13 Apr 2023 12:40:27 +0200 Subject: [PATCH 129/172] Lazy loading of the model (to avoid deadlocks on multithreaded processes) + misc: - do not necessarily use torchaudio - clarify cache folder business with faster_whisper - little fixes - move get_language into utils.py --- http_server/ingress.py | 5 ++-- requirements.ctranslate2.txt | 1 + requirements.torch.txt | 1 + stt/__init__.py | 19 +++++++----- stt/processing/__init__.py | 35 ++++++++++++++-------- stt/processing/alignment_model.py | 18 ++++++++---- stt/processing/decoding.py | 32 +++++++------------- stt/processing/load_model.py | 49 ++++++++++++++++++++++--------- stt/processing/utils.py | 33 ++++++++++++++++++--- 9 files changed, 126 insertions(+), 67 deletions(-) diff --git a/http_server/ingress.py b/http_server/ingress.py index bab7b65..b55bb03 100644 --- a/http_server/ingress.py +++ b/http_server/ingress.py @@ -9,9 +9,8 @@ from serving import GunicornServing, GeventServing from swagger import setupSwaggerUI -from stt.processing import decode, load_wave_buffer, model, alignment_model +from stt.processing import decode, load_wave_buffer, model, alignment_model, use_gpu from stt import logger as stt_logger -from stt import SHOULD_USE_GEVENT app = Flask("__stt-standalone-worker__") app.config["JSON_AS_ASCII"] = False @@ -102,7 +101,7 @@ def server_error(error): logger.info(f"Using {args.workers} workers") - if SHOULD_USE_GEVENT: # TODO: get rid of this + if use_gpu: # TODO: get rid of this? serving_type = GeventServing logger.debug("Serving with gevent") else: diff --git a/requirements.ctranslate2.txt b/requirements.ctranslate2.txt index 2cf4b8d..84547ac 100644 --- a/requirements.ctranslate2.txt +++ b/requirements.ctranslate2.txt @@ -5,6 +5,7 @@ flask-sock flask-swagger-ui>=3.36.0 gevent gunicorn +lockfile pyyaml>=5.4.1 requests>=2.26.0 wavio>=0.0.4 diff --git a/requirements.torch.txt b/requirements.torch.txt index 1b15744..9c40b6b 100644 --- a/requirements.torch.txt +++ b/requirements.torch.txt @@ -5,6 +5,7 @@ flask-sock flask-swagger-ui>=3.36.0 gevent gunicorn +lockfile num2words pyyaml>=5.4.1 requests>=2.26.0 diff --git a/stt/__init__.py b/stt/__init__.py index 5460088..6c57bb2 100644 --- a/stt/__init__.py +++ b/stt/__init__.py @@ -9,18 +9,21 @@ try: import faster_whisper USE_CTRANSLATE2 = True -except ImportError: +except ImportError as err: + try: + import whisper + except: + raise err USE_CTRANSLATE2 = False try: - import torch, torchaudio + import torch USE_TORCH = True except ImportError: USE_TORCH = False -# TODO: Get rid of that -if USE_TORCH: - SHOULD_USE_GEVENT = torch.cuda.is_available() - torch.set_num_threads(1) -else: - SHOULD_USE_GEVENT = USE_CTRANSLATE2 +try: + import torchaudio + USE_TORCHAUDIO = True +except ImportError: + USE_TORCHAUDIO = False diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index f891984..5e72252 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -1,9 +1,10 @@ import os import logging +from lockfile import FileLock -from stt import logger -from .decoding import decode, get_language -from .utils import get_device, LANGUAGES, load_wave_buffer, load_audiofile +from stt import logger, USE_CTRANSLATE2 +from .decoding import decode +from .utils import get_device, get_language, load_wave_buffer, load_audiofile from .load_model import load_whisper_model from .alignment_model import load_alignment_model, get_alignment_model @@ -11,6 +12,23 @@ __all__ = ["logger", "decode", "model", "alignment_model", "load_audiofile", "load_wave_buffer"] +class LazyLoadedModel: + + def __init__(self, model_type, device): + self.model_type = model_type + self.device = device + self._model = None + if USE_CTRANSLATE2: + # May download model here + load_whisper_model(self.model_type, device=self.device) + + def __getattr__(self, name): + if self._model is None: + lockfile = os.path.basename(self.model_type) + with FileLock(lockfile): + self._model = load_whisper_model(self.model_type, device=self.device) + return getattr(self._model, name) + # Set informative log logger.setLevel(logging.INFO) @@ -20,21 +38,14 @@ # Check language language = get_language() -available_languages = \ - list(LANGUAGES.keys()) + \ - [k.lower() for k in LANGUAGES.values()] + \ - [None] -if language not in available_languages: - raise ValueError(f"Language '{get_language()}' is not available. Available languages are: {available_languages}") -if isinstance(language, str) and language not in LANGUAGES: - language = {v: k for k, v in LANGUAGES.items()}[language.lower()] logger.info(f"Using language {language}") # Load ASR model model_type = os.environ.get("MODEL", "medium") logger.info(f"Loading Whisper model {model_type} ({'local' if os.path.exists(model_type) else 'remote'})...") try: - model = load_whisper_model(model_type, device=device) + model = LazyLoadedModel(model_type, device=device) + # model = load_whisper_model(model_type, device=device) except Exception as err: raise Exception( "Failed to load transcription model: {}".format(str(err))) from err diff --git a/stt/processing/alignment_model.py b/stt/processing/alignment_model.py index 08a5e45..a8e6e79 100644 --- a/stt/processing/alignment_model.py +++ b/stt/processing/alignment_model.py @@ -1,4 +1,4 @@ -from stt import logger, USE_TORCH +from stt import logger, USE_TORCH, USE_TORCHAUDIO from .utils import SAMPLE_RATE, LANGUAGES import os @@ -9,9 +9,17 @@ if USE_TORCH: import torch import torch.nn.utils.rnn as rnn_utils - import huggingface_hub - import speechbrain as sb - import transformers + try: + import speechbrain as sb + import huggingface_hub + except ImportError: + pass + try: + import transformers + except ImportError: + pass + +if USE_TORCHAUDIO: import torchaudio ################################################################################ @@ -77,7 +85,7 @@ def load_alignment_model(source, device="cpu", download_root="/opt"): start = time.time() - if source in torchaudio.pipelines.__all__: + if (source in torchaudio.pipelines.__all__) if USE_TORCHAUDIO else False: model = load_torchaudio_model( source, device=device, download_root=download_root) else: diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 6572053..c8a5380 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -4,7 +4,7 @@ import copy from stt import logger, USE_CTRANSLATE2 -from .utils import SAMPLE_RATE +from .utils import SAMPLE_RATE, get_language from .text_normalize import remove_punctuation, normalize_text, remove_emoji, _punctuations from .alignment_model import get_alignment_model, load_alignment_model from .word_alignment import compute_alignment @@ -14,20 +14,6 @@ import whisper_timestamped -def get_language(): - """ - Get the language from the environment variable LANGUAGE, and format as expected by Whisper. - """ - language = os.environ.get("LANGUAGE", "*") - # "fr-FR" -> "fr" (language-country code to ISO 639-1 code) - if len(language) > 2 and language[2] == "-": - language = language.split("-")[0] - # "*" means "all languages" - if language == "*": - language = None - return language - - def decode(audio, model, alignment_model: "Any", @@ -47,7 +33,7 @@ def decode(audio, kwargs = copy.copy(locals()) - logger.info(f"Transcribing audio with language {language}...") + logger.info("Transcribing audio with " + (f"language {language}" if language else "automatic language detection") + "...") start_t = time.time() @@ -123,7 +109,7 @@ def decode_torch(audio, if alignment_model is None: # Use Whisper cross-attention weights - whisper_res = whisper_timestamped.transcribe(model, audio, **kwargs) + whisper_res = whisper_timestamped.transcribe(model, audio, verbose=None, **kwargs) if language is None: language = whisper_res["language"] logger.info(f"Detected language: {language}") @@ -133,7 +119,7 @@ def decode_torch(audio, torch.manual_seed(1234) torch.cuda.manual_seed_all(1234) - whisper_res = model.transcribe(audio, **kwargs) + whisper_res = model.transcribe(audio, verbose=None, **kwargs) text = whisper_res["text"] text = remove_emoji(text).strip() @@ -294,9 +280,13 @@ def checked_timestamps(start, end=None): words[-1]["confidence"].append(word.probability) _, words[-1]["end"] = checked_timestamps(words[-1]["end"], word.end) continue - words.append( - {"text": word.word, "confidence": [word.probability]} | dict(zip(("start", "end"), checked_timestamps(word.start, word.end))) - ) + start, end = checked_timestamps(word.start, word.end) + words.append({ + "text": word.word, + "confidence": [word.probability], + "start": start, + "end": end + }) for word in words: word["text"] = word["text"].strip() diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py index 14ec3e2..1476d60 100644 --- a/stt/processing/load_model.py +++ b/stt/processing/load_model.py @@ -4,27 +4,48 @@ from stt import logger, USE_CTRANSLATE2 if USE_CTRANSLATE2: - import faster_whisper as whisper + import faster_whisper else: import whisper_timestamped as whisper -def load_whisper_model(model_type_or_file, device="cpu", download_root="/opt"): +def load_whisper_model(model_type_or_file, device="cpu", download_root=None): start = time.time() + logger.info("Loading Whisper model {}...".format(model_type_or_file)) + + default_cache_root = os.path.join(os.path.expanduser("~"), ".cache") + if download_root is None: + download_root = default_cache_root + if USE_CTRANSLATE2: if not os.path.isdir(model_type_or_file): - # To specify the cache directory - model_type_or_file = whisper.utils.download_model( - model_type_or_file, - output_dir=os.path.join(download_root, "huggingface/hub") - ) - model = whisper.WhisperModel(model_type_or_file, - device=device, - compute_type="default", - cpu_threads=0, # Can be controled with OMP_NUM_THREADS - num_workers=1, - ) + # Note: There is no good way to set the root cache directory + # with the current version of faster_whisper: + # if "download_root" is specified to faster_whisper.WhisperModel + # (or "output_dir" in faster_whisper.utils.download_model), + # then files are downloaded directly in it without symbolic links + # to the cache directory. So it's different from the behavior + # of the huggingface_hub. + # So we try to create a symbolic link to the cache directory that will be used by HuggingFace... + if not os.path.exists(download_root): + if not os.path.exists(default_cache_root): + os.makedirs(download_root) + if default_cache_root != download_root: + os.symlink(download_root, default_cache_root) + else: + os.symlink(default_cache_root, download_root) + elif not os.path.exists(default_cache_root): + os.symlink(download_root, default_cache_root) + + model = faster_whisper.WhisperModel( + model_type_or_file, + device=device, + compute_type="default", + cpu_threads=0, # Can be controled with OMP_NUM_THREADS + num_workers=1, + # download_root=os.path.join(download_root, f"huggingface/hub/models--guillaumekln--faster-whisper-{model_type_or_file}"), + ) else: model = whisper.load_model( @@ -34,6 +55,6 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root="/opt"): model.eval() model.requires_grad_(False) - logger.info("Whisper Model loaded. (t={}s)".format(time.time() - start)) + logger.info("Whisper model loaded. (t={}s)".format(time.time() - start)) return model \ No newline at end of file diff --git a/stt/processing/utils.py b/stt/processing/utils.py index e8b2bd8..a3719b0 100644 --- a/stt/processing/utils.py +++ b/stt/processing/utils.py @@ -1,4 +1,4 @@ -from stt import USE_CTRANSLATE2, USE_TORCH +from stt import USE_CTRANSLATE2, USE_TORCH, USE_TORCHAUDIO import io import wavio @@ -12,9 +12,11 @@ import faster_whisper else: import torch - import torchaudio import whisper +if USE_TORCHAUDIO: + import torchaudio + def has_cuda(): if USE_CTRANSLATE2: return ctranslate2.get_cuda_device_count() > 0 @@ -31,10 +33,33 @@ def get_device(): raise Exception("Failed to set device: {}".format(str(err))) from err return device, use_gpu +def get_language(): + """ + Get the language from the environment variable LANGUAGE, and format as expected by Whisper. + """ + language = os.environ.get("LANGUAGE", "*") + # "fr-FR" -> "fr" (language-country code to ISO 639-1 code) + if len(language) > 2 and language[2] == "-": + language = language.split("-")[0] + # "*" means "all languages" + if language == "*": + language = None + # Convert French -> fr + if isinstance(language, str) and language not in LANGUAGES: + language = {v: k for k, v in LANGUAGES.items()}.get(language.lower(), language) + # Raise an exception for unknown languages + if language not in LANGUAGES: + available_languages = \ + list(LANGUAGES.keys()) + \ + [k[0].upper() + k[1:] for k in LANGUAGES.values()] + \ + ["*", None] + raise ValueError(f"Language '{language}' is not available. Available languages are: {available_languages}") + return language + def conform_audio(audio, sample_rate=16_000): if sample_rate != SAMPLE_RATE: - if not USE_TORCH: - raise NotImplementedError("Resampling not available without Torch") + if not USE_TORCHAUDIO: + raise NotImplementedError("Resampling not available without torchaudio") # Down or Up sample to the right sampling rate audio = torchaudio.transforms.Resample(sample_rate, SAMPLE_RATE)(audio) if audio.shape[0] > 1: From adc1cf1d8e6abbf92065d7f24de7331137374966 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 13 Apr 2023 13:36:06 +0200 Subject: [PATCH 130/172] use upper case letters for global variables --- celery_app/tasks.py | 4 ++-- http_server/ingress.py | 6 +++--- stt/processing/__init__.py | 16 ++++++++-------- 3 files changed, 13 insertions(+), 13 deletions(-) diff --git a/celery_app/tasks.py b/celery_app/tasks.py index 3b7251f..26ae18a 100644 --- a/celery_app/tasks.py +++ b/celery_app/tasks.py @@ -3,7 +3,7 @@ from celery_app.celeryapp import celery from stt import logger -from stt.processing import decode, model, alignment_model +from stt.processing import decode, MODEL, ALIGNMENT_MODEL from stt.processing.utils import load_audiofile @@ -22,7 +22,7 @@ def transcribe_task(file_name: str, with_metadata: bool): # Decode try: - result = decode(file_content, model, alignment_model, with_metadata) + result = decode(file_content, MODEL, ALIGNMENT_MODEL, with_metadata) except Exception as err: logger.error(f"Failed to decode: {repr(err)}") raise Exception(f"Failed to decode {file_path}") from err diff --git a/http_server/ingress.py b/http_server/ingress.py index b55bb03..afed5d0 100644 --- a/http_server/ingress.py +++ b/http_server/ingress.py @@ -9,7 +9,7 @@ from serving import GunicornServing, GeventServing from swagger import setupSwaggerUI -from stt.processing import decode, load_wave_buffer, model, alignment_model, use_gpu +from stt.processing import decode, load_wave_buffer, MODEL, ALIGNMENT_MODEL, USE_GPU from stt import logger as stt_logger app = Flask("__stt-standalone-worker__") @@ -54,7 +54,7 @@ def transcribe(): # Transcription transcription = decode( - audio_data, model, alignment_model, join_metadata) + audio_data, MODEL, ALIGNMENT_MODEL, join_metadata) if join_metadata: return json.dumps(transcription, ensure_ascii=False), 200 @@ -101,7 +101,7 @@ def server_error(error): logger.info(f"Using {args.workers} workers") - if use_gpu: # TODO: get rid of this? + if USE_GPU: # TODO: get rid of this? serving_type = GeventServing logger.debug("Serving with gevent") else: diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index 5e72252..c4f9e55 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -33,7 +33,7 @@ def __getattr__(self, name): logger.setLevel(logging.INFO) # Set device -device, use_gpu = get_device() +device, USE_GPU = get_device() logger.info(f"Using device {device}") # Check language @@ -44,20 +44,20 @@ def __getattr__(self, name): model_type = os.environ.get("MODEL", "medium") logger.info(f"Loading Whisper model {model_type} ({'local' if os.path.exists(model_type) else 'remote'})...") try: - model = LazyLoadedModel(model_type, device=device) + MODEL = LazyLoadedModel(model_type, device=device) # model = load_whisper_model(model_type, device=device) except Exception as err: raise Exception( "Failed to load transcription model: {}".format(str(err))) from err # Load alignment model (if any) -alignment_model = get_alignment_model(os.environ.get("ALIGNMENT_MODEL"), language) -if alignment_model: +ALIGNMENT_MODEL = get_alignment_model(os.environ.get("ALIGNMENT_MODEL"), language) +if ALIGNMENT_MODEL: logger.info( - f"Loading alignment model {alignment_model} ({'local' if os.path.exists(alignment_model) else 'remote'})...") - alignment_model = load_alignment_model(alignment_model, device=device, download_root="/opt") -elif alignment_model is None: + f"Loading alignment model {ALIGNMENT_MODEL} ({'local' if os.path.exists(alignment_model) else 'remote'})...") + ALIGNMENT_MODEL = load_alignment_model(ALIGNMENT_MODEL, device=device, download_root="/opt") +elif ALIGNMENT_MODEL is None: logger.info("Alignment will be done using Whisper cross-attention weights") else: logger.info("No alignment model preloaded. It will be loaded on the fly depending on the detected language.") - alignment_model = {} # Alignement model(s) will be loaded on the fly + ALIGNMENT_MODEL = {} # Alignement model(s) will be loaded on the fly From e2a9292507abe8564dd23a4efca653f61c63f007 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Fri, 14 Apr 2023 18:57:49 +0200 Subject: [PATCH 131/172] use lower precision when possible + use accurate decoding --- stt/processing/__init__.py | 4 +--- stt/processing/decoding.py | 16 +++++++++++++--- stt/processing/load_model.py | 31 +++++++++++++++++++++++-------- 3 files changed, 37 insertions(+), 14 deletions(-) diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index c4f9e55..e650424 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -18,9 +18,6 @@ def __init__(self, model_type, device): self.model_type = model_type self.device = device self._model = None - if USE_CTRANSLATE2: - # May download model here - load_whisper_model(self.model_type, device=self.device) def __getattr__(self, name): if self._model is None: @@ -33,6 +30,7 @@ def __getattr__(self, name): logger.setLevel(logging.INFO) # Set device +os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID' # GPU in the right order device, USE_GPU = get_device() logger.info(f"Using device {device}") diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index c8a5380..ebcd052 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -2,6 +2,7 @@ import time import numpy as np import copy +from typing import Tuple, Union from stt import logger, USE_CTRANSLATE2 from .utils import SAMPLE_RATE, get_language @@ -13,6 +14,15 @@ import torch import whisper_timestamped +if "USE_ACCURATE": + default_beam_size = 5 + default_best_of = 5 + default_temperature = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0) +else: + default_beam_size = None + default_best_of = None + default_temperature = 0.0 + def decode(audio, model, @@ -20,9 +30,9 @@ def decode(audio, with_word_timestamps: bool, language: str = None, remove_punctuation_from_words=False, - beam_size: int = None, - best_of: int = None, - temperature: float = 0.0, + beam_size: int = default_beam_size, + best_of: int = default_best_of, + temperature: Union[float, Tuple[float, ...]] = default_temperature, condition_on_previous_text: bool = False, no_speech_threshold: float = 0.6, compression_ratio_threshold: float = 2.4, diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py index 1476d60..ce3dbdd 100644 --- a/stt/processing/load_model.py +++ b/stt/processing/load_model.py @@ -38,14 +38,29 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root=None): elif not os.path.exists(default_cache_root): os.symlink(download_root, default_cache_root) - model = faster_whisper.WhisperModel( - model_type_or_file, - device=device, - compute_type="default", - cpu_threads=0, # Can be controled with OMP_NUM_THREADS - num_workers=1, - # download_root=os.path.join(download_root, f"huggingface/hub/models--guillaumekln--faster-whisper-{model_type_or_file}"), - ) + if device == "cpu": + compute_types = ["int8", "float32"] + else: + compute_types = ["int8_float16", "float16", "float32"] + + model = None + for i, compute_type in enumerate(compute_types): + try: + model = faster_whisper.WhisperModel( + model_type_or_file, + device=device, + compute_type=compute_type, + cpu_threads=0, # Can be controled with OMP_NUM_THREADS + num_workers=1, + # download_root=os.path.join(download_root, f"huggingface/hub/models--guillaumekln--faster-whisper-{model_type_or_file}"), + ) + break + except ValueError as err: + # On some old GPU we may have the error + # "ValueError: Requested int8_float16 compute type, + # but the target device or backend do not support efficient int8_float16 computation." + if i == len(compute_types) - 1: + raise err else: model = whisper.load_model( From be66eb58c47bedccff5961fe7da1bf1b5b10b63c Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Fri, 14 Apr 2023 20:06:44 +0200 Subject: [PATCH 132/172] multi-GPU and specification of the right GPU index with faster_whisper --- stt/processing/__init__.py | 4 +++- stt/processing/load_model.py | 10 ++++++++-- stt/processing/utils.py | 16 +++++++++++++++- 3 files changed, 26 insertions(+), 4 deletions(-) diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index e650424..40f98fc 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -18,6 +18,9 @@ def __init__(self, model_type, device): self.model_type = model_type self.device = device self._model = None + if USE_CTRANSLATE2: + # This may download the model, and test the device + load_whisper_model(self.model_type, device=self.device) def __getattr__(self, name): if self._model is None: @@ -30,7 +33,6 @@ def __getattr__(self, name): logger.setLevel(logging.INFO) # Set device -os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID' # GPU in the right order device, USE_GPU = get_device() logger.info(f"Using device {device}") diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py index ce3dbdd..e4b1f58 100644 --- a/stt/processing/load_model.py +++ b/stt/processing/load_model.py @@ -43,15 +43,21 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root=None): else: compute_types = ["int8_float16", "float16", "float32"] + device_index = 0 + if device.startswith("cuda:"): + device_index = [int(dev) for dev in device[5:].split(",")] + device = "cuda" + model = None for i, compute_type in enumerate(compute_types): try: model = faster_whisper.WhisperModel( model_type_or_file, device=device, + device_index=device_index, compute_type=compute_type, - cpu_threads=0, # Can be controled with OMP_NUM_THREADS - num_workers=1, + # cpu_threads=0, # Can be controled with OMP_NUM_THREADS + # num_workers=1, # download_root=os.path.join(download_root, f"huggingface/hub/models--guillaumekln--faster-whisper-{model_type_or_file}"), ) break diff --git a/stt/processing/utils.py b/stt/processing/utils.py index a3719b0..787c433 100644 --- a/stt/processing/utils.py +++ b/stt/processing/utils.py @@ -26,7 +26,21 @@ def has_cuda(): def get_device(): device = os.environ.get("DEVICE", "cuda" if has_cuda() else "cpu") use_gpu = "cuda" in device - if not USE_CTRANSLATE2: + + # The following is to have GPU in the right order (as nvidia-smi show them) + # But somehow it does not work with ctranslate2: + # see https://github.com/guillaumekln/faster-whisper/issues/150 + os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID' # GPU in the right order + + if USE_CTRANSLATE2: + try: + if device.startswith("cuda:"): + _ = [int(dev) for dev in device[5:].split(",")] + else: + assert device in ["cpu", "cuda"] + except: + raise ValueError(f"Invalid DEVICE '{device}' (should be 'cpu' or 'cuda' or 'cuda: or 'cuda:,,...')") + else: try: device = torch.device(device) except Exception as err: From 3afa5f4575c920ee86dfe2006c5b9d7e86e905eb Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Mon, 17 Apr 2023 22:03:03 +0200 Subject: [PATCH 133/172] fix GPU order (when several GPUs) --- stt/__init__.py | 6 ++++++ stt/processing/__init__.py | 3 --- stt/processing/utils.py | 7 +------ 3 files changed, 7 insertions(+), 9 deletions(-) diff --git a/stt/__init__.py b/stt/__init__.py index 6c57bb2..aa3e314 100644 --- a/stt/__init__.py +++ b/stt/__init__.py @@ -1,3 +1,4 @@ +import os import logging logging.basicConfig( @@ -6,6 +7,11 @@ ) logger = logging.getLogger("__stt__") +# The following is to have GPU in the right order (as nvidia-smi show them) +# It is important to set that before loading ctranslate2 +# see https://github.com/guillaumekln/faster-whisper/issues/150 +os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID' # GPU in the right order + try: import faster_whisper USE_CTRANSLATE2 = True diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index 40f98fc..095f91b 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -18,9 +18,6 @@ def __init__(self, model_type, device): self.model_type = model_type self.device = device self._model = None - if USE_CTRANSLATE2: - # This may download the model, and test the device - load_whisper_model(self.model_type, device=self.device) def __getattr__(self, name): if self._model is None: diff --git a/stt/processing/utils.py b/stt/processing/utils.py index 787c433..0352de4 100644 --- a/stt/processing/utils.py +++ b/stt/processing/utils.py @@ -26,12 +26,7 @@ def has_cuda(): def get_device(): device = os.environ.get("DEVICE", "cuda" if has_cuda() else "cpu") use_gpu = "cuda" in device - - # The following is to have GPU in the right order (as nvidia-smi show them) - # But somehow it does not work with ctranslate2: - # see https://github.com/guillaumekln/faster-whisper/issues/150 - os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID' # GPU in the right order - + if USE_CTRANSLATE2: try: if device.startswith("cuda:"): From e970615d07d511f9ac3217621566c29c10e50be7 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 18 Apr 2023 09:39:23 +0200 Subject: [PATCH 134/172] add compute_type=int8 for GPU --- stt/processing/load_model.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py index e4b1f58..18541e2 100644 --- a/stt/processing/load_model.py +++ b/stt/processing/load_model.py @@ -41,7 +41,7 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root=None): if device == "cpu": compute_types = ["int8", "float32"] else: - compute_types = ["int8_float16", "float16", "float32"] + compute_types = ["int8", "int8_float16", "float16", "float32"] device_index = 0 if device.startswith("cuda:"): From 6cc5b3d9375b37ee087d8708714064595423f45b Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 2 May 2023 16:13:28 +0200 Subject: [PATCH 135/172] fix typo that was fixed in faster_whisper --- requirements.ctranslate2.txt | 2 +- stt/processing/decoding.py | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/requirements.ctranslate2.txt b/requirements.ctranslate2.txt index 84547ac..e50d462 100644 --- a/requirements.ctranslate2.txt +++ b/requirements.ctranslate2.txt @@ -10,4 +10,4 @@ pyyaml>=5.4.1 requests>=2.26.0 wavio>=0.0.4 websockets -faster_whisper \ No newline at end of file +faster_whisper>=0.5.1 \ No newline at end of file diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index ebcd052..f5a3222 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -306,7 +306,7 @@ def checked_timestamps(start, end=None): "text": segment.text.strip(), "start": start, "end": end, - "avg_logprob": segment.avg_log_prob, + "avg_logprob": segment.avg_logprob, "words": words }) @@ -315,7 +315,7 @@ def checked_timestamps(start, end=None): transcription = { "text": " ".join(segment["text"] for segment in segments_list), "language": language, - "confidence": round(np.exp(np.mean([segment.avg_log_prob for segment in segments])), 2), + "confidence": round(np.exp(np.mean([segment["avg_logprob"] for segment in segments_list])), 2), "segments": segments_list, } return format_whisper_timestamped_response(transcription, remove_punctuation_from_words=remove_punctuation_from_words) From 5fe83692e86064e2e210d5c36af71a75332369db Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 10 May 2023 17:44:28 +0200 Subject: [PATCH 136/172] give more information in case of error --- celery_app/tasks.py | 12 ++++++++---- http_server/ingress.py | 2 +- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/celery_app/tasks.py b/celery_app/tasks.py index 26ae18a..3fc38f2 100644 --- a/celery_app/tasks.py +++ b/celery_app/tasks.py @@ -17,14 +17,18 @@ def transcribe_task(file_name: str, with_metadata: bool): try: file_content = load_audiofile(file_path) except Exception as err: - logger.error(f"Failed to load ressource: {repr(err)}") - raise Exception(f"Could not open ressource {file_path}") from err + import traceback + msg = f"{traceback.format_exc()}\nFailed to load ressource {file_path}" + logger.error(msg) + raise Exception(msg) # from err # Decode try: result = decode(file_content, MODEL, ALIGNMENT_MODEL, with_metadata) except Exception as err: - logger.error(f"Failed to decode: {repr(err)}") - raise Exception(f"Failed to decode {file_path}") from err + import traceback + msg = f"{traceback.format_exc()}\nFailed to decode {file_path}" + logger.error(msg) + raise Exception(msg) # from err return result diff --git a/http_server/ingress.py b/http_server/ingress.py index afed5d0..0e8a640 100644 --- a/http_server/ingress.py +++ b/http_server/ingress.py @@ -62,7 +62,7 @@ def transcribe(): except Exception as error: import traceback - print(traceback.format_exc()) + logger.error(traceback.format_exc()) logger.error(repr(error)) return "Server Error: {}".format(str(error)), 400 if isinstance(error, ValueError) else 500 From d045a66d9aa5d3d05c6668b3ce60280e8d5ce6e0 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 10 May 2023 17:45:27 +0200 Subject: [PATCH 137/172] remove dangerous useless assert + add VAD --- requirements.torch.txt | 4 +++- stt/processing/decoding.py | 13 ++++++++----- 2 files changed, 11 insertions(+), 6 deletions(-) diff --git a/requirements.torch.txt b/requirements.torch.txt index 9c40b6b..75e747c 100644 --- a/requirements.torch.txt +++ b/requirements.torch.txt @@ -14,4 +14,6 @@ transformers wavio>=0.0.4 websockets # openai-whisper -git+https://github.com/linto-ai/whisper-timestamped.git \ No newline at end of file +git+https://github.com/linto-ai/whisper-timestamped.git +onnxruntime +torchaudio \ No newline at end of file diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index f5a3222..37bccee 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -14,7 +14,10 @@ import torch import whisper_timestamped -if "USE_ACCURATE": +USE_ACCURATE = True +USE_VAD = True + +if USE_ACCURATE: default_beam_size = 5 default_best_of = 5 default_temperature = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0) @@ -71,13 +74,14 @@ def decode_ct2(audio, kwargs["beam_size"] = 1 if kwargs.get("best_of") is None: kwargs["best_of"] = 1 - + segments, info = model.transcribe( audio, word_timestamps=with_word_timestamps, language=language, # Careful with the following options max_initial_timestamp=10000.0, + vad_filter=USE_VAD, **kwargs) segments = list(segments) @@ -114,7 +118,8 @@ def decode_torch(audio, best_of=best_of, condition_on_previous_text=condition_on_previous_text, no_speech_threshold=no_speech_threshold, - compression_ratio_threshold=compression_ratio_threshold + compression_ratio_threshold=compression_ratio_threshold, + vad=USE_VAD, ) if alignment_model is None: @@ -309,8 +314,6 @@ def checked_timestamps(start, end=None): "avg_logprob": segment.avg_logprob, "words": words }) - - assert len(segments_list) transcription = { "text": " ".join(segment["text"] for segment in segments_list), From 28f3809262d70964ecebb234fca307a36bedd0c8 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 11 May 2023 11:13:23 +0200 Subject: [PATCH 138/172] fix call of the model (lazy) wrapping --- stt/processing/__init__.py | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/stt/processing/__init__.py b/stt/processing/__init__.py index 095f91b..9bb51bc 100644 --- a/stt/processing/__init__.py +++ b/stt/processing/__init__.py @@ -19,13 +19,20 @@ def __init__(self, model_type, device): self.device = device self._model = None - def __getattr__(self, name): + def check_loaded(self): if self._model is None: lockfile = os.path.basename(self.model_type) with FileLock(lockfile): self._model = load_whisper_model(self.model_type, device=self.device) + + def __getattr__(self, name): + self.check_loaded() return getattr(self._model, name) + def __call__(self, *args, **kwargs): + self.check_loaded() + return self._model(*args, **kwargs) + # Set informative log logger.setLevel(logging.INFO) From be64669856d873ccb2563a80be6b041a3960e254 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 31 May 2023 09:04:40 +0200 Subject: [PATCH 139/172] build image with tag whisper-latest / 4.0.0 --- Jenkinsfile | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/Jenkinsfile b/Jenkinsfile index 572c1c5..75a09bd 100644 --- a/Jenkinsfile +++ b/Jenkinsfile @@ -48,25 +48,25 @@ pipeline { } } - // stage('Docker build for whisper branch'){ - // when{ - // branch 'feature/whisper' - // } - // steps { - // echo 'Publishing whisper' - // script { - // image = docker.build(env.DOCKER_HUB_REPO) - // VERSION = sh( - // returnStdout: true, - // script: "awk -v RS='' '/#/ {print; exit}' RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" - // ).trim() + stage('Docker build for whisper branch'){ + when{ + branch 'feature/whisper' + } + steps { + echo 'Publishing faster_whisper' + script { + image = docker.build(env.DOCKER_HUB_REPO, "-f Dockerfile.ctranslate2 .") + VERSION = sh( + returnStdout: true, + script: "awk -v RS='' '/#/ {print; exit}' RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" + ).trim() - // docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { - // image.push("${VERSION}") - // image.push('whisper') - // } - // } - // } - // } + docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { + image.push("${VERSION}") + image.push('whisper-latest') + } + } + } + } }// end stages } \ No newline at end of file From 46c3e8e157651678037bb365e3eb6ebfce523e13 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Mon, 12 Jun 2023 17:42:17 +0200 Subject: [PATCH 140/172] fix: spaces that were added before "-" and "'" --- RELEASE.md | 3 +++ stt/processing/decoding.py | 6 +++--- stt/processing/text_normalize.py | 1 + 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/RELEASE.md b/RELEASE.md index 4ea4e01..df928db 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,3 +1,6 @@ +# 4.0.1 +- Fix punctuations + # 4.0.0 - Integration of Whisper diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 37bccee..e49e208 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -6,7 +6,7 @@ from stt import logger, USE_CTRANSLATE2 from .utils import SAMPLE_RATE, get_language -from .text_normalize import remove_punctuation, normalize_text, remove_emoji, _punctuations +from .text_normalize import remove_punctuation, normalize_text, remove_emoji, _punctuations_plus from .alignment_model import get_alignment_model, load_alignment_model from .word_alignment import compute_alignment @@ -289,9 +289,9 @@ def checked_timestamps(start, end=None): words = [] if segment.words: for word in segment.words: - if len(words) and (not(word.word.strip()) or word.word.strip()[0] in _punctuations): + if len(words) and (not(word.word.strip()) or word.word.strip()[0] in _punctuations_plus): words[-1]["text"] += word.word - if word.word.strip() not in _punctuations: + if word.word.strip() not in _punctuations_plus: words[-1]["confidence"].append(word.probability) _, words[-1]["end"] = checked_timestamps(words[-1]["end"], word.end) continue diff --git a/stt/processing/text_normalize.py b/stt/processing/text_normalize.py index a4037bd..a7e0495 100644 --- a/stt/processing/text_normalize.py +++ b/stt/processing/text_normalize.py @@ -8,6 +8,7 @@ # string.punctuation, plus Whisper specific "«»¿", minus apostrophe "'", dash "-", and dot "." (which will be processed as special) _punctuations = '!"#$%&()*+,/:;<=>?@[\\]^_`{|}~«»¿' +_punctuations_plus = _punctuations + "'-" def remove_punctuation(text: str) -> str: From a6289a77635e3c7e6fd13b47b59216c6c9bfa007 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 27 Jun 2023 17:51:58 +0200 Subject: [PATCH 141/172] Small list of punctuations --- stt/processing/text_normalize.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/stt/processing/text_normalize.py b/stt/processing/text_normalize.py index a7e0495..427d22c 100644 --- a/stt/processing/text_normalize.py +++ b/stt/processing/text_normalize.py @@ -6,8 +6,8 @@ from stt import logger from .utils import flatten -# string.punctuation, plus Whisper specific "«»¿", minus apostrophe "'", dash "-", and dot "." (which will be processed as special) -_punctuations = '!"#$%&()*+,/:;<=>?@[\\]^_`{|}~«»¿' +# Punctuation marks +_punctuations = '.!?,:;¿。,!?:、…؟،؛' _punctuations_plus = _punctuations + "'-" From 658d77d1b46b87f4abf56b900fe8d1ac8ad70a16 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 27 Jun 2023 17:53:59 +0200 Subject: [PATCH 142/172] update release notes --- RELEASE.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/RELEASE.md b/RELEASE.md index df928db..e9850b1 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,3 +1,6 @@ +# 4.0.2 +- Do not considers symbols like "$" as punctuation marks + # 4.0.1 - Fix punctuations From 14e1c55522b6db09eb63a31ad4e965716a705fc2 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 27 Jun 2023 18:07:12 +0200 Subject: [PATCH 143/172] cosm --- stt/processing/text_normalize.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/stt/processing/text_normalize.py b/stt/processing/text_normalize.py index 427d22c..9077be6 100644 --- a/stt/processing/text_normalize.py +++ b/stt/processing/text_normalize.py @@ -7,7 +7,7 @@ from .utils import flatten # Punctuation marks -_punctuations = '.!?,:;¿。,!?:、…؟،؛' +_punctuations = '!,.:;?¿،؛؟…、。!,:?' # + '"”' + ')]}' _punctuations_plus = _punctuations + "'-" From 9dbd02a606badf8c0c6c6ab4cac18321d81dcb5e Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Mon, 3 Jul 2023 13:39:53 +0200 Subject: [PATCH 144/172] fix corner cases with punctuations and symbols --- RELEASE.md | 3 ++ stt/processing/decoding.py | 21 +++++++------ stt/processing/text_normalize.py | 54 +++++++++++++++++++++----------- 3 files changed, 51 insertions(+), 27 deletions(-) diff --git a/RELEASE.md b/RELEASE.md index e9850b1..174ae1b 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,3 +1,6 @@ +# 4.0.3 +- Tune punctuation heuristics + # 4.0.2 - Do not considers symbols like "$" as punctuation marks diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index e49e208..624255c 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -6,7 +6,7 @@ from stt import logger, USE_CTRANSLATE2 from .utils import SAMPLE_RATE, get_language -from .text_normalize import remove_punctuation, normalize_text, remove_emoji, _punctuations_plus +from .text_normalize import remove_punctuation, normalize_text, remove_emoji from .alignment_model import get_alignment_model, load_alignment_model from .word_alignment import compute_alignment @@ -264,8 +264,11 @@ def format_whisper_timestamped_response(transcription, remove_punctuation_from_w } -def format_faster_whisper_response(segments, info, - remove_punctuation_from_words=False): +def format_faster_whisper_response( + segments, info, + remove_punctuation_from_words=False, + glue_punctuations="'-&@.,", + ): language = info.language duration = info.duration @@ -289,13 +292,13 @@ def checked_timestamps(start, end=None): words = [] if segment.words: for word in segment.words: - if len(words) and (not(word.word.strip()) or word.word.strip()[0] in _punctuations_plus): - words[-1]["text"] += word.word - if word.word.strip() not in _punctuations_plus: - words[-1]["confidence"].append(word.probability) - _, words[-1]["end"] = checked_timestamps(words[-1]["end"], word.end) - continue start, end = checked_timestamps(word.start, word.end) + word_strip = word.word.strip() + if glue_punctuations and len(word_strip)>1 and word_strip[0] in glue_punctuations: + words[-1]["text"] += word.word.lstrip() + words[-1]["confidence"].append(word.probability) + words[-1]["end"] = end + continue words.append({ "text": word.word, "confidence": [word.probability], diff --git a/stt/processing/text_normalize.py b/stt/processing/text_normalize.py index 9077be6..a5f3d04 100644 --- a/stt/processing/text_normalize.py +++ b/stt/processing/text_normalize.py @@ -6,24 +6,42 @@ from stt import logger from .utils import flatten -# Punctuation marks -_punctuations = '!,.:;?¿،؛؟…、。!,:?' # + '"”' + ')]}' -_punctuations_plus = _punctuations + "'-" - - -def remove_punctuation(text: str) -> str: - text = text.translate(str.maketrans("", "", _punctuations)) - # We don't remove dots inside words (e.g. "ab@gmail.com") - text = re.sub(r"\.(\s)", r"\1", text+" ").strip() - return collapse_whitespace(text) - - -_whitespace_re = re.compile(r'[^\S\r\n]+') - - -def collapse_whitespace(text): - return re.sub(_whitespace_re, ' ', text).strip() - +# All punctuations and symbols EXCEPT: +# * apostrophe (') and hyphen (-) +# * underscore (_) +# * currency symbols ($, €, £, ...) -> \p{Sc} +# * math symbols (%, +, ×). ex: C++ +# * misc (#, @). ex: C#, @user +# and the space character (which can separate several series of punctuation marks) +# Example of punctuations that can output models like Whisper: !,.:;?¿،؛؟…、。!,:?>/]:!(~\u200b[ா「«»“”"< ?;…,*」.)' +_punctuation_regex = r"[^\w\p{Sc}" + re.escape("'-_%+×#@&") + "]" +_leading_punctuations_regex = r"^" + _punctuation_regex + r"+" +_trailing_punctuations_regex = _punctuation_regex + r"+$" + +# A list of symbols that can be an isolated words and not in the exclusion list above +# * & +# * candidates not retained: §, <, =, >, ≤, ≥ +_maybe_word_regex = None # r"[" + re.escape("&") + r"]$" + + +def remove_punctuation(text: str, ensure_no_spaces_in_words: bool=False) -> str: + text = text.strip() + # Note: we don't remove dots inside words (e.g. "ab@gmail.com") + new_text = re.sub(_leading_punctuations_regex, "", text) #.lstrip() + new_text = re.sub(_trailing_punctuations_regex, "", new_text) #.rstrip() + # Let punctuation marks that are alone + if not new_text: + if _maybe_word_regex and re.match(_maybe_word_regex, text): + new_text = text + else: + new_text = "" + # Ensure that there is no space in the middle of a word + if ensure_no_spaces_in_words and " " in new_text: + new_text, tail = new_text.split(" ", 1) + # OK if the tail only contains non alphanumeric characters (then we just keep the first part) + assert not re.search(r"[^\W\d\'\-_]", tail), f"Got unexpected word containing space: {text}" + return remove_punctuation(new_text, ensure_no_spaces_in_words=ensure_no_spaces_in_words) + return new_text def transliterate(c): # Transliterates a character to its closest ASCII equivalent. From 4aef237b9102abe85dd41b025d17adbfc5983d70 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Mon, 3 Jul 2023 13:45:31 +0200 Subject: [PATCH 145/172] safety --- stt/processing/decoding.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 624255c..4b1e1d5 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -297,7 +297,7 @@ def checked_timestamps(start, end=None): if glue_punctuations and len(word_strip)>1 and word_strip[0] in glue_punctuations: words[-1]["text"] += word.word.lstrip() words[-1]["confidence"].append(word.probability) - words[-1]["end"] = end + words[-1]["end"] = max(words[-1]["end"], end) continue words.append({ "text": word.word, From a36deec24a33df899764a6ab1c4f057b61535291 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Mon, 10 Jul 2023 12:13:56 +0200 Subject: [PATCH 146/172] update version of faster_whisper, and fix the version of ctranslate2 --- requirements.ctranslate2.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/requirements.ctranslate2.txt b/requirements.ctranslate2.txt index e50d462..f1d0e5a 100644 --- a/requirements.ctranslate2.txt +++ b/requirements.ctranslate2.txt @@ -10,4 +10,5 @@ pyyaml>=5.4.1 requests>=2.26.0 wavio>=0.0.4 websockets -faster_whisper>=0.5.1 \ No newline at end of file +ctranslate2==3.16.1 +faster_whisper==0.6.0 From e6ea125e602f7c62cf6f0ca5827e6e62697aa3f6 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Mon, 10 Jul 2023 12:14:27 +0200 Subject: [PATCH 147/172] fix timeout issues in celery --- celery_app/celeryapp.py | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/celery_app/celeryapp.py b/celery_app/celeryapp.py index e04d73b..b432831 100644 --- a/celery_app/celeryapp.py +++ b/celery_app/celeryapp.py @@ -10,9 +10,14 @@ if os.environ.get("BROKER_PASS", False): components = broker_url.split("//") broker_url = f'{components[0]}//:{os.environ.get("BROKER_PASS")}@{components[1]}' + celery.conf.broker_url = f"{broker_url}/0" celery.conf.result_backend = f"{broker_url}/1" -celery.conf.update(result_expires=3600, task_acks_late=True, task_track_started=True) +celery.conf.task_acks_late = False +celery.conf.task_track_started = True +celery.conf.broker_transport_options = {"visibility_timeout": float("inf")} +# celery.conf.result_backend_transport_options = {"visibility_timeout": float("inf")} +# celery.conf.result_expires = 3600 * 24 # Queues celery.conf.update( From f9435e7a6ee8b89a16f6377940f7df02d9cd71ef Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Mon, 10 Jul 2023 12:14:47 +0200 Subject: [PATCH 148/172] cosm --- http_server/ingress.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/http_server/ingress.py b/http_server/ingress.py index 0e8a640..e78967d 100644 --- a/http_server/ingress.py +++ b/http_server/ingress.py @@ -113,7 +113,7 @@ def server_error(error): { "bind": f"0.0.0.0:{args.service_port}", "workers": args.workers, - "timeout": 3600, + "timeout": 3600 * 24, }, ) logger.info(args) From 00ac19a3dc142787b520ec34dbf0c6b9c258492a Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Mon, 21 Aug 2023 17:27:22 +0200 Subject: [PATCH 149/172] update to latest --- requirements.ctranslate2.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/requirements.ctranslate2.txt b/requirements.ctranslate2.txt index f1d0e5a..054d595 100644 --- a/requirements.ctranslate2.txt +++ b/requirements.ctranslate2.txt @@ -10,5 +10,5 @@ pyyaml>=5.4.1 requests>=2.26.0 wavio>=0.0.4 websockets -ctranslate2==3.16.1 -faster_whisper==0.6.0 +ctranslate2==3.18.0 +faster_whisper==0.7.1 \ No newline at end of file From 9d385f7c1df02895bb87bd36b4fd03c611dd43a3 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 22 Aug 2023 15:29:31 +0200 Subject: [PATCH 150/172] support of Whisper models finetuned with transformers Python package (or in HuggingFace formats) --- stt/processing/load_model.py | 258 ++++++++++++++++++++++++++++++++++- 1 file changed, 253 insertions(+), 5 deletions(-) diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py index 18541e2..8a91b75 100644 --- a/stt/processing/load_model.py +++ b/stt/processing/load_model.py @@ -1,5 +1,8 @@ import os +import sys import time +import shutil +import subprocess from stt import logger, USE_CTRANSLATE2 @@ -48,6 +51,65 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root=None): device_index = [int(dev) for dev in device[5:].split(",")] device = "cuda" + if not os.path.isfile(os.path.join(model_type_or_file, "model.bin")) and \ + model_type_or_file not in ["tiny.en", "tiny", "base.en", "base", "small.en", "small", "medium.en", "medium", "large-v1", "large-v2"]: + + # Convert transformer model + + output_dir = os.path.join(download_root, f"ctranslate2/converters/transformers--{model_type_or_file.replace('/', '--')}") + logger.info(f"CTranslate2 model in {output_dir}") + if not os.path.isdir(output_dir): + + import huggingface_hub + + delete_hf_path = False + if not os.path.isdir(model_type_or_file): + + hf_path = huggingface_hub.hf_hub_download(repo_id=model_type_or_file, filename="pytorch_model.bin") + hf_path = os.path.dirname(os.path.dirname(os.path.dirname(hf_path))) + + delete_hf_path = not os.path.exists(hf_path) + else: + assert os.path.isfile(os.path.join(model_type_or_file, "pytorch_model.bin")), f"Could not find pytorch_model.bin in {model_type_or_file}" + + check_torch_installed() + + # from ctranslate2.converters.transformers import TransformersConverter + # converter = TransformersConverter( + # model_type_or_file, + # activation_scales=None, # Path to the pre-computed activation scales, see https://github.com/mit-han-lab/smoothquant + # copy_files=[], # Note: "tokenizer.json" does not always exist, we will copy it separately + # load_as_float16=False, + # revision=None, + # low_cpu_mem_usage=False, + # trust_remote_code=False, + # ) + + try: + # converter.convert( + # output_dir, + # force=False + # ) + + subprocess.check_call([ + "ct2-transformers-converter", + "--model", model_type_or_file, + "--output_dir", os.path.realpath(output_dir), + "--quantization", "float16", + ]) + except Exception as err: + shutil.rmtree(output_dir, ignore_errors=True) + raise err + + finally: + if delete_hf_path: + logger.info(f"Deleting {hf_path}") + shutil.rmtree(hf_path, ignore_errors=True) + + assert os.path.isdir(output_dir), f"Failed to build {output_dir}" + + model_type_or_file = output_dir + model = None for i, compute_type in enumerate(compute_types): try: @@ -62,6 +124,7 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root=None): ) break except ValueError as err: + logger.info("WARNING: failed to load model with compute_type={}".format(compute_type)) # On some old GPU we may have the error # "ValueError: Requested int8_float16 compute type, # but the target device or backend do not support efficient int8_float16 computation." @@ -69,13 +132,198 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root=None): raise err else: - model = whisper.load_model( - model_type_or_file, device=device, - download_root=os.path.join(download_root, "whisper") - ) + + extension = os.path.splitext(model_type_or_file)[-1] if os.path.isfile(model_type_or_file) else None + + if model_type_or_file in whisper.available_models() or extension == ".pt": + + model = whisper.load_model( + model_type_or_file, device=device, + download_root=os.path.join(download_root, "whisper") + ) + + else: + + # Convert HuggingFace model + import torch + + peft_folder = None + + if extension in [".ckpt", ".bin"]: + model_path = model_type_or_file + else: + # Search for the cached file (download if necessary) + if os.path.isdir(model_type_or_file): + for root, _, files in os.walk(model_type_or_file): + if "adapter_config.json" in files: + peft_folder = root + break + try: + import transformers + except ImportError: + raise ImportError(f"If you are trying to download a HuggingFace model with {model_type_or_file}, please install first the transformers library") + from transformers.utils import cached_file + + try: + model_path = cached_file(model_type_or_file, "pytorch_model.bin", cache_dir=download_root, use_auth_token=None, revision=None) + except Exception as e: + try: + if isinstance(e, OSError): + model_path = cached_file(model_type_or_file, "whisper.ckpt", cache_dir=download_root, use_auth_token=None, revision=None) + else: + raise e + except: + if peft_folder is None: + raise RuntimeError(f"Original error: {e}\nCould not find model {model_type_or_file} from HuggingFace nor local folders.") + + # Load HF Model + if peft_folder is not None: + from peft import PeftConfig, PeftModel + import transformers + + peft_config = PeftConfig.from_pretrained(peft_folder) + base_model = peft_config.base_model_name_or_path + + model = transformers.WhisperForConditionalGeneration.from_pretrained(base_model) + model = PeftModel.from_pretrained(model, peft_folder) + hf_state_dict = model.state_dict() + del model + else: + hf_state_dict = torch.load(model_path, map_location="cpu") + + # Rename layers + for key in list(hf_state_dict.keys()): + new_key = hf_to_whisper_states(key) + if new_key is None: + hf_state_dict.pop(key) + elif new_key != key: + hf_state_dict[new_key] = hf_state_dict.pop(key) + + # Init Whisper Model and replace model weights + dims = whisper.model.ModelDimensions(**states_to_dim(hf_state_dict)) + if "proj_out.weight" in hf_state_dict: + hf_state_dict["decoder.proj_out.weight"] = hf_state_dict.pop("proj_out.weight") + print("WARNING: Using untied projection layer") + whisper_model = WhisperUntied(dims) + else: + whisper_model = whisper.model.Whisper(dims) + whisper_model.load_state_dict(hf_state_dict) + del hf_state_dict + whisper_model = whisper_model.to(device) + return whisper_model + model.eval() model.requires_grad_(False) logger.info("Whisper model loaded. (t={}s)".format(time.time() - start)) - return model \ No newline at end of file + return model + + +def check_torch_installed(): + try: + import torch + except ImportError: + # Install transformers with torch + subprocess.check_call([sys.executable, "-m", "pip", "install", "transformers[torch]>=4.23"]) + + # # Re-load ctranslate2 + # import importlib + # import ctranslate2 + # importlib.reload(ctranslate2) + # importlib.reload(ctranslate2.converters.transformers) + + # import torch + +# Credit: https://github.com/openai/whisper/discussions/830 +def hf_to_whisper_states(text): + import re + + # From Speechbrain + if text == "_mel_filters": + return None + + # From PEFT + if "default" in text: + # print(f"WARNING: Ignoring {text}") + return None + if text.startswith("base_model.model."): + text = text[len("base_model.model."):] + + text = re.sub('.layers.', '.blocks.', text) + text = re.sub('.self_attn.', '.attn.', text) + text = re.sub('.q_proj.', '.query.', text) + text = re.sub('.k_proj.', '.key.', text) + text = re.sub('.v_proj.', '.value.', text) + text = re.sub('.out_proj.', '.out.', text) + text = re.sub('.fc1.', '.mlp.0.', text) + text = re.sub('.fc2.', '.mlp.2.', text) + text = re.sub('.fc3.', '.mlp.3.', text) + text = re.sub('.fc3.', '.mlp.3.', text) + text = re.sub('.encoder_attn.', '.cross_attn.', text) + text = re.sub('.cross_attn.ln.', '.cross_attn_ln.', text) + text = re.sub('.embed_positions.weight', '.positional_embedding', text) + text = re.sub('.embed_tokens.', '.token_embedding.', text) + text = re.sub('model.', '', text) + text = re.sub('attn.layer_norm.', 'attn_ln.', text) + text = re.sub('.final_layer_norm.', '.mlp_ln.', text) + text = re.sub('encoder.layer_norm.', 'encoder.ln_post.', text) + text = re.sub('decoder.layer_norm.', 'decoder.ln.', text) + return text + +def states_to_dim(state_dict): + n_audio_state = len(state_dict['encoder.ln_post.bias']) + n_text_state = len(state_dict["decoder.ln.bias"]) + return { + "n_mels": state_dict["encoder.conv1.weight"].shape[1], # 80 + "n_vocab": state_dict["decoder.token_embedding.weight"].shape[0], # 51864 / 51865 + "n_audio_ctx": state_dict["encoder.positional_embedding"].shape[0], # 1500 + "n_audio_state": n_audio_state, # 384 / 512 / 768 / 1024 / 1280 + "n_audio_head": n_audio_state // 64, # 6 / 8 / 12 / 16 / 20 + "n_audio_layer": len(set([".".join(k.split(".")[:3]) for k in state_dict.keys() if "encoder.blocks." in k])), # 4 / 6 / 12 / 24 / 32 + "n_text_ctx": state_dict["decoder.positional_embedding"].shape[0], # 448 + "n_text_state": n_text_state, # 384 / 512 / 768 / 1024 / 1280 + "n_text_head": n_text_state // 64, # 6 / 8 / 12 / 16 / 20 + "n_text_layer": len(set([".".join(k.split(".")[:3]) for k in state_dict.keys() if "decoder.blocks." in k])), # 4 / 6 / 12 / 24 / 32 + } + +if not USE_CTRANSLATE2: + + class TextDecoderUntied(whisper.model.TextDecoder): + """ + Same as TextDecoder but with untied weights + """ + def __init__(self, *args, **kwargs): + import torch + super().__init__(*args, **kwargs) + + n_vocab, n_state = self.token_embedding.weight.shape + + self.proj_out = torch.nn.Linear(n_state, n_vocab, bias=False) + + def forward(self, x, xa, kv_cache = None): + offset = next(iter(kv_cache.values())).shape[1] if kv_cache else 0 + x = self.token_embedding(x) + self.positional_embedding[offset : offset + x.shape[-1]] + x = x.to(xa.dtype) + + for block in self.blocks: + x = block(x, xa, mask=self.mask, kv_cache=kv_cache) + + x = self.ln(x) + + # logits = self.proj_out(x).float() + # logits = (x @ torch.transpose(self.proj_out.weight.to(x.dtype), 0, 1)).float() + logits = self.proj_out.to(x.dtype)(x).float() + + return logits + + class WhisperUntied(whisper.model.Whisper): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + self.decoder = TextDecoderUntied( + self.dims.n_vocab, + self.dims.n_text_ctx, + self.dims.n_text_state, + self.dims.n_text_head, + self.dims.n_text_layer, + ) From c7cf09597bfa6819173f89513a9620496d405489 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 22 Aug 2023 15:29:31 +0200 Subject: [PATCH 151/172] support of Whisper models finetuned with transformers Python package (or in HuggingFace formats) --- RELEASE.md | 3 + stt/processing/load_model.py | 258 ++++++++++++++++++++++++++++++++++- 2 files changed, 256 insertions(+), 5 deletions(-) diff --git a/RELEASE.md b/RELEASE.md index 174ae1b..5380da5 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,3 +1,6 @@ +# 4.0.4 +- Add integration of Whisper models from transformers + # 4.0.3 - Tune punctuation heuristics diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py index 18541e2..8a91b75 100644 --- a/stt/processing/load_model.py +++ b/stt/processing/load_model.py @@ -1,5 +1,8 @@ import os +import sys import time +import shutil +import subprocess from stt import logger, USE_CTRANSLATE2 @@ -48,6 +51,65 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root=None): device_index = [int(dev) for dev in device[5:].split(",")] device = "cuda" + if not os.path.isfile(os.path.join(model_type_or_file, "model.bin")) and \ + model_type_or_file not in ["tiny.en", "tiny", "base.en", "base", "small.en", "small", "medium.en", "medium", "large-v1", "large-v2"]: + + # Convert transformer model + + output_dir = os.path.join(download_root, f"ctranslate2/converters/transformers--{model_type_or_file.replace('/', '--')}") + logger.info(f"CTranslate2 model in {output_dir}") + if not os.path.isdir(output_dir): + + import huggingface_hub + + delete_hf_path = False + if not os.path.isdir(model_type_or_file): + + hf_path = huggingface_hub.hf_hub_download(repo_id=model_type_or_file, filename="pytorch_model.bin") + hf_path = os.path.dirname(os.path.dirname(os.path.dirname(hf_path))) + + delete_hf_path = not os.path.exists(hf_path) + else: + assert os.path.isfile(os.path.join(model_type_or_file, "pytorch_model.bin")), f"Could not find pytorch_model.bin in {model_type_or_file}" + + check_torch_installed() + + # from ctranslate2.converters.transformers import TransformersConverter + # converter = TransformersConverter( + # model_type_or_file, + # activation_scales=None, # Path to the pre-computed activation scales, see https://github.com/mit-han-lab/smoothquant + # copy_files=[], # Note: "tokenizer.json" does not always exist, we will copy it separately + # load_as_float16=False, + # revision=None, + # low_cpu_mem_usage=False, + # trust_remote_code=False, + # ) + + try: + # converter.convert( + # output_dir, + # force=False + # ) + + subprocess.check_call([ + "ct2-transformers-converter", + "--model", model_type_or_file, + "--output_dir", os.path.realpath(output_dir), + "--quantization", "float16", + ]) + except Exception as err: + shutil.rmtree(output_dir, ignore_errors=True) + raise err + + finally: + if delete_hf_path: + logger.info(f"Deleting {hf_path}") + shutil.rmtree(hf_path, ignore_errors=True) + + assert os.path.isdir(output_dir), f"Failed to build {output_dir}" + + model_type_or_file = output_dir + model = None for i, compute_type in enumerate(compute_types): try: @@ -62,6 +124,7 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root=None): ) break except ValueError as err: + logger.info("WARNING: failed to load model with compute_type={}".format(compute_type)) # On some old GPU we may have the error # "ValueError: Requested int8_float16 compute type, # but the target device or backend do not support efficient int8_float16 computation." @@ -69,13 +132,198 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root=None): raise err else: - model = whisper.load_model( - model_type_or_file, device=device, - download_root=os.path.join(download_root, "whisper") - ) + + extension = os.path.splitext(model_type_or_file)[-1] if os.path.isfile(model_type_or_file) else None + + if model_type_or_file in whisper.available_models() or extension == ".pt": + + model = whisper.load_model( + model_type_or_file, device=device, + download_root=os.path.join(download_root, "whisper") + ) + + else: + + # Convert HuggingFace model + import torch + + peft_folder = None + + if extension in [".ckpt", ".bin"]: + model_path = model_type_or_file + else: + # Search for the cached file (download if necessary) + if os.path.isdir(model_type_or_file): + for root, _, files in os.walk(model_type_or_file): + if "adapter_config.json" in files: + peft_folder = root + break + try: + import transformers + except ImportError: + raise ImportError(f"If you are trying to download a HuggingFace model with {model_type_or_file}, please install first the transformers library") + from transformers.utils import cached_file + + try: + model_path = cached_file(model_type_or_file, "pytorch_model.bin", cache_dir=download_root, use_auth_token=None, revision=None) + except Exception as e: + try: + if isinstance(e, OSError): + model_path = cached_file(model_type_or_file, "whisper.ckpt", cache_dir=download_root, use_auth_token=None, revision=None) + else: + raise e + except: + if peft_folder is None: + raise RuntimeError(f"Original error: {e}\nCould not find model {model_type_or_file} from HuggingFace nor local folders.") + + # Load HF Model + if peft_folder is not None: + from peft import PeftConfig, PeftModel + import transformers + + peft_config = PeftConfig.from_pretrained(peft_folder) + base_model = peft_config.base_model_name_or_path + + model = transformers.WhisperForConditionalGeneration.from_pretrained(base_model) + model = PeftModel.from_pretrained(model, peft_folder) + hf_state_dict = model.state_dict() + del model + else: + hf_state_dict = torch.load(model_path, map_location="cpu") + + # Rename layers + for key in list(hf_state_dict.keys()): + new_key = hf_to_whisper_states(key) + if new_key is None: + hf_state_dict.pop(key) + elif new_key != key: + hf_state_dict[new_key] = hf_state_dict.pop(key) + + # Init Whisper Model and replace model weights + dims = whisper.model.ModelDimensions(**states_to_dim(hf_state_dict)) + if "proj_out.weight" in hf_state_dict: + hf_state_dict["decoder.proj_out.weight"] = hf_state_dict.pop("proj_out.weight") + print("WARNING: Using untied projection layer") + whisper_model = WhisperUntied(dims) + else: + whisper_model = whisper.model.Whisper(dims) + whisper_model.load_state_dict(hf_state_dict) + del hf_state_dict + whisper_model = whisper_model.to(device) + return whisper_model + model.eval() model.requires_grad_(False) logger.info("Whisper model loaded. (t={}s)".format(time.time() - start)) - return model \ No newline at end of file + return model + + +def check_torch_installed(): + try: + import torch + except ImportError: + # Install transformers with torch + subprocess.check_call([sys.executable, "-m", "pip", "install", "transformers[torch]>=4.23"]) + + # # Re-load ctranslate2 + # import importlib + # import ctranslate2 + # importlib.reload(ctranslate2) + # importlib.reload(ctranslate2.converters.transformers) + + # import torch + +# Credit: https://github.com/openai/whisper/discussions/830 +def hf_to_whisper_states(text): + import re + + # From Speechbrain + if text == "_mel_filters": + return None + + # From PEFT + if "default" in text: + # print(f"WARNING: Ignoring {text}") + return None + if text.startswith("base_model.model."): + text = text[len("base_model.model."):] + + text = re.sub('.layers.', '.blocks.', text) + text = re.sub('.self_attn.', '.attn.', text) + text = re.sub('.q_proj.', '.query.', text) + text = re.sub('.k_proj.', '.key.', text) + text = re.sub('.v_proj.', '.value.', text) + text = re.sub('.out_proj.', '.out.', text) + text = re.sub('.fc1.', '.mlp.0.', text) + text = re.sub('.fc2.', '.mlp.2.', text) + text = re.sub('.fc3.', '.mlp.3.', text) + text = re.sub('.fc3.', '.mlp.3.', text) + text = re.sub('.encoder_attn.', '.cross_attn.', text) + text = re.sub('.cross_attn.ln.', '.cross_attn_ln.', text) + text = re.sub('.embed_positions.weight', '.positional_embedding', text) + text = re.sub('.embed_tokens.', '.token_embedding.', text) + text = re.sub('model.', '', text) + text = re.sub('attn.layer_norm.', 'attn_ln.', text) + text = re.sub('.final_layer_norm.', '.mlp_ln.', text) + text = re.sub('encoder.layer_norm.', 'encoder.ln_post.', text) + text = re.sub('decoder.layer_norm.', 'decoder.ln.', text) + return text + +def states_to_dim(state_dict): + n_audio_state = len(state_dict['encoder.ln_post.bias']) + n_text_state = len(state_dict["decoder.ln.bias"]) + return { + "n_mels": state_dict["encoder.conv1.weight"].shape[1], # 80 + "n_vocab": state_dict["decoder.token_embedding.weight"].shape[0], # 51864 / 51865 + "n_audio_ctx": state_dict["encoder.positional_embedding"].shape[0], # 1500 + "n_audio_state": n_audio_state, # 384 / 512 / 768 / 1024 / 1280 + "n_audio_head": n_audio_state // 64, # 6 / 8 / 12 / 16 / 20 + "n_audio_layer": len(set([".".join(k.split(".")[:3]) for k in state_dict.keys() if "encoder.blocks." in k])), # 4 / 6 / 12 / 24 / 32 + "n_text_ctx": state_dict["decoder.positional_embedding"].shape[0], # 448 + "n_text_state": n_text_state, # 384 / 512 / 768 / 1024 / 1280 + "n_text_head": n_text_state // 64, # 6 / 8 / 12 / 16 / 20 + "n_text_layer": len(set([".".join(k.split(".")[:3]) for k in state_dict.keys() if "decoder.blocks." in k])), # 4 / 6 / 12 / 24 / 32 + } + +if not USE_CTRANSLATE2: + + class TextDecoderUntied(whisper.model.TextDecoder): + """ + Same as TextDecoder but with untied weights + """ + def __init__(self, *args, **kwargs): + import torch + super().__init__(*args, **kwargs) + + n_vocab, n_state = self.token_embedding.weight.shape + + self.proj_out = torch.nn.Linear(n_state, n_vocab, bias=False) + + def forward(self, x, xa, kv_cache = None): + offset = next(iter(kv_cache.values())).shape[1] if kv_cache else 0 + x = self.token_embedding(x) + self.positional_embedding[offset : offset + x.shape[-1]] + x = x.to(xa.dtype) + + for block in self.blocks: + x = block(x, xa, mask=self.mask, kv_cache=kv_cache) + + x = self.ln(x) + + # logits = self.proj_out(x).float() + # logits = (x @ torch.transpose(self.proj_out.weight.to(x.dtype), 0, 1)).float() + logits = self.proj_out.to(x.dtype)(x).float() + + return logits + + class WhisperUntied(whisper.model.Whisper): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + self.decoder = TextDecoderUntied( + self.dims.n_vocab, + self.dims.n_text_ctx, + self.dims.n_text_state, + self.dims.n_text_head, + self.dims.n_text_layer, + ) From 80922dfce999b22c323d911e2a8698fb328f2e21 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 22 Aug 2023 17:09:13 +0200 Subject: [PATCH 152/172] add option for prompt --- RELEASE.md | 1 + stt/processing/decoding.py | 2 ++ 2 files changed, 3 insertions(+) diff --git a/RELEASE.md b/RELEASE.md index 5380da5..3917468 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,5 +1,6 @@ # 4.0.4 - Add integration of Whisper models from transformers +- Add support of prompt from Whisper models (env variable PROMPT) # 4.0.3 - Tune punctuation heuristics diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 4b1e1d5..8b81c5d 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -26,6 +26,7 @@ default_best_of = None default_temperature = 0.0 +default_initial_prompt = os.environ.get("PROMPT", None) def decode(audio, model, @@ -39,6 +40,7 @@ def decode(audio, condition_on_previous_text: bool = False, no_speech_threshold: float = 0.6, compression_ratio_threshold: float = 2.4, + initial_prompt: str = default_initial_prompt, ) -> dict: if language is None: From bc2ca6380b8b0bc5ec8fe19b0dc5dac8beff8797 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 22 Aug 2023 17:13:29 +0200 Subject: [PATCH 153/172] Use persistent prompt (when there is an initial_prompt and when condition_on_previous_text is False --- Dockerfile.ctranslate2 | 2 +- Dockerfile.ctranslate2.cpu | 2 +- requirements.ctranslate2.txt | 3 ++- 3 files changed, 4 insertions(+), 3 deletions(-) diff --git a/Dockerfile.ctranslate2 b/Dockerfile.ctranslate2 index e2e0008..64afb50 100644 --- a/Dockerfile.ctranslate2 +++ b/Dockerfile.ctranslate2 @@ -1,7 +1,7 @@ FROM ghcr.io/opennmt/ctranslate2:latest-ubuntu20.04-cuda11.2 LABEL maintainer="jlouradour@linagora.com" -RUN apt-get update && apt-get install -y --no-install-recommends ffmpeg +RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends ffmpeg git # Install python dependencies COPY requirements.ctranslate2.txt ./ diff --git a/Dockerfile.ctranslate2.cpu b/Dockerfile.ctranslate2.cpu index 46c148e..fc30d21 100644 --- a/Dockerfile.ctranslate2.cpu +++ b/Dockerfile.ctranslate2.cpu @@ -1,7 +1,7 @@ FROM python:3.9 LABEL maintainer="jlouradour@linagora.com" -RUN apt-get update && apt-get install -y --no-install-recommends ffmpeg +RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends ffmpeg git # Install python dependencies COPY requirements.ctranslate2.txt ./ diff --git a/requirements.ctranslate2.txt b/requirements.ctranslate2.txt index 054d595..ef4e8cc 100644 --- a/requirements.ctranslate2.txt +++ b/requirements.ctranslate2.txt @@ -11,4 +11,5 @@ requests>=2.26.0 wavio>=0.0.4 websockets ctranslate2==3.18.0 -faster_whisper==0.7.1 \ No newline at end of file +#faster_whisper==0.7.1 +git+https://github.com/linto-ai/faster-whisper.git@d9cffcaad763def754124977cc66150f0efcd7ea \ No newline at end of file From 7d386cf2f6b4d5472231b06125958446d70d05d2 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 23 Aug 2023 10:39:05 +0200 Subject: [PATCH 154/172] fix possible failure when a segment starts with a punctuation --- RELEASE.md | 1 + stt/processing/decoding.py | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/RELEASE.md b/RELEASE.md index 3917468..a53454e 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,6 +1,7 @@ # 4.0.4 - Add integration of Whisper models from transformers - Add support of prompt from Whisper models (env variable PROMPT) +- Fix possible failure when a Whisper segment starts with a punctuation # 4.0.3 - Tune punctuation heuristics diff --git a/stt/processing/decoding.py b/stt/processing/decoding.py index 8b81c5d..42b3c35 100644 --- a/stt/processing/decoding.py +++ b/stt/processing/decoding.py @@ -296,7 +296,7 @@ def checked_timestamps(start, end=None): for word in segment.words: start, end = checked_timestamps(word.start, word.end) word_strip = word.word.strip() - if glue_punctuations and len(word_strip)>1 and word_strip[0] in glue_punctuations: + if glue_punctuations and len(words) and len(word_strip)>1 and word_strip[0] in glue_punctuations: words[-1]["text"] += word.word.lstrip() words[-1]["confidence"].append(word.probability) words[-1]["end"] = max(words[-1]["end"], end) From 65e2ab0436e88424d3845ff49a927056cd763f85 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 30 Aug 2023 12:09:00 +0200 Subject: [PATCH 155/172] improve README --- README.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 164db6f..8e6b04d 100644 --- a/README.md +++ b/README.md @@ -84,12 +84,13 @@ cp .envdefault .env |---|---|---| | SERVICE_MODE | STT serving mode see [Serving mode](#serving-mode) | `http` \| `task` | | MODEL | Path to the Whisper model, or type of Whisper model used. | \ \| `medium` \| `large-v1` \| ... | -| ALIGNMENT_MODEL | (Optional) Path to the wav2vec model for word alignment, or name of HuggingFace repository or torchaudio pipeline | \ \| `WAV2VEC2_ASR_BASE_960H` \| `jonatasgrosman/wav2vec2-large-xlsr-53-english` \| ... | | LANGUAGE | (Optional) Language to recognize | `*` \| `fr` \| `fr-FR` \| `French` \| `en` \| `en-US` \| `English` \| ... | -| SERVICE_NAME | Using the task mode, set the queue's name for task processing | `my-stt` | -| SERVICE_BROKER | Using the task mode, URL of the message broker | `redis://my-broker:6379` | -| BROKER_PASS | Using the task mode, broker password | `my-password` | +| PROMPT | (Optional) Prompt to use for the Whisper model | `some free text to encourage a certain transcription style (disfluencies, no punctuation, ...)` | +| ALIGNMENT_MODEL | (Optional) Path to the wav2vec model for word alignment, or name of HuggingFace repository or torchaudio pipeline | \ \| `WAV2VEC2_ASR_BASE_960H` \| `jonatasgrosman/wav2vec2-large-xlsr-53-english` \| ... | | CONCURRENCY | Maximum number of parallel requests | `3` | +| SERVICE_NAME | (For the task mode) queue's name for task processing | `my-stt` | +| SERVICE_BROKER | (For the task mode) URL of the message broker | `redis://my-broker:6379` | +| BROKER_PASS | (For the task mode only) broker password | `my-password` | If `*` is used for the `LANGUAGE` environment variable, or if `LANGUAGE` is not defined, automatic language detection will be performed by Whisper. From 5eb31eae80b72c9afe8d3762084e6b41aac7c242 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 30 Aug 2023 15:01:14 +0200 Subject: [PATCH 156/172] do not publish numbered tag on whisper branch --- Jenkinsfile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Jenkinsfile b/Jenkinsfile index 75a09bd..e96f25d 100644 --- a/Jenkinsfile +++ b/Jenkinsfile @@ -62,7 +62,7 @@ pipeline { ).trim() docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { - image.push("${VERSION}") + // image.push("${VERSION}") image.push('whisper-latest') } } From 97320ce2bd6d130e3168665d9693e899aa1d7f45 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Mon, 13 Nov 2023 15:34:17 +0100 Subject: [PATCH 157/172] Support of new Whisper model large-v3 --- requirements.ctranslate2.txt | 6 +++--- stt/processing/load_model.py | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/requirements.ctranslate2.txt b/requirements.ctranslate2.txt index ef4e8cc..2ddc118 100644 --- a/requirements.ctranslate2.txt +++ b/requirements.ctranslate2.txt @@ -10,6 +10,6 @@ pyyaml>=5.4.1 requests>=2.26.0 wavio>=0.0.4 websockets -ctranslate2==3.18.0 -#faster_whisper==0.7.1 -git+https://github.com/linto-ai/faster-whisper.git@d9cffcaad763def754124977cc66150f0efcd7ea \ No newline at end of file +#faster_whisper==0.10.0 +# This is version faster_whisper==0.9.0 + prompt propagation + fix for large-v3 +git+https://github.com/linto-ai/faster-whisper.git@aad9e7508b528e79be2a9975ac79ef8317f02a6d \ No newline at end of file diff --git a/stt/processing/load_model.py b/stt/processing/load_model.py index 8a91b75..3790593 100644 --- a/stt/processing/load_model.py +++ b/stt/processing/load_model.py @@ -52,7 +52,7 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root=None): device = "cuda" if not os.path.isfile(os.path.join(model_type_or_file, "model.bin")) and \ - model_type_or_file not in ["tiny.en", "tiny", "base.en", "base", "small.en", "small", "medium.en", "medium", "large-v1", "large-v2"]: + not max([model_type_or_file.startswith(prefix) for prefix in ["tiny", "base", "small", "medium", "large"]]): # Convert transformer model From 586f1d60b9a00e722c00cbaef2fe01bbb58acb4a Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 22 Nov 2023 14:43:22 +0100 Subject: [PATCH 158/172] Update release note --- .envdefault | 2 +- RELEASE.md | 3 +++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/.envdefault b/.envdefault index 1dbc2b1..88c27ea 100644 --- a/.envdefault +++ b/.envdefault @@ -13,7 +13,7 @@ BROKER_PASS= # STT MODELING PARAMETERS ############################################ -# The model can be a path to a model, or a model name ("tiny", "base", "small", "medium", "large-v1" or "large-v2") +# The model can be a path to a model, or a model name ("tiny", "base", "small", "medium", "large-v1", "large-v2" or "large-v3") MODEL=medium # The language can be in different formats: "en", "en-US", "English", ... diff --git a/RELEASE.md b/RELEASE.md index a53454e..f507fa5 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -1,3 +1,6 @@ +# 4.0.5 +- Support of Whisper large-v3 model + # 4.0.4 - Add integration of Whisper models from transformers - Add support of prompt from Whisper models (env variable PROMPT) From b2fc9f072cd7616a33f585e0a566274785902d65 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 22 Nov 2023 15:15:29 +0100 Subject: [PATCH 159/172] publish tag 4.0.4 --- Jenkinsfile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Jenkinsfile b/Jenkinsfile index e96f25d..75a09bd 100644 --- a/Jenkinsfile +++ b/Jenkinsfile @@ -62,7 +62,7 @@ pipeline { ).trim() docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { - // image.push("${VERSION}") + image.push("${VERSION}") image.push('whisper-latest') } } From 74cb4b424fabab2adf28c2887f31e6c7d5b00ce2 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 29 Nov 2023 18:36:38 +0100 Subject: [PATCH 160/172] Isolate what is specific to Whisper in a folder --- RELEASE.md | 72 ------------------- .envdefault => whisper/.envdefault | 0 .../Dockerfile.ctranslate2 | 6 +- .../Dockerfile.ctranslate2.cpu | 6 +- Dockerfile.torch => whisper/Dockerfile.torch | 6 +- .../Dockerfile.torch.cpu | 6 +- README.md => whisper/README.md | 0 whisper/RELEASE.md | 16 +++++ .../docker-entrypoint.sh | 0 .../requirements.ctranslate2.txt | 0 .../requirements.torch.txt | 0 {stt => whisper/stt}/__init__.py | 0 {stt => whisper/stt}/processing/__init__.py | 0 .../stt}/processing/alignment_model.py | 0 {stt => whisper/stt}/processing/decoding.py | 3 + {stt => whisper/stt}/processing/load_model.py | 0 .../stt}/processing/text_normalize.py | 0 {stt => whisper/stt}/processing/utils.py | 0 .../stt}/processing/word_alignment.py | 0 19 files changed, 31 insertions(+), 84 deletions(-) delete mode 100644 RELEASE.md rename .envdefault => whisper/.envdefault (100%) rename Dockerfile.ctranslate2 => whisper/Dockerfile.ctranslate2 (81%) rename Dockerfile.ctranslate2.cpu => whisper/Dockerfile.ctranslate2.cpu (80%) rename Dockerfile.torch => whisper/Dockerfile.torch (79%) rename Dockerfile.torch.cpu => whisper/Dockerfile.torch.cpu (83%) rename README.md => whisper/README.md (100%) create mode 100644 whisper/RELEASE.md rename docker-entrypoint.sh => whisper/docker-entrypoint.sh (100%) rename requirements.ctranslate2.txt => whisper/requirements.ctranslate2.txt (100%) rename requirements.torch.txt => whisper/requirements.torch.txt (100%) rename {stt => whisper/stt}/__init__.py (100%) rename {stt => whisper/stt}/processing/__init__.py (100%) rename {stt => whisper/stt}/processing/alignment_model.py (100%) rename {stt => whisper/stt}/processing/decoding.py (99%) rename {stt => whisper/stt}/processing/load_model.py (100%) rename {stt => whisper/stt}/processing/text_normalize.py (100%) rename {stt => whisper/stt}/processing/utils.py (100%) rename {stt => whisper/stt}/processing/word_alignment.py (100%) diff --git a/RELEASE.md b/RELEASE.md deleted file mode 100644 index f507fa5..0000000 --- a/RELEASE.md +++ /dev/null @@ -1,72 +0,0 @@ -# 4.0.5 -- Support of Whisper large-v3 model - -# 4.0.4 -- Add integration of Whisper models from transformers -- Add support of prompt from Whisper models (env variable PROMPT) -- Fix possible failure when a Whisper segment starts with a punctuation - -# 4.0.3 -- Tune punctuation heuristics - -# 4.0.2 -- Do not considers symbols like "$" as punctuation marks - -# 4.0.1 -- Fix punctuations - -# 4.0.0 -- Integration of Whisper - -# 3.3.2 -- Fixed use of stereo audio in http serving mode - -# 3.3.1 -- Fixed lin_to_vosk throwing an error on a already existing container. -- Corrected an error on the README regarding mounting model volumes. -- Code styling (PEP 8) - -# 3.3.0 -- Added optional streaming route to the http serving mode -- Added serving mode: websocket -- Added Dynamic model conversion allowing to use either Vosk Models or Linagora AM/LM models -- Changer Vosk dependency to alphacep/vosk -- Updated README.md - -# 3.2.1 -- Repository total rework. The goal being to have a simple transcription service embeddable within a micro-service infrastructure. -- Changed repository name from linto-platform-stt-standalone-worker to linto-platform-stt. -- Added celery connector for microservice integration. -- Added launch option to specify serving mode between task and http. -- Removed diarization functionnality. -- Removed punctuation functionnality. -- Removed Async requests/Job management. -- Updated README to reflect those changes. - -# 3.1.1 -- Change Pykaldi with vosk-API (no python wrapper for decoding function, no extrat packages during installation, c++ implementation based on kaldi functions) -- New feature: Compute a confidence score per transcription -- Fix minor bugs - -# 2.2.1 -- Fix minor bugs -- put SWAGGER_PATH parameter as optional -- Generate the word_boundary file if it does not exist - -# 2.2.0 -- Speaker diarization feature: pyBK package -- Mulithreading feature: Speech decoding and Speaker diarization processes -- Optional parameter: real number of speaker in the audio - -# 2.0.0 -- Reimplement LinTO-Platform-stt-standalone-worker using Pykaldi package - -# 1.1.2 -- New features: - - Word timestamp computing - - Response type: plain/text: simple text output and application/json: the transcription and the words timestamp. - - Swagger: integrate swagger in the service using a python package - - Fix minor bugs - -# 1.0.0 -- First build of LinTO-Platform-stt-standalone-worker \ No newline at end of file diff --git a/.envdefault b/whisper/.envdefault similarity index 100% rename from .envdefault rename to whisper/.envdefault diff --git a/Dockerfile.ctranslate2 b/whisper/Dockerfile.ctranslate2 similarity index 81% rename from Dockerfile.ctranslate2 rename to whisper/Dockerfile.ctranslate2 index 64afb50..52fbc44 100644 --- a/Dockerfile.ctranslate2 +++ b/whisper/Dockerfile.ctranslate2 @@ -4,17 +4,17 @@ LABEL maintainer="jlouradour@linagora.com" RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends ffmpeg git # Install python dependencies -COPY requirements.ctranslate2.txt ./ +COPY whisper/requirements.ctranslate2.txt ./ RUN pip install --no-cache-dir -r requirements.ctranslate2.txt && rm requirements.ctranslate2.txt WORKDIR /usr/src/app -COPY stt /usr/src/app/stt COPY celery_app /usr/src/app/celery_app COPY http_server /usr/src/app/http_server COPY websocket /usr/src/app/websocket COPY document /usr/src/app/document -COPY docker-entrypoint.sh wait-for-it.sh healthcheck.sh ./ +COPY whisper/stt /usr/src/app/stt +COPY whisper/docker-entrypoint.sh wait-for-it.sh healthcheck.sh ./ ENV PYTHONPATH="${PYTHONPATH}:/usr/src/app/stt" diff --git a/Dockerfile.ctranslate2.cpu b/whisper/Dockerfile.ctranslate2.cpu similarity index 80% rename from Dockerfile.ctranslate2.cpu rename to whisper/Dockerfile.ctranslate2.cpu index fc30d21..c8d6972 100644 --- a/Dockerfile.ctranslate2.cpu +++ b/whisper/Dockerfile.ctranslate2.cpu @@ -4,17 +4,17 @@ LABEL maintainer="jlouradour@linagora.com" RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends ffmpeg git # Install python dependencies -COPY requirements.ctranslate2.txt ./ +COPY whisper/requirements.ctranslate2.txt ./ RUN pip install --no-cache-dir -r requirements.ctranslate2.txt && rm requirements.ctranslate2.txt WORKDIR /usr/src/app -COPY stt /usr/src/app/stt COPY celery_app /usr/src/app/celery_app COPY http_server /usr/src/app/http_server COPY websocket /usr/src/app/websocket COPY document /usr/src/app/document -COPY docker-entrypoint.sh wait-for-it.sh healthcheck.sh ./ +COPY whisper/stt /usr/src/app/stt +COPY whisper/docker-entrypoint.sh wait-for-it.sh healthcheck.sh ./ ENV PYTHONPATH="${PYTHONPATH}:/usr/src/app/stt" diff --git a/Dockerfile.torch b/whisper/Dockerfile.torch similarity index 79% rename from Dockerfile.torch rename to whisper/Dockerfile.torch index 37480c0..2f3a0d0 100644 --- a/Dockerfile.torch +++ b/whisper/Dockerfile.torch @@ -4,17 +4,17 @@ LABEL maintainer="jlouradour@linagora.com" RUN apt-get update && apt-get install -y --no-install-recommends ffmpeg # Install python dependencies -COPY requirements.torch.txt ./ +COPY whisper/requirements.torch.txt ./ RUN pip install --no-cache-dir -r requirements.torch.txt && rm requirements.torch.txt WORKDIR /usr/src/app -COPY stt /usr/src/app/stt COPY celery_app /usr/src/app/celery_app COPY http_server /usr/src/app/http_server COPY websocket /usr/src/app/websocket COPY document /usr/src/app/document -COPY docker-entrypoint.sh wait-for-it.sh healthcheck.sh ./ +COPY whisper/stt /usr/src/app/stt +COPY whisper/docker-entrypoint.sh wait-for-it.sh healthcheck.sh ./ ENV PYTHONPATH="${PYTHONPATH}:/usr/src/app/stt" diff --git a/Dockerfile.torch.cpu b/whisper/Dockerfile.torch.cpu similarity index 83% rename from Dockerfile.torch.cpu rename to whisper/Dockerfile.torch.cpu index 72582b6..e9198d5 100644 --- a/Dockerfile.torch.cpu +++ b/whisper/Dockerfile.torch.cpu @@ -10,17 +10,17 @@ RUN pip3 install \ -f https://download.pytorch.org/whl/torch_stable.html # Install python dependencies -COPY requirements.torch.txt ./ +COPY whisper/requirements.torch.txt ./ RUN pip install --no-cache-dir -r requirements.torch.txt && rm requirements.torch.txt WORKDIR /usr/src/app -COPY stt /usr/src/app/stt COPY celery_app /usr/src/app/celery_app COPY http_server /usr/src/app/http_server COPY websocket /usr/src/app/websocket COPY document /usr/src/app/document -COPY docker-entrypoint.sh wait-for-it.sh healthcheck.sh ./ +COPY whisper/stt /usr/src/app/stt +COPY whisper/docker-entrypoint.sh wait-for-it.sh healthcheck.sh ./ ENV PYTHONPATH="${PYTHONPATH}:/usr/src/app/stt" diff --git a/README.md b/whisper/README.md similarity index 100% rename from README.md rename to whisper/README.md diff --git a/whisper/RELEASE.md b/whisper/RELEASE.md new file mode 100644 index 0000000..2d57069 --- /dev/null +++ b/whisper/RELEASE.md @@ -0,0 +1,16 @@ +# 1.0.0 +- Support of Whisper (including large-v3 model) +- Add integration of Whisper models from transformers +- Add support of prompt from Whisper models (env variable PROMPT) +- Fix possible failure when a Whisper segment starts with a punctuation +- Tune punctuation heuristics + +# 0.0.0 +- Added optional streaming route to the http serving mode +- Added serving mode: websocket +- Added Dynamic model conversion allowing to use either Vosk Models or Linagora AM/LM models +- Added celery connector for microservice integration. +- Added launch option to specify serving mode between task and http. +- Removed Async requests/Job management. +- New feature: Compute a confidence score per transcription +- put SWAGGER_PATH parameter as optional diff --git a/docker-entrypoint.sh b/whisper/docker-entrypoint.sh similarity index 100% rename from docker-entrypoint.sh rename to whisper/docker-entrypoint.sh diff --git a/requirements.ctranslate2.txt b/whisper/requirements.ctranslate2.txt similarity index 100% rename from requirements.ctranslate2.txt rename to whisper/requirements.ctranslate2.txt diff --git a/requirements.torch.txt b/whisper/requirements.torch.txt similarity index 100% rename from requirements.torch.txt rename to whisper/requirements.torch.txt diff --git a/stt/__init__.py b/whisper/stt/__init__.py similarity index 100% rename from stt/__init__.py rename to whisper/stt/__init__.py diff --git a/stt/processing/__init__.py b/whisper/stt/processing/__init__.py similarity index 100% rename from stt/processing/__init__.py rename to whisper/stt/processing/__init__.py diff --git a/stt/processing/alignment_model.py b/whisper/stt/processing/alignment_model.py similarity index 100% rename from stt/processing/alignment_model.py rename to whisper/stt/processing/alignment_model.py diff --git a/stt/processing/decoding.py b/whisper/stt/processing/decoding.py similarity index 99% rename from stt/processing/decoding.py rename to whisper/stt/processing/decoding.py index 42b3c35..9dd6855 100644 --- a/stt/processing/decoding.py +++ b/whisper/stt/processing/decoding.py @@ -56,6 +56,7 @@ def decode(audio, kwargs.pop("alignment_model") res = decode_ct2(**kwargs) else: + print("OK") res = decode_torch(**kwargs) logger.info("Transcription complete (t={}s)".format(time.time() - start_t)) @@ -107,6 +108,7 @@ def decode_torch(audio, no_speech_threshold, compression_ratio_threshold, normalize_text_as_words=False, + initial_prompt=None, ): """Transcribe the audio data using Whisper with the defined model.""" @@ -122,6 +124,7 @@ def decode_torch(audio, no_speech_threshold=no_speech_threshold, compression_ratio_threshold=compression_ratio_threshold, vad=USE_VAD, + initial_prompt=initial_prompt, ) if alignment_model is None: diff --git a/stt/processing/load_model.py b/whisper/stt/processing/load_model.py similarity index 100% rename from stt/processing/load_model.py rename to whisper/stt/processing/load_model.py diff --git a/stt/processing/text_normalize.py b/whisper/stt/processing/text_normalize.py similarity index 100% rename from stt/processing/text_normalize.py rename to whisper/stt/processing/text_normalize.py diff --git a/stt/processing/utils.py b/whisper/stt/processing/utils.py similarity index 100% rename from stt/processing/utils.py rename to whisper/stt/processing/utils.py diff --git a/stt/processing/word_alignment.py b/whisper/stt/processing/word_alignment.py similarity index 100% rename from stt/processing/word_alignment.py rename to whisper/stt/processing/word_alignment.py From 8272a5f43fa51bc67ec614e471cc099caeccd908 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 30 Nov 2023 16:03:59 +0100 Subject: [PATCH 161/172] Isolate what is specific to Kaldi in a folder --- .envdefault => kaldi/.envdefault | 0 Dockerfile => kaldi/Dockerfile | 8 ++++---- README.md => kaldi/README.md | 0 RELEASE.md => kaldi/RELEASE.md | 0 docker-entrypoint.sh => kaldi/docker-entrypoint.sh | 0 lin_to_vosk.py => kaldi/lin_to_vosk.py | 0 requirements.txt => kaldi/requirements.txt | 0 {stt => kaldi/stt}/__init__.py | 0 {stt => kaldi/stt}/processing/__init__.py | 0 {stt => kaldi/stt}/processing/decoding.py | 0 {stt => kaldi/stt}/processing/streaming.py | 0 {stt => kaldi/stt}/processing/utils.py | 0 12 files changed, 4 insertions(+), 4 deletions(-) rename .envdefault => kaldi/.envdefault (100%) rename Dockerfile => kaldi/Dockerfile (91%) rename README.md => kaldi/README.md (100%) rename RELEASE.md => kaldi/RELEASE.md (100%) rename docker-entrypoint.sh => kaldi/docker-entrypoint.sh (100%) rename lin_to_vosk.py => kaldi/lin_to_vosk.py (100%) rename requirements.txt => kaldi/requirements.txt (100%) rename {stt => kaldi/stt}/__init__.py (100%) rename {stt => kaldi/stt}/processing/__init__.py (100%) rename {stt => kaldi/stt}/processing/decoding.py (100%) rename {stt => kaldi/stt}/processing/streaming.py (100%) rename {stt => kaldi/stt}/processing/utils.py (100%) diff --git a/.envdefault b/kaldi/.envdefault similarity index 100% rename from .envdefault rename to kaldi/.envdefault diff --git a/Dockerfile b/kaldi/Dockerfile similarity index 91% rename from Dockerfile rename to kaldi/Dockerfile index bdf65c0..f062951 100644 --- a/Dockerfile +++ b/kaldi/Dockerfile @@ -45,7 +45,7 @@ RUN git clone -b vosk --single-branch https://github.com/alphacep/kaldi /opt/kal && make -j $(nproc) online2 lm rnnlm # Install python dependencies -COPY requirements.txt ./ +COPY kaldi/requirements.txt ./ RUN pip install --no-cache-dir -r requirements.txt # Install Custom Vosk API @@ -57,13 +57,13 @@ RUN git clone --depth 1 https://github.com/alphacep/vosk-api /opt/vosk-api && cd WORKDIR /usr/src/app -COPY stt /usr/src/app/stt COPY celery_app /usr/src/app/celery_app COPY http_server /usr/src/app/http_server COPY websocket /usr/src/app/websocket COPY document /usr/src/app/document -COPY docker-entrypoint.sh wait-for-it.sh healthcheck.sh ./ -COPY lin_to_vosk.py /usr/src/app/lin_to_vosk.py +COPY kaldi/stt /usr/src/app/stt +COPY kaldi/docker-entrypoint.sh wait-for-it.sh healthcheck.sh ./ +COPY kaldi/lin_to_vosk.py /usr/src/app/lin_to_vosk.py RUN mkdir -p /var/log/supervisor/ diff --git a/README.md b/kaldi/README.md similarity index 100% rename from README.md rename to kaldi/README.md diff --git a/RELEASE.md b/kaldi/RELEASE.md similarity index 100% rename from RELEASE.md rename to kaldi/RELEASE.md diff --git a/docker-entrypoint.sh b/kaldi/docker-entrypoint.sh similarity index 100% rename from docker-entrypoint.sh rename to kaldi/docker-entrypoint.sh diff --git a/lin_to_vosk.py b/kaldi/lin_to_vosk.py similarity index 100% rename from lin_to_vosk.py rename to kaldi/lin_to_vosk.py diff --git a/requirements.txt b/kaldi/requirements.txt similarity index 100% rename from requirements.txt rename to kaldi/requirements.txt diff --git a/stt/__init__.py b/kaldi/stt/__init__.py similarity index 100% rename from stt/__init__.py rename to kaldi/stt/__init__.py diff --git a/stt/processing/__init__.py b/kaldi/stt/processing/__init__.py similarity index 100% rename from stt/processing/__init__.py rename to kaldi/stt/processing/__init__.py diff --git a/stt/processing/decoding.py b/kaldi/stt/processing/decoding.py similarity index 100% rename from stt/processing/decoding.py rename to kaldi/stt/processing/decoding.py diff --git a/stt/processing/streaming.py b/kaldi/stt/processing/streaming.py similarity index 100% rename from stt/processing/streaming.py rename to kaldi/stt/processing/streaming.py diff --git a/stt/processing/utils.py b/kaldi/stt/processing/utils.py similarity index 100% rename from stt/processing/utils.py rename to kaldi/stt/processing/utils.py From bbe0c2b511e5940a1378f99a5aa9931ba4a19664 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 30 Nov 2023 17:38:55 +0100 Subject: [PATCH 162/172] uniformize calls (by simplifying decode function) to make both worlds work --- celery_app/tasks.py | 4 ++-- http_server/ingress.py | 9 +++++---- kaldi/stt/processing/__init__.py | 15 ++++++++++++--- kaldi/stt/processing/decoding.py | 4 +++- kaldi/stt/processing/utils.py | 6 +++--- whisper/stt/processing/__init__.py | 26 ++++++++++++++++---------- whisper/stt/processing/decoding.py | 5 +++-- 7 files changed, 44 insertions(+), 25 deletions(-) diff --git a/celery_app/tasks.py b/celery_app/tasks.py index 3fc38f2..4b9a7d6 100644 --- a/celery_app/tasks.py +++ b/celery_app/tasks.py @@ -3,7 +3,7 @@ from celery_app.celeryapp import celery from stt import logger -from stt.processing import decode, MODEL, ALIGNMENT_MODEL +from stt.processing import decode, MODEL from stt.processing.utils import load_audiofile @@ -24,7 +24,7 @@ def transcribe_task(file_name: str, with_metadata: bool): # Decode try: - result = decode(file_content, MODEL, ALIGNMENT_MODEL, with_metadata) + result = decode(file_content, MODEL, with_metadata) except Exception as err: import traceback msg = f"{traceback.format_exc()}\nFailed to decode {file_path}" diff --git a/http_server/ingress.py b/http_server/ingress.py index da28e8d..3d3e306 100644 --- a/http_server/ingress.py +++ b/http_server/ingress.py @@ -10,7 +10,7 @@ from serving import GunicornServing, GeventServing from swagger import setupSwaggerUI -from stt.processing import decode, load_wave_buffer, MODEL, ALIGNMENT_MODEL, USE_GPU +from stt.processing import decode, load_wave_buffer, MODEL, USE_GPU from stt import logger as stt_logger app = Flask("__stt-standalone-worker__") @@ -25,13 +25,15 @@ # If websocket streaming route is enabled if os.environ.get("ENABLE_STREAMING", False) in [True, "true", 1]: + from flask_sock import Sock + from stt.processing.streaming import ws_streaming logger.info("Init websocket serving ...") sock = Sock(app) logger.info("Streaming is enabled") @sock.route("/streaming") def streaming(web_socket): - ws_streaming(web_socket, model) + ws_streaming(web_socket, MODEL) @app.route("/healthcheck", methods=["GET"]) @@ -68,8 +70,7 @@ def transcribe(): audio_data = load_wave_buffer(file_buffer) # Transcription - transcription = decode( - audio_data, MODEL, ALIGNMENT_MODEL, join_metadata) + transcription = decode(audio_data, MODEL, join_metadata) if join_metadata: return json.dumps(transcription, ensure_ascii=False), 200 diff --git a/kaldi/stt/processing/__init__.py b/kaldi/stt/processing/__init__.py index 2a3eca5..fc32781 100644 --- a/kaldi/stt/processing/__init__.py +++ b/kaldi/stt/processing/__init__.py @@ -6,9 +6,15 @@ from stt import logger from stt.processing.decoding import decode -from stt.processing.utils import formatAudio, load_wave +from stt.processing.utils import load_wave_buffer, load_audiofile -__all__ = ["model", "logger", "decode", "load_wave", "formatAudio"] +__all__ = [ + "logger", + "decode", + "load_audiofile", "load_wave_buffer", + "MODEL", + "USE_GPU", +] # Model locations (should be mounted) MODEL_PATH = "/opt/model" @@ -17,8 +23,11 @@ logger.info("Loading acoustic model and decoding graph ...") start = time() try: - model = Model(MODEL_PATH) + MODEL = Model(MODEL_PATH) except Exception as err: raise Exception("Failed to load transcription model: {}".format(str(err))) from err sys.exit(-1) logger.info("Acoustic model and decoding graph loaded. (t={}s)".format(time() - start)) + +# Not implemented yet in Kaldi +USE_GPU = False \ No newline at end of file diff --git a/kaldi/stt/processing/decoding.py b/kaldi/stt/processing/decoding.py index 2e1fb7c..8c06007 100644 --- a/kaldi/stt/processing/decoding.py +++ b/kaldi/stt/processing/decoding.py @@ -4,10 +4,12 @@ from vosk import KaldiRecognizer, Model -def decode(audio_data: bytes, model: Model, sampling_rate: int, with_metadata: bool) -> dict: +def decode(audio: tuple[bytes, int], model: Model, with_metadata: bool) -> dict: """Transcribe the audio data using the vosk library with the defined model.""" result = {"text": "", "confidence-score": 0.0, "words": []} + audio_data, sampling_rate = audio + recognizer = KaldiRecognizer(model, sampling_rate) recognizer.SetMaxAlternatives(0) # Set confidence per words recognizer.SetWords(with_metadata) diff --git a/kaldi/stt/processing/utils.py b/kaldi/stt/processing/utils.py index b81cc5d..4de66c7 100644 --- a/kaldi/stt/processing/utils.py +++ b/kaldi/stt/processing/utils.py @@ -4,13 +4,13 @@ from numpy import int16, squeeze, mean -def load_wave(file_path): +def load_audiofile(file_path): """Formats audio from a wavFile buffer to a bytebuffer""" audio = squeeze(wavio.read(file_path).data) - return audio.tobytes() + return (audio.tobytes(), 16000) -def formatAudio(file_buffer): +def load_wave_buffer(file_buffer): """Formats audio from a wavFile buffer to a numpy array for processing.""" file_buffer_io = io.BytesIO(file_buffer) file_content = wavio.read(file_buffer_io) diff --git a/whisper/stt/processing/__init__.py b/whisper/stt/processing/__init__.py index 9bb51bc..6faaab0 100644 --- a/whisper/stt/processing/__init__.py +++ b/whisper/stt/processing/__init__.py @@ -9,9 +9,13 @@ from .load_model import load_whisper_model from .alignment_model import load_alignment_model, get_alignment_model -__all__ = ["logger", "decode", "model", "alignment_model", - "load_audiofile", "load_wave_buffer"] - +__all__ = [ + "logger", + "decode", + "load_audiofile", "load_wave_buffer", + "MODEL", + "USE_GPU", +] class LazyLoadedModel: def __init__(self, model_type, device): @@ -48,20 +52,22 @@ def __call__(self, *args, **kwargs): model_type = os.environ.get("MODEL", "medium") logger.info(f"Loading Whisper model {model_type} ({'local' if os.path.exists(model_type) else 'remote'})...") try: - MODEL = LazyLoadedModel(model_type, device=device) + model = LazyLoadedModel(model_type, device=device) # model = load_whisper_model(model_type, device=device) except Exception as err: raise Exception( "Failed to load transcription model: {}".format(str(err))) from err # Load alignment model (if any) -ALIGNMENT_MODEL = get_alignment_model(os.environ.get("ALIGNMENT_MODEL"), language) -if ALIGNMENT_MODEL: +alignment_model = get_alignment_model(os.environ.get("alignment_model"), language) +if alignment_model: logger.info( - f"Loading alignment model {ALIGNMENT_MODEL} ({'local' if os.path.exists(alignment_model) else 'remote'})...") - ALIGNMENT_MODEL = load_alignment_model(ALIGNMENT_MODEL, device=device, download_root="/opt") -elif ALIGNMENT_MODEL is None: + f"Loading alignment model {alignment_model} ({'local' if os.path.exists(alignment_model) else 'remote'})...") + alignment_model = load_alignment_model(alignment_model, device=device, download_root="/opt") +elif alignment_model is None: logger.info("Alignment will be done using Whisper cross-attention weights") else: logger.info("No alignment model preloaded. It will be loaded on the fly depending on the detected language.") - ALIGNMENT_MODEL = {} # Alignement model(s) will be loaded on the fly + alignment_model = {} # Alignement model(s) will be loaded on the fly + +MODEL = (model, alignment_model) diff --git a/whisper/stt/processing/decoding.py b/whisper/stt/processing/decoding.py index 9dd6855..b78c4db 100644 --- a/whisper/stt/processing/decoding.py +++ b/whisper/stt/processing/decoding.py @@ -29,8 +29,7 @@ default_initial_prompt = os.environ.get("PROMPT", None) def decode(audio, - model, - alignment_model: "Any", + model_and_alignementmodel, # Tuple[model, alignment_model] with_word_timestamps: bool, language: str = None, remove_punctuation_from_words=False, @@ -47,6 +46,8 @@ def decode(audio, language = get_language() kwargs = copy.copy(locals()) + kwargs.pop("model_and_alignementmodel") + kwargs["model"], kwargs["alignment_model"] = model_and_alignementmodel logger.info("Transcribing audio with " + (f"language {language}" if language else "automatic language detection") + "...") From bde383d30baed0eaa038805cc3520b37e25332c3 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 30 Nov 2023 17:39:11 +0100 Subject: [PATCH 163/172] Update Jenkinsfile --- Jenkinsfile | 43 ++++++++++++++++++++++--------------------- 1 file changed, 22 insertions(+), 21 deletions(-) diff --git a/Jenkinsfile b/Jenkinsfile index 75a09bd..81d8ec8 100644 --- a/Jenkinsfile +++ b/Jenkinsfile @@ -1,10 +1,9 @@ pipeline { agent any environment { - DOCKER_HUB_REPO = "lintoai/linto-platform-stt" + DOCKER_HUB_REPO_KALDI = "lintoai/linto-platform-stt-kaldi" + DOCKER_HUB_REPO_WHISPER = "lintoai/linto-platform-stt-whisper" DOCKER_HUB_CRED = 'docker-hub-credentials' - - VERSION = '' } stages{ @@ -15,10 +14,22 @@ pipeline { steps { echo 'Publishing latest' script { - image = docker.build(env.DOCKER_HUB_REPO) + image = docker.build(env.DOCKER_HUB_REPO_KALDI, "-f kaldi/Dockerfile .") + VERSION = sh( + returnStdout: true, + script: "awk -v RS='' '/#/ {print; exit}' kaldi/RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" + ).trim() + + docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { + image.push("${VERSION}") + image.push('latest') + } + } + script { + image = docker.build(env.DOCKER_HUB_REPO_WHISPER, "-f whisper/Dockerfile.ctranslate2 .") VERSION = sh( returnStdout: true, - script: "awk -v RS='' '/#/ {print; exit}' RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" + script: "awk -v RS='' '/#/ {print; exit}' whisper/RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" ).trim() docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { @@ -36,37 +47,27 @@ pipeline { steps { echo 'Publishing unstable' script { - image = docker.build(env.DOCKER_HUB_REPO) + image = docker.build(env.DOCKER_HUB_REPO_KALDI, "-f kaldi/Dockerfile .") VERSION = sh( returnStdout: true, - script: "awk -v RS='' '/#/ {print; exit}' RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" + script: "awk -v RS='' '/#/ {print; exit}' kaldi/RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" ).trim() docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { image.push('latest-unstable') } } - } - } - - stage('Docker build for whisper branch'){ - when{ - branch 'feature/whisper' - } - steps { - echo 'Publishing faster_whisper' script { - image = docker.build(env.DOCKER_HUB_REPO, "-f Dockerfile.ctranslate2 .") + image = docker.build(env.DOCKER_HUB_REPO_WHISPER, "-f whisper/Dockerfile.ctranslate2 .") VERSION = sh( returnStdout: true, - script: "awk -v RS='' '/#/ {print; exit}' RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" + script: "awk -v RS='' '/#/ {print; exit}' whisper/RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" ).trim() - docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { - image.push("${VERSION}") - image.push('whisper-latest') + image.push('latest-unstable') } } } } + }// end stages } \ No newline at end of file From 3f88b73ea7b3e830d766675bf9108939736aea8b Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Thu, 30 Nov 2023 17:49:13 +0100 Subject: [PATCH 164/172] fix coding style --- Makefile | 2 +- celery_app/celeryapp.py | 1 - celery_app/tasks.py | 11 +- http_server/ingress.py | 19 +- http_server/serving.py | 16 +- kaldi/stt/processing/__init__.py | 10 +- kaldi/stt/processing/streaming.py | 3 +- kaldi/stt/processing/utils.py | 2 +- whisper/stt/__init__.py | 9 +- whisper/stt/processing/__init__.py | 33 +-- whisper/stt/processing/alignment_model.py | 115 ++++++----- whisper/stt/processing/decoding.py | 229 ++++++++++++--------- whisper/stt/processing/load_model.py | 176 +++++++++------- whisper/stt/processing/text_normalize.py | 142 ++++++------- whisper/stt/processing/utils.py | 237 +++++++++++----------- whisper/stt/processing/word_alignment.py | 64 +++--- 16 files changed, 595 insertions(+), 474 deletions(-) diff --git a/Makefile b/Makefile index 71be1a8..24db387 100644 --- a/Makefile +++ b/Makefile @@ -1,6 +1,6 @@ .DEFAULT_GOAL := help -target_dirs := stt http_server celery_app +target_dirs := kaldi/stt whisper/stt http_server celery_app help: @grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-30s\033[0m %s\n", $$1, $$2}' diff --git a/celery_app/celeryapp.py b/celery_app/celeryapp.py index b432831..d1c4099 100644 --- a/celery_app/celeryapp.py +++ b/celery_app/celeryapp.py @@ -1,7 +1,6 @@ import os from celery import Celery - from stt import logger celery = Celery(__name__, include=["celery_app.tasks"]) diff --git a/celery_app/tasks.py b/celery_app/tasks.py index 4b9a7d6..114df2a 100644 --- a/celery_app/tasks.py +++ b/celery_app/tasks.py @@ -1,11 +1,12 @@ import asyncio import os -from celery_app.celeryapp import celery from stt import logger -from stt.processing import decode, MODEL +from stt.processing import MODEL, decode from stt.processing.utils import load_audiofile +from celery_app.celeryapp import celery + @celery.task(name="transcribe_task") def transcribe_task(file_name: str, with_metadata: bool): @@ -18,17 +19,19 @@ def transcribe_task(file_name: str, with_metadata: bool): file_content = load_audiofile(file_path) except Exception as err: import traceback + msg = f"{traceback.format_exc()}\nFailed to load ressource {file_path}" logger.error(msg) - raise Exception(msg) # from err + raise Exception(msg) # from err # Decode try: result = decode(file_content, MODEL, with_metadata) except Exception as err: import traceback + msg = f"{traceback.format_exc()}\nFailed to decode {file_path}" logger.error(msg) - raise Exception(msg) # from err + raise Exception(msg) # from err return result diff --git a/http_server/ingress.py b/http_server/ingress.py index 3d3e306..6c71478 100644 --- a/http_server/ingress.py +++ b/http_server/ingress.py @@ -7,11 +7,10 @@ from confparser import createParser from flask import Flask, json, request -from serving import GunicornServing, GeventServing -from swagger import setupSwaggerUI - -from stt.processing import decode, load_wave_buffer, MODEL, USE_GPU +from serving import GeventServing, GunicornServing from stt import logger as stt_logger +from stt.processing import MODEL, USE_GPU, decode, load_wave_buffer +from swagger import setupSwaggerUI app = Flask("__stt-standalone-worker__") app.config["JSON_AS_ASCII"] = False @@ -27,6 +26,7 @@ if os.environ.get("ENABLE_STREAMING", False) in [True, "true", 1]: from flask_sock import Sock from stt.processing.streaming import ws_streaming + logger.info("Init websocket serving ...") sock = Sock(app) logger.info("Streaming is enabled") @@ -58,7 +58,9 @@ def transcribe(): elif request.headers.get("accept").lower() == "text/plain": join_metadata = False else: - raise ValueError(f"Not accepted header (accept={request.headers.get('accept')} should be either application/json or text/plain)") + raise ValueError( + f"Not accepted header (accept={request.headers.get('accept')} should be either application/json or text/plain)" + ) # logger.debug("Metadata: {}".format(join_metadata)) # get input file @@ -66,7 +68,7 @@ def transcribe(): raise ValueError(f"No audio file was uploaded (missing 'file' key)") file_buffer = request.files["file"].read() - + audio_data = load_wave_buffer(file_buffer) # Transcription @@ -78,6 +80,7 @@ def transcribe(): except Exception as error: import traceback + logger.error(traceback.format_exc()) logger.error(repr(error)) return "Server Error: {}".format(str(error)), 400 if isinstance(error, ValueError) else 500 @@ -116,8 +119,8 @@ def server_error(error): logger.warning("Could not setup swagger: {}".format(str(err))) logger.info(f"Using {args.workers} workers") - - if USE_GPU: # TODO: get rid of this? + + if USE_GPU: # TODO: get rid of this? serving_type = GeventServing logger.debug("Serving with gevent") else: diff --git a/http_server/serving.py b/http_server/serving.py index 725f763..9230eb4 100644 --- a/http_server/serving.py +++ b/http_server/serving.py @@ -1,6 +1,7 @@ -import gunicorn.app.base -import gevent.pywsgi import gevent.monkey +import gevent.pywsgi +import gunicorn.app.base + gevent.monkey.patch_all() @@ -22,22 +23,21 @@ def load_config(self): def load(self): return self.application -class GeventServing(): +class GeventServing: def __init__(self, app, options=None): self.options = options or {} self.application = app def run(self): - bind = self.options.get('bind', "0.0.0.0:8080") - workers = self.options.get('workers', 1) - listener = bind.split(':') + bind = self.options.get("bind", "0.0.0.0:8080") + workers = self.options.get("workers", 1) + listener = bind.split(":") try: assert len(listener) == 2 listener = (listener[0], int(listener[1])) except: print(f"Invalid bind address {bind}") - server = gevent.pywsgi.WSGIServer(listener, self.application, spawn = workers) + server = gevent.pywsgi.WSGIServer(listener, self.application, spawn=workers) server.serve_forever() - diff --git a/kaldi/stt/processing/__init__.py b/kaldi/stt/processing/__init__.py index fc32781..9f99406 100644 --- a/kaldi/stt/processing/__init__.py +++ b/kaldi/stt/processing/__init__.py @@ -2,16 +2,16 @@ import sys from time import time -from vosk import Model - from stt import logger from stt.processing.decoding import decode -from stt.processing.utils import load_wave_buffer, load_audiofile +from stt.processing.utils import load_audiofile, load_wave_buffer +from vosk import Model __all__ = [ "logger", "decode", - "load_audiofile", "load_wave_buffer", + "load_audiofile", + "load_wave_buffer", "MODEL", "USE_GPU", ] @@ -30,4 +30,4 @@ logger.info("Acoustic model and decoding graph loaded. (t={}s)".format(time() - start)) # Not implemented yet in Kaldi -USE_GPU = False \ No newline at end of file +USE_GPU = False diff --git a/kaldi/stt/processing/streaming.py b/kaldi/stt/processing/streaming.py index 28274b8..a33ecfc 100644 --- a/kaldi/stt/processing/streaming.py +++ b/kaldi/stt/processing/streaming.py @@ -3,11 +3,10 @@ from typing import Union from simple_websocket.ws import Server as WSServer +from stt import logger from vosk import KaldiRecognizer, Model from websockets.legacy.server import WebSocketServerProtocol -from stt import logger - async def wssDecode(ws: WebSocketServerProtocol, model: Model): """Async Decode function endpoint""" diff --git a/kaldi/stt/processing/utils.py b/kaldi/stt/processing/utils.py index 4de66c7..eb3349d 100644 --- a/kaldi/stt/processing/utils.py +++ b/kaldi/stt/processing/utils.py @@ -1,7 +1,7 @@ import io import wavio -from numpy import int16, squeeze, mean +from numpy import int16, mean, squeeze def load_audiofile(file_path): diff --git a/whisper/stt/__init__.py b/whisper/stt/__init__.py index aa3e314..f5551af 100644 --- a/whisper/stt/__init__.py +++ b/whisper/stt/__init__.py @@ -1,5 +1,5 @@ -import os import logging +import os logging.basicConfig( format="[%(asctime)s,%(msecs)03d %(name)s] %(levelname)s: %(message)s", @@ -8,12 +8,13 @@ logger = logging.getLogger("__stt__") # The following is to have GPU in the right order (as nvidia-smi show them) -# It is important to set that before loading ctranslate2 +# It is important to set that before loading ctranslate2 # see https://github.com/guillaumekln/faster-whisper/issues/150 -os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID' # GPU in the right order +os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID" # GPU in the right order try: import faster_whisper + USE_CTRANSLATE2 = True except ImportError as err: try: @@ -24,12 +25,14 @@ try: import torch + USE_TORCH = True except ImportError: USE_TORCH = False try: import torchaudio + USE_TORCHAUDIO = True except ImportError: USE_TORCHAUDIO = False diff --git a/whisper/stt/processing/__init__.py b/whisper/stt/processing/__init__.py index 6faaab0..b0e7f6d 100644 --- a/whisper/stt/processing/__init__.py +++ b/whisper/stt/processing/__init__.py @@ -1,23 +1,25 @@ -import os import logging +import os + from lockfile import FileLock +from stt import USE_CTRANSLATE2, logger -from stt import logger, USE_CTRANSLATE2 +from .alignment_model import get_alignment_model, load_alignment_model from .decoding import decode -from .utils import get_device, get_language, load_wave_buffer, load_audiofile - from .load_model import load_whisper_model -from .alignment_model import load_alignment_model, get_alignment_model +from .utils import get_device, get_language, load_audiofile, load_wave_buffer __all__ = [ "logger", "decode", - "load_audiofile", "load_wave_buffer", + "load_audiofile", + "load_wave_buffer", "MODEL", "USE_GPU", ] -class LazyLoadedModel: + +class LazyLoadedModel: def __init__(self, model_type, device): self.model_type = model_type self.device = device @@ -32,11 +34,12 @@ def check_loaded(self): def __getattr__(self, name): self.check_loaded() return getattr(self._model, name) - + def __call__(self, *args, **kwargs): self.check_loaded() return self._model(*args, **kwargs) + # Set informative log logger.setLevel(logging.INFO) @@ -50,24 +53,28 @@ def __call__(self, *args, **kwargs): # Load ASR model model_type = os.environ.get("MODEL", "medium") -logger.info(f"Loading Whisper model {model_type} ({'local' if os.path.exists(model_type) else 'remote'})...") +logger.info( + f"Loading Whisper model {model_type} ({'local' if os.path.exists(model_type) else 'remote'})..." +) try: model = LazyLoadedModel(model_type, device=device) # model = load_whisper_model(model_type, device=device) except Exception as err: - raise Exception( - "Failed to load transcription model: {}".format(str(err))) from err + raise Exception("Failed to load transcription model: {}".format(str(err))) from err # Load alignment model (if any) alignment_model = get_alignment_model(os.environ.get("alignment_model"), language) if alignment_model: logger.info( - f"Loading alignment model {alignment_model} ({'local' if os.path.exists(alignment_model) else 'remote'})...") + f"Loading alignment model {alignment_model} ({'local' if os.path.exists(alignment_model) else 'remote'})..." + ) alignment_model = load_alignment_model(alignment_model, device=device, download_root="/opt") elif alignment_model is None: logger.info("Alignment will be done using Whisper cross-attention weights") else: - logger.info("No alignment model preloaded. It will be loaded on the fly depending on the detected language.") + logger.info( + "No alignment model preloaded. It will be loaded on the fly depending on the detected language." + ) alignment_model = {} # Alignement model(s) will be loaded on the fly MODEL = (model, alignment_model) diff --git a/whisper/stt/processing/alignment_model.py b/whisper/stt/processing/alignment_model.py index a8e6e79..ea958db 100644 --- a/whisper/stt/processing/alignment_model.py +++ b/whisper/stt/processing/alignment_model.py @@ -1,17 +1,19 @@ -from stt import logger, USE_TORCH, USE_TORCHAUDIO -from .utils import SAMPLE_RATE, LANGUAGES - -import os import math +import os import time + import requests +from stt import USE_TORCH, USE_TORCHAUDIO, logger + +from .utils import LANGUAGES, SAMPLE_RATE if USE_TORCH: import torch import torch.nn.utils.rnn as rnn_utils + try: - import speechbrain as sb import huggingface_hub + import speechbrain as sb except ImportError: pass try: @@ -66,8 +68,7 @@ def get_alignment_model(alignment_model_name, language, force=False): elif language in ALIGNMENT_MODELS: return ALIGNMENT_MODELS[language] elif force: - raise ValueError( - f"No wav2vec alignment model for language '{language}'.") + raise ValueError(f"No wav2vec alignment model for language '{language}'.") else: logger.warn( f"No wav2vec alignment model for language '{language}'. Fallback to English." @@ -77,57 +78,59 @@ def get_alignment_model(alignment_model_name, language, force=False): return get_alignment_model("wav2vec", alignment_model_name, force=True) return alignment_model_name -def load_alignment_model(source, device="cpu", download_root="/opt"): +def load_alignment_model(source, device="cpu", download_root="/opt"): if not USE_TORCH: - raise NotImplementedError( - "Alignement model not available without Torch") + raise NotImplementedError("Alignement model not available without Torch") start = time.time() if (source in torchaudio.pipelines.__all__) if USE_TORCHAUDIO else False: - model = load_torchaudio_model( - source, device=device, download_root=download_root) + model = load_torchaudio_model(source, device=device, download_root=download_root) else: try: - model = load_transformers_model( - source, device=device, download_root=download_root) + model = load_transformers_model(source, device=device, download_root=download_root) except Exception as err1: try: - model = load_speechbrain_model( - source, device=device, download_root=download_root) + model = load_speechbrain_model(source, device=device, download_root=download_root) except Exception as err2: raise Exception( - f"Failed to load alignment model:\n<<< transformers <<<\n{str(err1)}\n<<< speechbrain <<<\n{str(err2)}") from err2 + f"Failed to load alignment model:\n<<< transformers <<<\n{str(err1)}\n<<< speechbrain <<<\n{str(err2)}" + ) from err2 logger.info( - f"Alignment Model of type {get_model_type(model)} loaded. (t={time.time() - start}s)") + f"Alignment Model of type {get_model_type(model)} loaded. (t={time.time() - start}s)" + ) return model def load_speechbrain_model(source, device="cpu", download_root="/opt"): - if os.path.isdir(source): yaml_file = os.path.join(source, "hyperparams.yaml") - assert os.path.isfile( - yaml_file), f"Hyperparams file {yaml_file} not found" + assert os.path.isfile(yaml_file), f"Hyperparams file {yaml_file} not found" else: try: yaml_file = huggingface_hub.hf_hub_download( - repo_id=source, filename="hyperparams.yaml", cache_dir=os.path.join(download_root, "huggingface/hub")) + repo_id=source, + filename="hyperparams.yaml", + cache_dir=os.path.join(download_root, "huggingface/hub"), + ) except requests.exceptions.HTTPError: yaml_file = None overrides = make_yaml_overrides( - yaml_file, {"save_path": os.path.join(download_root, "speechbrain")}) + yaml_file, {"save_path": os.path.join(download_root, "speechbrain")} + ) savedir = os.path.join(download_root, "speechbrain") try: model = sb.pretrained.EncoderASR.from_hparams( - source=source, run_opts={"device": device}, savedir=savedir, overrides=overrides) + source=source, run_opts={"device": device}, savedir=savedir, overrides=overrides + ) except ValueError: model = sb.pretrained.EncoderDecoderASR.from_hparams( - source=source, run_opts={"device": device}, savedir=savedir, overrides=overrides) + source=source, run_opts={"device": device}, savedir=savedir, overrides=overrides + ) model.train(False) model.requires_grad_(False) @@ -135,7 +138,6 @@ def load_speechbrain_model(source, device="cpu", download_root="/opt"): def load_transformers_model(source, device="cpu", download_root="/opt"): - model = transformers.Wav2Vec2ForCTC.from_pretrained(source).to(device) processor = transformers.Wav2Vec2Processor.from_pretrained(source) @@ -145,7 +147,6 @@ def load_transformers_model(source, device="cpu", download_root="/opt"): def load_torchaudio_model(source, device="cpu", download_root="/opt"): - bundle = torchaudio.pipelines.__dict__[source] model = bundle.get_model().to(device) labels = bundle.get_labels() @@ -187,8 +188,7 @@ def make_yaml_overrides(yaml_file, key_values): elif ":" in line: child = line.strip().split(":")[0].strip() if child in key_values: - override[parent] = override.get(parent, {}) | { - child: key_values[child]} + override[parent] = override.get(parent, {}) | {child: key_values[child]} return override @@ -205,15 +205,18 @@ def get_vocab(model): else: labels, blank_id = get_vocab_torchaudio(model) assert isinstance(labels, list) and min( - [isinstance(l, str) for l in labels]), "labels must be a list of strings" + [isinstance(l, str) for l in labels] + ), "labels must be a list of strings" return norm_labels(labels, blank_id), blank_id def get_vocab_speechbrain(model): tokenizer = model.tokenizer # Is this general enough? - labels = [{'': " ", ' ⁇ ': ""}.get(i, i) for i in tokenizer.decode( - [[i] for i in range(tokenizer.get_piece_size())])] + labels = [ + {"": " ", " ⁇ ": ""}.get(i, i) + for i in tokenizer.decode([[i] for i in range(tokenizer.get_piece_size())]) + ] blank_id = labels.index("") return labels, blank_id @@ -228,8 +231,7 @@ def get_vocab_torchaudio(model_and_labels): def get_vocab_transformers(model_and_processor): _, processor = model_and_processor - labels_dict = dict((v, k) - for k, v in processor.tokenizer.get_vocab().items()) + labels_dict = dict((v, k) for k, v in processor.tokenizer.get_vocab().items()) labels = [labels_dict[i] for i in range(len(labels_dict))] blank_id = labels.index("") return labels, blank_id @@ -239,6 +241,7 @@ def norm_labels(labels, blank_id): labels[blank_id] = "" return [l if l != "|" else " " for l in labels] + ################################################################################ # Compute log-probabilities from model @@ -250,7 +253,6 @@ def norm_labels(labels, blank_id): def compute_logprobas(model, audios, max_len=MAX_LEN): - # Single audio if not isinstance(audios, list): audios = [audios] @@ -280,22 +282,22 @@ def compute_logits_speechbrain(model, audios, max_len): chunks = [] i_audio = [] for a in audios: - chunks.extend([a[i:min(i+max_len, len(a))] - for i in range(0, len(a), max_len)]) + chunks.extend([a[i : min(i + max_len, len(a))] for i in range(0, len(a), max_len)]) i_audio.append(len(chunks)) if len(chunks) > 1: logger.warning( - "Audio too long, splitting into {} chunks for alignment".format(len(chunks))) + "Audio too long, splitting into {} chunks for alignment".format(len(chunks)) + ) # Decode chunks of audio and concatenate results log_probas = [[] for i in range(len(audios))] for i in range(0, len(chunks), batch_size): - chunk = chunks[i:min(i+batch_size, len(chunks))] + chunk = chunks[i : min(i + batch_size, len(chunks))] log_probas_tmp = compute_logits_speechbrain(model, chunk) - for j in range(i, i+len(chunk)): + for j in range(i, i + len(chunk)): k = 0 while j >= i_audio[k]: k += 1 - log_probas[k].append(log_probas_tmp[j-i]) + log_probas[k].append(log_probas_tmp[j - i]) log_probas = [torch.cat(p, dim=0) for p in log_probas] log_probas, wav_lens = pack_sequences(log_probas, device=model.device) else: @@ -307,16 +309,15 @@ def compute_logits_speechbrain(model, audios, max_len): def pack_sequences(tensors, device="cpu"): if len(tensors) == 1: - return tensors[0].unsqueeze(0).to(device), torch.Tensor([1.]).to(device) + return tensors[0].unsqueeze(0).to(device), torch.Tensor([1.0]).to(device) tensor = rnn_utils.pad_sequence(tensors, batch_first=True) wav_lens = [len(x) for x in tensors] maxwav_lens = max(wav_lens) - wav_lens = torch.Tensor([l/maxwav_lens for l in wav_lens]) + wav_lens = torch.Tensor([l / maxwav_lens for l in wav_lens]) return tensor.to(device), wav_lens.to(device) def compute_logits_transformers(model_and_processor, audios, max_len): - model, processor = model_and_processor # can be different from processor.feature_extractor.sampling_rate @@ -342,19 +343,28 @@ def compute_logits_transformers(model_and_processor, audios, max_len): if l > max_len: # Split batch in smaller chunks logger.warning( - "Audio too long, splitting into {} chunks for alignment".format(math.ceil(l / max_len))) + "Audio too long, splitting into {} chunks for alignment".format( + math.ceil(l / max_len) + ) + ) logits = [] for i in range(0, l, max_len): j = min(i + max_len, l) if use_mask: - logits.append(model(padded_batch.input_values[:, i:j].to(device), - attention_mask=padded_batch.attention_mask[:, i:j].to(device)).logits) + logits.append( + model( + padded_batch.input_values[:, i:j].to(device), + attention_mask=padded_batch.attention_mask[:, i:j].to(device), + ).logits + ) else: logits.append(model(padded_batch.input_values[:, i:j].to(device)).logits) logits = torch.cat(logits, dim=1) elif use_mask: - logits = model(padded_batch.input_values.to(device), - attention_mask=padded_batch.attention_mask.to(device)).logits + logits = model( + padded_batch.input_values.to(device), + attention_mask=padded_batch.attention_mask.to(device), + ).logits else: logits = model(padded_batch.input_values.to(device)).logits @@ -371,7 +381,7 @@ def compute_logits_torchaudio(model_and_labels, audios, max_len): for p in model.parameters(): device = p.device break - + all_logits = [] with torch.inference_mode(): @@ -380,7 +390,10 @@ def compute_logits_torchaudio(model_and_labels, audios, max_len): if l > max_len: # Split audio in smaller chunks logger.warning( - "Audio too long, splitting into {} chunks for alignment".format(math.ceil(l / max_len))) + "Audio too long, splitting into {} chunks for alignment".format( + math.ceil(l / max_len) + ) + ) logits = [] for i in range(0, l, max_len): j = min(i + max_len, l) diff --git a/whisper/stt/processing/decoding.py b/whisper/stt/processing/decoding.py index b78c4db..9f8411f 100644 --- a/whisper/stt/processing/decoding.py +++ b/whisper/stt/processing/decoding.py @@ -1,17 +1,18 @@ +import copy import os import time -import numpy as np -import copy from typing import Tuple, Union -from stt import logger, USE_CTRANSLATE2 -from .utils import SAMPLE_RATE, get_language -from .text_normalize import remove_punctuation, normalize_text, remove_emoji +import numpy as np +from stt import USE_CTRANSLATE2, logger + from .alignment_model import get_alignment_model, load_alignment_model +from .text_normalize import normalize_text, remove_emoji, remove_punctuation +from .utils import SAMPLE_RATE, get_language from .word_alignment import compute_alignment if not USE_CTRANSLATE2: - import torch + import torch import whisper_timestamped USE_ACCURATE = True @@ -28,20 +29,21 @@ default_initial_prompt = os.environ.get("PROMPT", None) -def decode(audio, - model_and_alignementmodel, # Tuple[model, alignment_model] - with_word_timestamps: bool, - language: str = None, - remove_punctuation_from_words=False, - beam_size: int = default_beam_size, - best_of: int = default_best_of, - temperature: Union[float, Tuple[float, ...]] = default_temperature, - condition_on_previous_text: bool = False, - no_speech_threshold: float = 0.6, - compression_ratio_threshold: float = 2.4, - initial_prompt: str = default_initial_prompt, - ) -> dict: +def decode( + audio, + model_and_alignementmodel, # Tuple[model, alignment_model] + with_word_timestamps: bool, + language: str = None, + remove_punctuation_from_words=False, + beam_size: int = default_beam_size, + best_of: int = default_best_of, + temperature: Union[float, Tuple[float, ...]] = default_temperature, + condition_on_previous_text: bool = False, + no_speech_threshold: float = 0.6, + compression_ratio_threshold: float = 2.4, + initial_prompt: str = default_initial_prompt, +) -> dict: if language is None: language = get_language() @@ -49,7 +51,11 @@ def decode(audio, kwargs.pop("model_and_alignementmodel") kwargs["model"], kwargs["alignment_model"] = model_and_alignementmodel - logger.info("Transcribing audio with " + (f"language {language}" if language else "automatic language detection") + "...") + logger.info( + "Transcribing audio with " + + (f"language {language}" if language else "automatic language detection") + + "..." + ) start_t = time.time() @@ -61,24 +67,19 @@ def decode(audio, res = decode_torch(**kwargs) logger.info("Transcription complete (t={}s)".format(time.time() - start_t)) - - return res + return res -def decode_ct2(audio, - model, - with_word_timestamps, - language, - remove_punctuation_from_words, - **kwargs - ): - kwargs["no_speech_threshold"] = 1 # To avoid empty output +def decode_ct2( + audio, model, with_word_timestamps, language, remove_punctuation_from_words, **kwargs +): + kwargs["no_speech_threshold"] = 1 # To avoid empty output if kwargs.get("beam_size") is None: kwargs["beam_size"] = 1 if kwargs.get("best_of") is None: kwargs["best_of"] = 1 - + segments, info = model.transcribe( audio, word_timestamps=with_word_timestamps, @@ -86,31 +87,32 @@ def decode_ct2(audio, # Careful with the following options max_initial_timestamp=10000.0, vad_filter=USE_VAD, - **kwargs) + **kwargs, + ) segments = list(segments) return format_faster_whisper_response( - segments, info, - remove_punctuation_from_words=remove_punctuation_from_words + segments, info, remove_punctuation_from_words=remove_punctuation_from_words ) -def decode_torch(audio, - model, - alignment_model, - with_word_timestamps, - language, - remove_punctuation_from_words, - beam_size, - best_of, - temperature, - condition_on_previous_text, - no_speech_threshold, - compression_ratio_threshold, - normalize_text_as_words=False, - initial_prompt=None, - ): +def decode_torch( + audio, + model, + alignment_model, + with_word_timestamps, + language, + remove_punctuation_from_words, + beam_size, + best_of, + temperature, + condition_on_previous_text, + no_speech_threshold, + compression_ratio_threshold, + normalize_text_as_words=False, + initial_prompt=None, +): """Transcribe the audio data using Whisper with the defined model.""" fp16 = model.device != torch.device("cpu") @@ -134,12 +136,14 @@ def decode_torch(audio, if language is None: language = whisper_res["language"] logger.info(f"Detected language: {language}") - return format_whisper_timestamped_response(whisper_res, remove_punctuation_from_words=remove_punctuation_from_words) + return format_whisper_timestamped_response( + whisper_res, remove_punctuation_from_words=remove_punctuation_from_words + ) # Force deterministic results torch.manual_seed(1234) torch.cuda.manual_seed_all(1234) - + whisper_res = model.transcribe(audio, verbose=None, **kwargs) text = whisper_res["text"] @@ -156,8 +160,12 @@ def decode_torch(audio, # Load alignment model on the fly if language not in alignment_model: alignment_model_name = get_alignment_model(language) - logger.info(f"Loading alignment model {alignment_model_name} ({'local' if os.path.exists(alignment_model_name) else 'remote'})...") - alignment_model[language] = load_alignment_model(alignment_model_name, device=model.device, download_root="/opt") + logger.info( + f"Loading alignment model {alignment_model_name} ({'local' if os.path.exists(alignment_model_name) else 'remote'})..." + ) + alignment_model[language] = load_alignment_model( + alignment_model_name, device=model.device, download_root="/opt" + ) spec_alignment_model = alignment_model[language] else: spec_alignment_model = alignment_model @@ -165,9 +173,9 @@ def decode_torch(audio, result = {} result["text"] = text result["language"] = language - result["confidence-score"] = np.exp( - np.array([r["avg_logprob"] for r in segments]) - ).mean() if len(segments) else 0.0 + result["confidence-score"] = ( + np.exp(np.array([r["avg_logprob"] for r in segments])).mean() if len(segments) else 0.0 + ) if not with_word_timestamps: if not normalize_text_as_words: @@ -202,37 +210,40 @@ def decode_torch(audio, if remove_punctuation_from_words: sub_text = remove_punctuation(sub_text) if not sub_text: - logger.warn( - f"Lost text in segment {segment['start']}-{segment['end']}") + logger.warn(f"Lost text in segment {segment['start']}-{segment['end']}") continue labels, emission, trellis, segments, word_segments = compute_alignment( - sub_audio, sub_text, spec_alignment_model) + sub_audio, sub_text, spec_alignment_model + ) ratio = len(sub_audio) / (trellis.size(0) * SAMPLE_RATE) sub_words = sub_text.split() words = [] use_original_words = True if len(sub_words) != len(word_segments): logger.warn( - f"Alignment failed. Some words might be mis-rendered.\nNumber of words: {len(sub_words)} != {len(word_segments)}\n>>>\n{sub_words}\n<<<\n{[segment.label for segment in word_segments]}") + f"Alignment failed. Some words might be mis-rendered.\nNumber of words: {len(sub_words)} != {len(word_segments)}\n>>>\n{sub_words}\n<<<\n{[segment.label for segment in word_segments]}" + ) assert len(word_segments) < len(sub_words) use_original_words = False for word, seg in zip(sub_words, word_segments): - words.append({ - "word": word if use_original_words else seg.label, - "start": seg.start * ratio + offset, - "end": seg.end * ratio + offset, - "conf": seg.score, - }) + words.append( + { + "word": word if use_original_words else seg.label, + "start": seg.start * ratio + offset, + "end": seg.end * ratio + offset, + "conf": seg.score, + } + ) # Glue the words inside a segment for i, word in enumerate(words): if i == 0: word["start"] = segment["start"] else: - word["start"] = words[i-1]["end"] + word["start"] = words[i - 1]["end"] if i == len(words) - 1: word["end"] = segment["end"] else: - word["end"] = .5 * (words[i+1]["start"] + word["end"]) + word["end"] = 0.5 * (words[i + 1]["start"] + word["end"]) # Accumulate results result["words"] += words @@ -244,7 +255,9 @@ def format_whisper_timestamped_response(transcription, remove_punctuation_from_w for i, seg in enumerate(transcription["segments"][:-1]): for expected_keys in ["start", "end", "words", "avg_logprob"]: - assert expected_keys in seg, f"Missing '{expected_keys}' in segment {i} (that has keys {list(seg.keys())})" + assert ( + expected_keys in seg + ), f"Missing '{expected_keys}' in segment {i} (that has keys {list(seg.keys())})" words = [] @@ -255,36 +268,43 @@ def format_whisper_timestamped_response(transcription, remove_punctuation_from_w text = word["text"] if remove_punctuation_from_words: text = remove_punctuation(text) - words.append({ - "word": text, - "start": word["start"], - "end": word["end"], - "conf": word["confidence"], - }) + words.append( + { + "word": text, + "start": word["start"], + "end": word["end"], + "conf": word["confidence"], + } + ) return { "text": transcription["text"].strip(), "language": transcription["language"], - "confidence-score": round(np.exp(np.array([r["avg_logprob"] for r in segments])).mean(), 2) if len(segments) else 0.0, + "confidence-score": round(np.exp(np.array([r["avg_logprob"] for r in segments])).mean(), 2) + if len(segments) + else 0.0, "words": words, } def format_faster_whisper_response( - segments, info, + segments, + info, remove_punctuation_from_words=False, glue_punctuations="'-&@.,", - ): - +): language = info.language duration = info.duration def checked_timestamps(start, end=None): if start > duration or (end is not None and end > duration): - print("WARNING, timestamp %f is greater than duration %f" % (max(start, end if end else start), duration)) + print( + "WARNING, timestamp %f is greater than duration %f" + % (max(start, end if end else start), duration) + ) if end and end <= start: if end == start: - pass # end = start + 0.01 + pass # end = start + 0.01 else: print("WARNING, end timestamp %f is smaller than start timestamp %f" % (end, start)) if end is None: @@ -300,34 +320,47 @@ def checked_timestamps(start, end=None): for word in segment.words: start, end = checked_timestamps(word.start, word.end) word_strip = word.word.strip() - if glue_punctuations and len(words) and len(word_strip)>1 and word_strip[0] in glue_punctuations: + if ( + glue_punctuations + and len(words) + and len(word_strip) > 1 + and word_strip[0] in glue_punctuations + ): words[-1]["text"] += word.word.lstrip() words[-1]["confidence"].append(word.probability) words[-1]["end"] = max(words[-1]["end"], end) continue - words.append({ - "text": word.word, - "confidence": [word.probability], - "start": start, - "end": end - }) + words.append( + { + "text": word.word, + "confidence": [word.probability], + "start": start, + "end": end, + } + ) for word in words: word["text"] = word["text"].strip() word["confidence"] = round(np.mean([c for c in word["confidence"]]), 2) - segments_list.append({ - "text": segment.text.strip(), - "start": start, - "end": end, - "avg_logprob": segment.avg_logprob, - "words": words - }) - + segments_list.append( + { + "text": segment.text.strip(), + "start": start, + "end": end, + "avg_logprob": segment.avg_logprob, + "words": words, + } + ) + transcription = { "text": " ".join(segment["text"] for segment in segments_list), "language": language, - "confidence": round(np.exp(np.mean([segment["avg_logprob"] for segment in segments_list])), 2), + "confidence": round( + np.exp(np.mean([segment["avg_logprob"] for segment in segments_list])), 2 + ), "segments": segments_list, } - return format_whisper_timestamped_response(transcription, remove_punctuation_from_words=remove_punctuation_from_words) + return format_whisper_timestamped_response( + transcription, remove_punctuation_from_words=remove_punctuation_from_words + ) diff --git a/whisper/stt/processing/load_model.py b/whisper/stt/processing/load_model.py index 3790593..b87a414 100644 --- a/whisper/stt/processing/load_model.py +++ b/whisper/stt/processing/load_model.py @@ -1,18 +1,18 @@ import os -import sys -import time import shutil import subprocess +import sys +import time -from stt import logger, USE_CTRANSLATE2 +from stt import USE_CTRANSLATE2, logger if USE_CTRANSLATE2: import faster_whisper else: import whisper_timestamped as whisper -def load_whisper_model(model_type_or_file, device="cpu", download_root=None): +def load_whisper_model(model_type_or_file, device="cpu", download_root=None): start = time.time() logger.info("Loading Whisper model {}...".format(model_type_or_file)) @@ -51,26 +51,34 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root=None): device_index = [int(dev) for dev in device[5:].split(",")] device = "cuda" - if not os.path.isfile(os.path.join(model_type_or_file, "model.bin")) and \ - not max([model_type_or_file.startswith(prefix) for prefix in ["tiny", "base", "small", "medium", "large"]]): - + if not os.path.isfile(os.path.join(model_type_or_file, "model.bin")) and not max( + [ + model_type_or_file.startswith(prefix) + for prefix in ["tiny", "base", "small", "medium", "large"] + ] + ): # Convert transformer model - output_dir = os.path.join(download_root, f"ctranslate2/converters/transformers--{model_type_or_file.replace('/', '--')}") + output_dir = os.path.join( + download_root, + f"ctranslate2/converters/transformers--{model_type_or_file.replace('/', '--')}", + ) logger.info(f"CTranslate2 model in {output_dir}") if not os.path.isdir(output_dir): - import huggingface_hub delete_hf_path = False if not os.path.isdir(model_type_or_file): - - hf_path = huggingface_hub.hf_hub_download(repo_id=model_type_or_file, filename="pytorch_model.bin") + hf_path = huggingface_hub.hf_hub_download( + repo_id=model_type_or_file, filename="pytorch_model.bin" + ) hf_path = os.path.dirname(os.path.dirname(os.path.dirname(hf_path))) delete_hf_path = not os.path.exists(hf_path) else: - assert os.path.isfile(os.path.join(model_type_or_file, "pytorch_model.bin")), f"Could not find pytorch_model.bin in {model_type_or_file}" + assert os.path.isfile( + os.path.join(model_type_or_file, "pytorch_model.bin") + ), f"Could not find pytorch_model.bin in {model_type_or_file}" check_torch_installed() @@ -91,16 +99,21 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root=None): # force=False # ) - subprocess.check_call([ - "ct2-transformers-converter", - "--model", model_type_or_file, - "--output_dir", os.path.realpath(output_dir), - "--quantization", "float16", - ]) + subprocess.check_call( + [ + "ct2-transformers-converter", + "--model", + model_type_or_file, + "--output_dir", + os.path.realpath(output_dir), + "--quantization", + "float16", + ] + ) except Exception as err: shutil.rmtree(output_dir, ignore_errors=True) raise err - + finally: if delete_hf_path: logger.info(f"Deleting {hf_path}") @@ -124,26 +137,28 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root=None): ) break except ValueError as err: - logger.info("WARNING: failed to load model with compute_type={}".format(compute_type)) + logger.info( + "WARNING: failed to load model with compute_type={}".format(compute_type) + ) # On some old GPU we may have the error - # "ValueError: Requested int8_float16 compute type, + # "ValueError: Requested int8_float16 compute type, # but the target device or backend do not support efficient int8_float16 computation." if i == len(compute_types) - 1: raise err else: - - extension = os.path.splitext(model_type_or_file)[-1] if os.path.isfile(model_type_or_file) else None + extension = ( + os.path.splitext(model_type_or_file)[-1] if os.path.isfile(model_type_or_file) else None + ) if model_type_or_file in whisper.available_models() or extension == ".pt": - model = whisper.load_model( - model_type_or_file, device=device, - download_root=os.path.join(download_root, "whisper") + model_type_or_file, + device=device, + download_root=os.path.join(download_root, "whisper"), ) else: - # Convert HuggingFace model import torch @@ -161,25 +176,41 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root=None): try: import transformers except ImportError: - raise ImportError(f"If you are trying to download a HuggingFace model with {model_type_or_file}, please install first the transformers library") + raise ImportError( + f"If you are trying to download a HuggingFace model with {model_type_or_file}, please install first the transformers library" + ) from transformers.utils import cached_file try: - model_path = cached_file(model_type_or_file, "pytorch_model.bin", cache_dir=download_root, use_auth_token=None, revision=None) + model_path = cached_file( + model_type_or_file, + "pytorch_model.bin", + cache_dir=download_root, + use_auth_token=None, + revision=None, + ) except Exception as e: try: if isinstance(e, OSError): - model_path = cached_file(model_type_or_file, "whisper.ckpt", cache_dir=download_root, use_auth_token=None, revision=None) + model_path = cached_file( + model_type_or_file, + "whisper.ckpt", + cache_dir=download_root, + use_auth_token=None, + revision=None, + ) else: raise e except: if peft_folder is None: - raise RuntimeError(f"Original error: {e}\nCould not find model {model_type_or_file} from HuggingFace nor local folders.") + raise RuntimeError( + f"Original error: {e}\nCould not find model {model_type_or_file} from HuggingFace nor local folders." + ) # Load HF Model if peft_folder is not None: - from peft import PeftConfig, PeftModel import transformers + from peft import PeftConfig, PeftModel peft_config = PeftConfig.from_pretrained(peft_folder) base_model = peft_config.base_model_name_or_path @@ -191,7 +222,7 @@ def load_whisper_model(model_type_or_file, device="cpu", download_root=None): else: hf_state_dict = torch.load(model_path, map_location="cpu") - # Rename layers + # Rename layers for key in list(hf_state_dict.keys()): new_key = hf_to_whisper_states(key) if new_key is None: @@ -235,73 +266,82 @@ def check_torch_installed(): # import torch + # Credit: https://github.com/openai/whisper/discussions/830 def hf_to_whisper_states(text): import re - + # From Speechbrain if text == "_mel_filters": return None - + # From PEFT if "default" in text: # print(f"WARNING: Ignoring {text}") return None if text.startswith("base_model.model."): - text = text[len("base_model.model."):] - - text = re.sub('.layers.', '.blocks.', text) - text = re.sub('.self_attn.', '.attn.', text) - text = re.sub('.q_proj.', '.query.', text) - text = re.sub('.k_proj.', '.key.', text) - text = re.sub('.v_proj.', '.value.', text) - text = re.sub('.out_proj.', '.out.', text) - text = re.sub('.fc1.', '.mlp.0.', text) - text = re.sub('.fc2.', '.mlp.2.', text) - text = re.sub('.fc3.', '.mlp.3.', text) - text = re.sub('.fc3.', '.mlp.3.', text) - text = re.sub('.encoder_attn.', '.cross_attn.', text) - text = re.sub('.cross_attn.ln.', '.cross_attn_ln.', text) - text = re.sub('.embed_positions.weight', '.positional_embedding', text) - text = re.sub('.embed_tokens.', '.token_embedding.', text) - text = re.sub('model.', '', text) - text = re.sub('attn.layer_norm.', 'attn_ln.', text) - text = re.sub('.final_layer_norm.', '.mlp_ln.', text) - text = re.sub('encoder.layer_norm.', 'encoder.ln_post.', text) - text = re.sub('decoder.layer_norm.', 'decoder.ln.', text) + text = text[len("base_model.model.") :] + + text = re.sub(".layers.", ".blocks.", text) + text = re.sub(".self_attn.", ".attn.", text) + text = re.sub(".q_proj.", ".query.", text) + text = re.sub(".k_proj.", ".key.", text) + text = re.sub(".v_proj.", ".value.", text) + text = re.sub(".out_proj.", ".out.", text) + text = re.sub(".fc1.", ".mlp.0.", text) + text = re.sub(".fc2.", ".mlp.2.", text) + text = re.sub(".fc3.", ".mlp.3.", text) + text = re.sub(".fc3.", ".mlp.3.", text) + text = re.sub(".encoder_attn.", ".cross_attn.", text) + text = re.sub(".cross_attn.ln.", ".cross_attn_ln.", text) + text = re.sub(".embed_positions.weight", ".positional_embedding", text) + text = re.sub(".embed_tokens.", ".token_embedding.", text) + text = re.sub("model.", "", text) + text = re.sub("attn.layer_norm.", "attn_ln.", text) + text = re.sub(".final_layer_norm.", ".mlp_ln.", text) + text = re.sub("encoder.layer_norm.", "encoder.ln_post.", text) + text = re.sub("decoder.layer_norm.", "decoder.ln.", text) return text + def states_to_dim(state_dict): - n_audio_state = len(state_dict['encoder.ln_post.bias']) + n_audio_state = len(state_dict["encoder.ln_post.bias"]) n_text_state = len(state_dict["decoder.ln.bias"]) return { - "n_mels": state_dict["encoder.conv1.weight"].shape[1], # 80 - "n_vocab": state_dict["decoder.token_embedding.weight"].shape[0], # 51864 / 51865 - "n_audio_ctx": state_dict["encoder.positional_embedding"].shape[0], # 1500 - "n_audio_state": n_audio_state, # 384 / 512 / 768 / 1024 / 1280 - "n_audio_head": n_audio_state // 64, # 6 / 8 / 12 / 16 / 20 - "n_audio_layer": len(set([".".join(k.split(".")[:3]) for k in state_dict.keys() if "encoder.blocks." in k])), # 4 / 6 / 12 / 24 / 32 - "n_text_ctx": state_dict["decoder.positional_embedding"].shape[0], # 448 - "n_text_state": n_text_state, # 384 / 512 / 768 / 1024 / 1280 - "n_text_head": n_text_state // 64, # 6 / 8 / 12 / 16 / 20 - "n_text_layer": len(set([".".join(k.split(".")[:3]) for k in state_dict.keys() if "decoder.blocks." in k])), # 4 / 6 / 12 / 24 / 32 + "n_mels": state_dict["encoder.conv1.weight"].shape[1], # 80 + "n_vocab": state_dict["decoder.token_embedding.weight"].shape[0], # 51864 / 51865 + "n_audio_ctx": state_dict["encoder.positional_embedding"].shape[0], # 1500 + "n_audio_state": n_audio_state, # 384 / 512 / 768 / 1024 / 1280 + "n_audio_head": n_audio_state // 64, # 6 / 8 / 12 / 16 / 20 + "n_audio_layer": len( + set([".".join(k.split(".")[:3]) for k in state_dict.keys() if "encoder.blocks." in k]) + ), # 4 / 6 / 12 / 24 / 32 + "n_text_ctx": state_dict["decoder.positional_embedding"].shape[0], # 448 + "n_text_state": n_text_state, # 384 / 512 / 768 / 1024 / 1280 + "n_text_head": n_text_state // 64, # 6 / 8 / 12 / 16 / 20 + "n_text_layer": len( + set([".".join(k.split(".")[:3]) for k in state_dict.keys() if "decoder.blocks." in k]) + ), # 4 / 6 / 12 / 24 / 32 } + if not USE_CTRANSLATE2: class TextDecoderUntied(whisper.model.TextDecoder): """ Same as TextDecoder but with untied weights """ + def __init__(self, *args, **kwargs): import torch + super().__init__(*args, **kwargs) n_vocab, n_state = self.token_embedding.weight.shape self.proj_out = torch.nn.Linear(n_state, n_vocab, bias=False) - def forward(self, x, xa, kv_cache = None): + def forward(self, x, xa, kv_cache=None): offset = next(iter(kv_cache.values())).shape[1] if kv_cache else 0 x = self.token_embedding(x) + self.positional_embedding[offset : offset + x.shape[-1]] x = x.to(xa.dtype) diff --git a/whisper/stt/processing/text_normalize.py b/whisper/stt/processing/text_normalize.py index a5f3d04..cde8f38 100644 --- a/whisper/stt/processing/text_normalize.py +++ b/whisper/stt/processing/text_normalize.py @@ -4,6 +4,7 @@ import unicodedata from stt import logger + from .utils import flatten # All punctuations and symbols EXCEPT: @@ -21,14 +22,14 @@ # A list of symbols that can be an isolated words and not in the exclusion list above # * & # * candidates not retained: §, <, =, >, ≤, ≥ -_maybe_word_regex = None # r"[" + re.escape("&") + r"]$" +_maybe_word_regex = None # r"[" + re.escape("&") + r"]$" -def remove_punctuation(text: str, ensure_no_spaces_in_words: bool=False) -> str: +def remove_punctuation(text: str, ensure_no_spaces_in_words: bool = False) -> str: text = text.strip() # Note: we don't remove dots inside words (e.g. "ab@gmail.com") - new_text = re.sub(_leading_punctuations_regex, "", text) #.lstrip() - new_text = re.sub(_trailing_punctuations_regex, "", new_text) #.rstrip() + new_text = re.sub(_leading_punctuations_regex, "", text) # .lstrip() + new_text = re.sub(_trailing_punctuations_regex, "", new_text) # .rstrip() # Let punctuation marks that are alone if not new_text: if _maybe_word_regex and re.match(_maybe_word_regex, text): @@ -43,6 +44,7 @@ def remove_punctuation(text: str, ensure_no_spaces_in_words: bool=False) -> str: return remove_punctuation(new_text, ensure_no_spaces_in_words=ensure_no_spaces_in_words) return new_text + def transliterate(c): # Transliterates a character to its closest ASCII equivalent. # Example: transliterate("à ß œ fl") = "a ss oe fl" @@ -56,66 +58,62 @@ def transliterate(c): def remove_emoji(text): # Remove emojis - return re.sub(r"[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF]+", "", text) + return re.sub( + r"[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF]+", + "", + text, + ) def normalize_text(text: str, lang: str) -> str: - """ Transform digits into characters... """ + """Transform digits into characters...""" # Reorder currencies (1,20€ -> 1 € 20) coma = "," if lang in ["fr"] else "\." for c in _currencies: if c in text: - text = re.sub(r"\b(\d+)" + coma + r"(\d+)\s*" + - c, r"\1 " + c + r" \2", text) + text = re.sub(r"\b(\d+)" + coma + r"(\d+)\s*" + c, r"\1 " + c + r" \2", text) # Roman digits if re.search(r"[IVX]", text): if lang == "en": - digits = re.findall( - r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})(º|st|nd|rd|th)?\b", text) + digits = re.findall(r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})(º|st|nd|rd|th)?\b", text) digits = ["".join(d) for d in digits] elif lang == "fr": digits = re.findall( - r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})(º|ème|eme|e|er|ère)?\b", text) + r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})(º|ème|eme|e|er|ère)?\b", text + ) digits = ["".join(d) for d in digits] else: - digits = re.findall( - r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})\b", text) + digits = re.findall(r"\b(?=[XVI])M*(XX{0,3})(I[XV]|V?I{0,3})\b", text) digits = ["".join(d) for d in digits] if digits: - digits = sorted(list(set(digits)), reverse=True, - key=lambda x: (len(x), x)) + digits = sorted(list(set(digits)), reverse=True, key=lambda x: (len(x), x)) for s in digits: filtered = re.sub("[a-zèº]", "", s) ordinal = filtered != s digit = roman_to_decimal(filtered) - v = undigit(str(digit), lang=lang, - to="ordinal" if ordinal else "cardinal") + v = undigit(str(digit), lang=lang, to="ordinal" if ordinal else "cardinal") text = re.sub(r"\b" + s + r"\b", v, text) # Ordinal digits if lang == "en": - digits = re.findall( - r"\b\d*1(?:st)|\d*2(?:nd)|\d*3(?:rd)|\d+(?:º|th)\b", text) + digits = re.findall(r"\b\d*1(?:st)|\d*2(?:nd)|\d*3(?:rd)|\d+(?:º|th)\b", text) elif lang == "fr": - digits = re.findall( - r"\b1(?:ère|ere|er|re|r)|2(?:nd|nde)|\d+(?:º|ème|eme|e)\b", text) + digits = re.findall(r"\b1(?:ère|ere|er|re|r)|2(?:nd|nde)|\d+(?:º|ème|eme|e)\b", text) else: logger.warn( - f"Language {lang} not supported for some normalization. Some words might be mis-localized.") + f"Language {lang} not supported for some normalization. Some words might be mis-localized." + ) digits = [] if digits: - digits = sorted(list(set(digits)), reverse=True, - key=lambda x: (len(x), x)) + digits = sorted(list(set(digits)), reverse=True, key=lambda x: (len(x), x)) for digit in digits: - word = undigit(re.findall(r"\d+", digit) - [0], to="ordinal", lang=lang) - text = re.sub(r'\b'+str(digit)+r'\b', word, text) + word = undigit(re.findall(r"\d+", digit)[0], to="ordinal", lang=lang) + text = re.sub(r"\b" + str(digit) + r"\b", word, text) # Cardinal digits - digits = re.findall( - r"(?:\-?\b[\d/]*\d+(?: \d\d\d)+\b)|(?:\-?\d[/\d]*)", text) + digits = re.findall(r"(?:\-?\b[\d/]*\d+(?: \d\d\d)+\b)|(?:\-?\d[/\d]*)", text) digits = list(map(lambda s: s.strip(r"[/ ]"), digits)) digits = list(set(digits)) digits = digits + flatten([c.split() for c in digits if " " in c]) @@ -131,54 +129,55 @@ def normalize_text(text: str, lang: str) -> str: elif numslash == 1: # Fraction or date i = digitf.index("/") is_date = False - if len(digitf[i+1:]) == 2: + if len(digitf[i + 1 :]) == 2: try: first = int(digitf[:i]) - second = int(digitf[i+1:]) + second = int(digitf[i + 1 :]) is_date = first > 0 and first < 32 and second > 0 and second < 13 except: pass if is_date: first = digitf[:i].lstrip("0") use_ordinal = (lang == "fr" and first == "1") or ( - lang != "fr" and first[-1] in ["1", "2", "3"]) - first = undigit(first, lang=lang, - to="ordinal" if use_ordinal else "cardinal") - second = _int_to_month.get(lang, {}).get(second,digitf[i+1:]) + lang != "fr" and first[-1] in ["1", "2", "3"] + ) + first = undigit(first, lang=lang, to="ordinal" if use_ordinal else "cardinal") + second = _int_to_month.get(lang, {}).get(second, digitf[i + 1 :]) else: first = undigit(digitf[:i], lang=lang) - second = undigit(digitf[i+1:], to="denominator", lang=lang) - if float(digitf[:i]) > 2. and second[-1] != "s": + second = undigit(digitf[i + 1 :], to="denominator", lang=lang) + if float(digitf[:i]) > 2.0 and second[-1] != "s": second += "s" word = first + " " + second elif numslash == 2: # Maybe a date i1 = digitf.index("/") - i2 = digitf.index("/", i1+1) + i2 = digitf.index("/", i1 + 1) is_date = False - if len(digitf[i1+1:i2]) == 2 and len(digitf[i2+1:]) == 4: + if len(digitf[i1 + 1 : i2]) == 2 and len(digitf[i2 + 1 :]) == 4: try: first = int(digitf[:i1]) - second = int(digitf[i1+1:i2]) - third = int(digitf[i2+1:]) - is_date = first > 0 and first < 32 and second > 0 and second < 13 and third > 1000 + second = int(digitf[i1 + 1 : i2]) + third = int(digitf[i2 + 1 :]) + is_date = ( + first > 0 and first < 32 and second > 0 and second < 13 and third > 1000 + ) except: pass - third = undigit(digitf[i2+1:], lang=lang) + third = undigit(digitf[i2 + 1 :], lang=lang) if is_date: first = digitf[:i1].lstrip("0") use_ordinal = (lang == "fr" and first == "1") or ( - lang != "fr" and first[-1] in ["1", "2", "3"]) - first = undigit(first, lang=lang, - to="ordinal" if use_ordinal else "cardinal") + lang != "fr" and first[-1] in ["1", "2", "3"] + ) + first = undigit(first, lang=lang, to="ordinal" if use_ordinal else "cardinal") second = _int_to_month.get(lang, {}).get( - int(digitf[i1+1:i2]), digitf[i1+1:i2]) + int(digitf[i1 + 1 : i2]), digitf[i1 + 1 : i2] + ) word = " ".join([first, second, third]) else: - word = " / ".join([undigit(s, lang=lang) - for s in digitf.split('/')]) + word = " / ".join([undigit(s, lang=lang) for s in digitf.split("/")]) else: - word = " / ".join([undigit(s, lang=lang) - for s in digitf.split('/')]) + word = " / ".join([undigit(s, lang=lang) for s in digitf.split("/")]) text = replace_keeping_word_boundaries(digit, word, text) # Symbols (currencies, percent...) @@ -194,12 +193,13 @@ def normalize_text(text: str, lang: str) -> str: def replace_keeping_word_boundaries(orig, dest, text): if orig in text: - text = re.sub(r"(\W)"+orig+r"(\W)", r"\1"+dest+r"\2", text) - text = re.sub(orig+r"(\W)", " "+dest+r"\1", text) - text = re.sub(r"(\W)"+orig, r"\1"+dest+" ", text) - text = re.sub(orig, " "+dest+" ", text) + text = re.sub(r"(\W)" + orig + r"(\W)", r"\1" + dest + r"\2", text) + text = re.sub(orig + r"(\W)", " " + dest + r"\1", text) + text = re.sub(r"(\W)" + orig, r"\1" + dest + " ", text) + text = re.sub(orig, " " + dest + " ", text) return text + def undigit(str, lang, to="cardinal"): str = re.sub(" ", "", str) if to == "denominator": @@ -224,7 +224,9 @@ def undigit(str, lang, to="cardinal"): if str.startswith("0") and to == "cardinal": numZeros = len(re.findall(r"0+", str)[0]) if numZeros < len(str): - return numZeros * (robust_num2words(0, lang=lang)+" ") + robust_num2words(float(str), lang=lang, to=to) + return numZeros * (robust_num2words(0, lang=lang) + " ") + robust_num2words( + float(str), lang=lang, to=to + ) return robust_num2words(float(str), lang=lang, to=to) @@ -233,6 +235,7 @@ def robust_num2words(x, lang, to="cardinal", orig=""): Bugfix for num2words """ from num2words import num2words + try: res = num2words(x, lang=lang, to=to) if lang == "fr" and to == "ordinal": @@ -244,34 +247,34 @@ def robust_num2words(x, lang, to="cardinal", orig=""): if x == -math.inf: # ! return "moins " + robust_num2words(-x, lang=lang, to=to, orig=orig.replace("-", "")) # TODO: print a warning - return robust_num2words(x//10, lang=lang, to=to) + return robust_num2words(x // 10, lang=lang, to=to) def roman_to_decimal(str): def value(r): - if (r == 'I'): + if r == "I": return 1 - if (r == 'V'): + if r == "V": return 5 - if (r == 'X'): + if r == "X": return 10 - if (r == 'L'): + if r == "L": return 50 - if (r == 'C'): + if r == "C": return 100 - if (r == 'D'): + if r == "D": return 500 - if (r == 'M'): + if r == "M": return 1000 return -1 res = 0 i = 0 - while (i < len(str)): + while i < len(str): s1 = value(str[i]) - if (i + 1 < len(str)): + if i + 1 < len(str): s2 = value(str[i + 1]) - if (s1 >= s2): + if s1 >= s2: # Value of current symbol is greater or equal to the next symbol res = res + s1 i = i + 1 @@ -313,7 +316,7 @@ def value(r): 10: "october", 11: "november", 12: "december", - } + }, } _currencies = ["€", "$", "£", "¥"] @@ -386,6 +389,5 @@ def value(r): "\$": "dollars", "£": "pounds", "¥": "yens", - } + }, } - diff --git a/whisper/stt/processing/utils.py b/whisper/stt/processing/utils.py index 0352de4..106167a 100644 --- a/whisper/stt/processing/utils.py +++ b/whisper/stt/processing/utils.py @@ -1,32 +1,35 @@ -from stt import USE_CTRANSLATE2, USE_TORCH, USE_TORCHAUDIO - import io -import wavio import os + import numpy as np +import wavio +from stt import USE_CTRANSLATE2, USE_TORCH, USE_TORCHAUDIO -SAMPLE_RATE = 16000 # whisper.audio.SAMPLE_RATE +SAMPLE_RATE = 16000 # whisper.audio.SAMPLE_RATE if USE_CTRANSLATE2: import ctranslate2 import faster_whisper else: import torch + import whisper if USE_TORCHAUDIO: import torchaudio + def has_cuda(): if USE_CTRANSLATE2: return ctranslate2.get_cuda_device_count() > 0 else: return torch.cuda.is_available() + def get_device(): device = os.environ.get("DEVICE", "cuda" if has_cuda() else "cpu") use_gpu = "cuda" in device - + if USE_CTRANSLATE2: try: if device.startswith("cuda:"): @@ -34,7 +37,9 @@ def get_device(): else: assert device in ["cpu", "cuda"] except: - raise ValueError(f"Invalid DEVICE '{device}' (should be 'cpu' or 'cuda' or 'cuda: or 'cuda:,,...')") + raise ValueError( + f"Invalid DEVICE '{device}' (should be 'cpu' or 'cuda' or 'cuda: or 'cuda:,,...')" + ) else: try: device = torch.device(device) @@ -42,6 +47,7 @@ def get_device(): raise Exception("Failed to set device: {}".format(str(err))) from err return device, use_gpu + def get_language(): """ Get the language from the environment variable LANGUAGE, and format as expected by Whisper. @@ -58,13 +64,17 @@ def get_language(): language = {v: k for k, v in LANGUAGES.items()}.get(language.lower(), language) # Raise an exception for unknown languages if language not in LANGUAGES: - available_languages = \ - list(LANGUAGES.keys()) + \ - [k[0].upper() + k[1:] for k in LANGUAGES.values()] + \ - ["*", None] - raise ValueError(f"Language '{language}' is not available. Available languages are: {available_languages}") + available_languages = ( + list(LANGUAGES.keys()) + + [k[0].upper() + k[1:] for k in LANGUAGES.values()] + + ["*", None] + ) + raise ValueError( + f"Language '{language}' is not available. Available languages are: {available_languages}" + ) return language + def conform_audio(audio, sample_rate=16_000): if sample_rate != SAMPLE_RATE: if not USE_TORCHAUDIO: @@ -93,13 +103,13 @@ def load_audiofile(path): def load_wave_buffer(file_buffer): - """ Formats audio from a wavFile buffer to a torch array for processing. """ + """Formats audio from a wavFile buffer to a torch array for processing.""" file_buffer_io = io.BytesIO(file_buffer) if USE_CTRANSLATE2: return faster_whisper.decode_audio(file_buffer_io, sampling_rate=SAMPLE_RATE) file_content = wavio.read(file_buffer_io) sample_rate = file_content.rate - audio = file_content.data.astype(np.float32)/32768 + audio = file_content.data.astype(np.float32) / 32768 audio = audio.transpose() audio = torch.from_numpy(audio) return conform_audio(audio, sample_rate) @@ -111,104 +121,105 @@ def flatten(l): """ return [item for sublist in l for item in sublist] -LANGUAGES = { # whisper.tokenizer.LANGUAGES - 'en': 'english', - 'zh': 'chinese', - 'de': 'german', - 'es': 'spanish', - 'ru': 'russian', - 'ko': 'korean', - 'fr': 'french', - 'ja': 'japanese', - 'pt': 'portuguese', - 'tr': 'turkish', - 'pl': 'polish', - 'ca': 'catalan', - 'nl': 'dutch', - 'ar': 'arabic', - 'sv': 'swedish', - 'it': 'italian', - 'id': 'indonesian', - 'hi': 'hindi', - 'fi': 'finnish', - 'vi': 'vietnamese', - 'he': 'hebrew', - 'uk': 'ukrainian', - 'el': 'greek', - 'ms': 'malay', - 'cs': 'czech', - 'ro': 'romanian', - 'da': 'danish', - 'hu': 'hungarian', - 'ta': 'tamil', - 'no': 'norwegian', - 'th': 'thai', - 'ur': 'urdu', - 'hr': 'croatian', - 'bg': 'bulgarian', - 'lt': 'lithuanian', - 'la': 'latin', - 'mi': 'maori', - 'ml': 'malayalam', - 'cy': 'welsh', - 'sk': 'slovak', - 'te': 'telugu', - 'fa': 'persian', - 'lv': 'latvian', - 'bn': 'bengali', - 'sr': 'serbian', - 'az': 'azerbaijani', - 'sl': 'slovenian', - 'kn': 'kannada', - 'et': 'estonian', - 'mk': 'macedonian', - 'br': 'breton', - 'eu': 'basque', - 'is': 'icelandic', - 'hy': 'armenian', - 'ne': 'nepali', - 'mn': 'mongolian', - 'bs': 'bosnian', - 'kk': 'kazakh', - 'sq': 'albanian', - 'sw': 'swahili', - 'gl': 'galician', - 'mr': 'marathi', - 'pa': 'punjabi', - 'si': 'sinhala', - 'km': 'khmer', - 'sn': 'shona', - 'yo': 'yoruba', - 'so': 'somali', - 'af': 'afrikaans', - 'oc': 'occitan', - 'ka': 'georgian', - 'be': 'belarusian', - 'tg': 'tajik', - 'sd': 'sindhi', - 'gu': 'gujarati', - 'am': 'amharic', - 'yi': 'yiddish', - 'lo': 'lao', - 'uz': 'uzbek', - 'fo': 'faroese', - 'ht': 'haitian creole', - 'ps': 'pashto', - 'tk': 'turkmen', - 'nn': 'nynorsk', - 'mt': 'maltese', - 'sa': 'sanskrit', - 'lb': 'luxembourgish', - 'my': 'myanmar', - 'bo': 'tibetan', - 'tl': 'tagalog', - 'mg': 'malagasy', - 'as': 'assamese', - 'tt': 'tatar', - 'haw': 'hawaiian', - 'ln': 'lingala', - 'ha': 'hausa', - 'ba': 'bashkir', - 'jw': 'javanese', - 'su': 'sundanese' + +LANGUAGES = { # whisper.tokenizer.LANGUAGES + "en": "english", + "zh": "chinese", + "de": "german", + "es": "spanish", + "ru": "russian", + "ko": "korean", + "fr": "french", + "ja": "japanese", + "pt": "portuguese", + "tr": "turkish", + "pl": "polish", + "ca": "catalan", + "nl": "dutch", + "ar": "arabic", + "sv": "swedish", + "it": "italian", + "id": "indonesian", + "hi": "hindi", + "fi": "finnish", + "vi": "vietnamese", + "he": "hebrew", + "uk": "ukrainian", + "el": "greek", + "ms": "malay", + "cs": "czech", + "ro": "romanian", + "da": "danish", + "hu": "hungarian", + "ta": "tamil", + "no": "norwegian", + "th": "thai", + "ur": "urdu", + "hr": "croatian", + "bg": "bulgarian", + "lt": "lithuanian", + "la": "latin", + "mi": "maori", + "ml": "malayalam", + "cy": "welsh", + "sk": "slovak", + "te": "telugu", + "fa": "persian", + "lv": "latvian", + "bn": "bengali", + "sr": "serbian", + "az": "azerbaijani", + "sl": "slovenian", + "kn": "kannada", + "et": "estonian", + "mk": "macedonian", + "br": "breton", + "eu": "basque", + "is": "icelandic", + "hy": "armenian", + "ne": "nepali", + "mn": "mongolian", + "bs": "bosnian", + "kk": "kazakh", + "sq": "albanian", + "sw": "swahili", + "gl": "galician", + "mr": "marathi", + "pa": "punjabi", + "si": "sinhala", + "km": "khmer", + "sn": "shona", + "yo": "yoruba", + "so": "somali", + "af": "afrikaans", + "oc": "occitan", + "ka": "georgian", + "be": "belarusian", + "tg": "tajik", + "sd": "sindhi", + "gu": "gujarati", + "am": "amharic", + "yi": "yiddish", + "lo": "lao", + "uz": "uzbek", + "fo": "faroese", + "ht": "haitian creole", + "ps": "pashto", + "tk": "turkmen", + "nn": "nynorsk", + "mt": "maltese", + "sa": "sanskrit", + "lb": "luxembourgish", + "my": "myanmar", + "bo": "tibetan", + "tl": "tagalog", + "mg": "malagasy", + "as": "assamese", + "tt": "tatar", + "haw": "hawaiian", + "ln": "lingala", + "ha": "hausa", + "ba": "bashkir", + "jw": "javanese", + "su": "sundanese", } diff --git a/whisper/stt/processing/word_alignment.py b/whisper/stt/processing/word_alignment.py index 229fb43..e7a9256 100644 --- a/whisper/stt/processing/word_alignment.py +++ b/whisper/stt/processing/word_alignment.py @@ -1,24 +1,26 @@ """ Credits: https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html """ -from stt import logger, USE_TORCH from dataclasses import dataclass +from stt import USE_TORCH, logger + from .alignment_model import compute_logprobas, get_vocab -from .utils import flatten from .text_normalize import transliterate +from .utils import flatten if USE_TORCH: import torch _unknown_chars = [] + def compute_alignment(audio, transcript, model): - """ Compute the alignment of the audio and a transcript, for a given model that returns log-probabilities on the charset defined the transcript.""" + """Compute the alignment of the audio and a transcript, for a given model that returns log-probabilities on the charset defined the transcript.""" emission = compute_logprobas(model, audio) labels, blank_id = get_vocab(model) - labels = labels[:emission.shape[1]] + labels = labels[: emission.shape[1]] dictionary = {c: i for i, c in enumerate(labels)} default = labels.index("-") if "-" in labels else None @@ -30,8 +32,7 @@ def compute_alignment(audio, transcript, model): if len(tokens) + num_repetitions > num_emissions: # It will be impossible to find a path... # It can happen when Whisper is lost in a loop (ex: "Ha ha ha ha ...") - logger.warn( - f"Got too many characters from Whisper. Shrinking to the first characters.") + logger.warn(f"Got too many characters from Whisper. Shrinking to the first characters.") tokens = tokens[:num_emissions] num_repetitions = count_repetitions(tokens) while len(tokens) + num_repetitions > num_emissions: @@ -62,8 +63,7 @@ def loose_get_char_index(dictionary, c, default=None): if i is None: # Try with alternative versions of the character tc = transliterate(c) - other_char = list( - set([c.lower(), c.upper(), tc, tc.lower(), tc.upper()])) + other_char = list(set([c.lower(), c.upper(), tc, tc.lower(), tc.upper()])) for c2 in other_char: i = dictionary.get(c2, None) if i is not None: @@ -73,15 +73,17 @@ def loose_get_char_index(dictionary, c, default=None): if i is None: for c2 in other_char: if len(c2) > 1: - candidate = [dictionary[c3] - for c3 in c2 if c3 in dictionary] + candidate = [dictionary[c3] for c3 in c2 if c3 in dictionary] if len(candidate) > 0 and (i is None or len(candidate) > len(i)): i = candidate # If still not found if i is None: if c not in _unknown_chars: - logger.warn("Character not correctly handled by alignment model: '" + - "' / '".join(list(set([c] + other_char))) + "'") + logger.warn( + "Character not correctly handled by alignment model: '" + + "' / '".join(list(set([c] + other_char))) + + "'" + ) _unknown_chars.append(c) i = [default] if default is not None else [] else: @@ -103,16 +105,23 @@ def get_trellis(emission, tokens, blank_id=0, use_max=False): trellis[-num_tokens:, 0] = float("inf") for t in range(num_frame): - trellis[t + 1, 1:] = torch.maximum( - # Score for staying at the same token - trellis[t, 1:] + emission[t, blank_id], - torch.maximum(trellis[t, 1:] + emission[t, tokens], - # Score for changing to the next token - trellis[t, :-1] + emission[t, tokens]) - ) if use_max else torch.logaddexp( - trellis[t, 1:] + emission[t, blank_id], - torch.logaddexp(trellis[t, 1:] + emission[t, tokens], - trellis[t, :-1] + emission[t, tokens]) + trellis[t + 1, 1:] = ( + torch.maximum( + # Score for staying at the same token + trellis[t, 1:] + emission[t, blank_id], + torch.maximum( + trellis[t, 1:] + emission[t, tokens], + # Score for changing to the next token + trellis[t, :-1] + emission[t, tokens], + ), + ) + if use_max + else torch.logaddexp( + trellis[t, 1:] + emission[t, blank_id], + torch.logaddexp( + trellis[t, 1:] + emission[t, tokens], trellis[t, :-1] + emission[t, tokens] + ), + ) ) return trellis @@ -146,8 +155,7 @@ def backtrack(trellis, emission, tokens, blank_id=0): changed = trellis[t - 1, j - 1] + emission[t - 1, tokens[j - 1]] # 2. Store the path with frame-wise probability. - prob = emission[t - 1, tokens[j - 1] - if changed > stayed else 0].exp().item() + prob = emission[t - 1, tokens[j - 1] if changed > stayed else 0].exp().item() # Return token index and time index in non-trellis coordinate. path.append(Point(j - 1, t - 1, prob)) @@ -205,10 +213,10 @@ def merge_words(segments, separator=" "): if i1 != i2: segs = segments[i1:i2] word = "".join([seg.label for seg in segs]) - score = sum(seg.score * seg.length for seg in segs) / \ - sum(seg.length for seg in segs) - words.append( - Segment(word, segments[i1].start, segments[i2 - 1].end, score)) + score = sum(seg.score * seg.length for seg in segs) / sum( + seg.length for seg in segs + ) + words.append(Segment(word, segments[i1].start, segments[i2 - 1].end, score)) i1 = i2 + 1 i2 = i1 else: From 38d866b9d51035f9defd1d0b724739591192d3a6 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 12 Dec 2023 17:25:28 +0100 Subject: [PATCH 165/172] Update READMEs (and add a main one) --- README.md | 12 ++++++++++ kaldi/README.md | 26 ++++++++++++---------- whisper/README.md | 56 ++++++++++++++++++++++++----------------------- 3 files changed, 55 insertions(+), 39 deletions(-) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..ac846d3 --- /dev/null +++ b/README.md @@ -0,0 +1,12 @@ +# LinTO-Platform-STT + +LinTO-Platform-STT is the transcription service within the [LinTO stack](https://github.com/linto-ai/linto-platform-stack), +which can currently work with Speech-To-Text (STT) models. +The following families of STT models are currently supported (please refer to respective documentation for more details): +* [Kaldi models](kaldi/README.md) +* [Whisper models](whisper/README.md) + +LinTO-Platform-STT can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector. + +## License +This project is developped under the AGPLv3 License (see LICENSE). diff --git a/kaldi/README.md b/kaldi/README.md index ec70060..9ab215f 100644 --- a/kaldi/README.md +++ b/kaldi/README.md @@ -1,7 +1,9 @@ -# LINTO-PLATFORM-STT -LinTO-platform-stt is the transcription service within the [LinTO stack](https://github.com/linto-ai/linto-platform-stack). +# LinTO-Platform-STT-Kaldi -LinTO-platform-stt can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector. +LinTO-Platform-STT-Kaldi is the transcription service within the [LinTO stack](https://github.com/linto-ai/linto-platform-stack) +based on Speech-To-Text (STT) models trained with [Kaldi](https://github.com/kaldi-asr/kaldi). + +LinTO-Platform-STT-Kaldi can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector. ## Pre-requisites @@ -12,7 +14,7 @@ To run the transcription models you'll need: * One CPU per worker. Inference time scales on CPU performances. ### Model -LinTO-Platform-STT accepts two kinds of models: +LinTO-Platform-STT-Kaldi accepts two kinds of models: * LinTO Acoustic and Languages models. * Vosk models. @@ -26,19 +28,19 @@ The transcription service requires docker up and running. The STT only entry point in task mode are tasks posted on a message broker. Supported message broker are RabbitMQ, Redis, Amazon SQS. On addition, as to prevent large audio from transiting through the message broker, STT-Worker use a shared storage folder (SHARED_FOLDER). -## Deploy linto-platform-stt +## Deploy LinTO-Platform-STT-Kaldi **1- First step is to build or pull the image:** ```bash git clone https://github.com/linto-ai/linto-platform-stt.git cd linto-platform-stt -docker build . -t linto-platform-stt:latest +docker build . -f kaldi/Dockerfile -t linto-platform-stt-kaldi:latest ``` or ```bash -docker pull lintoai/linto-platform-stt +docker pull lintoai/linto-platform-stt-kaldi ``` **2- Download the models** @@ -48,7 +50,7 @@ Have the acoustic and language model ready at AM_PATH and LM_PATH if you are usi **3- Fill the .env** ```bash -cp .envdefault .env +cp kaldi/.envdefault kaldi/.env ``` | PARAMETER | DESCRIPTION | EXEMPLE | @@ -84,8 +86,8 @@ docker run --rm \ -p HOST_SERVING_PORT:80 \ -v AM_PATH:/opt/AM \ -v LM_PATH:/opt/LM \ ---env-file .env \ -linto-platform-stt:latest +--env-file kaldi/.env \ +linto-platform-stt-kaldi:latest ``` This will run a container providing an [HTTP API](#http-api) binded on the host HOST_SERVING_PORT port. @@ -114,8 +116,8 @@ docker run --rm \ -v AM_PATH:/opt/AM \ -v LM_PATH:/opt/LM \ -v SHARED_AUDIO_FOLDER:/opt/audio \ ---env-file .env \ -linto-platform-stt:latest +--env-file kaldi/.env \ +linto-platform-stt-kaldi:latest ``` **Parameters:** diff --git a/whisper/README.md b/whisper/README.md index 8e6b04d..f460093 100644 --- a/whisper/README.md +++ b/whisper/README.md @@ -1,7 +1,9 @@ -# LINTO-PLATFORM-STT -LinTO-platform-stt is the transcription service within the [LinTO stack](https://github.com/linto-ai/linto-platform-stack). +# LinTO-Platform-STT-Whisper -LinTO-platform-stt can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector. +LinTO-Platform-STT-Whisper is the transcription service within the [LinTO stack](https://github.com/linto-ai/linto-platform-stack) +based on Speech-To-Text (STT) [Whisper models](https://openai.com/research/whisper). + +LinTO-Platform-STT-Whisper can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector. ## Pre-requisites @@ -11,24 +13,22 @@ To run the transcription models you'll need: * Up to 7GB of RAM depending on the model used. * One CPU per worker. Inference time scales on CPU performances. -### Model -LinTO-Platform-STT works with two models: -* A Whisper model to perform Automatic Speech Recognition, which must be in the PyTorch format. -* A wav2vec model to perform word alignment, which can be in the format of SpeechBrain, HuggingFace's Transformers or TorchAudio +### Model(s) + +LinTO-Platform-STT-Whisper works with a Whisper model to perform Automatic Speech Recognition, which must be in the PyTorch format. + +#### Optional alignment model (deprecated) +It can also work with a wav2vec model to perform word alignment. The wav2vec model can be specified either -* with a string corresponding to a `torchaudio` pipeline (e.g. "WAV2VEC2_ASR_BASE_960H") or -* with a string corresponding to a HuggingFace repository of a wav2vec model (e.g. "jonatasgrosman/wav2vec2-large-xlsr-53-english"), or -* with a path corresponding to a folder with a SpeechBrain model - -Default models are provided for the following languages: -* French (fr) -* English (en) -* Spanish (es) -* German (de) -* Dutch (nl) -* Japanese (ja) -* Chinese (zh) +* (TorchAudio) with a string corresponding to a `torchaudio` pipeline (e.g. "WAV2VEC2_ASR_BASE_960H") or +* (HuggingFace's Transformers) with a string corresponding to a HuggingFace repository of a wav2vec model (e.g. "jonatasgrosman/wav2vec2-large-xlsr-53-english"), or +* (SpeechBrain) with a path corresponding to a folder with a SpeechBrain model + +Default wav2vec models are provided for French (fr), English (en), Spanish (es), German (de), Dutch (nl), Japanese (ja), Chinese (zh). + +But we advise not to use a companion wav2vec alignment model. +This is not needed neither tested anymore. ### Docker The transcription service requires docker up and running. @@ -37,19 +37,19 @@ The transcription service requires docker up and running. The STT only entry point in task mode are tasks posted on a message broker. Supported message broker are RabbitMQ, Redis, Amazon SQS. On addition, as to prevent large audio from transiting through the message broker, STT-Worker use a shared storage folder (SHARED_FOLDER). -## Deploy linto-platform-stt +## Deploy LinTO-Platform-STT-Whisper **1- First step is to build or pull the image:** ```bash git clone https://github.com/linto-ai/linto-platform-stt.git cd linto-platform-stt -docker build . -t linto-platform-stt:latest +docker build . -f whisper/Dockerfile.ctranslate2 -t linto-platform-stt-whisper:latest ``` or ```bash -docker pull lintoai/linto-platform-stt +docker pull lintoai/linto-platform-stt-whisper ``` **2- Download the models** @@ -77,7 +77,7 @@ If may also want to download a specific wav2vec model for word alignment. **3- Fill the .env** ```bash -cp .envdefault .env +cp whisper/.envdefault whisper/.env ``` | PARAMETER | DESCRIPTION | EXEMPLE | @@ -134,8 +134,8 @@ The SERVICE_MODE value in the .env should be set to ```http```. docker run --rm \ -p HOST_SERVING_PORT:80 \ -v ASR_PATH:/opt/model.pt \ ---env-file .env \ -linto-platform-stt:latest +--env-file whisper/.env \ +linto-platform-stt-whisper:latest ``` This will run a container providing an [HTTP API](#http-api) binded on the host HOST_SERVING_PORT port. @@ -169,8 +169,8 @@ You need a message broker up and running at MY_SERVICE_BROKER. docker run --rm \ -v ASR_PATH:/opt/model.pt \ -v SHARED_AUDIO_FOLDER:/opt/audio \ ---env-file .env \ -linto-platform-stt:latest +--env-file whisper/.env \ +linto-platform-stt-whisper:latest ``` You may also want to mount your cache folder CACHE_PATH (e.g. "~/.cache") ```-v CACHE_PATH:/root/.cache``` @@ -267,7 +267,9 @@ This project is developped under the AGPLv3 License (see LICENSE). ## Acknowlegment. +* [Faster Whisper](https://github.com/SYSTRAN/faster-whisper) * [OpenAI Whisper](https://github.com/openai/whisper) +* [Ctranslate2](https://github.com/OpenNMT/CTranslate2) * [SpeechBrain](https://github.com/speechbrain/speechbrain). * [TorchAudio](https://github.com/pytorch/audio) * [HuggingFace Transformers](https://github.com/huggingface/transformers) \ No newline at end of file From 4c8f1c9db10451ae015b3912e2b0c525c5ed4ac4 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Tue, 12 Dec 2023 17:29:50 +0100 Subject: [PATCH 166/172] restart release history for the two new images --- kaldi/RELEASE.md | 55 +++------------------------------------------- whisper/RELEASE.md | 17 ++------------ 2 files changed, 5 insertions(+), 67 deletions(-) diff --git a/kaldi/RELEASE.md b/kaldi/RELEASE.md index 9966250..4bd02f5 100644 --- a/kaldi/RELEASE.md +++ b/kaldi/RELEASE.md @@ -1,52 +1,3 @@ -# 3.3.2 -- Fixed use of stereo audio in http serving mode - -# 3.3.1 -- Fixed lin_to_vosk throwing an error on a already existing container. -- Corrected an error on the README regarding mounting model volumes. -- Code styling (PEP 8) - -# 3.3.0 -- Added optional streaming route to the http serving mode -- Added serving mode: websocket -- Added Dynamic model conversion allowing to use either Vosk Models or Linagora AM/LM models -- Changer Vosk dependency to alphacep/vosk -- Updated README.md - -# 3.2.1 -- Repository total rework. The goal being to have a simple transcription service embeddable within a micro-service infrastructure. -- Changed repository name from linto-platform-stt-standalone-worker to linto-platform-stt. -- Added celery connector for microservice integration. -- Added launch option to specify serving mode between task and http. -- Removed diarization functionnality. -- Removed punctuation functionnality. -- Removed Async requests/Job management. -- Updated README to reflect those changes. - -# 3.1.1 -- Change Pykaldi with vosk-API (no python wrapper for decoding function, no extrat packages during installation, c++ implementation based on kaldi functions) -- New feature: Compute a confidence score per transcription -- Fix minor bugs - -# 2.2.1 -- Fix minor bugs -- put SWAGGER_PATH parameter as optional -- Generate the word_boundary file if it does not exist - -# 2.2.0 -- Speaker diarization feature: pyBK package -- Mulithreading feature: Speech decoding and Speaker diarization processes -- Optional parameter: real number of speaker in the audio - -# 2.0.0 -- Reimplement LinTO-Platform-stt-standalone-worker using Pykaldi package - -# 1.1.2 -- New features: - - Word timestamp computing - - Response type: plain/text: simple text output and application/json: the transcription and the words timestamp. - - Swagger: integrate swagger in the service using a python package - - Fix minor bugs - -# 1.0.0 -- First build of LinTO-Platform-stt-standalone-worker \ No newline at end of file +# 1.0.0 +- First build of linto-platform-stt-kaldi +- Based on 3.3.2 of linto-platform-stt (https://github.com/linto-ai/linto-platform-stt/blob/4361300a4463c90cec0bf3fa2975d7cc2ddf8d36/RELEASE.md) diff --git a/whisper/RELEASE.md b/whisper/RELEASE.md index 2d57069..53f203c 100644 --- a/whisper/RELEASE.md +++ b/whisper/RELEASE.md @@ -1,16 +1,3 @@ # 1.0.0 -- Support of Whisper (including large-v3 model) -- Add integration of Whisper models from transformers -- Add support of prompt from Whisper models (env variable PROMPT) -- Fix possible failure when a Whisper segment starts with a punctuation -- Tune punctuation heuristics - -# 0.0.0 -- Added optional streaming route to the http serving mode -- Added serving mode: websocket -- Added Dynamic model conversion allowing to use either Vosk Models or Linagora AM/LM models -- Added celery connector for microservice integration. -- Added launch option to specify serving mode between task and http. -- Removed Async requests/Job management. -- New feature: Compute a confidence score per transcription -- put SWAGGER_PATH parameter as optional +- First build of linto-platform-stt-whisper +- Based on 4.0.5 of linto-platform-stt https://github.com/linto-ai/linto-platform-stt/blob/a54b7b7ac2bc491a1795bb6dfb318a39c8b76d63/RELEASE.md From 4bb3c1d5a32199b5b531640c23c2a84804730013 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 13 Dec 2023 15:06:25 +0100 Subject: [PATCH 167/172] rename linto-platform-stt -> linto-plaftorm --- .github/workflows/dockerhub-description.yml | 17 +++++++++++--- Jenkinsfile | 4 ++-- README.md | 6 ++--- document/swagger.yml | 2 +- http_server/swagger.py | 2 +- kaldi/README.md | 26 ++++++++++----------- kaldi/RELEASE.md | 4 ++-- whisper/README.md | 26 ++++++++++----------- whisper/RELEASE.md | 4 ++-- 9 files changed, 51 insertions(+), 40 deletions(-) diff --git a/.github/workflows/dockerhub-description.yml b/.github/workflows/dockerhub-description.yml index 0367b21..1301449 100644 --- a/.github/workflows/dockerhub-description.yml +++ b/.github/workflows/dockerhub-description.yml @@ -7,7 +7,7 @@ on: - README.md - .github/workflows/dockerhub-description.yml jobs: - dockerHubDescription: + dockerHubDescriptionKaldi: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 @@ -16,5 +16,16 @@ jobs: with: username: ${{ secrets.DOCKERHUB_USERNAME }} password: ${{ secrets.DOCKERHUB_PASSWORD }} - repository: lintoai/linto-platform-stt - readme-filepath: ./README.md + repository: lintoai/linto-stt-kaldi + readme-filepath: ./kaldi/README.md + dockerHubDescriptionWhisper: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + - name: Docker Hub Description + uses: peter-evans/dockerhub-description@v3 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_PASSWORD }} + repository: lintoai/linto-stt-whisper + readme-filepath: ./whisper/README.md diff --git a/Jenkinsfile b/Jenkinsfile index 81d8ec8..99a4886 100644 --- a/Jenkinsfile +++ b/Jenkinsfile @@ -1,8 +1,8 @@ pipeline { agent any environment { - DOCKER_HUB_REPO_KALDI = "lintoai/linto-platform-stt-kaldi" - DOCKER_HUB_REPO_WHISPER = "lintoai/linto-platform-stt-whisper" + DOCKER_HUB_REPO_KALDI = "lintoai/linto-stt-kaldi" + DOCKER_HUB_REPO_WHISPER = "lintoai/linto-stt-whisper" DOCKER_HUB_CRED = 'docker-hub-credentials' } diff --git a/README.md b/README.md index ac846d3..09009fe 100644 --- a/README.md +++ b/README.md @@ -1,12 +1,12 @@ -# LinTO-Platform-STT +# LinTO-STT -LinTO-Platform-STT is the transcription service within the [LinTO stack](https://github.com/linto-ai/linto-platform-stack), +LinTO-STT is the transcription service within the [LinTO stack](https://github.com/linto-ai/linto-platform-stack), which can currently work with Speech-To-Text (STT) models. The following families of STT models are currently supported (please refer to respective documentation for more details): * [Kaldi models](kaldi/README.md) * [Whisper models](whisper/README.md) -LinTO-Platform-STT can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector. +LinTO-STT can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector. ## License This project is developped under the AGPLv3 License (see LICENSE). diff --git a/document/swagger.yml b/document/swagger.yml index 70bc9fc..6da4ed6 100644 --- a/document/swagger.yml +++ b/document/swagger.yml @@ -2,7 +2,7 @@ swagger: "2.0" info: version: "1.0.0" - title: LinTo-Platform-STT + title: LinTo-STT description: Speech To Text API contact: email: "support@linto.ai" diff --git a/http_server/swagger.py b/http_server/swagger.py index a9b93d0..31344cd 100644 --- a/http_server/swagger.py +++ b/http_server/swagger.py @@ -11,7 +11,7 @@ def setupSwaggerUI(app, args): args.swagger_prefix + args.swagger_url, args.swagger_path, config={ # Swagger UI config overrides - "app_name": "LinTO Platform STT", + "app_name": "LinTO STT", "spec": swagger_yml, }, ) diff --git a/kaldi/README.md b/kaldi/README.md index 9ab215f..444d3b3 100644 --- a/kaldi/README.md +++ b/kaldi/README.md @@ -1,9 +1,9 @@ -# LinTO-Platform-STT-Kaldi +# LinTO-STT-Kaldi -LinTO-Platform-STT-Kaldi is the transcription service within the [LinTO stack](https://github.com/linto-ai/linto-platform-stack) +LinTO-STT-Kaldi is the transcription service within the [LinTO stack](https://github.com/linto-ai/linto-platform-stack) based on Speech-To-Text (STT) models trained with [Kaldi](https://github.com/kaldi-asr/kaldi). -LinTO-Platform-STT-Kaldi can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector. +LinTO-STT-Kaldi can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector. ## Pre-requisites @@ -14,7 +14,7 @@ To run the transcription models you'll need: * One CPU per worker. Inference time scales on CPU performances. ### Model -LinTO-Platform-STT-Kaldi accepts two kinds of models: +LinTO-STT-Kaldi accepts two kinds of models: * LinTO Acoustic and Languages models. * Vosk models. @@ -28,19 +28,19 @@ The transcription service requires docker up and running. The STT only entry point in task mode are tasks posted on a message broker. Supported message broker are RabbitMQ, Redis, Amazon SQS. On addition, as to prevent large audio from transiting through the message broker, STT-Worker use a shared storage folder (SHARED_FOLDER). -## Deploy LinTO-Platform-STT-Kaldi +## Deploy LinTO-STT-Kaldi **1- First step is to build or pull the image:** ```bash -git clone https://github.com/linto-ai/linto-platform-stt.git -cd linto-platform-stt -docker build . -f kaldi/Dockerfile -t linto-platform-stt-kaldi:latest +git clone https://github.com/linto-ai/linto-stt.git +cd linto-stt +docker build . -f kaldi/Dockerfile -t linto-stt-kaldi:latest ``` or ```bash -docker pull lintoai/linto-platform-stt-kaldi +docker pull lintoai/linto-stt-kaldi ``` **2- Download the models** @@ -87,7 +87,7 @@ docker run --rm \ -v AM_PATH:/opt/AM \ -v LM_PATH:/opt/LM \ --env-file kaldi/.env \ -linto-platform-stt-kaldi:latest +linto-stt-kaldi:latest ``` This will run a container providing an [HTTP API](#http-api) binded on the host HOST_SERVING_PORT port. @@ -105,8 +105,8 @@ The HTTP serving mode connect a celery worker to a message broker. The SERVICE_MODE value in the .env should be set to ```task```. ->LinTO-platform-stt can be deployed within the linto-platform-stack through the use of linto-platform-services-manager. Used this way, the container spawn celery worker waiting for transcription task on a message broker. ->LinTO-platform-stt in task mode is not intended to be launch manually. +>LinTO-STT-Kaldi can be deployed within the linto-platform-stack through the use of linto-platform-services-manager. Used this way, the container spawn celery worker waiting for transcription task on a message broker. +>LinTO-STT-Kaldi in task mode is not intended to be launch manually. >However, if you intent to connect it to your custom message's broker here are the parameters: You need a message broker up and running at MY_SERVICE_BROKER. @@ -117,7 +117,7 @@ docker run --rm \ -v LM_PATH:/opt/LM \ -v SHARED_AUDIO_FOLDER:/opt/audio \ --env-file kaldi/.env \ -linto-platform-stt-kaldi:latest +linto-stt-kaldi:latest ``` **Parameters:** diff --git a/kaldi/RELEASE.md b/kaldi/RELEASE.md index 4bd02f5..e11f89a 100644 --- a/kaldi/RELEASE.md +++ b/kaldi/RELEASE.md @@ -1,3 +1,3 @@ # 1.0.0 -- First build of linto-platform-stt-kaldi -- Based on 3.3.2 of linto-platform-stt (https://github.com/linto-ai/linto-platform-stt/blob/4361300a4463c90cec0bf3fa2975d7cc2ddf8d36/RELEASE.md) +- First build of linto-stt-kaldi +- Based on 3.3.2 of linto-stt (https://github.com/linto-ai/linto-stt/blob/4361300a4463c90cec0bf3fa2975d7cc2ddf8d36/RELEASE.md) diff --git a/whisper/README.md b/whisper/README.md index f460093..f1eccd1 100644 --- a/whisper/README.md +++ b/whisper/README.md @@ -1,9 +1,9 @@ -# LinTO-Platform-STT-Whisper +# LinTO-STT-Whisper -LinTO-Platform-STT-Whisper is the transcription service within the [LinTO stack](https://github.com/linto-ai/linto-platform-stack) +LinTO-STT-Whisper is the transcription service within the [LinTO stack](https://github.com/linto-ai/linto-platform-stack) based on Speech-To-Text (STT) [Whisper models](https://openai.com/research/whisper). -LinTO-Platform-STT-Whisper can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector. +LinTO-STT-Whisper can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector. ## Pre-requisites @@ -15,7 +15,7 @@ To run the transcription models you'll need: ### Model(s) -LinTO-Platform-STT-Whisper works with a Whisper model to perform Automatic Speech Recognition, which must be in the PyTorch format. +LinTO-STT-Whisper works with a Whisper model to perform Automatic Speech Recognition, which must be in the PyTorch format. #### Optional alignment model (deprecated) @@ -37,19 +37,19 @@ The transcription service requires docker up and running. The STT only entry point in task mode are tasks posted on a message broker. Supported message broker are RabbitMQ, Redis, Amazon SQS. On addition, as to prevent large audio from transiting through the message broker, STT-Worker use a shared storage folder (SHARED_FOLDER). -## Deploy LinTO-Platform-STT-Whisper +## Deploy LinTO-STT-Whisper **1- First step is to build or pull the image:** ```bash -git clone https://github.com/linto-ai/linto-platform-stt.git -cd linto-platform-stt -docker build . -f whisper/Dockerfile.ctranslate2 -t linto-platform-stt-whisper:latest +git clone https://github.com/linto-ai/linto-stt.git +cd linto-stt +docker build . -f whisper/Dockerfile.ctranslate2 -t linto-stt-whisper:latest ``` or ```bash -docker pull lintoai/linto-platform-stt-whisper +docker pull lintoai/linto-stt-whisper ``` **2- Download the models** @@ -135,7 +135,7 @@ docker run --rm \ -p HOST_SERVING_PORT:80 \ -v ASR_PATH:/opt/model.pt \ --env-file whisper/.env \ -linto-platform-stt-whisper:latest +linto-stt-whisper:latest ``` This will run a container providing an [HTTP API](#http-api) binded on the host HOST_SERVING_PORT port. @@ -159,8 +159,8 @@ The HTTP serving mode connect a celery worker to a message broker. The SERVICE_MODE value in the .env should be set to ```task```. ->LinTO-platform-stt can be deployed within the linto-platform-stack through the use of linto-platform-services-manager. Used this way, the container spawn celery worker waiting for transcription task on a message broker. ->LinTO-platform-stt in task mode is not intended to be launch manually. +>LinTO-STT-Whisper can be deployed within the linto-platform-stack through the use of linto-platform-services-manager. Used this way, the container spawn celery worker waiting for transcription task on a message broker. +>LinTO-STT-Whisper in task mode is not intended to be launch manually. >However, if you intent to connect it to your custom message's broker here are the parameters: You need a message broker up and running at MY_SERVICE_BROKER. @@ -170,7 +170,7 @@ docker run --rm \ -v ASR_PATH:/opt/model.pt \ -v SHARED_AUDIO_FOLDER:/opt/audio \ --env-file whisper/.env \ -linto-platform-stt-whisper:latest +linto-stt-whisper:latest ``` You may also want to mount your cache folder CACHE_PATH (e.g. "~/.cache") ```-v CACHE_PATH:/root/.cache``` diff --git a/whisper/RELEASE.md b/whisper/RELEASE.md index 53f203c..2967139 100644 --- a/whisper/RELEASE.md +++ b/whisper/RELEASE.md @@ -1,3 +1,3 @@ # 1.0.0 -- First build of linto-platform-stt-whisper -- Based on 4.0.5 of linto-platform-stt https://github.com/linto-ai/linto-platform-stt/blob/a54b7b7ac2bc491a1795bb6dfb318a39c8b76d63/RELEASE.md +- First build of linto-stt-whisper +- Based on 4.0.5 of linto-stt https://github.com/linto-ai/linto-stt/blob/a54b7b7ac2bc491a1795bb6dfb318a39c8b76d63/RELEASE.md From ff42bd4ad09c740a13f8faa8d6d6ab1665082aa2 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 13 Dec 2023 15:37:30 +0100 Subject: [PATCH 168/172] remove wrong part --- kaldi/README.md | 4 ---- whisper/README.md | 4 ---- 2 files changed, 8 deletions(-) diff --git a/kaldi/README.md b/kaldi/README.md index 444d3b3..3bfa6c8 100644 --- a/kaldi/README.md +++ b/kaldi/README.md @@ -105,10 +105,6 @@ The HTTP serving mode connect a celery worker to a message broker. The SERVICE_MODE value in the .env should be set to ```task```. ->LinTO-STT-Kaldi can be deployed within the linto-platform-stack through the use of linto-platform-services-manager. Used this way, the container spawn celery worker waiting for transcription task on a message broker. ->LinTO-STT-Kaldi in task mode is not intended to be launch manually. ->However, if you intent to connect it to your custom message's broker here are the parameters: - You need a message broker up and running at MY_SERVICE_BROKER. ```bash diff --git a/whisper/README.md b/whisper/README.md index f1eccd1..015eb42 100644 --- a/whisper/README.md +++ b/whisper/README.md @@ -159,10 +159,6 @@ The HTTP serving mode connect a celery worker to a message broker. The SERVICE_MODE value in the .env should be set to ```task```. ->LinTO-STT-Whisper can be deployed within the linto-platform-stack through the use of linto-platform-services-manager. Used this way, the container spawn celery worker waiting for transcription task on a message broker. ->LinTO-STT-Whisper in task mode is not intended to be launch manually. ->However, if you intent to connect it to your custom message's broker here are the parameters: - You need a message broker up and running at MY_SERVICE_BROKER. ```bash From 690b7ee813c78b7ea82c9e3c9a8a643ef5ecc2a3 Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 13 Dec 2023 16:07:01 +0100 Subject: [PATCH 169/172] update the README with most common usage --- whisper/README.md | 67 +++++++++++++++++++++++++++-------------------- 1 file changed, 39 insertions(+), 28 deletions(-) diff --git a/whisper/README.md b/whisper/README.md index 015eb42..d22fd72 100644 --- a/whisper/README.md +++ b/whisper/README.md @@ -15,11 +15,13 @@ To run the transcription models you'll need: ### Model(s) -LinTO-STT-Whisper works with a Whisper model to perform Automatic Speech Recognition, which must be in the PyTorch format. +LinTO-STT-Whisper works with a Whisper model to perform Automatic Speech Recognition. +If not downloaded already, the model will be downloaded when calling the first transcription, +and can occupy several GB of disk space. #### Optional alignment model (deprecated) -It can also work with a wav2vec model to perform word alignment. +LinTO-STT-Whisper has also the option to work with a wav2vec model to perform word alignment. The wav2vec model can be specified either * (TorchAudio) with a string corresponding to a `torchaudio` pipeline (e.g. "WAV2VEC2_ASR_BASE_960H") or * (HuggingFace's Transformers) with a string corresponding to a HuggingFace repository of a wav2vec model (e.g. "jonatasgrosman/wav2vec2-large-xlsr-53-english"), or @@ -39,7 +41,7 @@ On addition, as to prevent large audio from transiting through the message broke ## Deploy LinTO-STT-Whisper -**1- First step is to build or pull the image:** +### 1- First step is to build or pull the image ```bash git clone https://github.com/linto-ai/linto-stt.git @@ -52,29 +54,7 @@ or docker pull lintoai/linto-stt-whisper ``` -**2- Download the models** - -Have the Whisper model file ready at ASR_PATH. - -If you already used Whisper in the past, you may have models in ~/.cache/whisper. - -You can download mutli-lingual Whisper models with the following links: -* tiny: "https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt -* base: https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt -* small: https://openaipublic.azureedge.net/main/whisper/models/9ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt -* medium: https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt -* large-v1: https://openaipublic.azureedge.net/main/whisper/models/e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a/large-v1.pt -* large-v2: https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt - -Whisper models specialized for English can also be found here: -* tiny.en: "https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt -* base.en: https://openaipublic.azureedge.net/main/whisper/models/25a8566e1d0c1e2231d1c762132cd20e0f96a85d16145c3a00adf5d1ac670ead/base.en.pt -* small.en: https://openaipublic.azureedge.net/main/whisper/models/f953ad0fd29cacd07d5a9eda5624af0f6bcf2258be67c92b79389873d91e0872/small.en.pt -* medium.en: https://openaipublic.azureedge.net/main/whisper/models/d7440d1dc186f76616474e0ff0b3b6b879abc9d1a4926b7adfa41db2d497ab4f/medium.en.pt - -If may also want to download a specific wav2vec model for word alignment. - -**3- Fill the .env** +### 2- Fill the .env ```bash cp whisper/.envdefault whisper/.env @@ -83,7 +63,7 @@ cp whisper/.envdefault whisper/.env | PARAMETER | DESCRIPTION | EXEMPLE | |---|---|---| | SERVICE_MODE | STT serving mode see [Serving mode](#serving-mode) | `http` \| `task` | -| MODEL | Path to the Whisper model, or type of Whisper model used. | \ \| `medium` \| `large-v1` \| ... | +| MODEL | Path to a Whisper model, type of Whisper model used, or HuggingFace identifier of a Whisper model. | \ \| `large-v3` \| `distil-whisper/distil-large-v2` \| ... | | LANGUAGE | (Optional) Language to recognize | `*` \| `fr` \| `fr-FR` \| `French` \| `en` \| `en-US` \| `English` \| ... | | PROMPT | (Optional) Prompt to use for the Whisper model | `some free text to encourage a certain transcription style (disfluencies, no punctuation, ...)` | | ALIGNMENT_MODEL | (Optional) Path to the wav2vec model for word alignment, or name of HuggingFace repository or torchaudio pipeline | \ \| `WAV2VEC2_ASR_BASE_960H` \| `jonatasgrosman/wav2vec2-large-xlsr-53-english` \| ... | @@ -92,6 +72,36 @@ cp whisper/.envdefault whisper/.env | SERVICE_BROKER | (For the task mode) URL of the message broker | `redis://my-broker:6379` | | BROKER_PASS | (For the task mode only) broker password | `my-password` | +#### MODEL environment variable + +**Warning:** +The model will be (downloaded if required and) loaded in memory when calling the first transcription. +When using a Whisper model from Hugging Face (transformers) along with ctranslate2 (faster_whisper), +it will also download torch library to make the conversion from torch to ctranslate2. + +If you want to preload the model (and later specify a path `ASR_PATH` as `MODEL`), +you may want to download one of OpenAI Whisper models: +* Mutli-lingual Whisper models can be downloaded with the following links: + * [tiny](https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt) + * [base](https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt) + * [small](https://openaipublic.azureedge.net/main/whisper/models/9ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt) + * [medium](https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt) + * [large-v1](https://openaipublic.azureedge.net/main/whisper/models/e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a/large-v1.pt) + * [large-v2](https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt) + * [large-v3](https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt) +* Whisper models specialized for English can also be found here: + * [tiny.en](https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt) + * [base.en](https://openaipublic.azureedge.net/main/whisper/models/25a8566e1d0c1e2231d1c762132cd20e0f96a85d16145c3a00adf5d1ac670ead/base.en.pt) + * [small.en](https://openaipublic.azureedge.net/main/whisper/models/f953ad0fd29cacd07d5a9eda5624af0f6bcf2258be67c92b79389873d91e0872/small.en.pt) + * [medium.en](https://openaipublic.azureedge.net/main/whisper/models/d7440d1dc186f76616474e0ff0b3b6b879abc9d1a4926b7adfa41db2d497ab4f/medium.en.pt) + +If you already used Whisper in the past locally using [OpenAI-Whipser](https://github.com/openai/whisper), models can be found under ~/.cache/whisper. + +The same apply for Whisper models from Hugging Face (transformers), as for instance https://huggingface.co/distil-whisper/distil-large-v2 +(you can either download the model or use the Hugging Face identifier `distil-whisper/distil-large-v2`). + +#### LANGUAGE + If `*` is used for the `LANGUAGE` environment variable, or if `LANGUAGE` is not defined, automatic language detection will be performed by Whisper. @@ -113,6 +123,7 @@ sv(swedish), sw(swahili), ta(tamil), te(telugu), tg(tajik), th(thai), tk(turkmen tr(turkish), tt(tatar), uk(ukrainian), ur(urdu), uz(uzbek), vi(vietnamese), yi(yiddish), yo(yoruba), zh(chinese) ``` +and also `yue(cantonese)` since large-v3. ### Serving mode ![Serving Modes](https://i.ibb.co/qrtv3Z6/platform-stt.png) @@ -266,6 +277,6 @@ This project is developped under the AGPLv3 License (see LICENSE). * [Faster Whisper](https://github.com/SYSTRAN/faster-whisper) * [OpenAI Whisper](https://github.com/openai/whisper) * [Ctranslate2](https://github.com/OpenNMT/CTranslate2) -* [SpeechBrain](https://github.com/speechbrain/speechbrain). +* [SpeechBrain](https://github.com/speechbrain/speechbrain) * [TorchAudio](https://github.com/pytorch/audio) * [HuggingFace Transformers](https://github.com/huggingface/transformers) \ No newline at end of file From cfaaaf0a8835cd1a2b3e5cbc08ac37e253ba9048 Mon Sep 17 00:00:00 2001 From: Houpert Date: Wed, 13 Dec 2023 16:09:19 +0100 Subject: [PATCH 170/172] Update Jenkinsfile with new structure --- Jenkinsfile | 123 +++++++++++++++++++++++++++++++++++----------------- 1 file changed, 84 insertions(+), 39 deletions(-) diff --git a/Jenkinsfile b/Jenkinsfile index 99a4886..2548775 100644 --- a/Jenkinsfile +++ b/Jenkinsfile @@ -1,73 +1,118 @@ +def buildWhisper(image_name, version) { + echo "Building Dockerfile for ${image_name}... with version ${version}" + + script { + def image = docker.build(image_name, "-f whisper/Dockerfile.ctranslate2 .") + + docker.withRegistry('https://registry.hub.docker.com', 'docker-hub-credentials') { + if (version == 'latest-unstable') { + image.push('latest-unstable') + } else { + image.push('latest') + image.push(version) + } + } + } +} + +def buildKaldi(image_name, version) { + echo "Building Dockerfile for ${image_name}... with version ${version}" + + script { + def image = docker.build(image_name, "-f kaldi/Dockerfile .") + + docker.withRegistry('https://registry.hub.docker.com', 'docker-hub-credentials') { + if (version == 'latest-unstable') { + image.push('latest-unstable') + } else { + image.push('latest') + image.push(version) + } + } + } +} + pipeline { agent any environment { DOCKER_HUB_REPO_KALDI = "lintoai/linto-stt-kaldi" DOCKER_HUB_REPO_WHISPER = "lintoai/linto-stt-whisper" - DOCKER_HUB_CRED = 'docker-hub-credentials' + + VERSION_KALDI = '' + VERSION_WHISPER = '' } - stages{ - stage('Docker build for master branch'){ - when{ + stages { + stage('Docker build for master branch') { + when { branch 'master' } steps { echo 'Publishing latest' script { - image = docker.build(env.DOCKER_HUB_REPO_KALDI, "-f kaldi/Dockerfile .") - VERSION = sh( + def changedFiles = sh(returnStdout: true, script: 'git diff --name-only HEAD^ HEAD').trim() + echo "My changed files: ${changedFiles}" + + VERSION_KALDI = sh( returnStdout: true, script: "awk -v RS='' '/#/ {print; exit}' kaldi/RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" ).trim() - docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { - image.push("${VERSION}") - image.push('latest') - } - } - script { - image = docker.build(env.DOCKER_HUB_REPO_WHISPER, "-f whisper/Dockerfile.ctranslate2 .") - VERSION = sh( + VERSION_WHISPER = sh( returnStdout: true, script: "awk -v RS='' '/#/ {print; exit}' whisper/RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" ).trim() + + if (changedFiles.contains('celery_app') || changedFiles.contains('http_server') || changedFiles.contains('websocket') || changedFiles.contains('document')) { + echo "Build kaldi version ${VERSION_KALDI}" + buildKaldi(env.DOCKER_HUB_REPO_KALDI, VERSION_KALDI) - docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { - image.push("${VERSION}") - image.push('latest') + echo "Build whisper version ${VERSION_WHISPER}" + buildWhisper(env.DOCKER_HUB_REPO_WHISPER, VERSION_WHISPER) + }else { + if (changedFiles.contains('kaldi')) { + echo "Build kaldi version ${VERSION_KALDI}" + buildKaldi(env.DOCKER_HUB_REPO_KALDI, VERSION_KALDI) + } + if (changedFiles.contains('whisper')) { + echo "Build whisper version ${VERSION_WHISPER}" + buildWhisper(env.DOCKER_HUB_REPO_WHISPER, VERSION_WHISPER) + } } } } } - stage('Docker build for next (unstable) branch'){ - when{ + stage('Docker build for next (unstable) branch') { + when { branch 'next' } steps { echo 'Publishing unstable' script { - image = docker.build(env.DOCKER_HUB_REPO_KALDI, "-f kaldi/Dockerfile .") - VERSION = sh( - returnStdout: true, - script: "awk -v RS='' '/#/ {print; exit}' kaldi/RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" - ).trim() - docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { - image.push('latest-unstable') - } - } - script { - image = docker.build(env.DOCKER_HUB_REPO_WHISPER, "-f whisper/Dockerfile.ctranslate2 .") - VERSION = sh( - returnStdout: true, - script: "awk -v RS='' '/#/ {print; exit}' whisper/RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" - ).trim() - docker.withRegistry('https://registry.hub.docker.com', env.DOCKER_HUB_CRED) { - image.push('latest-unstable') + def changedFiles = sh(returnStdout: true, script: 'git diff --name-only HEAD^ HEAD').trim() + echo "My changed files: ${changedFiles}" + + VERSION = 'latest-unstable' + + if (changedFiles.contains('celery_app') || changedFiles.contains('http_server') || changedFiles.contains('websocket') || changedFiles.contains('document')) { + echo 'Files in studio-api path are modified. Running specific build steps for studio-api...' + echo "Build whisper and kaldi version ${VERSION}" + + buildKaldi(env.DOCKER_HUB_REPO_KALDI, VERSION) + buildWhisper(env.DOCKER_HUB_REPO_WHISPER, VERSION) + }else { + if (changedFiles.contains('kaldi')) { + echo "Build kaldi version ${VERSION}" + buildKaldi(env.DOCKER_HUB_REPO_KALDI, VERSION) + } + if (changedFiles.contains('whisper')) { + echo "Build whisper version ${VERSION}" + buildWhisper(env.DOCKER_HUB_REPO_WHISPER, VERSION) + } } } } } - - }// end stages -} \ No newline at end of file + } +} From 9035609fca366d5e0ad3b82f1284b68131dfc07d Mon Sep 17 00:00:00 2001 From: Jeronymous Date: Wed, 13 Dec 2023 17:00:29 +0100 Subject: [PATCH 171/172] code factoring --- Jenkinsfile | 88 +++++++++++++---------------------------------------- 1 file changed, 21 insertions(+), 67 deletions(-) diff --git a/Jenkinsfile b/Jenkinsfile index 2548775..cd1ad07 100644 --- a/Jenkinsfile +++ b/Jenkinsfile @@ -1,32 +1,17 @@ -def buildWhisper(image_name, version) { - echo "Building Dockerfile for ${image_name}... with version ${version}" +def buildDockerfile(main_folder, dockerfilePath, image_name, version, changedFiles) { + if (changedFiles.contains(main_folder) || changedFiles.contains('celery_app') || changedFiles.contains('http_server') || changedFiles.contains('websocket') || changedFiles.contains('document')) { + echo "Building Dockerfile for ${image_name} with version ${version} (using ${dockerfilePath})" - script { - def image = docker.build(image_name, "-f whisper/Dockerfile.ctranslate2 .") + script { + def image = docker.build(image_name, "-f ${dockerfilePath} .") - docker.withRegistry('https://registry.hub.docker.com', 'docker-hub-credentials') { - if (version == 'latest-unstable') { - image.push('latest-unstable') - } else { - image.push('latest') - image.push(version) - } - } - } -} - -def buildKaldi(image_name, version) { - echo "Building Dockerfile for ${image_name}... with version ${version}" - - script { - def image = docker.build(image_name, "-f kaldi/Dockerfile .") - - docker.withRegistry('https://registry.hub.docker.com', 'docker-hub-credentials') { - if (version == 'latest-unstable') { - image.push('latest-unstable') - } else { - image.push('latest') - image.push(version) + docker.withRegistry('https://registry.hub.docker.com', 'docker-hub-credentials') { + if (version == 'latest-unstable') { + image.push('latest-unstable') + } else { + image.push('latest') + image.push(version) + } } } } @@ -37,11 +22,8 @@ pipeline { environment { DOCKER_HUB_REPO_KALDI = "lintoai/linto-stt-kaldi" DOCKER_HUB_REPO_WHISPER = "lintoai/linto-stt-whisper" - - VERSION_KALDI = '' - VERSION_WHISPER = '' } - + stages { stage('Docker build for master branch') { when { @@ -53,32 +35,18 @@ pipeline { def changedFiles = sh(returnStdout: true, script: 'git diff --name-only HEAD^ HEAD').trim() echo "My changed files: ${changedFiles}" - VERSION_KALDI = sh( + version_kaldi = sh( returnStdout: true, script: "awk -v RS='' '/#/ {print; exit}' kaldi/RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" ).trim() - VERSION_WHISPER = sh( + version_whisper = sh( returnStdout: true, script: "awk -v RS='' '/#/ {print; exit}' whisper/RELEASE.md | head -1 | sed 's/#//' | sed 's/ //'" ).trim() - - if (changedFiles.contains('celery_app') || changedFiles.contains('http_server') || changedFiles.contains('websocket') || changedFiles.contains('document')) { - echo "Build kaldi version ${VERSION_KALDI}" - buildKaldi(env.DOCKER_HUB_REPO_KALDI, VERSION_KALDI) - echo "Build whisper version ${VERSION_WHISPER}" - buildWhisper(env.DOCKER_HUB_REPO_WHISPER, VERSION_WHISPER) - }else { - if (changedFiles.contains('kaldi')) { - echo "Build kaldi version ${VERSION_KALDI}" - buildKaldi(env.DOCKER_HUB_REPO_KALDI, VERSION_KALDI) - } - if (changedFiles.contains('whisper')) { - echo "Build whisper version ${VERSION_WHISPER}" - buildWhisper(env.DOCKER_HUB_REPO_WHISPER, VERSION_WHISPER) - } - } + buildDockerfile('kaldi', 'kaldi/Dockerfile', env.DOCKER_HUB_REPO_KALDI, version_kaldi, changedFiles) + buildDockerfile('whisper', 'whisper/Dockerfile.ctranslate2', env.DOCKER_HUB_REPO_WHISPER, version_whisper, changedFiles) } } } @@ -93,26 +61,12 @@ pipeline { def changedFiles = sh(returnStdout: true, script: 'git diff --name-only HEAD^ HEAD').trim() echo "My changed files: ${changedFiles}" - VERSION = 'latest-unstable' - - if (changedFiles.contains('celery_app') || changedFiles.contains('http_server') || changedFiles.contains('websocket') || changedFiles.contains('document')) { - echo 'Files in studio-api path are modified. Running specific build steps for studio-api...' - echo "Build whisper and kaldi version ${VERSION}" + version = 'latest-unstable' - buildKaldi(env.DOCKER_HUB_REPO_KALDI, VERSION) - buildWhisper(env.DOCKER_HUB_REPO_WHISPER, VERSION) - }else { - if (changedFiles.contains('kaldi')) { - echo "Build kaldi version ${VERSION}" - buildKaldi(env.DOCKER_HUB_REPO_KALDI, VERSION) - } - if (changedFiles.contains('whisper')) { - echo "Build whisper version ${VERSION}" - buildWhisper(env.DOCKER_HUB_REPO_WHISPER, VERSION) - } - } + buildDockerfile('kaldi', 'kaldi/Dockerfile', env.DOCKER_HUB_REPO_KALDI, version, changedFiles) + buildDockerfile('whisper', 'whisper/Dockerfile.ctranslate2', env.DOCKER_HUB_REPO_WHISPER, version, changedFiles) } } } } -} +} \ No newline at end of file From 3b29d7163c0ad25856e5589932f1f43605cceefc Mon Sep 17 00:00:00 2001 From: gaydmi Date: Mon, 18 Dec 2023 10:21:17 +0100 Subject: [PATCH 172/172] Fixed small typos in README file --- kaldi/README.md | 2 +- whisper/README.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/kaldi/README.md b/kaldi/README.md index 3bfa6c8..0e3a31a 100644 --- a/kaldi/README.md +++ b/kaldi/README.md @@ -101,7 +101,7 @@ This will run a container providing an [HTTP API](#http-api) binded on the host | MODEL_PATH | Path to the model (using MODEL_TYPE=vosk) mounted to /opt/model | /my/path/to/models/vosk-model | ### Micro-service within LinTO-Platform stack -The HTTP serving mode connect a celery worker to a message broker. +The TASK serving mode connect a celery worker to a message broker. The SERVICE_MODE value in the .env should be set to ```task```. diff --git a/whisper/README.md b/whisper/README.md index d22fd72..20a3c7d 100644 --- a/whisper/README.md +++ b/whisper/README.md @@ -166,7 +166,7 @@ you can add option ```-v WAV2VEC_PATH:/opt/wav2vec``` and environment variable ` | WAV2VEC_PATH | (Optional) Path to a folder to a custom wav2vec alignment model | /my/path/to/models/wav2vec | ### Micro-service within LinTO-Platform stack -The HTTP serving mode connect a celery worker to a message broker. +The TASK serving mode connect a celery worker to a message broker. The SERVICE_MODE value in the .env should be set to ```task```.