Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable orc code generation at build-time for iOS #26

Open
sdroege opened this issue Feb 17, 2015 · 21 comments
Open

Enable orc code generation at build-time for iOS #26

sdroege opened this issue Feb 17, 2015 · 21 comments

Comments

@sdroege
Copy link
Contributor

sdroege commented Feb 17, 2015

This would be needed for using videoconvert and videoscale properly on iOS, as it's impossible to allocate executable memory on iOS nowadays for Apple reasons.

https://bugzilla.gnome.org/show_bug.cgi?id=742843 is about this issue too. Basically orcc can output NEON assembly at build time already, but there are some problems with getting that to run and also with outputting something for orc programs that contain unsupported opcodes for the specific target.

Ideally we would change orcc to output backup C code plus inline assembly (or in a separate file is ok too), and then at runtime it selects which one to use. As that is still possible on iOS.

@ijsf
Copy link

ijsf commented Feb 24, 2015

Just a comment on current performance: I just tested the latest master of everything on iPad Mini 1, and performance is crawling Bowser to a near halt. Using Time Analyzer, the main bottlenecks seem to be the orc_* related functions involved with video (notably orc_executor_emulate). Not really surprising.

P.S. Another bottleneck seems to be Opus and all the audio analysis it's performing, but I believe that was mentioned in another issue already, as well as aes_encrypt taking some time.

@ford-prefect
Copy link
Contributor

I ran Bowser on an iPod Touch, and the top threads for CPU consumption were:

  1. intervideosrc, 8.5%, most time spent in gst_buffer_make_writable performing a copy
  2. audioconvert, 8.3%
  3. opusenc, 8.2%
  4. videoconvert, 8.1%
  5. audioconvert, 6.1%
  6. nicesrc, 6.1%, 20% in debug code, reported in libnice performance in networking I/O openwebrtc#196)
  7. vp8dec, 5.8%
  8. videoconvert, 5.3%
  9. audioresample, 4.0%

There are a number of issues to tackle here:

  1. Merging some of @ikonst's work to decrease the amount of audio conversion, and possibly extending that to move the work to the RemoteIO AudioUnit might help
  2. The fact that the video encoder doesn't even turn up in the top 10 might need investigation
  3. If we get Orc to work, that should help perf on 4 of those *convert threads
  4. Opus git suggests that some NEON optimisations were added since 1.0.3 which we are using -- this might help as well
  5. I wonder if we can/should look at the possibility of pushing out non-writable buffers from intervideosrc (since we only modify metadata)

@superdump
Copy link
Contributor

do you know if we're sending h264 in that case? I expect so... In which case maybe that isn't weird at all.

The memory copy in intervideosrc was surprising. I think zero copy and queue like behavior are really needed for those to have low latency and high performance/low power drain applications.

videoconvert is more prominent in Bowser because of the rendering mechanism which needs to convert both the self and remote views to BGRA to send as bitmaps to the WebView for rendering. In a real app using glimagesink or similar, those would not be visible. As such and in general, we should perhaps try to optimize the colour space of the video for encoders and rely on having fast colour space conversion in renderers.

As for audio, absolutely we should try to avoid conversions that are unnecessary. Noops for the win.

@ford-prefect
Copy link
Contributor

Correction: we're already on Opus 1.1, but it doesn't seem like we're actually using the appropriate optimisations. Will take a look at this.

Update: the recipe's already supposed to be setting things up to use intrinsics, so bumping investigating this down.

@ford-prefect
Copy link
Contributor

We're using vp8enc, which has 1.2% CPU usage. It's possible the frames just aren't reaching the encoder.

@superdump
Copy link
Contributor

Or that they're perfect black frames?

@sdroege
Copy link
Contributor Author

sdroege commented Feb 24, 2015

intervideosrc and gst_buffer_make_writable() will only copy the underlying memory if it wasn't shareable, which is probably the problem here. We can't push non-writable buffers out there because we need to update the timestamps

@ijsf
Copy link

ijsf commented Feb 24, 2015

So here's the full analysis.

Scenario: testing with demo.openwebrtc.io:38080, A-side: Bowser (master), B-side: Chrome (OS X 10.9.5), local LAN.
Hardware: iPad Mini 1 (A5, armv7)

Remarks:

  • Audio latency B to A ~2 seconds.
  • Audio latency A to B ~8 seconds.
  • Local video latency A ~1 second.
  • Remote video from B on A-side black most of the time, ~0.5 fps and latency ~5 seconds.

Time Analysis (running ~10 minutes, calling ~9 minutes):

  • 5.7% _platform_memmove, gst_buffer_copy_into (gst_inter_video)
  • 4.1% gst_audio_quantize_quantize_signed_tpdf_none (gst_audio_convert_transform)
  • 3.3% gst_audio_quantize_quantize_signed_tpdf_none (gst_audio_convert_transform)
  • 1.7% speex_resampler_process_native (gst_audio_resample)
  • 1.4% emulate_mulhsw (orc, videoconverter)
  • 1.3% orc_executor_emulate (orc, videoconverter)
  • 1.1% _platform_memmove (Bowser, WebCore::SharedBuffer::createPurgeableBuffer)
  • 1.1% _platform_memmove, _owr_image_renderer_pull_bmp_image (openwebrtc)
  • 1.0% opus_fft (opus)
  • 1.0% celt_pitch_xcorr_c (opus)
  • 0.8% aes_encrypt (srtp)
  • 0.7% tonality_analysis (opus)
  • 0.7% emulate_addw (orc, videoconverter)
  • 0.7% emulate_convssswb (orc, videoconverter)
  • 0.5% opus_ifft (opus)
  • 0.4% alg_quant (opus)
    ...

(Excludes minor unavoidable functions involved with sockets and iOS libraries such as QuartzCore.)

Quite a few interesting bottlenecks apart from orc. What's also interesting is the fact that the video encoder itself doesn't really show up here, while audio and video conversion does seem to be a significant hit. Seems a little backwards.

@superdump
Copy link
Contributor

  • It looks like a recent commit from GStreamer should fix the first 5.7% - on IRC @sdroege said:

http://cgit.freedesktop.org/gstreamer/gst-plugins-bad/commit/?id=08b1d27e14e28ed028d73de1e5f7ec2e9c0a0d71 this might fix the make_writable() copying in intervideosrc

  • @ikonst is working on some stuff to correctly configure the source that should either eliminate or reduce (depending on what Core Audio decides to do) the audioresample stuff
  • All of the audioconvert, videoconvert, orc stuff should be improved by fixing this issue
  • Opus has improved optimisations for ARM and others on the way in their 1.1.1 release for which there is a beta. We updated to 1.1 recently I think.
  • The two memmoves are due to the rendering method used in Bowser where we create a bitmap from the raw BGRA frame (_owr_image_renderer_pull_bmp_image) and then send it over to UIWebView for drawing (WebCore::SharedBuffer::createPurgeableBuffer)

I don't think there's so much left after that. 😄 Crypto pops up but maybe addressing EricssonResearch/openwebrtc#199 will fix that

@stefanalund
Copy link
Contributor

@ijsf you mentioned earlier that you were interested in helping out. How about looking at this issue (ORC)?

@ijsf
Copy link

ijsf commented Feb 24, 2015

Sure, I'll try and have a look at this once my rebuild is done. I've forked the orc repository at https://github.com/ijsf/OpenWebRTC-orc

@ijsf
Copy link

ijsf commented Feb 26, 2015

Currently working on adding a --static-implementation option in the above fork.

@ijsf
Copy link

ijsf commented Feb 27, 2015

This fork is now capable of compiling static functions with inline asm code (and backup C functions for those that don't have an asm implementation). It's almost compilable.

The drawback is that I also had to clone gstreamer-common, the gstreamer and all gst* repositories, as common/orc.mak (which contains the arguments for orcc) is a submodule (gstreamer-common) in the gst* and gstreamer git repositories.

Furthermore, I found out there's also a "--target " (neon for iOS) command-line argument missing there (just hardcoded to neon for now, needs to be fixed).

So in short, the command-line needs to change from "... --implementation ..." to "... -target --static-implementation". Could somebody give me some pointers on how to do this as easy as possible?

@sdroege
Copy link
Contributor Author

sdroege commented Feb 27, 2015

I think there should be a configure parameter to select this:
--enable-static-orc=targets
where targets can be a list of targets.

Then orc.mak would do its thing depending on that.

@ijsf
Copy link

ijsf commented Apr 12, 2015

This is done as far as I'm concerned.

@superdump
Copy link
Contributor

@ijsf I'll keep it open until it lands and the above bugzilla ticket is closed. Should we create a separate issue about 64-bit ARM support? As far as I recall, that's a requirement for new App Store submissions. Well, we can support 64-but ARM with fallback C code but the performance is horrible.

@ijsf
Copy link

ijsf commented Apr 12, 2015

I think 64-bit ARM support is also important, but I won't be able to do this as I do not have any arm64 devices available.

The move to 64-bit ARM probably won't require a huge amount of changes. The most important changes are described here: http://community.arm.com/docs/DOC-8453

@superdump
Copy link
Contributor

We do and we can test and report back:

@ijsf
Copy link

ijsf commented Apr 12, 2015

One of the most important changes is the reordering of the s/d/q registers between 32-bit and 64-bit. It's not really an option for me as I believe this will require a lot of back and forth testing/verification on arm64.

@stefanalund
Copy link
Contributor

I was under the impression that ARM 64 bit instruction set is backwards compatible with 32 bit. Maybe that is not relevant here?

@ijsf
Copy link

ijsf commented Apr 12, 2015

Correct, but as far as I know only in 32-bit mode. Check the above docs for more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants